Advertisements
I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column:
Text |
---|
Dimensions: 23"/60 |
Dimensions: 23" / 60 |
Dimensions: 48" |
Dimensions: 22.5X8.25 |
Dimensions: 80IN |
Dimensions: 567 S |
Dimensions: 22.5X8.25 |
Dimensions: 26INNP |
Dimensions: 24" x 55" with pipe 16 x 7 |
I am using regex and is as follows: regex = r"(\d([^\s]*)\s.\s\d*[^\s])|(\d([^\s])*)"
I am using to the below script to create a new column with all the numeric information extracted from each sentence and combined with ‘_’
df2['Numeric_Info'] = df2['Text'].apply(lambda x: '_'.join([i[0] for i in re.findall(regex, str(x))]))
Can someone tell me what I am doing wrong?
>Solution :
You can use
regex = r'\d+(?:\.\d+)?"?(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?(?:\s*[A-Z]+\b)?'
df2['Numeric_Info'] = df2['Text'].str.findall(regex).str.join('_')
See the regex demo.
Details:
\d+(?:\.\d+)?
– an int or float value"?
– an optional"
char(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?
– an optional sequence of\s*[Xx/.]\s*
–X
,x
,/
or.
enclosed with zero or more whitespaces\d+(?:\.\d+)?
– an int or float value"?
– an optional"
char
(?:\s*[A-Z]+\b)?
– an optional sequence of\s*
– zero or more whitespaces[A-Z]+
– one or more ASCII uppercase letters\b
– a word boundary.
Here is a Pandas test:
import pandas as pd
df = pd.DataFrame({'Text':['No data here','Dimensions: 23"/60', 'Dimensions: 23" / 60', 'Dimensions: 48"', 'Dimensions: 22.5X8.25', 'Dimensions: 80IN', 'Dimensions: 567 S','Dimensions: 22.5X8.25', 'Dimensions: 26INNP','Dimensions: 24" x 55" with pipe 16 x 7']})
regex = r'\d+(?:\.\d+)?"?(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?(?:\s*[A-Z]+\b)?'
df['Numeric_Info'] = df['Text'].str.findall(regex).str.join('_')
Output:
>>> df
Text Numeric_Info
0 No data here
1 Dimensions: 23"/60 23"/60
2 Dimensions: 23" / 60 23" / 60
3 Dimensions: 48" 48"
4 Dimensions: 22.5X8.25 22.5X8.25
5 Dimensions: 80IN 80IN
6 Dimensions: 567 S 567 S
7 Dimensions: 22.5X8.25 22.5X8.25
8 Dimensions: 26INNP 26INNP
9 Dimensions: 24" x 55" with pipe 16 x 7 24" x 55"_16 x 7