Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract Numeric info from Pandas column using regex

I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column:

Text
Dimensions: 23"/60
Dimensions: 23" / 60
Dimensions: 48"
Dimensions: 22.5X8.25
Dimensions: 80IN
Dimensions: 567 S
Dimensions: 22.5X8.25
Dimensions: 26INNP
Dimensions: 24" x 55" with pipe 16 x 7

I am using regex and is as follows: regex = r"(\d([^\s]*)\s.\s\d*[^\s])|(\d([^\s])*)"

I am using to the below script to create a new column with all the numeric information extracted from each sentence and combined with ‘_’

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df2['Numeric_Info'] = df2['Text'].apply(lambda x: '_'.join([i[0] for i in re.findall(regex, str(x))]))

Can someone tell me what I am doing wrong?

>Solution :

You can use

regex = r'\d+(?:\.\d+)?"?(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?(?:\s*[A-Z]+\b)?'
df2['Numeric_Info'] = df2['Text'].str.findall(regex).str.join('_')

See the regex demo.

Details:

  • \d+(?:\.\d+)? – an int or float value
  • "? – an optional " char
  • (?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)? – an optional sequence of
    • \s*[Xx/.]\s*X, x, / or . enclosed with zero or more whitespaces
    • \d+(?:\.\d+)? – an int or float value
    • "? – an optional " char
  • (?:\s*[A-Z]+\b)? – an optional sequence of
    • \s* – zero or more whitespaces
    • [A-Z]+ – one or more ASCII uppercase letters
    • \b – a word boundary.

Here is a Pandas test:

import pandas as pd
df = pd.DataFrame({'Text':['No data here','Dimensions: 23"/60', 'Dimensions: 23" / 60', 'Dimensions: 48"', 'Dimensions: 22.5X8.25', 'Dimensions: 80IN', 'Dimensions: 567 S','Dimensions: 22.5X8.25', 'Dimensions: 26INNP','Dimensions: 24" x 55" with pipe 16 x 7']})
regex = r'\d+(?:\.\d+)?"?(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?(?:\s*[A-Z]+\b)?'
df['Numeric_Info'] = df['Text'].str.findall(regex).str.join('_')

Output:

>>> df
                                     Text      Numeric_Info
0                            No data here                  
1                      Dimensions: 23"/60            23"/60
2                    Dimensions: 23" / 60          23" / 60
3                         Dimensions: 48"               48"
4                   Dimensions: 22.5X8.25         22.5X8.25
5                        Dimensions: 80IN              80IN
6                       Dimensions: 567 S             567 S
7                   Dimensions: 22.5X8.25         22.5X8.25
8                      Dimensions: 26INNP            26INNP
9  Dimensions: 24" x 55" with pipe 16 x 7  24" x 55"_16 x 7
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading