Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Panda df to clean free form text and extract certain word or characters

Can you extract a series of letters and numbers from bad freeform data in a dataframe?

I want to create a new column in the data frame with data that contains ‘NEX’ and a series of numbers after it.

import pandas as pd

#Create a Dataframe
data = {
    'ID':[1,2,3,4,5],
    'PROGRAM': [ 'nbu 123456',
                'NBU-123456',
                'nex999999 b12',
                'NXE999999 123',
                'NBU123456 NEX999999']
}

df = pd.DataFrame(data)

I think I’m on the right lines with the below, but I somehow need to combine their functionality:-

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

print(df['PROGRAM'].str.contains('NEX', na=False))
print(df['PROGRAM'].str.extract(r'([NEX]+\d+)', expand=False)) #does not deal with lower case & contains letters NEX not nessary in that order

The result should only bring back NEX999999 (including converting lowercase to uppercase)

df['NEX'] = df['PROGRAM'].str.blahblahblah

>Solution :

You do not need to pre-filter the rows with NEX, your regex can already do that.

One issue with your regex is [NEX], this allows a single character from the N/E/X set, which is not what you want. [NEX]+ would match N or EEEN.

You should use:

df['out'] = df['PROGRAM'].str.upper().str.extract(r'(NEX\d+)', expand=False)

And if you do not care about lower/uppercase:

df['out2'] = df['PROGRAM'].str.extract(r'(NEX\d+)', expand=False, flags=re.I)

And if you just want the digits:

df['out3'] = df['PROGRAM'].str.extract(r'NEX(\d+)', expand=False, flags=re.I)

Output:

   ID              PROGRAM        out       out2    out3
0   1           nbu 123456        NaN        NaN     NaN
1   2           NBU-123456        NaN        NaN     NaN
2   3        nex999999 b12  NEX999999  nex999999  999999
3   4        NXE999999 123        NaN        NaN     NaN
4   5  NBU123456 NEX999999  NEX999999  NEX999999  999999
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading