Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to correctly extract substring from a string in a Pandas data frame?

I have a list of about 300 concepts and a pandas dataframe composed by two col: Abstractand Title. Some concepts from the list are in the Abstract as substring. I would like to extract from the Abstract the concepts from the list and use the extracted concepts as label for my record.

I am using the .str.extract function. I tried manually entering one of the concepts from the list

dataset["Indexes"]= dataset["Abstract"].str.extract("(land reform)")

works as expected. However, since the list is about 300 concepts I would like use the list as pandas dataframe and use the dataframe as reference to extract the concept from the abstract.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have defined the list as dataframe


tofind = landvoc["English"]

then I have modified what I was using before as follows


dataset["Indexes"]= dataset["Abstract"].str.extract(tofind)

but I got this error

TypeError: unhashable type: 'Series'

Any suggestions about how to solve this issue?

Thanks

>Solution :

You can use str.findall:

concepts = fr"({'|'.join(tofind)})"
df['Indexes'] = df['Abstract'].str.findall(concepts).str.join(', ')
print(df)

# Output
              Abstract       Indexes
0  black and small cat  black, small

Setup:

tofind = ['small', 'black']
df = pd.DataFrame({'Abstract': ['black and small cat']})
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading