Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Create a new column in pandas, using the multiple exact string matches with str.contains() or str.match()

I have a df with journals. I have different journals.

I want to extract journals with titles below only

Blood, Cancer, Chest, Circulation, Diabetes, JAMA, Endocrinology, Gastroenterology, Gut, Medicine, Neurology, Pediatrics, Physical therapy, Radiology, Surgery, Geriatrics

Some journals have the same words – Blood circulation, Cancer History, etc. I do not want to select them.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Example

Id Title
1  Blood
2  Blood
3  Blood purification
4  Blood transfusion
5  Cancer
6  Chest
7  Cancer History
8  Chest Analysis

I want to keep the exact journal title and create new column "Influential", but cannot find the way with str.contains or str.match.

I am trying two approaches

df.loc[df['Title'].str.contains("Blood", case = True, na = False), 'Influential'] = 'Blood'
df.loc[df['Title'].str.match("Blood", case = True, na = False), 'Influential'] = 'Blood'

Expected output with the exact title of the journal:

Id Title              Influential
1  Blood              Blood
2  Blood              Blood
3  Blood purification NA
4  Blood transfusion  NA
5  Cancer             Cancer
6  Chest              Chest
7  Cancer History     NA
8  Chest Analysis     NA

Should I do it somehow via regex? Thanks.

>Solution :

If you want to set Influential column values with the values from Title column if the latter is an exact match of the words in your lst list, you can use

df = pd.DataFrame({'Id':[1,2,3,4,5,6,7,8], 'Title': ['Blood','Blood', 'Blood purification', 'Blood transfusion', 'Cancer', 'Chest', 'Cancer History', 'Chest Analysis']})
lst = ['Blood', 'Chest', 'Cancer']
df['Influential'] = np.where(df['Title'].isin(lst), df['Title'], np.nan)
# >>> df
#    Id               Title Influential
# 0   1               Blood       Blood
# 1   2               Blood       Blood
# 2   3  Blood purification         NaN
# 3   4   Blood transfusion         NaN
# 4   5              Cancer      Cancer
# 5   6               Chest       Chest
# 6   7      Cancer History         NaN
# 7   8      Chest Analysis         NaN

If you have a specific word like Blood and you want to set Influential column values with this word if the whole title text equals this word, you can use

df = pd.DataFrame({'Id':[1,2,3,4], 'Title': ['Blood','Blood', 'Blood purification', 'Blood transfusion']})
df['Influential'] = df.apply(lambda x: "Blood" if x['Title'] == 'Blood' else np.nan, axis=1)
# => >>> df
#     Id               Title Influential
#  0   1               Blood       Blood
#  1   2               Blood       Blood
#  2   3  Blood purification         NaN
#  3   4   Blood transfusion         NaN

If the Title column value is equal to Blood (see if x['Title'] == 'Blood'), the Influential column value is set to Blood, else, to np.nan.

Or, just use numpy.where (also suggested in the comments):

import numpy as np
#...
df['Influential'] = np.where(df['Title']=='Blood', df['Title'], np.nan)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading