Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Filter a dataframe's rows if a list of words or phrases are not included in the column even if its not exact

I understand that to filter a dataframe if a word is not in a dataframe, you employ the below:

df[~df['Job Name'].isin(remover_rows)]

The issue is in my case, the dataframe may contain multiple words like so:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

 'Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Finance and Office Administrator',
 'Office/Accounts Administrator',
 'Office Administrator',
 'Temporary Office Administrator',
 'Accounts and Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Office Admin - Customer Support',
 'Office Administrator',
 'Office Administrator - London - Spanish Speaking'

The above solution only works if the specific words are indicated in the list of words to be excluded so for example, in the list above, if "Office Administrator" is in the remover_rows variable, then those rows will not show. But what if I want to remove rows in a Job Name column that contains ['Finance', 'Spanish'] even if it contains other words too? For example, Id be expecting rows containing 'Finance and Office Administrator' and 'Office Administrator - London - Spanish Speaking' in the Job Name column to not show up.

>Solution :

You can use Series.str.contains withe regular expression:

import re

words = ["Finance", "Spanish"]

x = df["Job Name"].str.contains(
    "|".join(map(re.escape, words)), flags=re.IGNORECASE
)
print(df[~x])

Prints:

                             Job Name
0                Office Administrator
1                Office Administrator
2                Office Administrator
4       Office/Accounts Administrator
5                Office Administrator
6      Temporary Office Administrator
7   Accounts and Office Administrator
8                Office Administrator
9                Office Administrator
10               Office Administrator
11               Office Administrator
12    Office Admin - Customer Support
13               Office Administrator

Or without re:

words = ["Finance", "Spanish"]

x = df["Job Name"].apply(lambda x: any(w in x for w in words))
print(df[~x])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading