Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Delete rows based on the multiple words stored in a list as fast as possible

I have a dataframe which consist a column named Keyword. There are around 1M Keywords. I want to delete all the rows where the Keywords consist of the words I stored in the list.

Here is some words stored in the list:

excluded_words = ['nz','ca']

I have tried the follwing code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df[~df['Keyword'].str.contains('|'.join(exclude_words), regex = True)]

This code is blazing fast. Doing its job but with a little issue.

It is deleting any keywords which contains any words including "ca". I want to delete only those keywords where "ca" is a seperate word.

For example let’s say we have two below Keywords

cast iron sump pump

sump pump repair service near ca

The first keyword shouldn’t be deleted as "ca" is just a part of the keyword "cast", not just a word itself. Where the second keyword should be surely deleted as "ca" is a word there.

How to modify the code so that it can deal with it? Thank you in advance.

>Solution :

You can surround each word to exclude with r'\b', a raw Python string which represents the regular expression special sequence for a word boundary (re.py docs):

excluded_words = ['nz', 'ca']
excluded_words = [r'\b' + x + r'\b' for x in excluded_words]
df[~df['Keyword'].str.contains('|'.join(excluded_words), regex=True)]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading