I have a dataframe which consist a column named Keyword. There are around 1M Keywords. I want to delete all the rows where the Keywords consist of the words I stored in the list.
Here is some words stored in the list:
excluded_words = ['nz','ca']
I have tried the follwing code:
df[~df['Keyword'].str.contains('|'.join(exclude_words), regex = True)]
This code is blazing fast. Doing its job but with a little issue.
It is deleting any keywords which contains any words including "ca". I want to delete only those keywords where "ca" is a seperate word.
For example let’s say we have two below Keywords
cast iron sump pump
sump pump repair service near ca
The first keyword shouldn’t be deleted as "ca" is just a part of the keyword "cast", not just a word itself. Where the second keyword should be surely deleted as "ca" is a word there.
How to modify the code so that it can deal with it? Thank you in advance.
>Solution :
You can surround each word to exclude with r'\b', a raw Python string which represents the regular expression special sequence for a word boundary (re.py docs):
excluded_words = ['nz', 'ca']
excluded_words = [r'\b' + x + r'\b' for x in excluded_words]
df[~df['Keyword'].str.contains('|'.join(excluded_words), regex=True)]