How do I identify rows containing list of words with no repetition?

January 2, 2022

I have a list of words as well as a dataset. I would like to identify rows within the dataset that have at least two of the words in the list.

I am able to identify rows containing at least two of the list words, but my code problematically also identifies rows where a single list word is repeated.

Here’s my code:

import pandas as pd
    
data={'Name':['Redred','redblue','redgreen','blue']}

df=pd.DataFrame(data)

df['Good colours'] = (df['Name'].str.contains("(red.*|blue.*|green.*){2,}",case=False, regex=True))

print(df)

In the resulting dataset, the row ‘redred’ returns true when it should not.

In an effort to exclude repeated words I’ve tried (red.*){1,}(blue.*){1,}(green.*){1} but this means that ‘redblue’ and ‘redgreen’ is no longer recognized (and it is essential for the code to recognize these rows.)

How do I write the regex so it identifies two instances of the list words while excluding repeated instances of the list word?

>Solution :

You can use

(df['Name'].str.contains(r"^(?!(red|blue|green)\1+$)(?:.*(?:red|blue|green)){2}",case=False, regex=True))

Details:

^ – start of string
(?!(red|blue|green)\1+$) – a negative lookahead that fails the match if there is
- (red|blue|green) – Group 1: any of the substrings defined in the group
- \1+ – one or more repetitions of Group 1 value and then
- $ – end of string
(?:.*(?:red|blue|green)){2} – two occurrences of zero or more chars other than line break chars, as many as possible and then an single occurrence of the substrings defined in the group.

See the regex demo.

Pandas demo:

>>> import pandas as pd
>>> data={'Name':['Redred','redblue','redgreen','blue','redredgreen']}
>>> df=pd.DataFrame(data)
>>> (df['Name'].str.contains(r"^(?!(red|blue|green)\1+$)(?:.*(?:red|blue|green)){2}",case=False, regex=True))
0    False
1     True
2     True
3    False
4     True