Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How do I identify rows containing list of words with no repetition?

I have a list of words as well as a dataset. I would like to identify rows within the dataset that have at least two of the words in the list.

I am able to identify rows containing at least two of the list words, but my code problematically also identifies rows where a single list word is repeated.

Here’s my code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import pandas as pd
    
data={'Name':['Redred','redblue','redgreen','blue']}

df=pd.DataFrame(data)

df['Good colours'] = (df['Name'].str.contains("(red.*|blue.*|green.*){2,}",case=False, regex=True))

print(df)

In the resulting dataset, the row ‘redred’ returns true when it should not.

In an effort to exclude repeated words I’ve tried (red.*){1,}(blue.*){1,}(green.*){1} but this means that ‘redblue’ and ‘redgreen’ is no longer recognized (and it is essential for the code to recognize these rows.)

How do I write the regex so it identifies two instances of the list words while excluding repeated instances of the list word?

>Solution :

You can use

(df['Name'].str.contains(r"^(?!(red|blue|green)\1+$)(?:.*(?:red|blue|green)){2}",case=False, regex=True))

Details:

  • ^ – start of string
  • (?!(red|blue|green)\1+$) – a negative lookahead that fails the match if there is
    • (red|blue|green) – Group 1: any of the substrings defined in the group
    • \1+ – one or more repetitions of Group 1 value and then
    • $ – end of string
  • (?:.*(?:red|blue|green)){2} – two occurrences of zero or more chars other than line break chars, as many as possible and then an single occurrence of the substrings defined in the group.

See the regex demo.

Pandas demo:

>>> import pandas as pd
>>> data={'Name':['Redred','redblue','redgreen','blue','redredgreen']}
>>> df=pd.DataFrame(data)
>>> (df['Name'].str.contains(r"^(?!(red|blue|green)\1+$)(?:.*(?:red|blue|green)){2}",case=False, regex=True))
0    False
1     True
2     True
3    False
4     True
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading