Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to make a regex orderless when validating a list of texts?

My input is this dataframe (but it could be a simple list) :

import pandas as pd

df = pd.DataFrame({'description': ['ij edf m-nop ij abc', 'abc ij mnop yz', 'yz yz mnop aa abc', 'i j y y abc xxx mnop y z', 'yz mnop ij kl abc uvwxyz', 'aaabc ijij uuu yz mnop']})

I also have a list of keywords (between 3 and 7 items) that I need to valid. We should only validate an exact combination of the whole keywords and ignore characters in between. The problem is that those keywords don’t respect the order I put them in my list (here keywords).

I searched in google and here too but couldn’t find any post that talks about a similar topic. So I made the code below which is making a permuation of the keywords and put them in a regex string.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import re
import itertools

keywords = ['abc', 'ij', 'mnop', 'yz']

regex = ''
for perm in list(itertools.permutations(keywords)):
    perm = [fr'\b{key}\b' for key in perm]
    regex += f'(?:{".*".join(perm)})|'

regex = regex.rstrip('|')

Here is a snippet of my regex :

# (?:\babc\b.*\bij\b.*\bmnop\b.*\byz\b)|(?:\babc\b.*\bij\b.*\byz\b.*\bmnop\b)|(?:\
# babc\b.*\bmnop\b.*\bij\b.*\byz\b)|(?:\babc\b.*\bmnop\b.*\byz\b.*\bij\b)|(?:\babc
# \b.*\byz\b.*\bij\b.*\bmnop\b)|(?:\babc\b.*\byz\b.*\bmnop\b.*\bij\b)|(?:\bij\b.*\
# babc\b.*\bmnop\b.*\byz\b)|(?:\bij\b.*\babc\b.*\byz\b.*\bmnop\b)|(?:\bij\b.*\bmno
# p\b.*\babc\b.*\byz\b)|(?:\bij\b.*\bmnop\b.*\byz\b.*\babc\b)|(?:\bij\b.*\byz\b.*\
# babc\b.*\bmnop\b)|(?:\bij\b.*\byz\b.*\bmnop\b.*\babc\b)|(?:\bmnop\b.*\babc\b.*\b
# ij\b.*\byz\b)|(?:\bmnop\b.*\babc\b.*\byz\b.*\bij\b)|(?:\bmnop\b.*\bij\b.*\babc\b
# .*\byz\b)|(?:\bmnop\b.*\bij\b.*\byz\b.*\babc\b)|(?:\bmnop\b.*\byz\b.*\babc\b.*\b
# ij\b)|(?:\bmnop\b.*\byz\b.*\bij\b.*\babc\b)|(?:\byz\b.*\babc\b.*\bij\b.*\bmnop\b
# )|(?:\byz\b.*\babc\b.*\bmnop\b.*\bij\b)|(?:\byz\b.*\bij\b.*\babc\b.*\bmnop\b)|(?
# :\byz\b.*\bij\b.*\bmnop\b.*\babc\b)|(?:\byz\b.*\bmnop\b.*\babc\b.*\bij\b)|(?:\by
# z\b.*\bmnop\b.*\bij\b.*\babc\b)

While it works on the example I gave, it takes 5-15 minutes on my real dataset (50k rows and very long descriptions with breaklines) and I’m not sure if my approach handles correctly all the rows. And there is also a problem, sometimes I had to validate a list of 6 keywords, which gives 720 permuation !

Can you guys help me solve this ? Is regex the right way to approach my problem ?

My expected ouptut is this :

                description  valid
0       ij edf m-nop ij abc
1            abc ij mnop yz   True
2         yz yz mnop aa abc
3  i j y y abc xxx mnop y z
4  yz mnop ij kl abc uvwxyz   True
5    aaabc ijij uuu yz mnop

>Solution :

A regex can be useful, but generating all permutations is not appropriate.

I would use a regex to extract words, then checking that the keywords are a subset of the extracted words with set.issubset:

import re

keywords = {'abc', 'ij', 'mnop', 'yz'} # this is a SET

reg = re.compile(r'\b[a-z]+\b', flags=re.I)

df['valid'] = [keywords.issubset(reg.findall(x)) for x in df['description']]

NB. you might want to add a casefold step to ignore case.

Output:

                description  valid
0       ij edf m-nop ij abc  False
1            abc ij mnop yz   True
2         yz yz mnop aa abc  False
3  i j y y abc xxx mnop y z  False
4  yz mnop ij kl abc uvwxyz   True
5    aaabc ijij uuu yz mnop  False

For fun, by tweaking the code you could even get the set of missing words instead of False:

df['valid'] = [keywords.issubset(S:=set(reg.findall(x))) or keywords-S
               for x in df['description']]

                description       valid
0       ij edf m-nop ij abc  {mnop, yz}
1            abc ij mnop yz        True
2         yz yz mnop aa abc        {ij}
3  i j y y abc xxx mnop y z    {yz, ij}
4  yz mnop ij kl abc uvwxyz        True
5    aaabc ijij uuu yz mnop   {abc, ij}

# or
df['missing'] = [keywords-set(reg.findall(x)) for x in df['description']]
df['valid'] = df['missing'].eq(set())

                description     missing  valid
0       ij edf m-nop ij abc  {mnop, yz}  False
1            abc ij mnop yz          {}   True
2         yz yz mnop aa abc        {ij}  False
3  i j y y abc xxx mnop y z    {yz, ij}  False
4  yz mnop ij kl abc uvwxyz          {}   True
5    aaabc ijij uuu yz mnop   {abc, ij}  False
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading