How to make a regex orderless when validating a list of texts?

March 19, 2024

My input is this dataframe (but it could be a simple list) :

import pandas as pd

df = pd.DataFrame({'description': ['ij edf m-nop ij abc', 'abc ij mnop yz', 'yz yz mnop aa abc', 'i j y y abc xxx mnop y z', 'yz mnop ij kl abc uvwxyz', 'aaabc ijij uuu yz mnop']})

I also have a list of keywords (between 3 and 7 items) that I need to valid. We should only validate an exact combination of the whole keywords and ignore characters in between. The problem is that those keywords don’t respect the order I put them in my list (here keywords).

I searched in google and here too but couldn’t find any post that talks about a similar topic. So I made the code below which is making a permuation of the keywords and put them in a regex string.

import re
import itertools

keywords = ['abc', 'ij', 'mnop', 'yz']

regex = ''
for perm in list(itertools.permutations(keywords)):
    perm = [fr'\b{key}\b' for key in perm]
    regex += f'(?:{".*".join(perm)})|'

regex = regex.rstrip('|')

Here is a snippet of my regex :

# (?:\babc\b.*\bij\b.*\bmnop\b.*\byz\b)|(?:\babc\b.*\bij\b.*\byz\b.*\bmnop\b)|(?:\
# babc\b.*\bmnop\b.*\bij\b.*\byz\b)|(?:\babc\b.*\bmnop\b.*\byz\b.*\bij\b)|(?:\babc
# \b.*\byz\b.*\bij\b.*\bmnop\b)|(?:\babc\b.*\byz\b.*\bmnop\b.*\bij\b)|(?:\bij\b.*\
# babc\b.*\bmnop\b.*\byz\b)|(?:\bij\b.*\babc\b.*\byz\b.*\bmnop\b)|(?:\bij\b.*\bmno
# p\b.*\babc\b.*\byz\b)|(?:\bij\b.*\bmnop\b.*\byz\b.*\babc\b)|(?:\bij\b.*\byz\b.*\
# babc\b.*\bmnop\b)|(?:\bij\b.*\byz\b.*\bmnop\b.*\babc\b)|(?:\bmnop\b.*\babc\b.*\b
# ij\b.*\byz\b)|(?:\bmnop\b.*\babc\b.*\byz\b.*\bij\b)|(?:\bmnop\b.*\bij\b.*\babc\b
# .*\byz\b)|(?:\bmnop\b.*\bij\b.*\byz\b.*\babc\b)|(?:\bmnop\b.*\byz\b.*\babc\b.*\b
# ij\b)|(?:\bmnop\b.*\byz\b.*\bij\b.*\babc\b)|(?:\byz\b.*\babc\b.*\bij\b.*\bmnop\b
# )|(?:\byz\b.*\babc\b.*\bmnop\b.*\bij\b)|(?:\byz\b.*\bij\b.*\babc\b.*\bmnop\b)|(?
# :\byz\b.*\bij\b.*\bmnop\b.*\babc\b)|(?:\byz\b.*\bmnop\b.*\babc\b.*\bij\b)|(?:\by
# z\b.*\bmnop\b.*\bij\b.*\babc\b)

While it works on the example I gave, it takes 5-15 minutes on my real dataset (50k rows and very long descriptions with breaklines) and I’m not sure if my approach handles correctly all the rows. And there is also a problem, sometimes I had to validate a list of 6 keywords, which gives 720 permuation !

Can you guys help me solve this ? Is regex the right way to approach my problem ?

My expected ouptut is this :

                description  valid
0       ij edf m-nop ij abc
1            abc ij mnop yz   True
2         yz yz mnop aa abc
3  i j y y abc xxx mnop y z
4  yz mnop ij kl abc uvwxyz   True
5    aaabc ijij uuu yz mnop

>Solution :

A regex can be useful, but generating all permutations is not appropriate.

I would use a regex to extract words, then checking that the keywords are a subset of the extracted words with set.issubset:

import re

keywords = {'abc', 'ij', 'mnop', 'yz'} # this is a SET

reg = re.compile(r'\b[a-z]+\b', flags=re.I)

df['valid'] = [keywords.issubset(reg.findall(x)) for x in df['description']]

NB. you might want to add a casefold step to ignore case.

Output:

                description  valid
0       ij edf m-nop ij abc  False
1            abc ij mnop yz   True
2         yz yz mnop aa abc  False
3  i j y y abc xxx mnop y z  False
4  yz mnop ij kl abc uvwxyz   True
5    aaabc ijij uuu yz mnop  False

For fun, by tweaking the code you could even get the set of missing words instead of False:

df['valid'] = [keywords.issubset(S:=set(reg.findall(x))) or keywords-S
               for x in df['description']]

                description       valid
0       ij edf m-nop ij abc  {mnop, yz}
1            abc ij mnop yz        True
2         yz yz mnop aa abc        {ij}
3  i j y y abc xxx mnop y z    {yz, ij}
4  yz mnop ij kl abc uvwxyz        True
5    aaabc ijij uuu yz mnop   {abc, ij}

# or
df['missing'] = [keywords-set(reg.findall(x)) for x in df['description']]
df['valid'] = df['missing'].eq(set())

                description     missing  valid
0       ij edf m-nop ij abc  {mnop, yz}  False
1            abc ij mnop yz          {}   True
2         yz yz mnop aa abc        {ij}  False
3  i j y y abc xxx mnop y z    {yz, ij}  False
4  yz mnop ij kl abc uvwxyz          {}   True
5    aaabc ijij uuu yz mnop   {abc, ij}  False