My input is this dataframe (but it could be a simple list) :
import pandas as pd
df = pd.DataFrame({'description': ['ij edf m-nop ij abc', 'abc ij mnop yz', 'yz yz mnop aa abc', 'i j y y abc xxx mnop y z', 'yz mnop ij kl abc uvwxyz', 'aaabc ijij uuu yz mnop']})
I also have a list of keywords (between 3 and 7 items) that I need to valid. We should only validate an exact combination of the whole keywords and ignore characters in between. The problem is that those keywords don’t respect the order I put them in my list (here keywords).
I searched in google and here too but couldn’t find any post that talks about a similar topic. So I made the code below which is making a permuation of the keywords and put them in a regex string.
import re
import itertools
keywords = ['abc', 'ij', 'mnop', 'yz']
regex = ''
for perm in list(itertools.permutations(keywords)):
perm = [fr'\b{key}\b' for key in perm]
regex += f'(?:{".*".join(perm)})|'
regex = regex.rstrip('|')
Here is a snippet of my regex :
# (?:\babc\b.*\bij\b.*\bmnop\b.*\byz\b)|(?:\babc\b.*\bij\b.*\byz\b.*\bmnop\b)|(?:\
# babc\b.*\bmnop\b.*\bij\b.*\byz\b)|(?:\babc\b.*\bmnop\b.*\byz\b.*\bij\b)|(?:\babc
# \b.*\byz\b.*\bij\b.*\bmnop\b)|(?:\babc\b.*\byz\b.*\bmnop\b.*\bij\b)|(?:\bij\b.*\
# babc\b.*\bmnop\b.*\byz\b)|(?:\bij\b.*\babc\b.*\byz\b.*\bmnop\b)|(?:\bij\b.*\bmno
# p\b.*\babc\b.*\byz\b)|(?:\bij\b.*\bmnop\b.*\byz\b.*\babc\b)|(?:\bij\b.*\byz\b.*\
# babc\b.*\bmnop\b)|(?:\bij\b.*\byz\b.*\bmnop\b.*\babc\b)|(?:\bmnop\b.*\babc\b.*\b
# ij\b.*\byz\b)|(?:\bmnop\b.*\babc\b.*\byz\b.*\bij\b)|(?:\bmnop\b.*\bij\b.*\babc\b
# .*\byz\b)|(?:\bmnop\b.*\bij\b.*\byz\b.*\babc\b)|(?:\bmnop\b.*\byz\b.*\babc\b.*\b
# ij\b)|(?:\bmnop\b.*\byz\b.*\bij\b.*\babc\b)|(?:\byz\b.*\babc\b.*\bij\b.*\bmnop\b
# )|(?:\byz\b.*\babc\b.*\bmnop\b.*\bij\b)|(?:\byz\b.*\bij\b.*\babc\b.*\bmnop\b)|(?
# :\byz\b.*\bij\b.*\bmnop\b.*\babc\b)|(?:\byz\b.*\bmnop\b.*\babc\b.*\bij\b)|(?:\by
# z\b.*\bmnop\b.*\bij\b.*\babc\b)
While it works on the example I gave, it takes 5-15 minutes on my real dataset (50k rows and very long descriptions with breaklines) and I’m not sure if my approach handles correctly all the rows. And there is also a problem, sometimes I had to validate a list of 6 keywords, which gives 720 permuation !
Can you guys help me solve this ? Is regex the right way to approach my problem ?
My expected ouptut is this :
description valid
0 ij edf m-nop ij abc
1 abc ij mnop yz True
2 yz yz mnop aa abc
3 i j y y abc xxx mnop y z
4 yz mnop ij kl abc uvwxyz True
5 aaabc ijij uuu yz mnop
>Solution :
A regex can be useful, but generating all permutations is not appropriate.
I would use a regex to extract words, then checking that the keywords are a subset of the extracted words with set.issubset:
import re
keywords = {'abc', 'ij', 'mnop', 'yz'} # this is a SET
reg = re.compile(r'\b[a-z]+\b', flags=re.I)
df['valid'] = [keywords.issubset(reg.findall(x)) for x in df['description']]
NB. you might want to add a casefold step to ignore case.
Output:
description valid
0 ij edf m-nop ij abc False
1 abc ij mnop yz True
2 yz yz mnop aa abc False
3 i j y y abc xxx mnop y z False
4 yz mnop ij kl abc uvwxyz True
5 aaabc ijij uuu yz mnop False
For fun, by tweaking the code you could even get the set of missing words instead of False:
df['valid'] = [keywords.issubset(S:=set(reg.findall(x))) or keywords-S
for x in df['description']]
description valid
0 ij edf m-nop ij abc {mnop, yz}
1 abc ij mnop yz True
2 yz yz mnop aa abc {ij}
3 i j y y abc xxx mnop y z {yz, ij}
4 yz mnop ij kl abc uvwxyz True
5 aaabc ijij uuu yz mnop {abc, ij}
# or
df['missing'] = [keywords-set(reg.findall(x)) for x in df['description']]
df['valid'] = df['missing'].eq(set())
description missing valid
0 ij edf m-nop ij abc {mnop, yz} False
1 abc ij mnop yz {} True
2 yz yz mnop aa abc {ij} False
3 i j y y abc xxx mnop y z {yz, ij} False
4 yz mnop ij kl abc uvwxyz {} True
5 aaabc ijij uuu yz mnop {abc, ij} False