Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Reverse lookup of a sub-string in a dict of lists

I am struggling to find a "pythonic" way to make this logic work:

I have a dataframe of payment transactions with a column of strings( "beneficiary"):

index beneficiary
12 REWE SAGT DANKE. …
13 NaN
14 OBI BAU- U. HEIMWER//BERLIN/DE / OBI SAGT DANKE
15 NETFLIX INTERNATIONAL B.V.

I need to create another column in my dataframe which would be the category that every line belongs to.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

index beneficiary category
12 REWE SAGT DANKE. … Groceries
14 OBI BAU- U. HEIMWER//BERLIN/DE / OBI SAGT DANKE Groceries
15 NETFLIX INTERNATIONAL B.V. Entertainment

I am thinking to make a dictionary like this below and somehow reverse look-up the sub-string values from the categories dict with the column above:

categories = {"Groceries": ["EDEKA", "REWE", "OBI"],
            "Entertainment": ["NETFLIX"]}

The logic is: if sub-string "REWE" is in df[‘beneficiary’] then df[‘category’] equals the key of the dict element where the sub-string is.

I am open to other mapping logic.

>Solution :

You can use a regex, for this you need to rework your dictionary to have the matched strings as key and categories as values:

categories = {"Groceries": ["EDEKA", "REWE", "OBI"],
              "Entertainment": ["NETFLIX"]}

cat_sub = {v:k for k,l in categories.items() for v in l}
regex = r'(%s)' % '|'.join(fr'\b{c}\b' for c in cat_sub)
# regex looks like this: (\bEDEKA\b|\bREWE\b|\bOBI\b|\bNETFLIX\b)

df['category'] = df['beneficiary'].str.extract(regex)[0].map(cat_sub)

NB. I used word boundaries (\b) to ensure matching full words, use regex = r'(%s)' % '|'.join(cat_sub) if you don’t want this behaviour

output:

   index                                       beneficiary       category
0      12                             REWE SAGT DANKE. ...      Groceries
1      13                                              NaN            NaN
2      14  OBI BAU- U. HEIMWER//BERLIN/DE / OBI SAGT DANKE      Groceries
3      15                       NETFLIX INTERNATIONAL B.V.  Entertainment

NB. if needed to drop the NaNs, use dropna

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading