Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How do I create a regex dynamically using strings in a list for use in a pandas dataframe search?

The following code allows me to successfully identify the 2nd and 3rd texts, and only those texts, in a pandas dataframe by search for rows that contain the word "cod" or "i":

import numpy as np
import pandas as pd
texts_df = pd.DataFrame({"id":[1,2,3,4],
                      "text":["she loves coding", 
                              "he was eating cod",
                              "i do not like fish",
                              "fishing is not for me"]})

texts_df.loc[texts_df["text"].str.contains(r'\b(cod|i)\b', regex=True)]

enter image description here

I would like to build the list of words up dynamically by inserting words from a long list but I can’t figure out how to do that successfully.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I’ve tried the following but I get an error saying "r is not defined" (which I expected as it’s not a variable but I can’t put it as part of the string either and don’t know what I should do)

kw_list = ["cod", "i"]

kw_regex_string = "\b("
for kw in kw_list:
  kw_regex_string = kw_regex_string + kw + "|"
kw_regex_string = kw_regex_string[:-1]  # remove the final "|" at the end
kw_regex_string = kw_regex_string + ")\b"

myregex = r + kw_regex_string
texts_df.loc[texts_df["text"].str.contains(myregex, regex=True)]

How can I build the ‘or’ condition containing the list of key words and then insert that into the reg ex in a way that will work in the pandas dataframe search?

>Solution :

When I’m doing this, I wrap the list with map and re.escape to escape special characters that could have a regex meaning, then I join them with | as separator and I include this in the parentheses with string formatting:

import re

kw_list = ['cod', 'i']

my_regex = r'\b(?:%s)\b' % '|'.join(map(re.escape, kw_list))

texts_df.loc[texts_df['text'].str.contains(my_regex, regex=True)]

Variant:

my_regex = fr'\b(?:{"|".join(map(re.escape, kw_list))})\b'

Crafted regex: '\\b(?:cod|i)\\b'

Example of escaping of special characters:

kw_list = ['10.00$', '*word*', '(A)']

# crafted regex
'\\b(?:10\\.00\\$|\\*word\\*|\\(A\\))\\b'
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading