Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract text by keyword between delimiters

Please help me solve the problem with clearing text from unnecessary parts.

I have an example of dataset:

df = pd.DataFrame({'addressfrom': ['HĂĽseyinaÄźa, Rexee Hotel, BĂĽyĂĽk Bayram Sokak', 'Rixos Premium', '123 Main St, Hotel Hilton Antalya', 'Residence Hotel & SPA, 1234']})

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

and a list of:

 keywords = ['hotel', 'resort', 'hilton', 'novotel', 'rixos', 'palace', 'residence', 'radisson', 'holiday', 'apartments', 'plaza', 'inn', 'club', 'spa']

I’m trying to extract a part of a string with keywords. At the same time, I need to eliminate the text that surrounds the desired part. I’m attempting to achieve this using a separator ‘,’ in some cases it may be ‘-‘. In the end, I want to achieve the following format.

index addressfrom
0 Rexee Hotel
1 Rixos Premium
2 Hotel Hilton Antalya
3 Residence Hotel & SPA

The best I managed to achieve was this

`df = pd.DataFrame({'addressfrom': ['HĂĽseyinaÄźa, Rexee Hotel, BĂĽyĂĽk Bayram Sokak', 'Rixos Premium', '123 Main St, Hotel Hilton Antalya', 'Residence Hotel & SPA, 1234']})

keywords = ['hotel', 'resort', 'hilton', 'novotel', 'rixos', 'palace', 'residence', 'radisson', 'holiday', 'apartments', 'plaza', 'inn', 'club', 'spa']

pattern = f'[^,]*({"|".join(keywords)})[^,]*'

df['addressfrom'] = df['addressfrom'].str.extract(pattern, flags=re.IGNORECASE)

print(df)`

Output:

index addressfrom
0 Hotel
1 Resort
2 Hilton
3 Rixos

>Solution :

One way to achieve this as per me is to split the address string using a comma as the separator, and then appliy the regex pattern to each part. Then extract the matched parts and join them back into a single string. Something like:

def extract_keywords(s, keywords):
    pattern = f'[^,]*\\b({"|".join(keywords)})\\b[^,]*'
    match = re.search(pattern, s, flags=re.IGNORECASE)
    return match.group(0) if match else None

df['addressfrom'] = df['addressfrom'].apply(lambda x: extract_keywords(x, keywords))

CODE DEMO

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading