Follow

Follow

Contact

Home Extract strings based on custom list of items

Questions

Extract strings based on custom list of items

byMR

September 21, 2022

Say we have this df:

import pandas as pd
df = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color']})

    a
0   hair color other family, friends
1   family, friends hair color

I want to extract strings using my own list of items:

items = ['hair color', 'other', 'family, friends']

I want to do this because there are no consistent delimiter or pattern in the raw data.

Desired output:

import numpy as np
desired_output = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color'],
                                   'hair color': ['hair color', 'hair color'],
                                   'other': ['other', np.nan],
                                   'family, friends': ['family, friends', 'family, friends']
                                  })


                                  a     hair color  other   family, friends
0   hair color other family, friends    hair color  other   family, friends
1   family, friends hair color          hair color  NaN     family, friends

>Solution :

You can craft a regex to use with str.extractall:

import re

regex = '|'.join([f'({re.escape(i)})' for i in items])
# '(hair\\ color)|(other)|(family,\\ friends)'

df.join(df['a'].str.extractall(regex)
                   .set_axis(items, axis=1)
                   .groupby(level=0).first())

output:

                                   a  hair color  other  family, friends
0  hair color other family, friends   hair color  other  family, friends
1         family, friends hair color  hair color   None  family, friends

update:

df.join(df['a'].str.extractall(regex)
                   .set_axis(items, axis=1)
                   .groupby(level=0).first()
                   .add_prefix('item1_')
                   .replace({None: np.nan})
       )

output:

                                   a item1_hair color item1_other item1_family, friends
0  hair color other family, friends        hair color       other       family, friends
1         family, friends hair color       hair color         NaN       family, friends

string

byMR

Published September 21, 2022

Add a comment

Leave a ReplyCancel reply

Read more

Questions

Issues with datetime and isinstance()

byMR

September 21, 2022

Questions

How to colorize all rows of a dataframe based on values of a column dynamically?

byMR

September 21, 2022

Questions

finding the element in a list closest to the mean of elements in python?

byMR

September 21, 2022

Questions

TypeScript: Idiomatic way to do a switch-case on Enum to set a variable

byMR

September 21, 2022

Questions

WhatsApp Cloud Api – Header & Body Template Message Php

byMR

September 21, 2022

Questions

How to wait for a function for a certain delay before executing the next function

byMR

September 21, 2022