This is my sample code:
import pandas as pd
df = pd.DataFrame({'A':
['btcrr',
'You have crypto here',
'coinbase.com was there ',
'hotwalletint']
})
regex = r"(^|\W)(?:btc|crypto|coinbase|hotwallet)[^A-Za-z0-9]"
tagged_df = df[df['A'].str.contains(regex, na=False, regex=True, case=False)]
The output of tagged_df:
A
1 You have crypto here
2 coinbase.com was there
In this case, this will return only if it matches the regex that I gave. But I want the pandas to return the matched keyword. I am expecting something like this to return in tagged_df
The Expected output of tagged_df:
A
1 crypto
2 coinbase.com
If pandas do not have the ability, Please suggest alternates that can solve this case.
>Solution :
Use pandas.Series.str.extract(). For each capture group in the regular expession (a non-capture group is just a group with ?: at the beginning, e.g. (?:abc)), a new colum will be created containing the matched value for that group, for that row. You can also Add ?P<your_name> to the very beginning of a capture group to name the outputted column associated with that group:
new_df = df['A'].str.extract(r'(?:^|\W)(?P<A>btc|crypto|coinbase|hotwallet)[^A-Za-z0-9]')
Output:
>>> new_df
A
0 NaN
1 crypto
2 coinbase
3 NaN
>>> new_df.dropna()
A
1 crypto
2 coinbase