I need to have a list of all non- duplicated regex matches.
Consider below dataframe:
Letter Actions
r1 a30,a30
r2 a30,a12-rf,a15,a15
r3 0
r4 a10,a93
r5 a13
I expect:
Letter Actions
r1 ['a30']
r2 ['a30','a12','a15']
r3 0
r4 ['a10','a93']
r5 ['a13']
I have below but it returns all pattern matches, while I need don’t need the duplicated ones:
import pandas as pd
df = pd.DataFrame(
[['r1', 'a30,a30'],
['r2', 'a30,a12-rf,a15,a15'],
['r3', '0'],
['r4', 'a10,a93'],
['r5', 'a13']],
columns=['Letter', 'Actions'])
df['Action_list'] = df['Actions'].str.findall(r'([a]\d{2})')
>Solution :
You can use set to remove duplicates:
mask = df["Actions"].str.contains(r"a\d+", regex=True)
df["new_Actions"] = np.where(
mask, df["Actions"].str.findall(r"a\d+").apply(set).apply(list), df["Actions"]
)
print(df)
Prints:
Letter Actions new_Actions
0 r1 a30,a30 [a30]
1 r2 a30,a12-rf,a15,a15 [a30, a15, a12]
2 r3 0 0
3 r4 a10,a93 [a93, a10]
4 r5 a13 [a13]