Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas findall by pattern but not duplicated ones

I need to have a list of all non- duplicated regex matches.

Consider below dataframe:

Letter      Actions
r1          a30,a30
r2          a30,a12-rf,a15,a15
r3          0
r4          a10,a93
r5          a13

I expect:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Letter      Actions
r1          ['a30']
r2          ['a30','a12','a15']
r3          0
r4          ['a10','a93']
r5          ['a13']

I have below but it returns all pattern matches, while I need don’t need the duplicated ones:

import pandas as pd

df = pd.DataFrame(
    [['r1', 'a30,a30'],
     ['r2', 'a30,a12-rf,a15,a15'],
     ['r3', '0'],
     ['r4', 'a10,a93'],
     ['r5', 'a13']],
    columns=['Letter', 'Actions'])

df['Action_list'] = df['Actions'].str.findall(r'([a]\d{2})')

>Solution :

You can use set to remove duplicates:

mask = df["Actions"].str.contains(r"a\d+", regex=True)

df["new_Actions"] = np.where(
    mask, df["Actions"].str.findall(r"a\d+").apply(set).apply(list), df["Actions"]
)
print(df)

Prints:

  Letter             Actions      new_Actions
0     r1             a30,a30            [a30]
1     r2  a30,a12-rf,a15,a15  [a30, a15, a12]
2     r3                   0                0
3     r4             a10,a93       [a93, a10]
4     r5                 a13            [a13]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading