Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Keep only characters in between two hyphens from a dataframe cell which has comma separated list

I have a database for eg.

Name    Subset
Apple   -AI-,-BI-A,-XC-,ZX-
Bat     -po-,-IJ-,-IA-B

and want to convert it to:

Name    Subset
Apple   AI,BI,XC,ZX
Bat     po,IJ,IA

i.e removing the first hyphen and removing the second hyphen and charachters following it until next comma appears.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

IIUC, use extractall with the -([^-]+)- regex to get only the values between dashes. Then GroupBy.agg with ','.join to concatenate the values:

df['Subset'] = (df['Subset'].str.extractall(r'-([^-]+)-')[0]
                .groupby(level=0).agg(','.join)
                )

output:

    Name    Subset
0  Apple  AI,BI,XC
1    Bat  po,IJ,IA

To handle the case where the left dash can be missing, use [-,]+([^-,]+)-:

df['Subset'] = (df['Subset'].str.extractall(r'[-,]+([^-,]+)-')[0]
                .groupby(level=0).agg(','.join))

output:

    Name       Subset
0  Apple  AI,BI,XC,ZX
1    Bat     po,IJ,I
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading