Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to aggregate only non duplicates values using Pandas

I have the dataframe below:

    ID  COL1    COL2
0   id001   val1    xxxxx
1   id001   val1    yyyyy
2   id002   val2    yyyyy
3   id003   val3    zzzzz
4   id003   val4    zzzzz

And this is the expected output:

ID  COL1    COL2
0   id001   val1    xxxxx|yyyyy
1   id002   val2    yyyyy
2   id003   val3|val4   zzzzz

I made the code below but unfortunately val1 (first row) and zzzzz (last row) are repeated.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df_gr = df[['COL1', 'COL2']].astype(str).groupby(df['ID']).agg('|'.join).reset_index()

    ID  COL1    COL2
0   id001   val1|val1   xxxxx|yyyyy
1   id002   val2    yyyyy
2   id003   val3|val4   zzzzz|zzzzz

Do you know how to fix this, please ?

>Solution :

Note that I needed to convert an array to a list with string elements before applying join:

df.groupby(['ID']).agg({'COL1': 'unique', 'COL2':'unique'}).applymap(lambda x: '|'.join(map(str, x))).reset_index()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading