so I have this dataframe here
>>> df
uniprot_id protein_group protein_family protein_subfamily
0 Q8TAS1 Other KIS NaN
1 P35916 TK VEGFR NaN
2 Q96SB4 CMGC SRPK NaN
3 Q6P3W7 Other SCY1 NaN
4 Q9UKI8 Other TLK NaN
.. ... ... ... ...
561 Q96S53 TKL LISK TESK
562 Q13163 STE STE7 NaN
563 P45985 STE STE7 NaN
564 Q5VT25 AGC DMPK GEK
565 O00141 AGC SGK NaN
There are some duplicate values in the uniprot_id column
and I want to combine them and make idenitcal values merge but different values seperated by a semicolon because the rows for these duplicate uniprot_id values are similar but not identical
after applying the code below I don’t get the result I am looking for, and I’m wondering what i’m doing wrong
df2 = df.groupby(['uniprot_id'])['protein_group','protein_family','protein_subfamily'].apply(lambda x: '; '.join(set(x))).reset_index()
>>> print(df2)
uniprot_id 0
0 A0A0B4J2F2 protein_subfamily; protein_family; protein_group
1 A4QPH2 protein_subfamily; protein_family; protein_group
2 B5MCJ9 protein_subfamily; protein_family; protein_group
3 O00141 protein_subfamily; protein_family; protein_group
4 O00238 protein_subfamily; protein_family; protein_group
.. ... ...
547 Q9Y616 protein_subfamily; protein_family; protein_group
548 Q9Y6E0 protein_subfamily; protein_family; protein_group
549 Q9Y6M4 protein_subfamily; protein_family; protein_group
550 Q9Y6R4 protein_subfamily; protein_family; protein_group
551 Q9Y6S9 protein_subfamily; protein_family; protein_group
I need duplicate rows to combine and to look like this
uniprot_id protein_group protein_family protein_subfamily
133 Q9UK32 Other RSK; RSKb RSKp90; RSKb
>Solution :
Use GroupBy.agg with remove missing values by Series.dropna:
df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
.agg(lambda x: '; '.join(set(x.dropna())))
.reset_index())
print (df2)
uniprot_id protein_group protein_family protein_subfamily
0 O00141 AGC SGK
1 P35916 TK VEGFR
2 P45985 STE STE7
3 Q13163 STE STE7
4 Q5VT25 AGC DMPK GEK
5 Q6P3W7 Other SCY1
6 Q8TAS1 Other KIS
7 Q96S53 TKL LISK TESK
8 Q96SB4 CMGC SRPK
9 Q9UKI8 Other TLK
If order is important dont use sets, because there is order not defined, use dict.fromkeys trick:
df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
.agg(lambda x: '; '.join(dict.fromkeys(x.dropna()).keys()))
.reset_index())