using python I would like to remove duplicate rows based on first column but would like to keep all the values in second column

I have below dataframe

              CVE ID             Product Versions
0      CVE-2022-46689                   Mac OS 12
1      CVE-2022-42856                      Safari
2      CVE-2022-46689             Windows 10 21h1
3      CVE-2022-41121             Windows 10 21h2
4      CVE-2022-42856                      Safari

I would like to remove duplicates based on the column CVE ID but also want to make sure that I store the value present in the 2nd column Product Versions (but remove the value if already present)

Something like this below:

              CVE ID             Product Versions
0      CVE-2022-46689            Mac OS 12, Windows 10 21h1
1      CVE-2022-42856            Safari
2      CVE-2022-41121            Windows 10 21h2

How should I do it?

Any help is appreciated

>Solution :

here is one way to do it

# drop duplicates (in memory)
# groupby CVE ID and join the resulting list of product version

out=(df.drop_duplicates(subset=['CVE ID','Product Versions'])
 .groupby(['CVE ID'],as_index=False)['Product Versions']
 .agg(','.join ))

out
            CVE ID  Product Versions
0   CVE-2022-41121  Windows 10 21h2
1   CVE-2022-42856  Safari
2   CVE-2022-46689  Mac OS 12, Windows 10 21h1

Leave a Reply