How to remove duplicate strings after grouping?

February 9, 2023

I would like to group the following Pandas DataFrame by ID column:


|----+----------------------------------------+-----------------|
| ID | Elements                               | Colors          |
|----+----------------------------------------+-----------------|
| A  | '1st element, 2d element, 3d element'  | 'red, blue'     |
| A  | '2d element, 4th element'              | 'blue, green'   |
| B  | '3d element, 5th element, 6th element' | 'white, purple' |
| B  | '3d element, 5th element, 7th element' | 'white, teal'   |
| B  | '3d element, 5th element, 8th element' | 'white, black'  |
|----+----------------------------------------+-----------------|

In order to obtain the following Pandas DataFrame:

|----+-----------------------------------------------------------------+------------------------------|
| ID | Elements                                                        | Colors                       |
|----+-----------------------------------------------------------------+------------------------------|
| A  | '1st element, 2d element, 3d element, 4th element'              | 'red, blue, green'           |
| B  | '3d element, 5h element, 6th element, 7th element, 8th element' | 'white, purple, teal, black' |
|----+-----------------------------------------------------------------+------------------------------|

>Solution :

To remove the duplicates in the "Elements" and "Colors" columns after grouping by the "ID" column, you can use the following code:

def clean_column(col):
    elements = ', '.join([str(x) for x in col if str(x) != 'nan'])
    unique_elements = list(set(elements.split(', ')))
    sorted_elements = sorted(unique_elements)
    return ', '.join(sorted_elements)

grouped_df = df.groupby('ID').agg({'Elements': clean_column, 'Colors': clean_column})

grouped_df