How to remove rows with duplicate values in some columns, while keeping values that are different?

I have a pandas dataframe in the following format.

| id | name    | last_name | address | x       | y | x_list    | y_list|
| -- | ------- | --------- | ------- | ------- | - | --------- | ----- |
| 1  | 'John'  | 'Smith'   | 'add_1' | 'one'   | 1 | ['one']   | [1]   |
| 2  | 'Tom'   | 'Davis'   | 'add_2' | 'two'   | 2 | ['two']   | [2]   |
| 3  | 'John'  | 'Smith'   | 'add_1' | 'three' | 3 | ['three'] | [3]   |
| 4  | 'Tom'   | 'Davis'   | 'add_2' | 'four'  | 4 | ['four']  | [4]   |
| 5  | 'Susan' | 'Jones'   | 'add_1' | 'one'   | 1 | ['one']   | [1]   |

I have no idea how to approach this problem. I need this output:

| id | name    | last_name  | address | x_list           | y_list |
| -- | ------- | ---------- | ------- | ---------------- | ------ |
| 1  | 'John'  | 'Smith'    | 'add_1' | ['one', 'three'] | [1, 3] |
| 2  | 'Tom'   | 'Davis'    | 'add_2' | ['two', 'four']  | [2, 4] |
| 3  | 'Susan' | 'Jones'    | 'add_1' | ['one']          | [1]    |

Basically, I need to return a new DataFrame, or modify the existing one so the columns with the same name, last_name, and address have their x_list and y_list merged. Can anyone help me how to do this in pandas? This needs to be done on a dataframe of about 58 000 rows.

>Solution :

then use following code:

df.groupby(['name', 'last_name', 'address'])[['x_list', 'y_list']].sum().reset_index()

output:

    name    last_name   address x_list          y_list
0   John    Smith       add_1   [ one , three ] [ 1 , 3 ]
1   Susan   Jones       add_1   [ one ]         [ 1 ]
2   Tom     Davis       add_2   [ two , four ]  [ 2 , 4 ]

Leave a Reply