I have a big dataframe with columns including ID and multiple values and different rows can have same or different ID values. I would like to create a new dataframe so, that every row has only one ID and the specific column values are just appended next to the ID. The Dataframe also has other columns with additional values that are same for same ID rows that i would like to keep
| ID | type1 | type2 | value1 | value2 | value3 |
|---|---|---|---|---|---|
| 1 | dog | yellow | 1 | 2 | 3 |
| 1 | dog | yellow | 5 | 6 | 7 |
| 2 | cat | brown | 1 | 1 | 1 |
| 3 | mouse | blue | 1 | 1 | 1 |
| 1 | dog | yellow | 1 | 2 | 3 |
expected output:
| ID | type1 | type2 | value |
|---|---|---|---|
| 1 | dog | yellow | 1 2 3 5 6 7 1 2 3 |
| 2 | cat | brown | 1 1 1 |
| 3 | mouse | blue | 1 1 1 |
I have been exploring the groupby option, can’t get it to have this kind of output
>Solution :
You can melt and groupby.agg:
group = ['ID', 'type1', 'type2']
out = df.melt(group).groupby(group, as_index=False)['value'].agg(list)
Output:
ID type1 type2 value
0 1 dog yellow [1, 5, 1, 2, 6, 2, 3, 7, 3]
1 2 cat brown [1, 1, 1]
2 3 mouse blue [1, 1, 1]
If order matters:
out = (df.set_index(group).stack().groupby(group).agg(list)
.reset_index(name='value')
)
Output:
ID type1 type2 value
0 1 dog yellow [1, 2, 3, 5, 6, 7, 1, 2, 3]
1 2 cat brown [1, 1, 1]
2 3 mouse blue [1, 1, 1]