My input is this dataframe :
df = pd.DataFrame({'category': ['A', 'A', 'A', 'B', 'B', 'C'],
'id': ['id002', 'id001', 'id003', 'id004', 'id005', 'id006'],
'total': [None, 25.0, 10.0, 20.0, None, 5.0],
'origin': ['x', 'z', 'y', 'y', 'x', 'y']})
print(df)
category id total origin
0 A id002 NaN x
1 A id001 25.0 z
2 A id003 10.0 y
3 B id004 20.0 y
4 B id005 NaN x
5 C id006 5.0 y
I want in each category to get the minimum id, the sum of total column and the origin that correspond to the minimum id that was choosen.
For that I made the code below :
result = df.groupby('category', as_index=False).agg(
{'id': 'min', 'total': 'sum', 'origin': 'first'}
)
print(result)
category id total origin
0 A id001 35.0 x
1 B id004 20.0 y
2 C id006 5.0 y
Unfortunately there is two problems. The origin is wrong for A and B and the index is gone.
My expected output is this :
category id total origin
1 A id001 35.0 z
3 B id004 20.0 y
5 C id006 5.0 y
Have you guys dealed with a situation like this ? Is there a solution ?
>Solution :
Use DataFrame.sort_values with DataFrame.drop_duplicates for rows by minimal id per groups and rewrite total column with mappping aggregation sum:
s = df.groupby('category')['total'].sum()
result = (df.sort_values(['category','id'])
.drop_duplicates('category')
.assign(total=lambda x: x['category'].map(s)))
print(result)
category id total origin
1 A id001 35.0 z
3 B id004 20.0 y
5 C id006 5.0 y