How to make an aggregation that depends on another one?

January 30, 2024

My input is this dataframe :

df = pd.DataFrame({'category': ['A', 'A', 'A', 'B', 'B', 'C'],
 'id': ['id002', 'id001', 'id003', 'id004', 'id005', 'id006'],
 'total': [None, 25.0, 10.0, 20.0, None, 5.0],
 'origin': ['x', 'z', 'y', 'y', 'x', 'y']})

print(df)

  category     id  total origin
0        A  id002    NaN      x
1        A  id001   25.0      z
2        A  id003   10.0      y
3        B  id004   20.0      y
4        B  id005    NaN      x
5        C  id006    5.0      y

I want in each category to get the minimum id, the sum of total column and the origin that correspond to the minimum id that was choosen.

For that I made the code below :

result = df.groupby('category', as_index=False).agg(
    {'id': 'min', 'total': 'sum', 'origin': 'first'}
)

print(result)

  category     id  total origin
0        A  id001   35.0      x
1        B  id004   20.0      y
2        C  id006    5.0      y

Unfortunately there is two problems. The origin is wrong for A and B and the index is gone.

My expected output is this :

  category     id  total origin
1        A  id001   35.0      z
3        B  id004   20.0      y
5        C  id006    5.0      y

Have you guys dealed with a situation like this ? Is there a solution ?

>Solution :

Use DataFrame.sort_values with DataFrame.drop_duplicates for rows by minimal id per groups and rewrite total column with mappping aggregation sum:

s = df.groupby('category')['total'].sum()

result = (df.sort_values(['category','id'])
            .drop_duplicates('category')
            .assign(total=lambda x: x['category'].map(s)))
print(result)
  category     id  total origin
1        A  id001   35.0      z
3        B  id004   20.0      y
5        C  id006    5.0      y