How to make a customized count of occurences in a grouped dataframe

September 11, 2022

Please find below my input/output :

INPUT :

      Id  Values   Status
0  Id001     red   online
1  Id002   brown  running
2  Id002   white      off
3  Id003    blue   online
4  Id003   green    valid
5  Id003  yellow  running
6  Id004    rose      off
7  Id004  purple      off

OUTPUT :

      Id             Values                Status  Id_occ  Val_occ  Sta_occ
0  Id001                red                online       1        1        1
1  Id002        brown|white           running|off       2        2        2
2  Id003  blue|green|yellow  online|valid|running       3        3        3
3  Id004        rose|purple                   off       2        2        1

I was able to re-calculate the columns Values and Status but I don’t know how to create the three columns of occurences.

import pandas as pd

df = pd.DataFrame({'Id': ['Id001', 'Id002', 'Id002', 'Id003', 'Id003', 'Id003', 'Id004', 'Id004'],
                   'Values': ['red', 'brown','white','blue', 'green', 'yellow', 'rose', 'purple'],
                   'Status': ['online', 'running', 'off', 'online', 'valid', 'running', 'off', 'off']})

out = (df.groupby(['Id'])
         .agg({'Values': 'unique', 'Status': 'unique'})
         .applymap(lambda x: '|'.join([str(val) for val in list(x)]))
         .reset_index()
        )

Do you have any suggestions of how to create the three columns of occurences ? Also, is there a better way to re-calculate the columns Values and Status ?

>Solution :

You can use named aggregation and a custom function:

ujoin = lambda s: '|'.join(dict.fromkeys(s))

out = (df
         .assign(N_off=df['Status'].eq('off'))
         .groupby(['Id'], as_index=False)
         .agg(**{'Values': ('Values', ujoin),
                 'Status': ('Status', ujoin),
                 'Id_occ': ('Values', 'size'),
                 'Val_occ': ('Values', 'nunique'),
                 'Stat_occ': ('Status', 'nunique'),
                 'N_off': ('N_off', 'sum')
                })
        )

Output:

      Id             Values                Status  Id_occ  Val_occ  Stat_occ  N_off
0  Id001                red                online       1        1         1      0
1  Id002        brown|white           running|off       2        2         2      1
2  Id003  blue|green|yellow  online|valid|running       3        3         3      0
3  Id004        rose|purple                   off       2        2         1      2