Succinct / pythonic method of groupby percentage calculation

I want to groupby and calculate the percentage within each respective group over a certain value. I have achieved this by two groupbys into temporary dataframes (group count over a value, and group count), then merging the dataframes before calculating the percentage. I assume there is a more succinct method that doesn’t require creating temporary dataframes.

e.g.

df = pd.DataFrame({'n': ['a', 'b', 'c', 'a', 'c', 'a', 'a', 'c', 'a'],
                    'm': ['j', 'j', 'j', 'j', 'j', 'j', 'f', 'f', 'f'],
                    'v': [1, 1, 1, 2, 1, 2, 3, 6, 5]})
  1. Count groupby on ‘n’ and ‘m’ and count
dfcount = df.groupby(['n', 'm'])['v'].agg('count').reset_index().rename(columns={'v': 'count'})
  1. Filter values over 1, then groupby ‘n’ and ‘m’ and count
dfover1 = df[df['v'] > 1].groupby(['n','m']).count().reset_index().rename(columns={'v': 'over1count'})
  1. Merge two new dataframes on the groupby columns ‘n’ and ‘m’, then calculate the percentage over 1 for those groupbys.
dfnew = dfcount.merge(dfover1, on=['n', 'm'])
dfnew['late%'] = (100 * dfnew['over1count'] / dfnew['count']).round(1)
n m count latecount late%
0 a f 2 2 100.0
1 a j 3 2 66.7
2 c f 1 1 100.0

>Solution :

You can use method chaining (pipeline) to avoid using named intermediates:

out = (df
   .groupby(['n', 'm'], as_index=False)
   # compute count and latecount
   .agg(**{'count': ('v', 'count'),
           'latecount': ('v', lambda v: len(v[v>1]))
          })
   # filter rows with latecount > 0
   .query('latecount > 0')
   # compute late%
   .assign(**{'late%': lambda d: d['latecount'].div(d['count']).mul(100).round(2)})
)

output:

   n  m  count  latecount   late%
0  a  f      2          2  100.00
1  a  j      3          2   66.67
3  c  f      1          1  100.00

Leave a Reply