Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Succinct / pythonic method of groupby percentage calculation

I want to groupby and calculate the percentage within each respective group over a certain value. I have achieved this by two groupbys into temporary dataframes (group count over a value, and group count), then merging the dataframes before calculating the percentage. I assume there is a more succinct method that doesn’t require creating temporary dataframes.

e.g.

df = pd.DataFrame({'n': ['a', 'b', 'c', 'a', 'c', 'a', 'a', 'c', 'a'],
                    'm': ['j', 'j', 'j', 'j', 'j', 'j', 'f', 'f', 'f'],
                    'v': [1, 1, 1, 2, 1, 2, 3, 6, 5]})
  1. Count groupby on ‘n’ and ‘m’ and count
dfcount = df.groupby(['n', 'm'])['v'].agg('count').reset_index().rename(columns={'v': 'count'})
  1. Filter values over 1, then groupby ‘n’ and ‘m’ and count
dfover1 = df[df['v'] > 1].groupby(['n','m']).count().reset_index().rename(columns={'v': 'over1count'})
  1. Merge two new dataframes on the groupby columns ‘n’ and ‘m’, then calculate the percentage over 1 for those groupbys.
dfnew = dfcount.merge(dfover1, on=['n', 'm'])
dfnew['late%'] = (100 * dfnew['over1count'] / dfnew['count']).round(1)
n m count latecount late%
0 a f 2 2 100.0
1 a j 3 2 66.7
2 c f 1 1 100.0

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can use method chaining (pipeline) to avoid using named intermediates:

out = (df
   .groupby(['n', 'm'], as_index=False)
   # compute count and latecount
   .agg(**{'count': ('v', 'count'),
           'latecount': ('v', lambda v: len(v[v>1]))
          })
   # filter rows with latecount > 0
   .query('latecount > 0')
   # compute late%
   .assign(**{'late%': lambda d: d['latecount'].div(d['count']).mul(100).round(2)})
)

output:

   n  m  count  latecount   late%
0  a  f      2          2  100.00
1  a  j      3          2   66.67
3  c  f      1          1  100.00
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading