Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to slice a PandasGroupByObject and use multiple functions in agg

I have a pandas DataFrame with Data and Groups and I want to perform multiple functions using the agg-method.

from scipy.stats import iqr, norm
import pandas as pd
df = pd.DataFrame({'Data':[1,2,3,5,10,5,3,3,4,1], 'Group':[1,2,2,1,1,1,2,2,2,1]})
df.groupby('Group')['Data'].agg(['median', iqr])

which works fine. But now I want to slice the groups befor I perform the operations.

The problem is, that

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df.groupby('Group')['Data'].iloc[2:-2].agg(['median', iqr])

thows an error because slicing is not supported for PandasGroupByObjects. An error comes for

df.groupby('Group')['Data'].nth[2:-2].agg(['median', iqr])

too, because the combinations is not allowed.

This post shows that a lambda function can be used to slice the values.

I tried to use this, but

df.groupby('Group')['Data'].agg(lambda x: [np.median(x.iloc[2:-2]), iqr(x.iloc[2:-2])])

returns a DataFrames with lists in it, which is not what I want.

How to apply the slicing and get a well formated DataFrame as return value?

Wanted output:

       median  iqr
Group             
1        10.0  0.0
2         3.0  0.0

>Solution :

If you have many groups or many functions to apply, one option is a double groupby:

(df.groupby('Group', as_index=False, group_keys=False)
   .apply(lambda g: g.iloc[2:-2])
   .groupby('Group')['Data']
   .agg(['median', iqr])
)

Efficiency will depend on the number of functions and groups.

output:

       median  iqr
Group             
1        10.0  0.0
2         3.0  0.0

timings

The double groupby is less efficient than named aggregation for few functions and few groups, but more efficient for many of either, which is expected as the slicing is perform once for the double groupby vs as many times as there are groups × aggregations functions (although on a smaller size) in the other case.

Provided example:

# double groupby
2.43 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# lambdas with named aggregation
2.39 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

100k rows, 2 groups, 2 aggregations:

# double groupby
12.1 ms ± 913 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# lambdas with named aggregation
8.12 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

100k rows, 2 groups, 100 aggregations:

# double groupby
149 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# lambdas with named aggregation
300 ms ± 34.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

100k rows, 52 groups, 5 aggregations:

# double groupby
30.1 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# lambdas with named aggregation
52.2 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

100k rows, 52 groups, 100 aggregations:

# double groupby
220 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# lambdas with named aggregation
837 ms ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading