How to slice a PandasGroupByObject and use multiple functions in agg

August 26, 2022

I have a pandas DataFrame with Data and Groups and I want to perform multiple functions using the agg-method.

from scipy.stats import iqr, norm
import pandas as pd
df = pd.DataFrame({'Data':[1,2,3,5,10,5,3,3,4,1], 'Group':[1,2,2,1,1,1,2,2,2,1]})
df.groupby('Group')['Data'].agg(['median', iqr])

which works fine. But now I want to slice the groups befor I perform the operations.

The problem is, that

df.groupby('Group')['Data'].iloc[2:-2].agg(['median', iqr])

thows an error because slicing is not supported for PandasGroupByObjects. An error comes for

df.groupby('Group')['Data'].nth[2:-2].agg(['median', iqr])

too, because the combinations is not allowed.

This post shows that a lambda function can be used to slice the values.

I tried to use this, but

df.groupby('Group')['Data'].agg(lambda x: [np.median(x.iloc[2:-2]), iqr(x.iloc[2:-2])])

returns a DataFrames with lists in it, which is not what I want.

How to apply the slicing and get a well formated DataFrame as return value?

Wanted output:

       median  iqr
Group             
1        10.0  0.0
2         3.0  0.0

>Solution :

If you have many groups or many functions to apply, one option is a double groupby:

(df.groupby('Group', as_index=False, group_keys=False)
   .apply(lambda g: g.iloc[2:-2])
   .groupby('Group')['Data']
   .agg(['median', iqr])
)

Efficiency will depend on the number of functions and groups.

output:

       median  iqr
Group             
1        10.0  0.0
2         3.0  0.0

timings

The double groupby is less efficient than named aggregation for few functions and few groups, but more efficient for many of either, which is expected as the slicing is perform once for the double groupby vs as many times as there are groups × aggregations functions (although on a smaller size) in the other case.

Provided example:

# double groupby
2.43 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# lambdas with named aggregation
2.39 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

100k rows, 2 groups, 2 aggregations:

# double groupby
12.1 ms ± 913 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# lambdas with named aggregation
8.12 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

100k rows, 2 groups, 100 aggregations:

# double groupby
149 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# lambdas with named aggregation
300 ms ± 34.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

100k rows, 52 groups, 5 aggregations:

# double groupby
30.1 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# lambdas with named aggregation
52.2 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

100k rows, 52 groups, 100 aggregations:

# double groupby
220 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# lambdas with named aggregation
837 ms ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)