Pandas DataFrame groupby.mean() including string columns

February 21, 2022

I have a DataFrame which I need to aggregate. The data can be of mixed type. I can easily achieve this for numeric data using a simple groupby.mean().

Example:

import pandas as pd
import numpy as np

d = {'ID':[1,1,1,2,2,2,3,3,3,3],
     'ID2':['A','A','A','B','B','B','C','C','C','C'],
     'p1':[1,2,1,100,1,1,10,3,2,1],
     'p2':[1,2,1,100,1,1,10,3,2,2],
     'p3':[1,2,1,100,1,1,10,3,2,5],
     'p4':['A','A','A','B','B','B','C','C','X','X']
    }

df = pd.DataFrame(d)

    ID  ID2 p1  p2  p3  p4
0   1   A   1   1   1   A
1   1   A   2   2   2   A
2   1   A   1   1   1   A
3   2   B   100 100 100 B
4   2   B   1   1   1   B
5   2   B   1   1   1   B
6   3   C   10  10  10  C
7   3   C   3   3   3   C
8   3   C   2   2   2   X
9   3   C   1   2   5   X

I then apply the following df2 = df.groupby(['ID','ID2']).mean() which returns:

        p1          p2          p3
ID  ID2         
1   A   1.333333    1.333333    1.333333
2   B   34.000000   34.000000   34.000000
3   C   4.000000    4.250000    5.000000

This is great however (not unexpected) does not accommodate ‘p4’ which is of type str. I would like to aggregate the string values where if all values are the same then keep the value else replace with NaN and include within the results set as follows:

    p1  p2  p3  p4
ID  ID2             
1   A   1.333333    1.333333    1.333333    A
2   B   34.000000   34.000000   34.000000   B
3   C   4.000000    4.250000    5.000000    NaN

Additionally to this I potentially have unknown number of columns as such would like to do this by some form of type checking so not to have to explicitly state the param names.

Is there a pythonic way to achieve this without deconstructing the DataFrame into alpha/numerics?

>Solution :

You can use a custom aggregation function:

dct = {
    'p1': 'mean',
    'p2': 'mean',
    'p3': 'mean',
    'p4': lambda col: col.mode() if col.nunique() == 1 else np.nan,
}
agg = df.groupby(['ID','ID2']).agg(**{k: (k, v) for k, v in dct.items()})

Or, by type:

dct = {
    'number': 'mean',
    'object': lambda col: col.mode() if col.nunique() == 1 else np.nan,
}

groupby_cols = ['ID','ID2']
dct = {k: v for i in [{col: agg for col in df.select_dtypes(tp).columns.difference(groupby_cols)} for tp, agg in dct.items()] for k, v in i.items()}
agg = df.groupby(groupby_cols).agg(**{k: (k, v) for k, v in dct.items()})

Output for both:

>>> agg
               p1         p2         p3   p4
ID ID2                                      
1  A     1.333333   1.333333   1.333333    A
2  B    34.000000  34.000000  34.000000    B
3  C     4.000000   4.250000   5.000000  NaN