Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas DataFrame groupby.mean() including string columns

I have a DataFrame which I need to aggregate. The data can be of mixed type. I can easily achieve this for numeric data using a simple groupby.mean().

Example:

import pandas as pd
import numpy as np

d = {'ID':[1,1,1,2,2,2,3,3,3,3],
     'ID2':['A','A','A','B','B','B','C','C','C','C'],
     'p1':[1,2,1,100,1,1,10,3,2,1],
     'p2':[1,2,1,100,1,1,10,3,2,2],
     'p3':[1,2,1,100,1,1,10,3,2,5],
     'p4':['A','A','A','B','B','B','C','C','X','X']
    }

df = pd.DataFrame(d)

    ID  ID2 p1  p2  p3  p4
0   1   A   1   1   1   A
1   1   A   2   2   2   A
2   1   A   1   1   1   A
3   2   B   100 100 100 B
4   2   B   1   1   1   B
5   2   B   1   1   1   B
6   3   C   10  10  10  C
7   3   C   3   3   3   C
8   3   C   2   2   2   X
9   3   C   1   2   5   X

I then apply the following df2 = df.groupby(['ID','ID2']).mean() which returns:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

        p1          p2          p3
ID  ID2         
1   A   1.333333    1.333333    1.333333
2   B   34.000000   34.000000   34.000000
3   C   4.000000    4.250000    5.000000

This is great however (not unexpected) does not accommodate ‘p4’ which is of type str. I would like to aggregate the string values where if all values are the same then keep the value else replace with NaN and include within the results set as follows:

    p1  p2  p3  p4
ID  ID2             
1   A   1.333333    1.333333    1.333333    A
2   B   34.000000   34.000000   34.000000   B
3   C   4.000000    4.250000    5.000000    NaN

Additionally to this I potentially have unknown number of columns as such would like to do this by some form of type checking so not to have to explicitly state the param names.

Is there a pythonic way to achieve this without deconstructing the DataFrame into alpha/numerics?

>Solution :

You can use a custom aggregation function:

dct = {
    'p1': 'mean',
    'p2': 'mean',
    'p3': 'mean',
    'p4': lambda col: col.mode() if col.nunique() == 1 else np.nan,
}
agg = df.groupby(['ID','ID2']).agg(**{k: (k, v) for k, v in dct.items()})

Or, by type:

dct = {
    'number': 'mean',
    'object': lambda col: col.mode() if col.nunique() == 1 else np.nan,
}

groupby_cols = ['ID','ID2']
dct = {k: v for i in [{col: agg for col in df.select_dtypes(tp).columns.difference(groupby_cols)} for tp, agg in dct.items()] for k, v in i.items()}
agg = df.groupby(groupby_cols).agg(**{k: (k, v) for k, v in dct.items()})

Output for both:

>>> agg
               p1         p2         p3   p4
ID ID2                                      
1  A     1.333333   1.333333   1.333333    A
2  B    34.000000  34.000000  34.000000    B
3  C     4.000000   4.250000   5.000000  NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading