pd.groupby in two columns sum() function does not work

March 22, 2023

I am trying to do the next example. I want to get the total number of survivors (survived) and the total amount paid per class (fared) using the Titanic dataset.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bvalgard/data/main/titanic.csv')

df.groupby(['pclass'])[['survived', 'fare']].sum()

When I run this code, I only get the total number of survivors, but not the total amount paid. However, if I use other functions such as .min(), .max() etc it works.

What is the problem with the .sum() function then?

>Solution :

This is because you don’t have numerical values but objects (df['fare'].dtype gives dtype('O'), df['survived'].dtype gives dtype('int64')), and groupby.sum silently drops invalid columns. You must convert to_numeric:

(df.assign(fare=pd.to_numeric(df['fare'], errors='coerce'))
   .groupby(['pclass'])[['survived', 'fare']].sum()
)

Output:

        survived        fare
pclass                      
1            200  28265.4043
2            119   5866.6374
3            181   9418.4452

Why does it fail?

groupby.sum only works with numeric data by default (numeric_only=True), while groupby.max has a default of False as it’s more flexible to handle a wider range of data types (you can for instance compare strings, max(['A', 'B']) give 'B').

Can’t I use `numeric_only=False`?

Yes, you can, but this would concatenate the strings rather than adding numerical values, which is not what you want:

df.groupby(['pclass'])[['survived', 'fare']].sum(numeric_only=False)

        survived                                               fare
pclass                                                             
1            200  211.3375151.55151.55151.55151.5526.5577.958305...
2            119  24241311.510.526261311.510.51310.512.525262626...
3            181  7.5520.2520.2520.257.657.657.9257.22927.258.05...