I am trying to do the next example. I want to get the total number of survivors (survived) and the total amount paid per class (fared) using the Titanic dataset.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bvalgard/data/main/titanic.csv')
df.groupby(['pclass'])[['survived', 'fare']].sum()
When I run this code, I only get the total number of survivors, but not the total amount paid. However, if I use other functions such as .min(), .max() etc it works.
What is the problem with the .sum() function then?
>Solution :
This is because you don’t have numerical values but objects (df['fare'].dtype gives dtype('O'), df['survived'].dtype gives dtype('int64')), and groupby.sum silently drops invalid columns. You must convert to_numeric:
(df.assign(fare=pd.to_numeric(df['fare'], errors='coerce'))
.groupby(['pclass'])[['survived', 'fare']].sum()
)
Output:
survived fare
pclass
1 200 28265.4043
2 119 5866.6374
3 181 9418.4452
Why does it fail?
groupby.sum only works with numeric data by default (numeric_only=True), while groupby.max has a default of False as it’s more flexible to handle a wider range of data types (you can for instance compare strings, max(['A', 'B']) give 'B').
Can’t I use numeric_only=False?
Yes, you can, but this would concatenate the strings rather than adding numerical values, which is not what you want:
df.groupby(['pclass'])[['survived', 'fare']].sum(numeric_only=False)
survived fare
pclass
1 200 211.3375151.55151.55151.55151.5526.5577.958305...
2 119 24241311.510.526261311.510.51310.512.525262626...
3 181 7.5520.2520.2520.257.657.657.9257.22927.258.05...