Home Summing values in column of dataframe according to nonunique values in another column

Questions

Summing values in column of dataframe according to nonunique values in another column

October 10, 2023

I am learning pandas, and have come across a super easy problem, but not sure how to solve it. I have a dataframe with 2 columns. One column is categorical with three levels i.e. the values are non-unique, and one column is quantitative. For example,

df = pd.DataFrame({"X": list("AAABBC"), "Y": range(6)})

I want to sum each value in the Y column according to each unique-value in the X column. I.e. I should obtain a dataframe with 2 columns; one column being the unique values in X i.e. ["A","B","C"] and the other column being the summed values in Y corresponding to each level in X i.e.[3,7,5].

This is obviously quite a basic thing to do, but I have tried googling and I couldn’t find the answer, so it’s quite frustrating. I think the answer should be quite simply, probably a one-liner, but I just don’t know the command. I am quite new to pandas, so please go easy 🙂

>Solution :

You are looking for the .groupby method. Relevant docs:

Usage example:

import pandas as pd

df = pd.DataFrame({"X": list("AAABBC"), "Y": range(6)})

df_by_x = df.groupby("X").sum()

print(df_by_x)

Result:

   Y
X
A  3
B  7
C  5

As I’m sure you can imagine, there are many many other things you can do with grouping in Pandas. There are a handful of different ways to solve the problem you proposed here, although the "one obvious way to do it" would be the .sum() method as above. However, it might be a useful learning exercise to work through some of the other ways you might accomplish the same task.

Another related search term for this operation is "split-apply-combine", but that was somewhat of a short-lived trend in terminology (mostly confined to R users), while "group by" is a long-standing term inherited from SQL.