Summing values in column of dataframe according to nonunique values in another column

I am learning pandas, and have come across a super easy problem, but not sure how to solve it. I have a dataframe with 2 columns. One column is categorical with three levels i.e. the values are non-unique, and one column is quantitative. For example,

df = pd.DataFrame({"X": list("AAABBC"), "Y": range(6)})

I want to sum each value in the Y column according to each unique-value in the X column. I.e. I should obtain a dataframe with 2 columns; one column being the unique values in X i.e. ["A","B","C"] and the other column being the summed values in Y corresponding to each level in X i.e.[3,7,5].

This is obviously quite a basic thing to do, but I have tried googling and I couldn’t find the answer, so it’s quite frustrating. I think the answer should be quite simply, probably a one-liner, but I just don’t know the command. I am quite new to pandas, so please go easy 🙂

>Solution :

You are looking for the .groupby method. Relevant docs:

Usage example:

import pandas as pd

df = pd.DataFrame({"X": list("AAABBC"), "Y": range(6)})

df_by_x = df.groupby("X").sum()



A  3
B  7
C  5

As I’m sure you can imagine, there are many many other things you can do with grouping in Pandas. There are a handful of different ways to solve the problem you proposed here, although the "one obvious way to do it" would be the .sum() method as above. However, it might be a useful learning exercise to work through some of the other ways you might accomplish the same task.

Another related search term for this operation is "split-apply-combine", but that was somewhat of a short-lived trend in terminology (mostly confined to R users), while "group by" is a long-standing term inherited from SQL.

Leave a Reply