I am learning pandas, and have come across a super easy problem, but not sure how to solve it. I have a dataframe with 2 columns. One column is categorical with three levels i.e. the values are non-unique, and one column is quantitative. For example,
df = pd.DataFrame({"X": list("AAABBC"), "Y": range(6)})
I want to sum each value in the Y
column according to each unique-value in the X
column. I.e. I should obtain a dataframe with 2 columns; one column being the unique values in X
i.e. ["A","B","C"]
and the other column being the summed values in Y corresponding to each level in X i.e.[3,7,5]
.
This is obviously quite a basic thing to do, but I have tried googling and I couldn’t find the answer, so it’s quite frustrating. I think the answer should be quite simply, probably a one-liner, but I just don’t know the command. I am quite new to pandas, so please go easy 🙂
>Solution :
You are looking for the .groupby
method. Relevant docs:
- https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Usage example:
import pandas as pd
df = pd.DataFrame({"X": list("AAABBC"), "Y": range(6)})
df_by_x = df.groupby("X").sum()
print(df_by_x)
Result:
Y
X
A 3
B 7
C 5
As I’m sure you can imagine, there are many many other things you can do with grouping in Pandas. There are a handful of different ways to solve the problem you proposed here, although the "one obvious way to do it" would be the .sum()
method as above. However, it might be a useful learning exercise to work through some of the other ways you might accomplish the same task.
Another related search term for this operation is "split-apply-combine", but that was somewhat of a short-lived trend in terminology (mostly confined to R users), while "group by" is a long-standing term inherited from SQL.