I have this dataframe
df = pd.DataFrame({
"col1": ["Kev", "Kev", "Fr"],
"col2": ["Red; Purple", "Yellow; Purple; Red", "Red; Yellow"], }, index=["1", "2", "3"])
It’ll look like this
col1 col2
1 Kev Red; Purple
2 Kev Yellow; Purple; Red
3 Fr Red; Yellow
I want to count all the items in col2 according to col1. In this case the final df will be like this:
col1 col2 count
1 Kev Red 2
2 Kev Purple 2
3 Kev Yellow 1
4 Fr Red 1
5 Fr Yellow 1
I tried using explode:
df2 = (df.explode(df.columns.tolist())
.apply(lambda col: col.str.split(';'))
.explode('col1')
.explode('col2'))
but that only gives me col1 and col2 of my desired dataframe, not the count. If I use crosstab on df2, I’ll get a very different result.
I managed to get the desired output with 2 nested for loops, but my dataframe is so big that it takes almost a minute loading the function. I want to avoid this solution.
>Solution :
After pd.crosstab, you can try melt
df2 = (df.explode(df.columns.tolist())
.apply(lambda col: col.str.split('; ')) # <-- space here
.explode('col1')
.explode('col2'))
out = (pd.crosstab(df2['col1'], df2['col2'])
.melt(value_name='count', ignore_index=False)
.reset_index())
print(out)
col1 col2 count
0 Fr Purple 0
1 Kev Purple 2
2 Fr Red 1
3 Kev Red 2
4 Fr Yellow 1
5 Kev Yellow 1