Count columns with multiple values

May 29, 2022

I have this dataframe

df = pd.DataFrame({
    "col1": ["Kev", "Kev", "Fr"],
    "col2": ["Red; Purple", "Yellow; Purple; Red", "Red; Yellow"], }, index=["1", "2", "3"])

It’ll look like this

    col1 col2
1   Kev  Red; Purple
2   Kev  Yellow; Purple; Red
3   Fr   Red; Yellow

I want to count all the items in col2 according to col1. In this case the final df will be like this:

    col1 col2   count
1   Kev  Red    2
2   Kev  Purple 2
3   Kev  Yellow 1
4   Fr   Red    1
5   Fr   Yellow 1

I tried using explode:

df2 = (df.explode(df.columns.tolist())
      .apply(lambda col: col.str.split(';'))
      .explode('col1')
      .explode('col2'))

but that only gives me col1 and col2 of my desired dataframe, not the count. If I use crosstab on df2, I’ll get a very different result.

I managed to get the desired output with 2 nested for loops, but my dataframe is so big that it takes almost a minute loading the function. I want to avoid this solution.

>Solution :

After pd.crosstab, you can try melt

df2 = (df.explode(df.columns.tolist())
      .apply(lambda col: col.str.split('; ')) # <-- space here
      .explode('col1')
      .explode('col2'))


out = (pd.crosstab(df2['col1'], df2['col2'])
       .melt(value_name='count', ignore_index=False)
       .reset_index())

print(out)

  col1    col2  count
0   Fr  Purple      0
1  Kev  Purple      2
2   Fr     Red      1
3  Kev     Red      2
4   Fr  Yellow      1
5  Kev  Yellow      1