I have this sample dataset:
mydf = pd.DataFrame({'city':['Porto','Loa','Porto','Porto','Loa'],\
'town':['A','C','A','B','C']})
mydf['city'] = pd.Categorical(mydf['city'])
mydf['town'] = pd.Categorical(mydf['town'])
mydf
city town
0 Porto A
1 Loa C
2 Porto A
3 Porto B
4 Loa C
And I want to count the occurrences grouped by city and town. So I tried this:
mydf.groupby(['city','town']).size().to_frame()
0
city town
Loa A 0
B 0
C 2
Porto A 2
B 1
C 0
But this is wrong, since city C is located only in Loa, not in Porto, and cities A and B are located only in Porto. My expected result is this:
0
city town
Loa C 2
Porto A 2
B 1
Sure I can avoid the pd.Categorical conversion in ‘city’ and ‘town’, but I don’t understand that behavior. Is there a parameter I should use to avoid this and get the right and simplified expected result?
>Solution :
Yes, the groupby + size behavior is expected.
By default, if any of the grouping columns are categorical then it will show all the values for categorical columns regardless whether they appear in a particular group or not.
To turn this default behaviors off, you can set the optional parameter observed=True in groupby which will show only observed values(actual appearing values) of categorical columns:
mydf.groupby(['city','town'], observed=True).size().to_frame()
0
city town
Porto A 2
B 1
Loa C 2