I have the following code using bincounts to get occurrences
print(categories = df[attribute].cat.categories)
>>> Int64Index([0, 1, 2], dtype='int64')
print(df[attribute].to_numpy())
>>> [0 1 0 1 1]
partition = np.bincount(df[attribute].to_numpy())
print(partition)
>>> [2 3]
What I want is so that it is counting but using bins based on the categories array such that it would be [2 3 0] because there are no 2’s in the array. Is there any way to do this? My dataframes are always setup such that categorical data types are integer encoded starting from 0 up to the number of classes. I want to avoid using df[attribute].value_count() because profiling makes it seem like it is a bottleneck, though I’m not entirely sure.
>Solution :
You can use np.unique with return_counts=True:
df = pd.DataFrame({'attribute': [0, 0, 1, 1, 1]})
df = df.astype({'attribute': pd.CategoricalDtype([0, 1, 2])})
cat, count = np.unique(df['attribute'], return_counts=True)
Output:
>>> cat, count
(array([0, 1]), array([2, 3]))
Suggested by @jezrael, to get your expected output, you can use:
>>> pd.Series(count, index=cat).reindex(df['attribute'].cat.categories, fill_value=0)
0 2
1 3
2 0
dtype: int64
But you have to compare the performance with:
>>> df['attribute'].value_counts(sort=False)
0 2
1 3
2 0
Name: attribute, dtype: int64