I have a dataset where i show ticket purchases by number of days out from event (NDO).
| NDO | Purch | Country |
|---|---|---|
| 0 | 3 | UK |
| 1 | 3 | UK |
| 1 | 3 | UK |
| 1 | 3 | USA |
| 2 | 3 | USA |
I want to split the data in groups by NDO, eg 0 days, 1-5 days, 5-10 days, 50-100 days, and add an extra column, ‘NDOGroup’to let me then create a pivot table by NDOGroup, and ultimately a heatmap.
The code I used for this is:
bins = [-1,5,10,16,21,26,31,36,41,46,51,56,61,66,71,76,81,86,91,96,101]
labels = [f'{a0}-{a1 - 1}' for a0, a1 in zip(bins[:-2], bins[1:-1])] + [f'≥{bins[-2]}']
df['NDOGroup'] = pd.cut(df['NDO'], bins=bins, labels=labels, right=True)
countrypt = df.pivot_table(index = 'NDOGroup',columns = 'Country', values ='Purch', aggfunc='sum', observed=False)
plt.figure(figsize = (15,9))
sns.heatmap(countrypt, cmap = 'viridis', vmin=0, linewidths=0.5, linecolor='black')
but the heatmap produced is like so:
and when I examine the data I see that for NDO=0, the NDOGroup = NaN
| NDO | Purch | Country | NDOGroup |
|---|---|---|---|
| 0 | 3 | UK | Nan |
| 1 | 3 | UK | 0-4 |
| 1 | 3 | UK | 0-4 |
| 1 | 3 | USA | 0-4 |
| 2 | 3 | USA | 0-4 |
I have tried different indexing in the bins e.g. [-1,0,5], [-1,4,5] but can’t seem to get it to show the correct groups.
Is this an indexing issue? Or is there a much cleaner way possible to fix this so i get a table like below:
| NDO | Purch | Country | NDOGroup |
|---|---|---|---|
| 0 | 3 | UK | 0 |
| 1 | 3 | UK | 1-5 |
| 3 | 3 | UK | 1-5 |
| 4 | 3 | USA | 1-5 |
| 5 | 3 | USA | 1-5 |
| 6 | 3 | USA | 6-10 |
| 12 | 3 | USA | 11-15 |
>Solution :
You could use:
df = pd.DataFrame({'NDO': [0, 1, 2, 5, 6, 10, 11, 15, 16, 91, 92, 96, 97, 1000]})
bins = [-np.inf,0,5,10,16,21,26,31,36,41,46,51,56,61,66,71,76,81,86,91,96,np.inf]
labels = ['0']+[f'{a0+1}-{a1}' for a0, a1 in zip(bins[1:-2], bins[2:-1])] + [f'≥{bins[-2]}']
df['NDOGroup'] = pd.cut(df['NDO'], bins=bins, labels=labels, right=True)
Variant:
from itertools import pairwise
bins = [-np.inf,0,5,10,16,21,26,31,36,41,46,51,56,61,66,71,76,81,86,91,96,np.inf]
labels = [f'{a0+1}-{a1}' for a0, a1 in pairwise(bins)]
labels[0] = '0'
labels[-1] = f'≥{bins[-2]}'
Output:
NDO NDOGroup
0 0 0
1 1 1-5
2 2 1-5
3 5 1-5
4 6 6-10
5 10 6-10
6 11 11-16
7 15 11-16
8 16 11-16
9 91 87-91
10 92 92-96
11 96 92-96
12 97 ≥96
13 1000 ≥96
