I have a simple data frame as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'CUS_NO': ['900636229', '900636229', '900636080', '900636080', '900636052', '900636052',
'900636053', '900636054', '900636055', '900636056'],
'indicator': ['both', 'left_only', 'both', 'left_only', 'both', 'left_only',
'both', 'left_only', 'both', 'left_only'],
'Nationality': ['VN', 'VN', 'KR', 'KR', 'VN', 'VN',
'KR', 'VN', 'KR', 'VN']})
CUS_NO indicator Nationality
0 900636229 both VN
1 900636229 left_only VN
2 900636080 both KR
3 900636080 left_only KR
4 900636052 both VN
5 900636052 left_only VN
6 900636053 both KR
7 900636054 left_only VN
8 900636055 both KR
9 900636056 left_only VN
I want to count unique value of CUS_NO
so I used pd.Series.nunique
by below code:
df2 = pd.pivot_table(df, values='CUS_NO',
index='Nationality',
columns='indicator',
aggfunc=pd.Series.nunique,
margins=True).reset_index()
df2
And here is the result:
indicator Nationality both left_only All
0 KR 3 1 3
1 VN 2 4 4
2 All 5 5 7
But I my expectation is if CUS_NO
was same and indicator was different, I just need to count both
indicator. So below is my expected Output:
indicator Nationality both left_only All
0 KR 3 0 3
1 VN 2 2 4
2 All 5 2 7
Thank you.
>Solution :
You can sort_values
to have "both" on top (if more categories, use a Categorical
to define a custom order), then drop_duplicates
:
tmp = (df
.sort_values(by='indicator')
.drop_duplicates(subset=['CUS_NO', 'Nationality'], keep='first')
)
df2 = pd.pivot_table(tmp, values='CUS_NO',
index='Nationality',
columns='indicator',
aggfunc=pd.Series.nunique,
margins=True,
fill_value=0).reset_index()
Output:
indicator Nationality both left_only All
0 KR 3 0 3
1 VN 2 2 4
2 All 5 2 7
Intermediate tmp
:
CUS_NO indicator Nationality
0 900636229 both VN
2 900636080 both KR
4 900636052 both VN
6 900636053 both KR
8 900636055 both KR
7 900636054 left_only VN
9 900636056 left_only VN