Count unique value with prioritize value in pandas

November 21, 2023

I have a simple data frame as below:

import pandas as pd
import numpy as np
df = pd.DataFrame({'CUS_NO': ['900636229', '900636229', '900636080', '900636080', '900636052', '900636052', 
                              '900636053', '900636054', '900636055', '900636056'], 
                   'indicator': ['both', 'left_only', 'both', 'left_only', 'both', 'left_only', 
                                 'both', 'left_only', 'both', 'left_only'],
                   'Nationality': ['VN', 'VN', 'KR', 'KR', 'VN', 'VN', 
                                   'KR', 'VN', 'KR', 'VN']})

        CUS_NO      indicator   Nationality
0       900636229   both        VN
1       900636229   left_only   VN
2       900636080   both        KR
3       900636080   left_only   KR
4       900636052   both        VN
5       900636052   left_only   VN
6       900636053   both        KR
7       900636054   left_only   VN
8       900636055   both        KR
9       900636056   left_only   VN

I want to count unique value of CUS_NO so I used pd.Series.nunique by below code:

df2 = pd.pivot_table(df, values='CUS_NO', 
                     index='Nationality', 
                     columns='indicator', 
                     aggfunc=pd.Series.nunique, 
                     margins=True).reset_index()
df2

And here is the result:

indicator   Nationality both    left_only   All
0           KR          3       1           3
1           VN          2       4           4
2           All         5       5           7

But I my expectation is if CUS_NO was same and indicator was different, I just need to count both indicator. So below is my expected Output:

indicator   Nationality both    left_only   All
0           KR          3       0           3
1           VN          2       2           4
2           All         5       2           7

Thank you.

>Solution :

You can sort_values to have "both" on top (if more categories, use a Categorical to define a custom order), then drop_duplicates:

tmp = (df
   .sort_values(by='indicator')
   .drop_duplicates(subset=['CUS_NO', 'Nationality'], keep='first')
)

df2 = pd.pivot_table(tmp, values='CUS_NO', 
                     index='Nationality', 
                     columns='indicator', 
                     aggfunc=pd.Series.nunique, 
                     margins=True,
                     fill_value=0).reset_index()

Output:

indicator Nationality  both  left_only  All
0                  KR     3          0    3
1                  VN     2          2    4
2                 All     5          2    7

Intermediate tmp:

      CUS_NO  indicator Nationality
0  900636229       both          VN
2  900636080       both          KR
4  900636052       both          VN
6  900636053       both          KR
8  900636055       both          KR
7  900636054  left_only          VN
9  900636056  left_only          VN