Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Count unique value with prioritize value in pandas

I have a simple data frame as below:

import pandas as pd
import numpy as np
df = pd.DataFrame({'CUS_NO': ['900636229', '900636229', '900636080', '900636080', '900636052', '900636052', 
                              '900636053', '900636054', '900636055', '900636056'], 
                   'indicator': ['both', 'left_only', 'both', 'left_only', 'both', 'left_only', 
                                 'both', 'left_only', 'both', 'left_only'],
                   'Nationality': ['VN', 'VN', 'KR', 'KR', 'VN', 'VN', 
                                   'KR', 'VN', 'KR', 'VN']})

        CUS_NO      indicator   Nationality
0       900636229   both        VN
1       900636229   left_only   VN
2       900636080   both        KR
3       900636080   left_only   KR
4       900636052   both        VN
5       900636052   left_only   VN
6       900636053   both        KR
7       900636054   left_only   VN
8       900636055   both        KR
9       900636056   left_only   VN

I want to count unique value of CUS_NO so I used pd.Series.nunique by below code:

df2 = pd.pivot_table(df, values='CUS_NO', 
                     index='Nationality', 
                     columns='indicator', 
                     aggfunc=pd.Series.nunique, 
                     margins=True).reset_index()
df2

And here is the result:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

indicator   Nationality both    left_only   All
0           KR          3       1           3
1           VN          2       4           4
2           All         5       5           7

But I my expectation is if CUS_NO was same and indicator was different, I just need to count both indicator. So below is my expected Output:

indicator   Nationality both    left_only   All
0           KR          3       0           3
1           VN          2       2           4
2           All         5       2           7

Thank you.

>Solution :

You can sort_values to have "both" on top (if more categories, use a Categorical to define a custom order), then drop_duplicates:

tmp = (df
   .sort_values(by='indicator')
   .drop_duplicates(subset=['CUS_NO', 'Nationality'], keep='first')
)

df2 = pd.pivot_table(tmp, values='CUS_NO', 
                     index='Nationality', 
                     columns='indicator', 
                     aggfunc=pd.Series.nunique, 
                     margins=True,
                     fill_value=0).reset_index()

Output:

indicator Nationality  both  left_only  All
0                  KR     3          0    3
1                  VN     2          2    4
2                 All     5          2    7

Intermediate tmp:

      CUS_NO  indicator Nationality
0  900636229       both          VN
2  900636080       both          KR
4  900636052       both          VN
6  900636053       both          KR
8  900636055       both          KR
7  900636054  left_only          VN
9  900636056  left_only          VN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading