Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

panda's value_counts() method counting missing values inconsistently

Please consider this simple dataframe:

df = pd.DataFrame({'x': [1, 2, 3, 4, 10]}, index = range(5))

df:
    x
0   1
1   2
2   3
3   4
4   10

Some indices:

ff_idx = [1, 2]

sd_idx= [3, 4]

One way of creating a new column by filtering df based on the above indices:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df['ff_sd_indicator'] = np.nan
df['ff_sd_indicator'][df.index.isin(ff_idx)] = 'ff_count' 
df['ff_sd_indicator'][df.index.isin(sd_idx)] = 'sd_count' 

Another way of doing the same thing:

df['ff_sd_indicator2'] = np.select([df.index.isin(ff_idx) , df.index.isin(sd_idx)], ['ff_count','sd_count' ], default=np.nan)

Notice that while the values of ff_sd_indicator and ff_sd_indicator2 are naturally the same, the missing values are printed differently (NaN vs nan):

df: 

    x   ff_sd_indicator ff_sd_indicator2
0   1   NaN         nan
1   2   ff_count    ff_count
2   3   ff_count    ff_count
3   4   sd_count    sd_count
4   10  sd_count    sd_count

I don’t care about the different prints but surprisingly the missing values do not show up in the output of:

df['ff_sd_indicator'].value_counts()

which is:

ff_sd_indicator
ff_count    2
sd_count    2

But they do show up in the output of:

df['ff_sd_indicator2'].value_counts()

which is:

ff_sd_indicator2
ff_count    2
sd_count    2
nan         1

So, what is going on here with value_counts() not counting the missing values in ff_sd_indicator while they were created by the same np.nan as the missing values in ff_sd_indicator2 were created?

Edit:
df.info() :

RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   x                 5 non-null      int64 
 1   ff_sd_indicator   5 non-null      object
 2   ff_sd_indicator2  5 non-null      object

>Solution :

By default value_counts drops the NaN, which can be avoided by setting dropna=False:

df['ff_sd_indicator'].value_counts(dropna=False)

ff_sd_indicator
ff_count    2
sd_count    2
NaN         1
Name: count, dtype: int64

If you check the output of:

np.select([df.index.isin(ff_idx) , df.index.isin(sd_idx)],
          ['ff_count','sd_count'], default=np.nan)

You will see however that you don’t have a NaN but a string:

array(['nan', 'ff_count', 'ff_count', 'sd_count', 'sd_count'],
      dtype='<U32')

Thus the value is not dropped automatically.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading