Home panda's value_counts() method counting missing values inconsistently

Questions

panda's value_counts() method counting missing values inconsistently

July 12, 2024

Please consider this simple dataframe:

df = pd.DataFrame({'x': [1, 2, 3, 4, 10]}, index = range(5))

df:
    x
0   1
1   2
2   3
3   4
4   10

Some indices:

ff_idx = [1, 2]

sd_idx= [3, 4]

One way of creating a new column by filtering df based on the above indices:

df['ff_sd_indicator'] = np.nan
df['ff_sd_indicator'][df.index.isin(ff_idx)] = 'ff_count' 
df['ff_sd_indicator'][df.index.isin(sd_idx)] = 'sd_count'

Another way of doing the same thing:

df['ff_sd_indicator2'] = np.select([df.index.isin(ff_idx) , df.index.isin(sd_idx)], ['ff_count','sd_count' ], default=np.nan)

Notice that while the values of ff_sd_indicator and ff_sd_indicator2 are naturally the same, the missing values are printed differently (NaN vs nan):

df: 

    x   ff_sd_indicator ff_sd_indicator2
0   1   NaN         nan
1   2   ff_count    ff_count
2   3   ff_count    ff_count
3   4   sd_count    sd_count
4   10  sd_count    sd_count

I don’t care about the different prints but surprisingly the missing values do not show up in the output of:

df['ff_sd_indicator'].value_counts()

which is:

ff_sd_indicator
ff_count    2
sd_count    2

But they do show up in the output of:

df['ff_sd_indicator2'].value_counts()

which is:

ff_sd_indicator2
ff_count    2
sd_count    2
nan         1

So, what is going on here with value_counts() not counting the missing values in ff_sd_indicator while they were created by the same np.nan as the missing values in ff_sd_indicator2 were created?

Edit:
df.info() :

RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   x                 5 non-null      int64 
 1   ff_sd_indicator   5 non-null      object
 2   ff_sd_indicator2  5 non-null      object

>Solution :

By default value_counts drops the NaN, which can be avoided by setting dropna=False:

df['ff_sd_indicator'].value_counts(dropna=False)

ff_sd_indicator
ff_count    2
sd_count    2
NaN         1
Name: count, dtype: int64

If you check the output of:

np.select([df.index.isin(ff_idx) , df.index.isin(sd_idx)],
          ['ff_count','sd_count'], default=np.nan)

You will see however that you don’t have a NaN but a string:

array(['nan', 'ff_count', 'ff_count', 'sd_count', 'sd_count'],
      dtype='<U32')

Thus the value is not dropped automatically.

numpy

byMR

Published July 12, 2024

Add a comment

How to use a different fill gradient based on value threshold in ggplot2?

byMR

July 12, 2024

Questions

is it possible to add two values to a conditional statement switch C#

byMR

July 12, 2024

Questions

Sql join with like condition not working for non exact matches

byMR

July 12, 2024

Questions

How to import ESM inside nestjs?

byMR

July 13, 2024

Questions

Parse a pretty-printed string representation of a DataFrame back into a Polars DataFrame?

byMR

July 13, 2024

Questions

Get all combinations of a string in Python

byMR

July 13, 2024

panda's value_counts() method counting missing values inconsistently

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

How to use a different fill gradient based on value threshold in ggplot2?

is it possible to add two values to a conditional statement switch C#

Sql join with like condition not working for non exact matches

How to import ESM inside nestjs?

Parse a pretty-printed string representation of a DataFrame back into a Polars DataFrame?

Get all combinations of a string in Python

Keep Up to Date with the Most Important News

panda's value_counts() method counting missing values inconsistently

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

How to use a different fill gradient based on value threshold in ggplot2?

is it possible to add two values to a conditional statement switch C#

Sql join with like condition not working for non exact matches

How to import ESM inside nestjs?

Parse a pretty-printed string representation of a DataFrame back into a Polars DataFrame?

Get all combinations of a string in Python

Discover more from Dev solutions