Please consider this simple dataframe:
df = pd.DataFrame({'x': [1, 2, 3, 4, 10]}, index = range(5))
df:
x
0 1
1 2
2 3
3 4
4 10
Some indices:
ff_idx = [1, 2]
sd_idx= [3, 4]
One way of creating a new column by filtering df based on the above indices:
df['ff_sd_indicator'] = np.nan
df['ff_sd_indicator'][df.index.isin(ff_idx)] = 'ff_count'
df['ff_sd_indicator'][df.index.isin(sd_idx)] = 'sd_count'
Another way of doing the same thing:
df['ff_sd_indicator2'] = np.select([df.index.isin(ff_idx) , df.index.isin(sd_idx)], ['ff_count','sd_count' ], default=np.nan)
Notice that while the values of ff_sd_indicator and ff_sd_indicator2 are naturally the same, the missing values are printed differently (NaN vs nan):
df:
x ff_sd_indicator ff_sd_indicator2
0 1 NaN nan
1 2 ff_count ff_count
2 3 ff_count ff_count
3 4 sd_count sd_count
4 10 sd_count sd_count
I don’t care about the different prints but surprisingly the missing values do not show up in the output of:
df['ff_sd_indicator'].value_counts()
which is:
ff_sd_indicator
ff_count 2
sd_count 2
But they do show up in the output of:
df['ff_sd_indicator2'].value_counts()
which is:
ff_sd_indicator2
ff_count 2
sd_count 2
nan 1
So, what is going on here with value_counts() not counting the missing values in ff_sd_indicator while they were created by the same np.nan as the missing values in ff_sd_indicator2 were created?
Edit:
df.info() :
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x 5 non-null int64
1 ff_sd_indicator 5 non-null object
2 ff_sd_indicator2 5 non-null object
>Solution :
By default value_counts drops the NaN, which can be avoided by setting dropna=False:
df['ff_sd_indicator'].value_counts(dropna=False)
ff_sd_indicator
ff_count 2
sd_count 2
NaN 1
Name: count, dtype: int64
If you check the output of:
np.select([df.index.isin(ff_idx) , df.index.isin(sd_idx)],
['ff_count','sd_count'], default=np.nan)
You will see however that you don’t have a NaN but a string:
array(['nan', 'ff_count', 'ff_count', 'sd_count', 'sd_count'],
dtype='<U32')
Thus the value is not dropped automatically.