Winsorizing on column with NaN does not change the max value

December 15, 2021

Please note that a similar question was asked a while back but never answered (see Winsorizing does not change the max value).

I am trying to winsorize a column in a dataframe using winsorize from scipy.stats.mstats. If there are no NaN values in the column then the process works correctly.

However, NaN values seem to prevent the process from working on the top (but not the bottom) of the distribution. Regardless of what value I set for nan_policy, the NaN values are set to the maximum value in the distribution. I feel like a must be setting the option incorrectly some how.

Below is an example that can be used to reproduce both correct winsorizing when there are no NaN values and the problem behavior I am experiencing when there NaN values are present. Any help on sorting this out would be appreciated.

#Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize

# initialise data of lists.
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'], 'Age':[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]}
 
# Create 2 DataFrames
df = pd.DataFrame(data)
df2 = pd.DataFrame(data)

# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan

# Winsorize Age in both DataFrames
winsorize(df['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
winsorize(df2['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')

# Check min and max values of Age in both DataFrames
print('Max/min value of Age from dataframe without NaN values')
print(df['Age'].max())
print(df['Age'].min())

print()

print('Max/min value of Age from dataframe with NaN values')
print(df2['Age'].max())
print(df2['Age'].min())

>Solution :

It looks like the nan_policy is being ignored. But winsorization is just clipping, so you can handle this with pandas.

def winsorize_with_pandas(s, limits):
    """
    s : pd.Series
        Series to winsorize
    limits : tuple of float
        Tuple of the percentages to cut on each side of the array, 
        with respect to the number of unmasked data, as floats between 0. and 1
    """
    return s.clip(lower=s.quantile(limits[0], interpolation='lower'), 
                  upper=s.quantile(1-limits[1], interpolation='higher'))


winsorize_with_pandas(df['Age'], limits=(0.1, 0.1))
0      3.0
1      3.0
2      3.0
3      4.0
4      5.0
5      6.0
6      7.0
7      8.0
8      9.0
9     10.0
10    11.0
11    12.0
12    13.0
13    14.0
14    15.0
15    16.0
16    17.0
17    18.0
18    18.0
19    18.0
Name: Age, dtype: float64

winsorize_with_pandas(df2['Age'], limits=(0.1, 0.1))
0      2.0
1      2.0
2      3.0
3      4.0
4      5.0
5      NaN
6      7.0
7      8.0
8      NaN
9     10.0
10    11.0
11    12.0
12    13.0
13    14.0
14    15.0
15    16.0
16    17.0
17    18.0
18    19.0
19    19.0
Name: Age, dtype: float64