Please note that a similar question was asked a while back but never answered (see Winsorizing does not change the max value).
I am trying to winsorize a column in a dataframe using winsorize from scipy.stats.mstats. If there are no NaN values in the column then the process works correctly.
However, NaN values seem to prevent the process from working on the top (but not the bottom) of the distribution. Regardless of what value I set for nan_policy, the NaN values are set to the maximum value in the distribution. I feel like a must be setting the option incorrectly some how.
Below is an example that can be used to reproduce both correct winsorizing when there are no NaN values and the problem behavior I am experiencing when there NaN values are present. Any help on sorting this out would be appreciated.
#Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize
# initialise data of lists.
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'], 'Age':[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]}
# Create 2 DataFrames
df = pd.DataFrame(data)
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# Winsorize Age in both DataFrames
winsorize(df['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
winsorize(df2['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
# Check min and max values of Age in both DataFrames
print('Max/min value of Age from dataframe without NaN values')
print(df['Age'].max())
print(df['Age'].min())
print()
print('Max/min value of Age from dataframe with NaN values')
print(df2['Age'].max())
print(df2['Age'].min())
>Solution :
It looks like the nan_policy is being ignored. But winsorization is just clipping, so you can handle this with pandas.
def winsorize_with_pandas(s, limits):
"""
s : pd.Series
Series to winsorize
limits : tuple of float
Tuple of the percentages to cut on each side of the array,
with respect to the number of unmasked data, as floats between 0. and 1
"""
return s.clip(lower=s.quantile(limits[0], interpolation='lower'),
upper=s.quantile(1-limits[1], interpolation='higher'))
winsorize_with_pandas(df['Age'], limits=(0.1, 0.1))
0 3.0
1 3.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 18.0
19 18.0
Name: Age, dtype: float64
winsorize_with_pandas(df2['Age'], limits=(0.1, 0.1))
0 2.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
6 7.0
7 8.0
8 NaN
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 19.0
Name: Age, dtype: float64