The problem:
I would like to impute missing values and set a flag in another column that the value is imputed – if that value itself is not missing.
Masking would do a great job with replacing the missing values but I don’t know how to simultaneously set up also a flag in a different column.
Here is an example of a dataframe containing missing values and of a series that contains imputed values that can be used to replace the missing values – however the series itself contains some missing values.
df_orig = pd.DataFrame({'cust_id': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'period': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6],
'volume': [1, 2, np.NaN, 4, 5, 6, 7, np.NaN, 9, np.NaN, 11, np.NaN],
'is_imputed': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
imputed_vol = pd.Series([np.NaN, np.NaN, 3, 4, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN])
In the above case only the missing value in the 3rd row in "volume" can be replaced by an imputed value. So the end result would be:
df_orig = pd.DataFrame({'cust_id': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'period': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6],
'volume': [1, 2, 3, 4, 5, 6, 7, np.NaN, 9, np.NaN, 11, np.NaN],
'is_imputed': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
I can easily do the replacing by masking
df_orig.mask(df_orig.volume.isna(), imputed_vol, axis=0)
but how do I set up also the flag ‘is_imputed’?
>Solution :
Use fillna and boolean arithmetic:
# keep copy of original column
original = df_orig['volume']
# fill value
df_orig['volume'] = df_orig['volume'].fillna(imputed_vol)
# check if value has changed
df_orig['is_imputed'] = (df_orig['volume'].ne(original)
& imputed_vol.notna() # required as NaN != NaN
).astype(int)
Output:
cust_id period volume is_imputed
0 1 1 1.0 0
1 1 2 2.0 0
2 1 3 3.0 1
3 1 4 4.0 0
4 1 5 5.0 0
5 1 6 6.0 0
6 2 1 7.0 0
7 2 2 NaN 0
8 2 3 9.0 0
9 2 4 NaN 0
10 2 5 11.0 0
11 2 6 NaN 0