Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Masking a value in a column of a pandas dataframe and setting a flag in the same row (but different column)

The problem:

I would like to impute missing values and set a flag in another column that the value is imputed – if that value itself is not missing.

Masking would do a great job with replacing the missing values but I don’t know how to simultaneously set up also a flag in a different column.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Here is an example of a dataframe containing missing values and of a series that contains imputed values that can be used to replace the missing values – however the series itself contains some missing values.

df_orig = pd.DataFrame({'cust_id': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
                   'period': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6],
                   'volume': [1, 2, np.NaN, 4, 5, 6, 7, np.NaN, 9, np.NaN, 11, np.NaN],
                   'is_imputed': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]})

imputed_vol = pd.Series([np.NaN, np.NaN, 3, 4, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN])

In the above case only the missing value in the 3rd row in "volume" can be replaced by an imputed value. So the end result would be:

df_orig = pd.DataFrame({'cust_id': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
                   'period': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6],
                   'volume': [1, 2, 3, 4, 5, 6, 7, np.NaN, 9, np.NaN, 11, np.NaN],
                   'is_imputed': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]})

I can easily do the replacing by masking

df_orig.mask(df_orig.volume.isna(), imputed_vol, axis=0)

but how do I set up also the flag ‘is_imputed’?

>Solution :

Use fillna and boolean arithmetic:

# keep copy of original column
original = df_orig['volume']

# fill value
df_orig['volume'] = df_orig['volume'].fillna(imputed_vol)

# check if value has changed
df_orig['is_imputed'] = (df_orig['volume'].ne(original)
                         & imputed_vol.notna() # required as NaN != NaN
                         ).astype(int)

Output:

    cust_id  period  volume  is_imputed
0         1       1     1.0           0
1         1       2     2.0           0
2         1       3     3.0           1
3         1       4     4.0           0
4         1       5     5.0           0
5         1       6     6.0           0
6         2       1     7.0           0
7         2       2     NaN           0
8         2       3     9.0           0
9         2       4     NaN           0
10        2       5    11.0           0
11        2       6     NaN           0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading