Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to create a Dummy Variable in Python if Missing Values are included?

How to create a dummy variable if missing values are included? I have the following data and I want to create a Dummy variable based on several conditions. My problem is that it automatically converts my missing values to 0, but I want to keep them as missing values.

import pandas as pd

mydata = {'x' : [10, 50, np.nan, 32, 47, np.nan, 20, 5, 100, 62], 
          'y' : [10, 1, 5,  np.nan, 47, np.nan, 8, 5, 100, 3]}
df = pd.DataFrame(mydata)

df["z"] = ((df["x"] >= 50) & (df["y"] <= 20)).astype(int)

print(df)

>Solution :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

When creating your boolean-mask, you are comparing integers with nans. In your case, when comparing df['x']=np.nan with 50, your mask df['x'] >= 50 will always be False and will equal 0 if you convert it to an integer. You can just create a boolean-mask that equals True for all rows that contain any np.nan in the columns ['x', 'y'] and then assign np.nan to these rows.

Code:

import pandas as pd
import numpy as np

mydata = {'x' : [10, 50, np.nan, 32, 47, np.nan, 20, 5, 100, 62], 
          'y' : [10, 1, 5,  np.nan, 47, np.nan, 8, 5, 100, 3]}
df = pd.DataFrame(mydata)

df["z"] = ((df["x"] >= 50) & (df["y"] <= 20)).astype("uint32")
df.loc[df[["x", "y"]].isna().any(axis=1), "z"] = np.nan

Output:

    x       y       z
0   10.0    10.0    0.0
1   50.0    1.0     1.0
2   NaN     5.0     NaN
3   32.0    NaN     NaN
4   47.0    47.0    0.0
5   NaN     NaN     NaN
6   20.0    8.0     0.0
7   5.0     5.0     0.0
8   100.0   100.0   0.0
9   62.0    3.0     1.0

Alternatively, if you want an one-liner, you could use nested np.where statements:

df["z"] = np.where(
    df.isnull().any(axis=1), np.nan, np.where((df["x"] >= 50) & (df["y"] <= 20), 1, 0)
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading