Advertisements

How to create a dummy variable if missing values are included? I have the following data and I want to create a Dummy variable based on several conditions. My problem is that it automatically converts my missing values to 0, but I want to keep them as missing values.

```
import pandas as pd
mydata = {'x' : [10, 50, np.nan, 32, 47, np.nan, 20, 5, 100, 62],
'y' : [10, 1, 5, np.nan, 47, np.nan, 8, 5, 100, 3]}
df = pd.DataFrame(mydata)
df["z"] = ((df["x"] >= 50) & (df["y"] <= 20)).astype(int)
print(df)
```

### >Solution :

When creating your boolean-mask, you are comparing integers with `nans`

. In your case, when comparing `df['x']=np.nan`

with 50, your mask `df['x'] >= 50`

will always be `False`

and will equal `0`

if you convert it to an integer. You can just create a boolean-mask that equals `True`

for all rows that contain any `np.nan`

in the columns `['x', 'y']`

and then assign `np.nan`

to these rows.

Code:

```
import pandas as pd
import numpy as np
mydata = {'x' : [10, 50, np.nan, 32, 47, np.nan, 20, 5, 100, 62],
'y' : [10, 1, 5, np.nan, 47, np.nan, 8, 5, 100, 3]}
df = pd.DataFrame(mydata)
df["z"] = ((df["x"] >= 50) & (df["y"] <= 20)).astype("uint32")
df.loc[df[["x", "y"]].isna().any(axis=1), "z"] = np.nan
```

Output:

```
x y z
0 10.0 10.0 0.0
1 50.0 1.0 1.0
2 NaN 5.0 NaN
3 32.0 NaN NaN
4 47.0 47.0 0.0
5 NaN NaN NaN
6 20.0 8.0 0.0
7 5.0 5.0 0.0
8 100.0 100.0 0.0
9 62.0 3.0 1.0
```

Alternatively, if you want an one-liner, you could use nested `np.where`

statements:

```
df["z"] = np.where(
df.isnull().any(axis=1), np.nan, np.where((df["x"] >= 50) & (df["y"] <= 20), 1, 0)
)
```