According to the reference page of Pandas.DataFrame.fillna, all NA/NaN values are filled using the specified method.
However, in presence of pd.NA values it does not work.
As you can see in the following code block in fact, if I want to replace missing booleans (marked with the pd.NA values) with the column’s mode it does not work:
import pandas as pd
import numpy as np
# create dataframe
df = pd.DataFrame({"a": [True, pd.NA, False, True], "b": [0, np.nan, 2, 3]})
# convert types (a becomes boolean, b becomes Int64)
df = df.convert_dtypes()
# get boolean columns
bool_cols = df.select_dtypes(include=bool).columns.tolist()
# get most frequent values
most_frequent_values = df[bool_cols].mode()
# replace missing content with column's mode
df[bool_cols] = df[bool_cols].fillna(most_frequent_values)
# print
print(df)
This is the current output:
id | a | b |
---|---|---|
0 | True | 0 |
1 | ||
2 | False | 2 |
3 | True | 3 |
while this is the expected output:
id | a | b |
---|---|---|
0 | True | 0 |
1 | True | |
2 | False | 2 |
3 | True | 3 |
What am I missing? Should I convert all pd.NA in NaNs?
Side note: My Pandas version is 1.5.2
>Solution :
The issue is that mode
doesn’t return a single value but a 2D output.
You need to change:
most_frequent_values = df[bool_cols].mode().loc[0] # take the first mode
# then fillna
df[bool_cols] = df[bool_cols].fillna(most_frequent_values)
Then the output is correct:
a b
0 True 0
1 True <NA>
2 False 2
3 True 3