What does np.mean(data.isnull()) exactly?

January 13, 2023

in creating a cleaning project throught Python, I’ve found this code:

# let's see if there is any missing data

for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing,2)))

Which actually works fine, giving back the % of null values per column in the dataframe, but I’m a little confused on how it works:

First we define a loop for each column in the dataframe, then we execute that mean but exactly the mean of what? The mean for each columns of the quantity of null cells or what?

Just for reference, I’ve worked around it with this:

NullValues=df.isnull().sum()/len(df)
print('{} - {}%'.format(col, round(NullValues,2)))

that gives me back basically the same results but just to understand the mechanism…I’m confused about the first block of code…

>Solution :

It’s something that’s very intuitive once you’re used to it. The steps leading to this kind of code could be like the following:

To get the percentage of null values, we need to count all null rows, and divide the count by the total number of rows.
So, first we need to detect the null rows. This is easy, as there is a provided method: df[col].isnull().
The result of df[col].isnull() is a new column consisting of booleans — True or False.
Now we need to count the Trues. Here we can realize that counting Trues in a boolean array is the same as summing the array: True can be converted to 1, and False to zero.
So we would be left with df[col].isnull().sum() / len(df[col]).
But summing and dividing by the length is just the arithmetic mean! Therefore, we can shorten this to arrive at the final result: mean(df[col].isnull()).