Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

What does np.mean(data.isnull()) exactly?

in creating a cleaning project throught Python, I’ve found this code:

# let's see if there is any missing data

for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing,2)))

Which actually works fine, giving back the % of null values per column in the dataframe, but I’m a little confused on how it works:

First we define a loop for each column in the dataframe, then we execute that mean but exactly the mean of what? The mean for each columns of the quantity of null cells or what?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Just for reference, I’ve worked around it with this:

NullValues=df.isnull().sum()/len(df)
print('{} - {}%'.format(col, round(NullValues,2)))

that gives me back basically the same results but just to understand the mechanism…I’m confused about the first block of code…

>Solution :

It’s something that’s very intuitive once you’re used to it. The steps leading to this kind of code could be like the following:

  1. To get the percentage of null values, we need to count all null rows, and divide the count by the total number of rows.
  2. So, first we need to detect the null rows. This is easy, as there is a provided method: df[col].isnull().
  3. The result of df[col].isnull() is a new column consisting of booleans — True or False.
  4. Now we need to count the Trues. Here we can realize that counting Trues in a boolean array is the same as summing the array: True can be converted to 1, and False to zero.
  5. So we would be left with df[col].isnull().sum() / len(df[col]).
  6. But summing and dividing by the length is just the arithmetic mean! Therefore, we can shorten this to arrive at the final result: mean(df[col].isnull()).
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading