in creating a cleaning project throught Python, I’ve found this code:
# let's see if there is any missing data
for col in df.columns:
pct_missing = np.mean(df[col].isnull())
print('{} - {}%'.format(col, round(pct_missing,2)))
Which actually works fine, giving back the % of null values per column in the dataframe, but I’m a little confused on how it works:
First we define a loop for each column in the dataframe, then we execute that mean but exactly the mean of what? The mean for each columns of the quantity of null cells or what?
Just for reference, I’ve worked around it with this:
NullValues=df.isnull().sum()/len(df)
print('{} - {}%'.format(col, round(NullValues,2)))
that gives me back basically the same results but just to understand the mechanism…I’m confused about the first block of code…
>Solution :
It’s something that’s very intuitive once you’re used to it. The steps leading to this kind of code could be like the following:
- To get the percentage of null values, we need to count all null rows, and divide the count by the total number of rows.
- So, first we need to detect the null rows. This is easy, as there is a provided method:
df[col].isnull(). - The result of
df[col].isnull()is a new column consisting of booleans —TrueorFalse. - Now we need to count the
Trues. Here we can realize that countingTrues in a boolean array is the same as summing the array:Truecan be converted to 1, andFalseto zero. - So we would be left with
df[col].isnull().sum() / len(df[col]). - But summing and dividing by the length is just the arithmetic mean! Therefore, we can shorten this to arrive at the final result:
mean(df[col].isnull()).