How to flag/count missing values in RStudio on a row by row basis?

November 10, 2021

Taking the airquality dataset from the MASS library as an example:

> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

I want to create three columns:

Missing_Ozone, Missing_Total and Missing_Percent, such that:

Missing_Ozone = 1 if there is a missing value in the Ozone column in the current row.

Missing_total = total count of missing values in the current row

Missing_Percent = percentage of missing values in a row.

So for example, in row 1:

Missing_Ozone = 0, Missing_total = 0, Missing_percent = 0

In row 5:

Missing_Ozone = 1, Missing_total = 2, Missing percent = 100*(2/6)

In row 6:

Missing_Ozone = 0, Missing_total = 1, Missing percent = 100*(1/6)

I tried two approaches, without any luck:

The first was to iterate over each row and use an if statement:

library(MASS)
df_test = airquality

df_test$Missing_Ozone <- 0
for(i in 1:nrow(df_test)){
  if (is.na(df_test$Ozone)) {
    df_test$Missing_Ozone <- 1
  } 
}

The second was to just use the if-statement inside that for-loop.

Neither work, and I just get:

> df_test
    Ozone Solar.R Wind Temp Month Day Missing_Ozone
1      41     190  7.4   67     5   1             0
2      36     118  8.0   72     5   2             0
3      12     149 12.6   74     5   3             0
4      18     313 11.5   62     5   4             0
5      NA      NA 14.3   56     5   5             0

Any help is appreciated.

Edit: Also, does this type of data manipulation have a certain name? I found it hard to search online for a guide that goes through this type of data manipulation.

>Solution :

A solution using the dplyr package. rowwise and c_cross allow us to do calculation by each row.

library(dplyr)

dat <- airquality %>%
  mutate(Missing_Ozone = as.integer(is.na(Ozone))) %>%
  rowwise() %>%
  mutate(Missing_Total = sum(is.na((c_across(-Missing_Ozone))))) %>%
  mutate(Missing_Percent = Missing_Total/ncol(airquality)) %>%
  ungroup()
dat
# # A tibble: 153 x 9
#    Ozone Solar.R  Wind  Temp Month   Day Missing_Ozone Missing_Total Missing_Percent
#    <int>   <int> <dbl> <int> <int> <int>         <int>         <int>           <dbl>
#  1    41     190   7.4    67     5     1             0             0           0    
#  2    36     118   8      72     5     2             0             0           0    
#  3    12     149  12.6    74     5     3             0             0           0    
#  4    18     313  11.5    62     5     4             0             0           0    
#  5    NA      NA  14.3    56     5     5             1             2           0.333
#  6    28      NA  14.9    66     5     6             0             1           0.167
#  7    23     299   8.6    65     5     7             0             0           0    
#  8    19      99  13.8    59     5     8             0             0           0    
#  9     8      19  20.1    61     5     9             0             0           0    
# 10    NA     194   8.6    69     5    10             1             1           0.167
# # ... with 143 more rows