Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to flag/count missing values in RStudio on a row by row basis?

Taking the airquality dataset from the MASS library as an example:

> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

I want to create three columns:

Missing_Ozone, Missing_Total and Missing_Percent, such that:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Missing_Ozone = 1 if there is a missing value in the Ozone column in the current row.

Missing_total = total count of missing values in the current row

Missing_Percent = percentage of missing values in a row.

So for example, in row 1:

Missing_Ozone = 0, Missing_total = 0, Missing_percent = 0

In row 5:

Missing_Ozone = 1, Missing_total = 2, Missing percent = 100*(2/6)

In row 6:

Missing_Ozone = 0, Missing_total = 1, Missing percent = 100*(1/6)

I tried two approaches, without any luck:

The first was to iterate over each row and use an if statement:

library(MASS)
df_test = airquality

df_test$Missing_Ozone <- 0
for(i in 1:nrow(df_test)){
  if (is.na(df_test$Ozone)) {
    df_test$Missing_Ozone <- 1
  } 
}

The second was to just use the if-statement inside that for-loop.

Neither work, and I just get:

> df_test
    Ozone Solar.R Wind Temp Month Day Missing_Ozone
1      41     190  7.4   67     5   1             0
2      36     118  8.0   72     5   2             0
3      12     149 12.6   74     5   3             0
4      18     313 11.5   62     5   4             0
5      NA      NA 14.3   56     5   5             0

Any help is appreciated.

Edit: Also, does this type of data manipulation have a certain name? I found it hard to search online for a guide that goes through this type of data manipulation.

>Solution :

A solution using the dplyr package. rowwise and c_cross allow us to do calculation by each row.

library(dplyr)

dat <- airquality %>%
  mutate(Missing_Ozone = as.integer(is.na(Ozone))) %>%
  rowwise() %>%
  mutate(Missing_Total = sum(is.na((c_across(-Missing_Ozone))))) %>%
  mutate(Missing_Percent = Missing_Total/ncol(airquality)) %>%
  ungroup()
dat
# # A tibble: 153 x 9
#    Ozone Solar.R  Wind  Temp Month   Day Missing_Ozone Missing_Total Missing_Percent
#    <int>   <int> <dbl> <int> <int> <int>         <int>         <int>           <dbl>
#  1    41     190   7.4    67     5     1             0             0           0    
#  2    36     118   8      72     5     2             0             0           0    
#  3    12     149  12.6    74     5     3             0             0           0    
#  4    18     313  11.5    62     5     4             0             0           0    
#  5    NA      NA  14.3    56     5     5             1             2           0.333
#  6    28      NA  14.9    66     5     6             0             1           0.167
#  7    23     299   8.6    65     5     7             0             0           0    
#  8    19      99  13.8    59     5     8             0             0           0    
#  9     8      19  20.1    61     5     9             0             0           0    
# 10    NA     194   8.6    69     5    10             1             1           0.167
# # ... with 143 more rows
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading