Taking the airquality dataset from the MASS library as an example:
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
I want to create three columns:
Missing_Ozone, Missing_Total and Missing_Percent, such that:
Missing_Ozone = 1 if there is a missing value in the Ozone column in the current row.
Missing_total = total count of missing values in the current row
Missing_Percent = percentage of missing values in a row.
So for example, in row 1:
Missing_Ozone = 0, Missing_total = 0, Missing_percent = 0
In row 5:
Missing_Ozone = 1, Missing_total = 2, Missing percent = 100*(2/6)
In row 6:
Missing_Ozone = 0, Missing_total = 1, Missing percent = 100*(1/6)
I tried two approaches, without any luck:
The first was to iterate over each row and use an if statement:
library(MASS)
df_test = airquality
df_test$Missing_Ozone <- 0
for(i in 1:nrow(df_test)){
if (is.na(df_test$Ozone)) {
df_test$Missing_Ozone <- 1
}
}
The second was to just use the if-statement inside that for-loop.
Neither work, and I just get:
> df_test
Ozone Solar.R Wind Temp Month Day Missing_Ozone
1 41 190 7.4 67 5 1 0
2 36 118 8.0 72 5 2 0
3 12 149 12.6 74 5 3 0
4 18 313 11.5 62 5 4 0
5 NA NA 14.3 56 5 5 0
Any help is appreciated.
Edit: Also, does this type of data manipulation have a certain name? I found it hard to search online for a guide that goes through this type of data manipulation.
>Solution :
A solution using the dplyr package. rowwise and c_cross allow us to do calculation by each row.
library(dplyr)
dat <- airquality %>%
mutate(Missing_Ozone = as.integer(is.na(Ozone))) %>%
rowwise() %>%
mutate(Missing_Total = sum(is.na((c_across(-Missing_Ozone))))) %>%
mutate(Missing_Percent = Missing_Total/ncol(airquality)) %>%
ungroup()
dat
# # A tibble: 153 x 9
# Ozone Solar.R Wind Temp Month Day Missing_Ozone Missing_Total Missing_Percent
# <int> <int> <dbl> <int> <int> <int> <int> <int> <dbl>
# 1 41 190 7.4 67 5 1 0 0 0
# 2 36 118 8 72 5 2 0 0 0
# 3 12 149 12.6 74 5 3 0 0 0
# 4 18 313 11.5 62 5 4 0 0 0
# 5 NA NA 14.3 56 5 5 1 2 0.333
# 6 28 NA 14.9 66 5 6 0 1 0.167
# 7 23 299 8.6 65 5 7 0 0 0
# 8 19 99 13.8 59 5 8 0 0 0
# 9 8 19 20.1 61 5 9 0 0 0
# 10 NA 194 8.6 69 5 10 1 1 0.167
# # ... with 143 more rows