I just discovered that, case_when might not work if a variable is recoded based on multiple variables.
Reproducible data:
data <- data.frame(f103 = c(2, NA, NA, 1, 2, 2),
f76 = c(2, NA, NA, NA, 3, 3),
f4 = c(1,3,3,1,1,2))
The following code produces the same results for var1 and var 2 (which is not what I want):
reprdata <- reprdata %>%
mutate(var1 = f4) %>%
mutate(var1 = case_when(f103 == 2 ~ 3, TRUE ~ as.numeric(var1))) %>%
mutate(var2 = f4) %>%
mutate(var2 = case_when(f103 == 2 ~ 3, f76 == 1 ~ 1, f76 == 2 ~ 2, f76 == 3 ~ 3, TRUE ~ as.numeric(var2)))
The following produces the correct result (i.e., the solution to my problem):
reprdata <- reprdata %>%
mutate(var1 = f4) %>%
mutate(var1 = case_when(f103 == 2 ~ 3, TRUE ~ as.numeric(var1))) %>%
mutate(var2 = f4) %>%
mutate(var2 = case_when(f103 == 2 ~ 3, TRUE ~ as.numeric(var2))) %>%
mutate(var2 = case_when(f76 == 1 ~ 1, f76 == 2 ~ 2, f76 == 3 ~ 3, TRUE ~ as.numeric(var2)))
(I am aware that in this snippet of my data, the f103 condition for var1 is superfluous, still, I wouldn’t expect it to cause this issue.)
I’d be interested to know if someone can explain to my why this problem occurs and how to prevent it in future.
>Solution :
It has to do with how case_when evaluates: It’s evaluating from the bottom and up, which is contrary to what most people think intuitively (my experience). I.e.
f76 wins (what you expect!)
library(dplyr)
data |>
mutate(var1 = case_when(f103 == 2 ~ 3,
TRUE ~ f4)) |>
mutate(var2 = case_when(f76 %in% 1:3 ~ f76,
f103 == 2 ~ 3, # NB!
TRUE ~ f4))
f103 f76 f4 var1 var2
1 2 2 1 3 2
2 NA NA 3 3 3
3 NA NA 3 3 3
4 1 NA 1 1 1
5 2 3 1 3 3
6 2 3 2 3 3
f103 wins (what you don’t expect)
library(dplyr)
data |>
mutate(var1 = case_when(f103 == 2 ~ 3,
TRUE ~ f4)) |>
mutate(var2 = case_when(f103 == 2 ~ 3, # NB!
f76 %in% 1:3 ~ f76
TRUE ~ f4))
f103 f76 f4 var1 var2
1 2 2 1 3 3
2 NA NA 3 3 3
3 NA NA 3 3 3
4 1 NA 1 1 1
5 2 3 1 3 3
6 2 3 2 3 3