I have been trying to create a variable based on multiple conditions from other variables, while referencing its previous value. Unfortunately nothing I tried seemed to work.
Any help would be greatly appreciated!
I have a dataframe like this:
df <- data.frame(
ID = (rep(c(1, 2, 3), times = c(3, 6, 4))),
threshold = c(NA, 2, 6,
NA, 2, 3, 7, 3, 7,
NA, 7, 7, 2)
)
I am trying to create a new variable new_var
in a way that it assigns the number 1 to the first row, and keeps assigning the same number until the value of threshold
is >= than 5. When this happens, the value of new_var
should increase by one, and stay like that until the next time that threshold
is larger or equal to 5. Additionally, this rule should reset for every participant, so that the first entry for each participant starts at 1.
This is how the current example should look like:
df <- data.frame(
ID = (rep(c(1, 2, 3), times = c(3, 6, 4))),
threshold = c(NA, 2, 6,
NA, 2, 3, 7, 3, 7,
NA, 7, 7, 2)
)
I have tried grouping by ID
, and using case_when
to define the different options based on threshold
. It is worth mentioning that the value of threshold is always missing for the first row of each participant.
I have used this code:
library(tidyverse)
df1 <- df %>%
group_by(ID) %>%
mutate(
new_var = NA, #first had to create an empty variable so later I can refer to it
new_var = case_when(
is.na(threshold) == TRUE ~ 1,
threshold < 5 ~ new_var[-1],
threshold >= 5 ~ new_var[-1] + 1
)
) %>%
ungroup()
However, I keep getting this error:
Error in mutate()
:
! Problem while computing new_var = case_when(...)
.
ℹ The error occurred in group 1: ID = 1.
Caused by error in case_when()
:
! threshold < 5 ~ new_var[-1]
, threshold >= 5 ~ new_var[-1] + 1
must be length 3 or one, not 2.
Backtrace:
- … %>% ungroup()
- dplyr::case_when(…)
So according to my understanding, the problem is that when I try to refer to the previous value of new_var
, the program treats it as a whole vector instead of one particular data point separately at the each row. But maybe I’m wrong…
Is there a better way to refer to the previous value of a vector than by new_var[-1]?
Or perhaps a better approach to solving this whole puzzle?
I would be grateful to hear any insight!
Thank you!
>Solution :
You could try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(new_var = cumsum(coalesce(threshold, 0L) >= 5) + 1) %>%
ungroup
Output:
# A tibble: 13 × 3
ID threshold new_var
<dbl> <dbl> <dbl>
1 1 NA 1
2 1 2 1
3 1 6 2
4 2 NA 1
5 2 2 1
6 2 3 1
7 2 7 2
8 2 3 2
9 2 7 3
10 3 NA 1
11 3 7 2
12 3 7 3
13 3 2 3
Which could be translated into this one-liner if you’re using the latest version of dplyr
(1.1.0.):
mutate(df, new_var = cumsum(coalesce(threshold, 0L) >= 5) + 1, .by = ID)