I am trying to improve the efficiency of code that is already working. For instance, consider the toy dataset below:
df = data.frame(id = c(1,1,2,2,2,3,3,3),
date = c(12,13,1,4,5,9,10,12),
visit = c("out","in","in","out","out","out","in","in"))
df
id date visit
1 12 out
1 13 in
2 1 in
2 4 out
2 5 out
3 9 out
3 10 in
3 12 in
What I am trying to do is create separate variables that contain the minimum date (here date is just day of the month) for both in and out visits grouped by ID.
If I try grouping on both variables, the output is correct but only one variable:
df %>% group_by(id,visit) %>%
mutate(m_date = min(date))
# A tibble: 8 × 4
# Groups: id, visit [6]
id date visit m_date
<dbl> <dbl> <chr> <dbl>
1 1 12 out 12
2 1 13 in 13
3 2 1 in 1
4 2 4 out 4
5 2 5 out 4
6 3 9 out 9
7 3 10 in 10
8 3 12 in 10
I’ve also tried mutate(case_when()) which seems to have slow performance and results in the same issue (all results under on variable) as would mutate(if_else()) if I’m not mistaken. The way I’m currently achieving this is a two-step process to filter and then rejoin the results:
df = full_join(df,
df %>% group_by(id) %>%
filter(visit == "out") %>%
mutate(m_date_out = min(date)) %>%
select(id, m_date_out) %>% distinct(id, .keep_all=TRUE),
by = "id")
df = full_join(df,
df %>% group_by(id) %>%
filter(visit == "in") %>%
mutate(m_date_in = min(date)) %>%
select(id, m_date_in) %>% distinct(id, .keep_all=TRUE),
by = "id")
df
id date visit m_date_out m_date_in
1 1 12 out 12 13
2 1 13 in 12 13
3 2 1 in 4 1
4 2 4 out 4 1
5 2 5 out 4 1
6 3 9 out 9 10
7 3 10 in 9 10
8 3 12 in 9 10
However this seems overly complicated for what I"m trying to do. Can the more experienced R coders show me a simpler way to achieve the same output? Speed is also relevant as this will be used on a large dataset.
>Solution :
It looks like you’re kind of making a long and wide table at the same, which seems weird to me. But you can just filter directly inside the mutate
call like so:
df %>%
group_by(id) %>%
mutate(
m_date_out = min(date[visit == "out"]),
m_date_in = min(date[visit == "in"])
)
# A tibble: 8 × 5 # Groups: id [3] id date visit m_date_out m_date_in <dbl> <dbl> <chr> <dbl> <dbl> 1 1 12 out 12 13 2 1 13 in 12 13 3 2 1 in 4 1 4 2 4 out 4 1 5 2 5 out 4 1 6 3 9 out 9 10 7 3 10 in 9 10 8 3 12 in 9 10