Efficiently assigning multiple variables created from a subset of grouped data in R

Advertisements

I am trying to improve the efficiency of code that is already working. For instance, consider the toy dataset below:

df = data.frame(id = c(1,1,2,2,2,3,3,3),
                date = c(12,13,1,4,5,9,10,12),
                visit = c("out","in","in","out","out","out","in","in"))

df
 id date visit
  1   12   out
  1   13    in
  2    1    in
  2    4   out
  2    5   out
  3    9   out
  3   10    in
  3   12    in

What I am trying to do is create separate variables that contain the minimum date (here date is just day of the month) for both in and out visits grouped by ID.

If I try grouping on both variables, the output is correct but only one variable:

df %>% group_by(id,visit) %>%
  mutate(m_date = min(date))
# A tibble: 8 × 4
# Groups:   id, visit [6]
     id  date visit m_date
  <dbl> <dbl> <chr>  <dbl>
1     1    12 out       12
2     1    13 in        13
3     2     1 in         1
4     2     4 out        4
5     2     5 out        4
6     3     9 out        9
7     3    10 in        10
8     3    12 in        10

I’ve also tried mutate(case_when()) which seems to have slow performance and results in the same issue (all results under on variable) as would mutate(if_else()) if I’m not mistaken. The way I’m currently achieving this is a two-step process to filter and then rejoin the results:

df = full_join(df,
    df %>% group_by(id) %>%
    filter(visit == "out") %>%
    mutate(m_date_out = min(date)) %>%
    select(id, m_date_out) %>% distinct(id, .keep_all=TRUE),
    by = "id")

df = full_join(df,
     df %>% group_by(id) %>%
     filter(visit == "in") %>%
     mutate(m_date_in = min(date)) %>%
     select(id, m_date_in) %>% distinct(id, .keep_all=TRUE),
     by = "id")
df
  id date visit m_date_out m_date_in
1  1   12   out         12        13
2  1   13    in         12        13
3  2    1    in          4         1
4  2    4   out          4         1
5  2    5   out          4         1
6  3    9   out          9        10
7  3   10    in          9        10
8  3   12    in          9        10

However this seems overly complicated for what I"m trying to do. Can the more experienced R coders show me a simpler way to achieve the same output? Speed is also relevant as this will be used on a large dataset.

>Solution :

It looks like you’re kind of making a long and wide table at the same, which seems weird to me. But you can just filter directly inside the mutate call like so:

df %>% 
  group_by(id) %>%
  mutate(
    m_date_out = min(date[visit == "out"]),
    m_date_in = min(date[visit == "in"])
  )
# A tibble: 8 × 5
# Groups:   id [3]
     id  date visit m_date_out m_date_in
  <dbl> <dbl> <chr>      <dbl>     <dbl>
1     1    12 out           12        13
2     1    13 in            12        13
3     2     1 in             4         1
4     2     4 out            4         1
5     2     5 out            4         1
6     3     9 out            9        10
7     3    10 in             9        10
8     3    12 in             9        10

Leave a Reply Cancel reply