How to use an "or" conditional in an n_distinct function, in dplyr?

January 10, 2022

Suppose we start with this data frame:

mydat <- 
  data.frame(
    ID = c(115,115,115,88,88,88,100,100),
    Period = c(1, 2, 3, 1, 2, 3, 1, 2),
    Status_1 = c(1,2,1,1,2,3,2,1),
    Status_2 = c("Open","Open","Terminus","Open","Open","Closed","Open","Open")
  )

> mydat
   ID Period Status_1 Status_2
1 115      1        1     Open
2 115      2        2     Open
3 115      3        1 Terminus
4  88      1        1     Open
5  88      2        2     Open
6  88      3        3   Closed
7 100      1        2     Open
8 100      2        1     Open

Next, we run the following dplyr grouping to sum the number of instances by Period and Status_1, where Status_2 = "Open":

mydat %>%
  group_by(Period,Status_1) %>%
  summarize(StatusCount = n_distinct(ID[Status_2 == "Open"]))

  Period Status_1 StatusCount
   <dbl>    <dbl>       <int>
1      1        1           2
2      1        2           1
3      2        1           1
4      2        2           2
5      3        1           0
6      3        3           0

I’ve been trying to expand the above n_distinct() function to also include Status_2 = "Terminus" (in addition to the "Open" in the above code). I’ve tried various iterations of "or" conditions, and summing tricks, with no luck yet. Any ideas how to do this?

The result, including the Status_2 = "Terminus", would look like this:

Period Status_1 StatusCount
   <dbl>    <dbl>       <int>
1      1        1           2
2      1        2           1
3      2        1           1
4      2        2           2
5      3        1           1
6      3        3           0

>Solution :

This might work for you. I added a conditional/logical or in the selection.

mydat %>% 
      group_by(Period,Status_1) %>% 
      summarize(StatusCount = n_distinct(ID[Status_2 == "Open"|Status_2 == "Terminus"])) %>% 
      ungroup()
`summarise()` has grouped output by 'Period'. You can override using the `.groups` argument.
# A tibble: 6 x 3
  Period Status_1 StatusCount
   <dbl>    <dbl>       <int>
1      1        1           2
2      1        2           1
3      2        1           1
4      2        2           2
5      3        1           1
6      3        3           0