Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to use an "or" conditional in an n_distinct function, in dplyr?

Suppose we start with this data frame:

mydat <- 
  data.frame(
    ID = c(115,115,115,88,88,88,100,100),
    Period = c(1, 2, 3, 1, 2, 3, 1, 2),
    Status_1 = c(1,2,1,1,2,3,2,1),
    Status_2 = c("Open","Open","Terminus","Open","Open","Closed","Open","Open")
  )

> mydat
   ID Period Status_1 Status_2
1 115      1        1     Open
2 115      2        2     Open
3 115      3        1 Terminus
4  88      1        1     Open
5  88      2        2     Open
6  88      3        3   Closed
7 100      1        2     Open
8 100      2        1     Open

Next, we run the following dplyr grouping to sum the number of instances by Period and Status_1, where Status_2 = "Open":

mydat %>%
  group_by(Period,Status_1) %>%
  summarize(StatusCount = n_distinct(ID[Status_2 == "Open"]))

  Period Status_1 StatusCount
   <dbl>    <dbl>       <int>
1      1        1           2
2      1        2           1
3      2        1           1
4      2        2           2
5      3        1           0
6      3        3           0

I’ve been trying to expand the above n_distinct() function to also include Status_2 = "Terminus" (in addition to the "Open" in the above code). I’ve tried various iterations of "or" conditions, and summing tricks, with no luck yet. Any ideas how to do this?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The result, including the Status_2 = "Terminus", would look like this:

Period Status_1 StatusCount
   <dbl>    <dbl>       <int>
1      1        1           2
2      1        2           1
3      2        1           1
4      2        2           2
5      3        1           1
6      3        3           0

>Solution :

This might work for you. I added a conditional/logical or in the selection.

mydat %>% 
      group_by(Period,Status_1) %>% 
      summarize(StatusCount = n_distinct(ID[Status_2 == "Open"|Status_2 == "Terminus"])) %>% 
      ungroup()
`summarise()` has grouped output by 'Period'. You can override using the `.groups` argument.
# A tibble: 6 x 3
  Period Status_1 StatusCount
   <dbl>    <dbl>       <int>
1      1        1           2
2      1        2           1
3      2        1           1
4      2        2           2
5      3        1           1
6      3        3           0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading