Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

In R, conditionally count IDs who meet an ANY condition on a certain attribute

Background

I’ve got a dataframe df:

> df <- data.frame(ID = c("a","a","a", "b", "c","c","c","c","d","d","d","d"),
                 treatment = c(0,1,1,0,0,0,0,0,0,0,0,1),
                 event = c(0,1,0,0,1,1,1,1,0,0,1,0),
                 stringsAsFactors=FALSE) 
> df
   ID treatment event
1   a         0     0
2   a         1     1
3   a         1     0
4   b         0     0
5   c         0     1
6   c         0     1
7   c         0     1
8   c         0     1
9   d         0     0
10  d         0     0
11  d         0     1
12  d         1     0

It’s got 4 people (ID) in it: a, b, c, and d. Each ID can have one record, like b does, or several records, like the other 3 do. They each have values for treatment, which is binary, and event, which is binary too.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The problem

I want to count how many distinct IDs who have ANY treatment==1 or NO treatment==1 ever (in other words, all treatment==0) also have event==1 and event==0. Here’s what I’d like the result to look like:

> [some R code]

event  treatment  n
    0          0  1           
    0          1  0
    1          0  1
    1          1  2

To break this out into English:

  • IDs a and d and have got ANY treatment==1 and ANY event==1
  • ID b has got no treatment EVER and no event EVER
  • ID c has no treatment==1 ever (i.e. all treatment==0) but does have any event==1 (as you can see, he’s actually got ALL treatment==1)

What I’ve tried

I know I’m close, but the dplyr code I’m using is getting tripped up by the fact that IDs have rows with both treatment==1 and treatment==0, and the same for event — so R is double-counting everyone, showing me with 8 people instead of 4:

df %>% 
  group_by(event, treatment) %>%
  distinct(ID) %>%
  count(event) %>%
  as.data.frame() %>%
  mutate(Percent = round((n/sum(n))*100, digits = 1))

  event treatment n Percent
1     0         0 3    37.5
2     0         1 2    25.0
3     1         0 2    25.0
4     1         1 1    12.5

I’m not married to dplyr on this, but it’s what I’m most familiar with (as opposed to data.table, for instance). The real dataset this code is going to be run on has several million rows, just FYI.

>Solution :

You may try

df %>%
  group_by(ID) %>%
  summarize(treatment = as.numeric(sum(treatment) > 0), 
            event = as.numeric(sum(event) > 0)) %>%
  select(-ID) %>%
  count(treatment, event)

  treatment event     n
      <dbl> <dbl> <int>
1         0     0     1
2         0     1     1
3         1     1     2
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading