Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to count the number of successive occurrences within groups while ignoring NA?

I would like to count the number of successive occurrences by two groups while ignoring missing values. Suppose I have a dataset with 3 identifiers, one identifies the country, another variable that identifies the person, another refers to time, and a dummy variable that identifies the occurences. Something like this:

df <- data.frame(country = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
                 person = c("a", "a", "a", "b", "b", "a", "a", "b", "b", "b","b"),
                 time = c(1,2,3,1,2,1,2,1,2,3,4),
                 ocurrences = c(1,1,NA,1,0,0,NA,1,1,0,1))

Which will look like:

   country person time ocurrences
1        A      a    1          1
2        A      a    2          1
3        A      a    3         NA
4        A      b    1          1
5        A      b    2          0
6        B      a    1          0
7        B      a    2         NA
8        B      b    1          1
9        B      b    2          1
10       B      b    3          0
11       B      b    4          1

I would like to generate another column with the number of consecutive occurrences by these two groups (country, person) across time. At this point, I am not sure what is the best way to deal with missing values, whether to account them as 0 or leave it as missings. For now, let’s just say I will leave them as missing values. So the column should begin counting the number of successive ocurrences from the first 0 by person in a country. If the person has a 0 in between, it should stop computing and begin again on the next occurrence. So the output should look like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

   country person time ocurrences count
1        A      a    1          1     1
2        A      a    2          1     2
3        A      a    3         NA    NA
4        A      b    1          1     1
5        A      b    2          0     0
6        B      a    1          0     0
7        B      a    2         NA    NA
8        B      b    1          1     1
9        B      b    2          1     2
10       B      b    3          0     0
11       B      b    4          1     1

I edited and added another row as an example. Notice that rows 8 and 9 the column counts the number of successive occurrences, but it stops in row 10 because there was no occurrence. Then it starts computing again in row 11. What is the best way to accomplish it?

>Solution :

Well, in order to group and cumsum with your conditions, i must create first an auxiliary variable rleid (note that a function with the same name that do this can be found in package data.table)
Then group by country,person, relid, then cumsum ocurrences. NA’s are propagated naturally.

library(dplyr)
df <- data.frame(country = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
                 person = c("a", "a", "a", "b", "b", "a", "a", "b", "b", "b","b"),
                 time = c(1,2,3,1,2,1,2,1,2,3,4),
                 ocurrences = c(1,1,NA,1,0,0,NA,1,1,0,1))
df$rleid = with(rle(ifelse(df$ocurrences==0,NA,df$ocurrences)), rep(seq_along(values), lengths))
df <- mutate(group_by(df, country, person, rleid), count=if_else(ocurrences>0,cumsum(ocurrences),0))
df <- ungroup(df)
df$rleid<-NULL
df
#> # A tibble: 11 × 5
#>    country person  time ocurrences count
#>    <chr>   <chr>  <dbl>      <dbl> <dbl>
#>  1 A       a          1          1     1
#>  2 A       a          2          1     2
#>  3 A       a          3         NA    NA
#>  4 A       b          1          1     1
#>  5 A       b          2          0     0
#>  6 B       a          1          0     0
#>  7 B       a          2         NA    NA
#>  8 B       b          1          1     1
#>  9 B       b          2          1     2
#> 10 B       b          3          0     0
#> 11 B       b          4          1     1

See this for an explanation of rleid, which basically allows to create groups when data changes with respect to previous row.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading