Grouping data in R only for consecutive rows in same category

June 29, 2023

Here is a mock data frame of what I’m dealing with:

df_in <- data.frame(
  time = c(1,2,3,4,5,6,7,8,9,10),
  state = c(1,1,1,2,2,3,1,1,1,2),
  returns = c(0.5,0.2,0.3,0.4,0.1,0.2,1.1,0.8,0.5,0.2)
)
   time state returns
1     1     1     0.5
2     2     1     0.2
3     3     1     0.3
4     4     2     0.4
5     5     2     0.1
6     6     3     0.2
7     7     1     1.1
8     8     1     0.8
9     9     1     0.5
10   10     2     0.2

I want to aggregate the data so, that returns in consecutive rows with the same ‘state’ would be calulated together, preferably compounded together, but a simple sum is also fine for this application.

What I would like the output to look like would be something like this:

df_out <- data.frame(
  time_start = c(1,4,6,7,10),
  time_end = c(3,5,6,9,10),
  state = c(1,2,3,1,2),
  returns = c(1.0,0.5,0.2,2.4,0.2)
)

  time_start time_end state returns
1          1        3     1     1.0
2          4        5     2     0.5
3          6        6     3     0.2
4          7        9     1     2.4
5         10       10     2     0.2

In the example I have the returns simply summed together, but as previously said, compounded returns would be optimal if manageably possible.

There is another thread which is fairly close to what I’m trying to achieve, but it uses an identifying ‘ID’, which the states here don’t have.

I haven’t really tried too much, but was thinking of using methods in the dplyr package, as they seem fairly efficient for something like this. I’m completely new to the package though.

>Solution :

You may use consecutive_id function to create this new group which will have a unique id for rows that occur together. You may then calculate your preferred summary statistics for each of this group.

library(dplyr)

df_in %>%
  mutate(consecutive_group = consecutive_id(state)) %>%
  summarise(time_start = min(time, na.rm = TRUE), 
            time_end = max(time, na.rm = TRUE), 
            state = first(state), 
            returns = sum(returns, na.rm = TRUE), .by = consecutive_group) %>%
  select(-consecutive_group)

#  time_start time_end state returns
#1          1        3     1     1.0
#2          4        5     2     0.5
#3          6        6     3     0.2
#4          7        9     1     2.4
#5         10       10     2     0.2