Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Grouping data in R only for consecutive rows in same category

Here is a mock data frame of what I’m dealing with:

df_in <- data.frame(
  time = c(1,2,3,4,5,6,7,8,9,10),
  state = c(1,1,1,2,2,3,1,1,1,2),
  returns = c(0.5,0.2,0.3,0.4,0.1,0.2,1.1,0.8,0.5,0.2)
)
   time state returns
1     1     1     0.5
2     2     1     0.2
3     3     1     0.3
4     4     2     0.4
5     5     2     0.1
6     6     3     0.2
7     7     1     1.1
8     8     1     0.8
9     9     1     0.5
10   10     2     0.2

I want to aggregate the data so, that returns in consecutive rows with the same ‘state’ would be calulated together, preferably compounded together, but a simple sum is also fine for this application.

What I would like the output to look like would be something like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df_out <- data.frame(
  time_start = c(1,4,6,7,10),
  time_end = c(3,5,6,9,10),
  state = c(1,2,3,1,2),
  returns = c(1.0,0.5,0.2,2.4,0.2)
)

  time_start time_end state returns
1          1        3     1     1.0
2          4        5     2     0.5
3          6        6     3     0.2
4          7        9     1     2.4
5         10       10     2     0.2

In the example I have the returns simply summed together, but as previously said, compounded returns would be optimal if manageably possible.

There is another thread which is fairly close to what I’m trying to achieve, but it uses an identifying ‘ID’, which the states here don’t have.

I haven’t really tried too much, but was thinking of using methods in the dplyr package, as they seem fairly efficient for something like this. I’m completely new to the package though.

>Solution :

You may use consecutive_id function to create this new group which will have a unique id for rows that occur together. You may then calculate your preferred summary statistics for each of this group.

library(dplyr)

df_in %>%
  mutate(consecutive_group = consecutive_id(state)) %>%
  summarise(time_start = min(time, na.rm = TRUE), 
            time_end = max(time, na.rm = TRUE), 
            state = first(state), 
            returns = sum(returns, na.rm = TRUE), .by = consecutive_group) %>%
  select(-consecutive_group)

#  time_start time_end state returns
#1          1        3     1     1.0
#2          4        5     2     0.5
#3          6        6     3     0.2
#4          7        9     1     2.4
#5         10       10     2     0.2
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading