Subset data frame based on column values containing factors of a value

July 22, 2022

I have a data frame that contains a column with time differences (in hours) between the previous row and the current row. These differences are NA, 4, 8, or 16. The rows are as such one long sequence.

I want to subset the data frame so the time differences are regular (16h), and delete rows that are not part of a 16-hour ‘step’. However, an NA time difference should not just be skipped in calculations (e.g. rows with time differences 8, 8 together are a step of 16h, but three rows with 8, NA, 8, are not). Instead, calculating the 16h time differences should initiate again starting from this NA. So where one or multiple time differences are 4 or 8 hours but together cannot be reduced to a single step of 16h, the individual values should be turned NA too, see below.

Below is an example of the current data frame, and what I would like to achieve

CURRENT

Input

df1 <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A","B","B","B","B"), dist = c(12L, 10L, 18L, 4L, 6L, 2L, 6L, 7L, 4L, 1L, 
2L), timediff = c(NA, 4L, 4L, 4L, 4L, 16L, 8L, NA, 8L, 4L, 4L)), class = "data.frame", row.names = c(NA, -11L))

df1
   ID dist timediff
1   A   12       NA
2   A   10        4
3   A   18        4
4   A    4        4
5   A    6        4
6   A    2       16
7   A    6        8
8   B    7       NA
9   B    4        8
10  B    1        4
11  B    2        4



Desired output:
  ID dist timediff
1  A   12       NA
2  A    6       16
3  A    2       16
4  A    6       NA
5  B    7       NA
6  B    2       16

>Solution :

We could create a grouping column with rleid and then get the cumulative sum by grp and ‘ID’, and filter only the ’16’ values or the NA elements

library(dplyr)
library(data.table)
df1 %>% 
  group_by(ID, grp = rleid(timediff %/% 16)) %>% 
  mutate(timediff = cumsum(timediff), 
  timediff = replace(timediff, n() ==1 & timediff != 16, NA_real_))  %>% 
  ungroup %>% 
  select(-grp) %>%
  filter(timediff == 16|is.na(timediff))

-output

# A tibble: 6 × 3
  ID     dist timediff
  <chr> <int>    <dbl>
1 A        12       NA
2 A         6       16
3 A         2       16
4 A         6       NA
5 B         7       NA
6 B         2       16