Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Subset data frame based on column values containing factors of a value

I have a data frame that contains a column with time differences (in hours) between the previous row and the current row. These differences are NA, 4, 8, or 16. The rows are as such one long sequence.

I want to subset the data frame so the time differences are regular (16h), and delete rows that are not part of a 16-hour ‘step’. However, an NA time difference should not just be skipped in calculations (e.g. rows with time differences 8, 8 together are a step of 16h, but three rows with 8, NA, 8, are not). Instead, calculating the 16h time differences should initiate again starting from this NA. So where one or multiple time differences are 4 or 8 hours but together cannot be reduced to a single step of 16h, the individual values should be turned NA too, see below.

Below is an example of the current data frame, and what I would like to achieve

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

CURRENT

Input

df1 <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A","B","B","B","B"), dist = c(12L, 10L, 18L, 4L, 6L, 2L, 6L, 7L, 4L, 1L, 
2L), timediff = c(NA, 4L, 4L, 4L, 4L, 16L, 8L, NA, 8L, 4L, 4L)), class = "data.frame", row.names = c(NA, -11L))

df1
   ID dist timediff
1   A   12       NA
2   A   10        4
3   A   18        4
4   A    4        4
5   A    6        4
6   A    2       16
7   A    6        8
8   B    7       NA
9   B    4        8
10  B    1        4
11  B    2        4



Desired output:
  ID dist timediff
1  A   12       NA
2  A    6       16
3  A    2       16
4  A    6       NA
5  B    7       NA
6  B    2       16

>Solution :

We could create a grouping column with rleid and then get the cumulative sum by grp and ‘ID’, and filter only the ’16’ values or the NA elements

library(dplyr)
library(data.table)
df1 %>% 
  group_by(ID, grp = rleid(timediff %/% 16)) %>% 
  mutate(timediff = cumsum(timediff), 
  timediff = replace(timediff, n() ==1 & timediff != 16, NA_real_))  %>% 
  ungroup %>% 
  select(-grp) %>%
  filter(timediff == 16|is.na(timediff))

-output

# A tibble: 6 × 3
  ID     dist timediff
  <chr> <int>    <dbl>
1 A        12       NA
2 A         6       16
3 A         2       16
4 A         6       NA
5 B         7       NA
6 B         2       16
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading