I’m working on a dataset of time intervals. Some intervals overlap. I would like to take the raw interval data, and break it into consecutive intervals, by number of overlaps. In the toy data below, there are 3 intervals. My desired output is a data frame that contains the start and stop of where there’s only one ID, then start and stop where ID 1 and ID 2 intersect, then start and stop of where IDs 1-3 intersect, then start and stop of where IDs 1 and 3 intersect, and finally the start and stop of the remainder of ID 1.
library(lubridate)
library(ggplot2)
df <- structure(list(ID = 1:3, Start = structure(c(1690740180, 1690740480,
1690741380), class = c("POSIXct", "POSIXt"), tzone = "America/Iqaluit"),
End = structure(c(1690751520, 1690742140, 1690742280), class = c("POSIXct",
"POSIXt"), tzone = "America/Iqaluit")), row.names = 3:5, class = "data.frame")
ggplot(df) + geom_segment(aes(x = Start, xend = End, y = as.factor(ID), yend = as.factor(ID)))
Desired output should look like this:
Intervals Start End
1 2023-07-30 14:03:00 2023-07-30 14:07:59
2 2023-07-30 14:08:00 2023-07-30 14:22:59
3 2023-07-30 14:23:00 2023-07-30 14:35:40
2 2023-07-30 14:35:40 2023-07-30 14:38:00
1 2023-07-30 14:38:00 2023-07-30 15:06:40
I can do this by interpolating the data down to 1 sec and checking for intersections, but I was hoping for a cleaner solution.
>Solution :
Here’s a base R solution:
alltimes <- unique(sort(c(df$Start, df$End)))
intervals <- sapply(intervals[-length(intervals)],
function(tm) df$Start <= tm & tm < df$End)
intervals
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE TRUE TRUE TRUE TRUE
# [2,] FALSE TRUE TRUE FALSE FALSE
# [3,] FALSE FALSE TRUE TRUE FALSE
In intervals, each row is a row from the original df, each column is a time segment, and the value indicates if the original df row was found in that time segment. We can take the sums of each column to create the Intervals column, and then the Start and End columns are merely the pairs of our alltimes vector.
data.frame(
Intervals = colSums(intervals),
Start = alltimes[-length(alltimes)],
End = alltimes[-1]
)
# Intervals Start End
# 1 1 2023-07-30 14:03:00 2023-07-30 14:08:00
# 2 2 2023-07-30 14:08:00 2023-07-30 14:23:00
# 3 3 2023-07-30 14:23:00 2023-07-30 14:35:40
# 4 2 2023-07-30 14:35:40 2023-07-30 14:38:00
# 5 1 2023-07-30 14:38:00 2023-07-30 17:12:00
I’m not certain if the new End should be the same as the next Start or offset by one second, your expected output uses both. Also, I don’t know how your last row is 15:06:40 (is not in your raw data), I suspect it’s an artifact of your real data and not the sample.