I have a column with dates (Time), I consider one duration as consecutive times of 1s:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"))
I would like to generate an Id for each duration like this:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"),Id = c(1,1,1,2,2))
With dplyr…
Thank you
>Solution :
Up front:
cumsum(c(TRUE, as.numeric(diff(as.POSIXct(data$Time)), units = "secs") > 1L))
# [1] 1 1 1 2 2
First, you should really be working with real timestamps and not strings. If you’re doing anything else with your Time field, it is almost certainly going to be a number-like operation, so you should do this up-front with
data$Time <- as.POSIXct(data$Time)
This works easily here because they are well-formed along the default format of "%Y-%m-%d %H:%M:%S"; see ?strptime for the %-codes.
From here, you want to keep track of when a difference in time is more than 1 second. The differencing is easy enough with:
as.numeric(diff(data$Time), units = "secs")
# [1] 1 1 2 1
Really, the key operator is diff, but it can report minutes or hours or such if the data is widely-enough spaced; there’s an internal heuristic for that. Wrapping it in as.numeric(., units="secs") forces it to always be in seconds.
From here, we need a cumulative sum of when it is above 1, ergo > 1L, so cumsum(. > 1L).
Note that we have input length 5 but output length 4, this makes sense realizing that differences are between two elements. We force the first difference-test to be TRUE. If you have since changed to POSIXt-class, then the original code is reduced slightly to be
cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))
and therefore to store it as Id,
data$Id <- cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))