My dataset records presentations to a location by an individual. This is tabulated below but included as a dput further down.
Identifier | date |
---|---|
"A1" | "28/01/2020" |
"A1" | "01/04/2020" |
"A1" | "16/08/2020" |
"A1" | "20/08/2020" |
"A1" | "30/08/2020" |
"A1" | "31/10/2020" |
"A1" | "14/11/2020" |
"A1" | "26/11/2020" |
"A1" | "25/12/2020" |
"A1" | "04/05/2021" |
"A1" | "08/05/2021" |
"A1" | "26/07/2021" |
The individual attends sporadically and on occasions returns several times within a 28 day period.
Working in date order I want to identify index cases that occur at least 28 days apart and reattendances that occur within the 28 day window.
The resulting data when processed correctly would look like this:
Identifier | date | index_visit | index_id |
---|---|---|---|
"A1" | "28/01/2020" | TRUE | 1 |
"A1" | "01/04/2020" | TRUE | 2 |
"A1" | "16/08/2020" | TRUE | 3 |
"A1" | "20/08/2020" | FALSE | 3 |
"A1" | "30/08/2020" | FALSE | 3 |
"A1" | "31/10/2020" | TRUE | 4 |
"A1" | "14/11/2020" | FALSE | 4 |
"A1" | "26/11/2020" | FALSE | 4 |
"A1" | "25/12/2020" | TRUE | 5 |
"A1" | "04/05/2021" | TRUE | 6 |
"A1" | "08/05/2021" | FALSE | 6 |
"A1" | "26/07/2021" | TRUE | 7 |
I would prefer solutions that make use of dplyr as this is what I am most familiar with.
Dput:
data <- structure(list(identifier = c("A1", "A1", "A1", "A1", "A1", "A1",
"A1", "A1", "A1", "A1", "A1", "A1", "A1"),
date = structure(c(18520, 18504, 18621, 18580, 18353, 18751,
18289, 18494, 18592, 18490, 18755, 18834, 18566),
class = "Date")),
row.names = c(NA, -13L), class = "data.frame")
>Solution :
My function time_episodes()
was originally written to identify episodes of disease reinfection but is applicable to any episodic event analysis.
It also supports more complex time units like months through lubridate
.
# Uncomment to install
# remotes::install_github("NicChr/timeplyr")
library(tidyverse)
library(timeplyr)
data <- as_tibble(data)
episodic_data <- data %>%
group_by(identifier) %>%
time_episodes(date, time_by = "day", window = 28, switch_on_boundary = TRUE)
episodic_data %>%
arrange(identifier, date) %>%
mutate(index_visit = ep_id_new > 0)
# A tibble: 13 × 7
# Groups: identifier [1]
identifier date t_elapsed ep_start ep_id ep_id_new index_visit
<chr> <date> <dbl> <date> <int> <int> <lgl>
1 A1 2020-01-28 0 2020-01-28 1 1 TRUE
2 A1 2020-04-01 64 2020-04-01 2 2 TRUE
3 A1 2020-08-16 137 2020-08-16 3 3 TRUE
4 A1 2020-08-20 4 2020-08-16 3 0 FALSE
5 A1 2020-08-30 10 2020-08-16 3 0 FALSE
6 A1 2020-09-15 16 2020-08-16 3 0 FALSE
7 A1 2020-10-31 46 2020-10-31 4 4 TRUE
8 A1 2020-11-14 14 2020-10-31 4 0 FALSE
9 A1 2020-11-26 12 2020-10-31 4 0 FALSE
10 A1 2020-12-25 29 2020-12-25 5 5 TRUE
11 A1 2021-05-04 130 2021-05-04 6 6 TRUE
12 A1 2021-05-08 4 2021-05-04 6 0 FALSE
13 A1 2021-07-26 79 2021-07-26 7 7 TRUE
Alternative method
data %>%
group_by(identifier) %>%
arrange(identifier, date) %>%
mutate(elapsed = time_elapsed(date, "days")) %>%
mutate(index_visit = row_number() == 1 | elapsed >= 28)
# A tibble: 13 × 4
# Groups: identifier [1]
identifier date elapsed index_visit
<chr> <date> <dbl> <lgl>
1 A1 2020-01-28 NA TRUE
2 A1 2020-04-01 64 TRUE
3 A1 2020-08-16 137 TRUE
4 A1 2020-08-20 4 FALSE
5 A1 2020-08-30 10 FALSE
6 A1 2020-09-15 16 FALSE
7 A1 2020-10-31 46 TRUE
8 A1 2020-11-14 14 FALSE
9 A1 2020-11-26 12 FALSE
10 A1 2020-12-25 29 TRUE
11 A1 2021-05-04 130 TRUE
12 A1 2021-05-08 4 FALSE
13 A1 2021-07-26 79 TRUE