I have a data table with 7 columns: date, time, temperature, id, treatment, year, and month. I have data from 3 winters from 2019 – 2022. The data were taken from November to April.
But I am interested in data from December to March only. So the first year is from 2019 December to 2020 March, the second year is from 2020 December to 2021 March, and the third year is from 2021 December to 2022 March.
I managed to use this code to obtain the months of interest.
filter(month(ymd(date)) %in% c(1,2,3,12))
But how do I select specific date ranges (start and end) for each year?
2019-12-05 to 2020-03-05 AND
2020-12-07 to 2021-03-15 AND
2021-12-01 to 2021-03-28
Here is a dummy dataset to work with with just two columns (if it helps.)
library(tidyverse)
set.seed(123)
date <- sample(seq(as.Date('2019-10-01'), as.Date('2023-04-25'), by="day"), 1000)
df <- data.frame( date)
df <- df %>% mutate(id = row_number())
Most of the example posts focus on one date range, but not disjointed date ranges like this. Any help is appreciated.
>Solution :
We can do a range join (given dplyr_1.1.0 or newer).
Reducing your sample a bit,
set.seed(123)
date <- sample(seq(as.Date('2019-10-01'), as.Date('2023-04-25'), by="day"), 30)
df <- data.frame( date)
df <- df %>% mutate(id = row_number())
head(df)
# date id
# 1 2020-11-18 1
# 2 2021-01-05 2
# 3 2020-03-27 3
# 4 2021-03-09 4
# 5 2020-04-12 5
# 6 2022-04-25 6
We can define the ranges in a separate frame:
ranges <- tibble(startdate=c("2019-12-05", "2020-12-07", "2021-12-01"), enddate=c("2020-03-05", "2021-03-15", "2021-03-28")) %>%
mutate(across(c(startdate, enddate), as.Date), group = row_number())
ranges
# # A tibble: 3 × 3
# startdate enddate group
# <date> <date> <int>
# 1 2019-12-05 2020-03-05 1
# 2 2020-12-07 2021-03-15 2
# 3 2021-12-01 2021-03-28 3
and do the join/filter using inner_join:
inner_join(df, ranges, join_by(between(date, startdate, enddate)))
# date id startdate enddate group
# 1 2021-01-05 2 2020-12-07 2021-03-15 2
# 2 2021-03-09 4 2020-12-07 2021-03-15 2
# 3 2021-03-02 21 2020-12-07 2021-03-15 2
# 4 2021-02-01 30 2020-12-07 2021-03-15 2
where only four rows are found in the three ranges we specified.
Assuming the ranges date ranges are non-overlapping, this is analogous to
df %>%
filter(
between(date, ranges$startdate[1], ranges$enddate[1]) |
between(date, ranges$startdate[2], ranges$enddate[2]) |
between(date, ranges$startdate[3], ranges$enddate[3])
)
# date id
# 1 2021-01-05 2
# 2 2021-03-09 4
# 3 2021-03-02 21
# 4 2021-02-01 30
Note that if any of the ranges date ranges overlap, you may have multiples. For instance,
ranges <- tibble(startdate=c("2019-12-05", "2020-12-07", "2021-01-01"), enddate=c("2020-03-05", "2021-03-15", "2021-01-31")) %>%
mutate(across(c(startdate, enddate), as.Date), group = row_number())
ranges
# # A tibble: 3 × 3
# startdate enddate group
# <date> <date> <int>
# 1 2019-12-05 2020-03-05 1
# 2 2020-12-07 2021-03-15 2
# 3 2021-01-01 2021-01-31 3 # <--- this overlaps with id=2
then we’ll see repeats of ids:
inner_join(df, ranges, join_by(between(date, startdate, enddate)))
# date id startdate enddate group
# 1 2021-01-05 2 2020-12-07 2021-03-15 2
# 2 2021-01-05 2 2021-01-01 2021-01-31 3
# 3 2021-03-09 4 2020-12-07 2021-03-15 2
# 4 2021-03-02 21 2020-12-07 2021-03-15 2
# 5 2021-02-01 30 2020-12-07 2021-03-15 2
If this is a concern, then you’ll need to find some way to deal with multiple rows per id, such as first or min(group) or something else.