I have this df1
library(dplyr)
library(tidyverse)
df1 = data.frame(ID = c(100,101,101,102,102,103,103,104,104,105,106),
x_line = c(1,1,2,1,2,1,2,1,2,1,1),
start_date = c('04/01/2018','05/01/2019','25/08/2021','08/03/2017','07/08/2018',
'09/04/2016','29/12/2018','04/08/2018','03/05/2022','04/01/2018','04/01/2018'),
end_date = c('04/05/2019','07/02/2020','27/09/2021','18/07/2018','17/10/2019',
'19/12/2018','22/12/2019','14/09/2021','26/12/2022','15/02/2020','24/08/2020')
)
and the following df2
df2 = data.frame(ID = c(100,100,100,101,101,102,102,103,103,104,104,105,105,106,106,106),
product_name = c('AA','BB','CC','AA','CC','DD','EE','DD','FF',
'AA','FF','DD','AA','CC','AA','BB'),
start_taken_date = c('04/05/2018','25/08/2018','27/09/2018','18/07/2019','25/11/2019',
'29/01/2018','07/09/2018','14/09/2017','01/01/2019','15/02/2019','24/08/2020',
'04/03/2019','04/08/2018',
'05/05/2018','06/06/2019','08/09/2018'),
end_taken_date = c('01/05/2019','26/09/2018','25/03/2019','25/09/2019','02/01/2020',
'19/06/2018','22/09/2019','16/01/2018','04/03/2019','25/06/2022','23/07/2022',
'05/04/2019','05/09/2018',
'29/03/2019','07/07/2019','04/05/2020'))
df3 is the result of merging df1 and df2
df3 = df2%>%left_join(df1,.by=ID)
and now I wish to create df4 with the following condition (the problem is that it does not give me the desired output)
df4 = df3%>%mutate(line_m = ifelse(start_taken_date >=start_date & end_taken_date <= end_date,
x_line,NA))
The desired final output is as follows
ID product_name start_taken_date end_taken_date x_line
1 100 AA 04/05/2018 01/05/2019 1
2 100 BB 25/08/2018 26/09/2018 1
3 100 CC 27/09/2018 25/03/2019 1
4 101 AA 18/07/2019 25/09/2019 1
5 101 CC 25/11/2019 02/01/2020 1
6 102 DD 29/01/2018 19/06/2018 1
7 102 EE 07/09/2018 22/09/2019 2
8 103 DD 14/09/2017 16/01/2018 1
9 103 FF 01/01/2019 04/03/2019 2
10 104 AA 15/02/2019 25/06/2022 1
11 104 FF 24/08/2020 23/07/2022 1
12 105 DD 04/03/2019 05/04/2019 1
13 105 AA 04/08/2018 05/09/2018 1
14 106 CC 05/05/2018 29/03/2019 1
15 106 AA 06/06/2019 07/07/2019 1
16 106 BB 08/09/2018 04/05/2020 1
>Solution :
The main issue is that your left_join function isn’t accounting for date ranges or the x_line variable. This causes an issue when one ID has more than one row in df1 because it matches all possible combinations without considering the date ranges.
Please convert the date columns to Date type , as currently, they are characters. This could lead to incorrect comparisons. you can use lubridate to do the date conversion.
# Make sure these libraries are installed
library(dplyr)
library(fuzzyjoin)
library(lubridate)
# Convert date columns from characters to Date
df1$start_date <- dmy(df1$start_date)
df1$end_date <- dmy(df1$end_date)
df2$start_taken_date <- dmy(df2$start_taken_date)
df2$end_taken_date <- dmy(df2$end_taken_date)
# Use fuzzyjoin to merge based on the condition of start_date and end_date
df4 <- df2 %>%
fuzzy_left_join(df1,
by = c("ID" = "ID",
"start_taken_date" = "start_date",
"end_taken_date" = "end_date"),
match_fun = list(`==`, `>=`, `<=`)) %>%
select(ID = ID.x, product_name, start_taken_date, end_taken_date, x_line)
print(df4)
The match_fun argument is a list of functions that should return TRUE for matches in the corresponding position in the by argument. In this case, we check that IDs are equal, start_taken_date is after start_date and end_taken_date is before end_date.