Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to merge two dataframes based on the condition of a startdate and enddate?

I have this df1

library(dplyr)
library(tidyverse)

df1 = data.frame(ID = c(100,101,101,102,102,103,103,104,104,105,106),
                 x_line = c(1,1,2,1,2,1,2,1,2,1,1),
                 start_date = c('04/01/2018','05/01/2019','25/08/2021','08/03/2017','07/08/2018',
                                '09/04/2016','29/12/2018','04/08/2018','03/05/2022','04/01/2018','04/01/2018'),
                 end_date = c('04/05/2019','07/02/2020','27/09/2021','18/07/2018','17/10/2019',
                              '19/12/2018','22/12/2019','14/09/2021','26/12/2022','15/02/2020','24/08/2020') 
                 )

and the following df2

df2 = data.frame(ID = c(100,100,100,101,101,102,102,103,103,104,104,105,105,106,106,106),
                 product_name = c('AA','BB','CC','AA','CC','DD','EE','DD','FF',
                                  'AA','FF','DD','AA','CC','AA','BB'),
                 start_taken_date = c('04/05/2018','25/08/2018','27/09/2018','18/07/2019','25/11/2019',
                                      '29/01/2018','07/09/2018','14/09/2017','01/01/2019','15/02/2019','24/08/2020',
                                      '04/03/2019','04/08/2018',
                                      '05/05/2018','06/06/2019','08/09/2018'),
                 end_taken_date = c('01/05/2019','26/09/2018','25/03/2019','25/09/2019','02/01/2020',
                                    '19/06/2018','22/09/2019','16/01/2018','04/03/2019','25/06/2022','23/07/2022',
                                    '05/04/2019','05/09/2018',
                                    '29/03/2019','07/07/2019','04/05/2020'))

df3 is the result of merging df1 and df2

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df3 = df2%>%left_join(df1,.by=ID)

and now I wish to create df4 with the following condition (the problem is that it does not give me the desired output)

df4 = df3%>%mutate(line_m = ifelse(start_taken_date >=start_date & end_taken_date <= end_date,
                                   x_line,NA))

The desired final output is as follows

   ID     product_name start_taken_date end_taken_date x_line
1  100           AA       04/05/2018     01/05/2019      1
2  100           BB       25/08/2018     26/09/2018      1
3  100           CC       27/09/2018     25/03/2019      1
4  101           AA       18/07/2019     25/09/2019      1
5  101           CC       25/11/2019     02/01/2020      1
6  102           DD       29/01/2018     19/06/2018      1
7  102           EE       07/09/2018     22/09/2019      2
8  103           DD       14/09/2017     16/01/2018      1
9  103           FF       01/01/2019     04/03/2019      2
10 104           AA       15/02/2019     25/06/2022      1
11 104           FF       24/08/2020     23/07/2022      1
12 105           DD       04/03/2019     05/04/2019      1
13 105           AA       04/08/2018     05/09/2018      1
14 106           CC       05/05/2018     29/03/2019      1
15 106           AA       06/06/2019     07/07/2019      1
16 106           BB       08/09/2018     04/05/2020      1

>Solution :

The main issue is that your left_join function isn’t accounting for date ranges or the x_line variable. This causes an issue when one ID has more than one row in df1 because it matches all possible combinations without considering the date ranges.

Please convert the date columns to Date type , as currently, they are characters. This could lead to incorrect comparisons. you can use lubridate to do the date conversion.

# Make sure these libraries are installed
library(dplyr)
library(fuzzyjoin)
library(lubridate)

# Convert date columns from characters to Date
df1$start_date <- dmy(df1$start_date)
df1$end_date <- dmy(df1$end_date)
df2$start_taken_date <- dmy(df2$start_taken_date)
df2$end_taken_date <- dmy(df2$end_taken_date)

# Use fuzzyjoin to merge based on the condition of start_date and end_date
df4 <- df2 %>%
  fuzzy_left_join(df1, 
                  by = c("ID" = "ID", 
                         "start_taken_date" = "start_date", 
                         "end_taken_date" = "end_date"), 
                  match_fun = list(`==`, `>=`, `<=`)) %>%
  select(ID = ID.x, product_name, start_taken_date, end_taken_date, x_line)

print(df4)

The match_fun argument is a list of functions that should return TRUE for matches in the corresponding position in the by argument. In this case, we check that IDs are equal, start_taken_date is after start_date and end_taken_date is before end_date.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading