Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Joining dates are changed to certain key in data.table

I have the following dataframes df and df_dates (dput below):

> df
   group      start        end
1      A 2022-12-01 2022-12-04
2      A 2022-12-04 2022-12-07
3      A 2022-12-07 2022-12-10
4      A 2022-12-10 2022-12-13
5      A 2022-12-13 2022-12-16
6      A 2022-12-16 2022-12-19
7      B 2022-12-01 2022-12-04
8      B 2022-12-04 2022-12-07
9      B 2022-12-07 2022-12-10
10     B 2022-12-10 2022-12-13
11     B 2022-12-13 2022-12-16
12     B 2022-12-16 2022-12-19
> df_dates
  group       date value
1     A 2022-12-02     1
2     A 2022-12-14     3
3     B 2022-12-06     2
4     B 2022-12-13     4

I would like to join the rows of df_dates by group where the date column is between the column start and end of df. When I join these two dataframes, the dates of the date column return the same as the dates from the start column of df. Here is the code with output:

df <- data.frame(group = rep(c('A', 'B'), each = 6),
                 start = c(seq.Date(as.Date('2022-12-01'), as.Date('2022-12-16'), '3 days')),
                 end = c(seq.Date(as.Date('2022-12-04'), as.Date('2022-12-19'), '3 days')))
df_dates <- data.frame(group = c('A', 'A', 'B', 'B'),
                       date = as.Date(c('2022-12-02', '2022-12-14', '2022-12-06', '2022-12-13')),
                       value = c(1,3,2,4))
library(data.table)
setDT(df)
setDT(df_dates)
df_dates[df, 
         .(group, date, start, end, value), 
         on = .(group, date >= start, date <= end)]
#>     group       date      start        end value
#>  1:     A 2022-12-01 2022-12-01 2022-12-04     1
#>  2:     A 2022-12-04 2022-12-04 2022-12-07    NA
#>  3:     A 2022-12-07 2022-12-07 2022-12-10    NA
#>  4:     A 2022-12-10 2022-12-10 2022-12-13    NA
#>  5:     A 2022-12-13 2022-12-13 2022-12-16     3
#>  6:     A 2022-12-16 2022-12-16 2022-12-19    NA
#>  7:     B 2022-12-01 2022-12-01 2022-12-04    NA
#>  8:     B 2022-12-04 2022-12-04 2022-12-07     2
#>  9:     B 2022-12-07 2022-12-07 2022-12-10    NA
#> 10:     B 2022-12-10 2022-12-10 2022-12-13     4
#> 11:     B 2022-12-13 2022-12-13 2022-12-16     4
#> 12:     B 2022-12-16 2022-12-16 2022-12-19    NA

Created on 2022-12-12 with reprex v2.0.2

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

As you can see, the dates of the date column are now changed to the dates of the start column, while I want them to be the same as in df_dates dataframe. The desired output should look like this:

#>     group       date      start        end value
#>  1:     A 2022-12-02 2022-12-01 2022-12-04     1
#>  2:     A         NA 2022-12-04 2022-12-07    NA
#>  3:     A         NA 2022-12-07 2022-12-10    NA
#>  4:     A         NA 2022-12-10 2022-12-13    NA
#>  5:     A 2022-12-14 2022-12-13 2022-12-16     3
#>  6:     A         NA 2022-12-16 2022-12-19    NA
#>  7:     B         NA 2022-12-01 2022-12-04    NA
#>  8:     B 2022-12-06 2022-12-04 2022-12-07     2
#>  9:     B         NA 2022-12-07 2022-12-10    NA
#> 10:     B 2022-12-13 2022-12-10 2022-12-13     4
#> 11:     B 2022-12-13 2022-12-13 2022-12-16     4
#> 12:     B         NA 2022-12-16 2022-12-19    NA

So I was wondering if anyone knows how to join these two dataframes in the right way using data.table?


dput of df and df_dates:

df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B", "B"), start = structure(c(19327, 19330, 19333, 
19336, 19339, 19342, 19327, 19330, 19333, 19336, 19339, 19342
), class = "Date"), end = structure(c(19330, 19333, 19336, 19339, 
19342, 19345, 19330, 19333, 19336, 19339, 19342, 19345), class = "Date")), class = "data.frame", row.names = c(NA, 
-12L))

df_dates <- structure(list(group = c("A", "A", "B", "B"), date = structure(c(19328, 
19340, 19332, 19339), class = "Date"), value = c(1, 3, 2, 4)), class = "data.frame", row.names = c(NA, 
-4L))

>Solution :

library(data.table)
setDT(df)
setDT(df_dates)

Just specifying you want to the original date (you can use prefix x. or i. (to reference df in this case) )

df_dates[df, 
         .(group, x.date, start, end, value), 
         on = .(group, date >= start, date <= end)]

Or modifying the original:

df[, c("date", "value") := 
       df_dates[.SD, on = .(group, date >= start, date <= end), .(x.date, value)]]
#      group      start        end       date value
#     <char>     <Date>     <Date>     <Date> <num>
#  1:      A 2022-12-01 2022-12-04 2022-12-02     1
#  2:      A 2022-12-04 2022-12-07       <NA>    NA
#  3:      A 2022-12-07 2022-12-10       <NA>    NA
#  4:      A 2022-12-10 2022-12-13       <NA>    NA
#  5:      A 2022-12-13 2022-12-16 2022-12-14     3
#  6:      A 2022-12-16 2022-12-19       <NA>    NA
#  7:      B 2022-12-01 2022-12-04       <NA>    NA
#  8:      B 2022-12-04 2022-12-07 2022-12-06     2
#  9:      B 2022-12-07 2022-12-10       <NA>    NA
# 10:      B 2022-12-10 2022-12-13 2022-12-13     4
# 11:      B 2022-12-13 2022-12-16 2022-12-13     4
# 12:      B 2022-12-16 2022-12-19       <NA>    NA
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading