Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Getting summary values from other dataset up to some date value for each observation

I currently have two datasets. The first one has weekly observations for each ID (the number of weeks observed may not be the same across different IDs, but all are within a fixed range, the year of 2022):

ID Date Var1
1 2022-07-04 200
1 2022-07-11 45
1 2022-07-18 70
1 2022-08-01 90
2 2022-06-27 50
2 2022-07-18 70
2 2022-08-01 90
3 2022-09-26 70
3 2022-10-03 90

The second one has the same IDs, but with values recorded not according to some fixed interval (i.e. can be any date and number of observations)

ID Date Var2 Var3
1 2010-12-06 200 0.5
1 2022-07-29 300 0.6
2 2015-06-27 100 0.1
2 2016-07-04 150 0.1
2 2016-07-05 200 0.4
2 2018-09-21 600 0.0

My goal is to include in the first dataset summaries of Var2 and Var3 from the second dataset UP TO the value specified in Date (i.e. the average of all values of Var2 and Var3 prior to Date). The desired output would be:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID Date Var1 Mean_Var2 Mean_Var3 N
1 2022-07-04 200 200 0.5 1
1 2022-07-11 45 200 0.5 1
1 2022-07-18 70 200 0.5 1
1 2022-08-01 90 250 0.55 2
2 2022-06-27 50 262.5 0.15 4
2 2022-07-18 70 262.5 0.15 4
2 2022-08-01 90 262.5 0.15 4
3 2022-12-01 70 NA NA 0
3 2022-12-03 90 NA NA 0

where N contains a count of the values which were averaged.

df1 <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3),
                  Date = as.Date(c('2022-07-04', '2022-07-11', '2022-07-18', '2022-08-01',
                                   '2022-06-27', '2022-07-18', '2022-08-01',
                                   '2022-12-01', '2022-12-03')),
                  Var1 = c(200, 45, 70, 90, 50, 70, 90, 70, 90))

df2 <- data.frame(ID = c(1, 1, 2, 2, 2, 2),
                  Date = as.Date(c('2010-12-06', '2022-07-29', 
                                   '2015-06-27', '2016-07-04', '2016-07-05', '2018-09-21')),
                  Var2 = c(200, 300, 100, 150, 200, 600),
                  Var3 = c(0.5, 0.6, 0.1, 0.1, 0.4, 0.0))

I am a bit stumped as to how to go about this. I was thinking of just iterating by row across the first dataset and constructing the measure for each row, but the datasets themselves are also quite large ( > 1 million obs each), so this seems inefficient. Does anyone know a faster way?

>Solution :

You want to join by ID in all cases where the Date in df2 is less than the Date in df1. Sounds like a case for the data.table non-equi join!

library(data.table)

setDT(df1)
setDT(df2)

df2[, date_for_join_df2 := Date]

df2[df1,
    on = .(
        ID,
        date_for_join_df2 < Date
    )
][, .(
        ID, 
        Date = date_for_join_df2, 
        Mean_Var2 = mean(Var2),
        Mean_Var3 = mean(Var3),
        .N
    )
, by = .(ID, date_for_join_df2)
]


#    ID date_for_join_df2 ID       Date Mean_Var2 Mean_Var3 N
# 1:  1        2022-07-04  1 2022-07-04     200.0      0.50 1
# 2:  1        2022-07-11  1 2022-07-11     200.0      0.50 1
# 3:  1        2022-07-18  1 2022-07-18     200.0      0.50 1
# 4:  1        2022-08-01  1 2022-08-01     250.0      0.55 2
# 5:  2        2022-06-27  2 2022-06-27     262.5      0.15 4
# 6:  2        2022-07-18  2 2022-07-18     262.5      0.15 4
# 7:  2        2022-08-01  2 2022-08-01     262.5      0.15 4
# 8:  3        2022-12-01  3 2022-12-01        NA        NA 1
# 9:  3        2022-12-03  3 2022-12-03        NA        NA 1

Thanks to the magic of data.table this should be quite fast. Note the df2[, date_for_join_df2 := Date]. It’s important to copy the column and join on that as you lose one of the columns you join on – which is fine with an equi join, but not when it’s not equal to the column it has joined to.

I linked to the data.table vignette above. This blog post by David Selby contains a more in-depth explanation and examples.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading