Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Combination of approx and map2 is surprisingly slow

I have a dataset that looks like the below:

> head(mydata)
   id value1 value2
1:  1 200001 300001
2:  2 200002 300002
3:  3 200003 300003
4:  4 200004 300004
5:  5 200005 300005
6:  6 200006 300006

value1 and value2 represent amounts at the beginning and the end of a given year. I would like to linearly interpolate the value for a given month, for each id (i.e. rowwise).

After trying different options that were slower, I am currently using map2 from the purrr package in combination with approx from base R. I create the new variable using assignment by reference from the data.table package. This is still surprisingly slow, as it takes approximately 2.2 min for my code to run on my data (1.7 million rows).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Note that I also use get() to access the variables for the interpolation, as their names need to be dynamic. This is slowing down my code, but it doesn’t seem to be the bottleneck. Also, I have tried to use the furrr package to speed up map2 by making the code parallel, but the speed gains were not material.

Below is reproducible example with 1000 rows of data. Any help to speed up the code is greatly appreciated!

mydata <- data.table(id = 1:1000, value1= 2001:3000, value2= 3001:4000)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7

monthly_sum_assured <- function(a, b, m) {
  monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}

mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
                                    ~ monthly_sum_assured(.x, .y, m))]

>Solution :

Just use the formula for linear interpolation to vectorize over the whole data.table.

mydata <- data.table(id = 0:1e6, value1= 2e6:3e6, value2= 3e6:4e6)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7

monthly_sum_assured <- function(a, b, m) {
  monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}

system.time({
  mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
                                      ~ monthly_sum_assured(.x, .y, m))]
})
#>    user  system elapsed 
#>   41.50    0.53   42.05

system.time({
  mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#>    user  system elapsed 
#>       0       0       0

identical(unlist(mydata$interpolated_value), mydata$interpolated_value2)
#> [1] TRUE

It also works just as fast when m is a vector.

m <- sample(12, 1e6 + 1, 1)
system.time({
  mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#>    user  system elapsed 
#>    0.01    0.00    0.02
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading