Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

which of the two is more efficient to add new columns in a data.table? And why?

Consider the two methods below to add columns to an existing data.table. One is the chaining of data.table calls using [] and the other is the classic single variables addition each time by :=

Both methods utlize the same memory (1.66GB) at the end but one of the two looks around 15 ~ 20% faster.
My question is:
is this speed increase a fluke or a real one?
And if real, why is it so?

library(pacman)
p_load(reprex, data.table,magrittr,pryr)
mem_used()
#> 53.7 MB
# initialise a large data.table 
dt1 <- data.table(x= 1:5e7, y = 5e7:1)
mem_used()
#> 454 MB

Method 1

system.time(dt2 <- dt1[,a:=log(x)][,b:=log(y)][,c := a + b])
#>    user  system elapsed 
#>   1.379   0.589   1.968
mem_used()
#> 1.66 GB

release the memory to start again

rm(dt1,dt2)
gc()
#>           used (Mb) gc trigger   (Mb)  max used (Mb)
#> Ncells  795598 42.5    1291680   69.0   1291680   69
#> Vcells 1424205 10.9  234120480 1786.2 201449048 1537
mem_used()
#> 55.9 MB
dt1 <- data.table(x= 1:5e7, y = 5e7:1)
mem_used()
#> 456 MB

Method 2

system.time({
dt1[,a:=log(x)]
dt1[,b:=log(y)]
dt1[,c := a + b]
})
#>    user  system elapsed 
#>   1.207   0.472   1.679
mem_used()
#> 1.66 GB

As you see Method 2 is 15 ~ 17% faster. Why?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Created on 2022-07-08 by the reprex package (v2.0.1)

>Solution :

The difference isn’t significant : repeating calculation many times (default = 100) with microbenchmark shows there’s no difference

microbenchmark::microbenchmark(chain={dt2 <- dt1[,a:=log(x)][,b:=log(y)][,c := a + b]},
                               seq = {dt1[,a:=log(x)]
                                 dt1[,b:=log(y)]
                                 dt1[,c := a + b]})

Unit: seconds
  expr      min       lq     mean   median       uq      max neval
 chain 3.056398 3.123273 3.207696 3.204743 3.270068 3.500883   100
   seq 3.060816 3.131185 3.208122 3.222308 3.273654 3.483277   100

system.time() isn’t precise enough to measure a 20% difference.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading