Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Coalesce column values based on mapping to a vector of values

If I have the following two objects:

> set.seed(100)
> lookup <- sample(1:3, 20, replace=T)
> lookup
[1] 2 3 2 3 1 2 2 3 2 2 3 2 2 3 3 3 3 2 1 3

and

> tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
> tb

> tb
# A tibble: 20 Ă— 3
       A     B      C
   <dbl> <dbl>  <dbl>
 1 0.770 0.780 0.456 
 2 0.882 0.884 0.445 
 3 0.549 0.208 0.245 
 4 0.278 0.307 0.694 
 5 0.488 0.331 0.412 
 6 0.929 0.199 0.328 
 7 0.349 0.236 0.573 
 8 0.954 0.275 0.967 
 9 0.695 0.591 0.662 
10 0.889 0.253 0.625 
11 0.180 0.123 0.857 
12 0.629 0.230 0.775 
13 0.990 0.598 0.834 
14 0.130 0.211 0.0915
15 0.331 0.464 0.460 
16 0.865 0.647 0.599 
17 0.778 0.961 0.920 
18 0.827 0.676 0.983 
19 0.603 0.445 0.0378
20 0.491 0.358 0.578

How do I use lookup to select the value of the corresponding row/column from tb?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

i.e.

  • if the first element of lookup = 1 then I would like to select the value in A from the first row of tb
  • if the second element of lookup = 2 then I would like to select the value in B from the second row of tb

So I should end up with a 1d vector that is the same size as lookup. It will look like this:

> new data
> [1] 0.780 0.445 0.208 0.694 0.488 ... 0.578

Thanks!

>Solution :

data.frame (but not tibble or data.table) supports indexing on a matrix, so with this data,

set.seed(42)
lookup <- sample(1:3, 20, replace=T)
lookup
#  [1] 1 1 1 1 2 2 2 1 3 3 1 1 2 2 2 3 3 1 1 3
tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
head(tb)
# # A tibble: 6 x 3
#       A     B      C
#   <dbl> <dbl>  <dbl>
# 1 0.514 0.958 0.189 
# 2 0.390 0.888 0.271 
# 3 0.906 0.640 0.828 
# 4 0.447 0.971 0.693 
# 5 0.836 0.619 0.241 
# 6 0.738 0.333 0.0430

We can do

as.data.frame(tb)[cbind(seq_along(lookup), lookup)]
#  [1] 0.514211784 0.390203467 0.905738131 0.446969628 0.618838207 0.333427211 0.346748248 0.388108283 0.479398564
# [10] 0.197410342 0.832916080 0.007334147 0.171264330 0.261087964 0.514412935 0.581604003 0.157905208 0.037431033
# [19] 0.973539914 0.775823363

A less-efficient method can be done without as.data.frame:

mapply(`[[`, list(tb), seq_along(lookup), lookup)
#  [1] 0.514211784 0.390203467 0.905738131 0.446969628 0.618838207 0.333427211 0.346748248 0.388108283 0.479398564
# [10] 0.197410342 0.832916080 0.007334147 0.171264330 0.261087964 0.514412935 0.581604003 0.157905208 0.037431033
# [19] 0.973539914 0.775823363
## also works with `list(as.data.table(tb))`

Though it does take a big hit in performance (not a surprise):

bench::mark(
  sindri_baldur1 = unlist(tb, use.names = FALSE)[seq_along(lookup) + (lookup - 1L)*nrow(tb)], 
  sindri_baldur2 = unlist(tb)[seq_along(lookup) + (lookup - 1L)*nrow(tb)], 
  base = as.data.frame(tb)[cbind(seq_along(lookup), lookup)], 
  mapply = mapply(`[[`, list(tb), seq_along(lookup), lookup), 
  paulsmith2 = {
tb %>% 
  mutate(lookup = lookup) %>% 
  rowwise %>% 
  mutate(new = c_across(A:C)[lookup]) %>% 
  pull(new)
},
  check = FALSE)
# # A tibble: 5 x 13
#   expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory  time   gc    
#   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>  <list> <list>
# 1 sindri_baldur1    4.5us    5.3us   159430.      736B    15.9   9999     1     62.7ms <NULL> <Rprof~ <benc~ <tibb~
# 2 sindri_baldur2   13.2us   14.7us    56723.    1.44KB     0    10000     0    176.3ms <NULL> <Rprof~ <benc~ <tibb~
# 3 base             78.3us   91.6us     7334.      944B     8.59  3414     4    465.5ms <NULL> <Rprof~ <benc~ <tibb~
# 4 mapply          612.4us 779.45us      942.      720B     6.39   442     3    469.4ms <NULL> <Rprof~ <benc~ <tibb~
# 5 paulsmith2       4.37ms   5.85ms      147.    20.3KB     6.51    68     3    461.1ms <NULL> <Rprof~ <benc~ <tibb~

(I have to use check=FALSE to work with the names introduced in sindri_baldur2, otherwise all results are numerically identical.)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading