Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

join two data.table based on pattern matching

Given:

library(data.table)
dat1 <- setDT(data.frame(pat=c("A.C",".BC"),val=c(1,2)))
dat2 <- setDT(data.frame(q=c("ABC","AXC","XBC"),val2=c(10,11,12)))

I want the result to be:

dat3 <- setDT(data.frame(pat=c("A.C","A.C",".BC",".BC"),val=c(1,1,2,2),q=c("ABC","AXC","ABC","XBC"),val2=c(10,11,10,12)))
dat3
   pat val   q val2
1: A.C   1 ABC   10
2: A.C   1 AXC   11
3: .BC   2 ABC   10
4: .BC   2 XBC   12

In other words, a left join of dat1 to dat2 for each match of regex pattern pat to query string q
I was wondering if this is possible with a concise data.table merge expression, i.e.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

dat1[dat2, .(pat, val, q, val2), on= .(grepl(pat,q))] # this does not work

or any other data.table trick for efficiency. In reality dat1 is hundreds of rows and dat2 can be 10-100 thousand rows.
This sounds simple enough, but I haven’t found a post that quite covers it.

The closest I’ve gotten to is the following:

match_pat_to_q <- function(pattern, data, data.col="q"){
  ret<-lapply(pattern, function(x){
    data[grepl(x,get(data.col))]
  })
  names(ret) <- pattern # becomes an .id column in next step
  rbindlist(ret, idcol=TRUE)
}

match_pat_to_q(dat1$pat, dat2)[dat1, on=.(.id==pat)]

   .id   q val2 val
1: A.C ABC   10   1
2: A.C AXC   11   1
3: .BC ABC   10   2
4: .BC XBC   12   2

>Solution :

You can solve your probelm as follows:

dat1[, dat2[grep(pat, q), .(val, x=q, val2)], by=pat]

      pat   val      x  val2
   <char> <num> <char> <num>
1:    A.C     1    ABC    10
2:    A.C     1    AXC    11
3:    .BC     2    ABC    10
4:    .BC     2    XBC    12

You group dat1 by pat. This allows to logically create one group for each pattern in dat1. Then for each group, you filter the corresponding data from dat2 using pattern matching while selecting the variables needed at the same time.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading