dplyr solution to exact and partial-string join

June 5, 2023

I need to join two datasets by two columns, one column by an exact match and the other one by a partial match:

pf=data.frame('exact'=c('s1','s2','s3','s2','s4'),'id_part'=c('a','a','a','b','c'), 'value'=c(1,2,3,4,5))
> pf
  exact id_part value
1    s1       a     1
2    s2       a     2
3    s3       a     3
4    s2       b     4
5    s4       c     5

and

cj=data.frame('exact'=c('s1','s1','s4','s2','s4'), 'id_part'=c('saf','r2a@ff','k5-a','6b4-d','ab1'))
> cj
  exact id_part
1    s1     saf
2    s1  r2a@ff
3    s4    k5-a
4    s2   6b4-d
5    s4     ab1

the desired outcome should be a datasets with the same rows as cj plus an additional columns with values form pf, in a way that if pf$exact==cj$exact & pf$id_part %in% cj$id_part,then output$value<- pf$value, else output$value<-0:

> output
  exact id_part value
1    s1     saf     1
2    s1  r2a@ff     1
3    s4    k5-a     0
4    s2   6b4-d     4
5    s4     ab1     0

i.e. an exact match on the exact column and a partial match on the id_part column. I tried with stringdist_inner_join(cj,pf, by=c('exact','id_part'), method='lv') and similar, but got nowhere.

any help appreciated.

>Solution :

You can use regex_left_join:

library(fuzzyjoin)
library(dplyr)
library(tidyr)

regex_left_join(cj, pf) |>
  mutate(value = replace_na(value, 0))

#   exact.x id_part.x exact.y id_part.y value
# 1      s1       saf      s1         a     1
# 2      s1    r2a@ff      s1         a     1
# 3      s4      k5-a    <NA>      <NA>     0
# 4      s2     6b4-d      s2         b     4
# 5      s4       ab1    <NA>      <NA>     0