Is there a way to join two dataframes of unequal lengths by str_detect() or other string match function?

Advertisements

I have two data frames and I want to join them by detecting a string in one of the columns. Say I have these sample data frame columns:

df <- tibble(value = c("a <- 1:3", "b <- function()", "c <- rnorm(1:10)", "d <- c(x, y, z)"), 
             line = 1:4)

dfSearch <- c("a", "b", "c") %>% as_tibble()

And I want to join them where the values of dfSearch can be found in the strings of df, so that it looks like this:

value  line
  a     1
  b     2
  c     3
  d     NA             

However, str_detect() doesn’t work with vectors of unequal lengths. This is what I’ve tried:

new_df <- dfSearch %>%
filter(str_detect(value, df$value))

or

new_df <- dfSearch %>%
inner_join(., df, by=str_detect(value, df$value))

And each time I get the same error message: Error in str_detect(): ! Can't recycle string (size 3) to match pattern (size 4).

Any ideas how I can accomplish this?

>Solution :

I don’t think we need a join here, we can do this:

df %>%
  mutate(line2 = if_else(sub(" .*", "", value) %in% dfSearch$value, line, line[NA]))
# # A tibble: 4 × 3
#   value             line line2
#   <chr>            <int> <int>
# 1 a <- 1:3             1     1
# 2 b <- function()      2     2
# 3 c <- rnorm(1:10)     3     3
# 4 d <- c(x, y, z)      4    NA

If you need to join for other reasons, then …

dfSearch <- tibble(value = c("a", "b", "c"), insearch = TRUE)
df %>%
  mutate(value = sub(" .*", "", value)) %>%
  left_join(dfSearch, by = "value")
# # A tibble: 4 × 3
#   value  line insearch
#   <chr> <int> <lgl>   
# 1 a         1 TRUE    
# 2 b         2 TRUE    
# 3 c         3 TRUE    
# 4 d         4 NA      

where you can use insearch to NA-out the line if needed.

Another option is a fuzzy join:

dfSearch %>%
  mutate(re = paste0("^", value, " ")) %>%
  fuzzyjoin::regex_full_join(df, ., by = c("value" = "re"))
# # A tibble: 4 × 5
#   value.x           line value.y insearch re   
#   <chr>            <int> <chr>   <lgl>    <chr>
# 1 a <- 1:3             1 a       TRUE     "^a "
# 2 b <- function()      2 b       TRUE     "^b "
# 3 c <- rnorm(1:10)     3 c       TRUE     "^c "
# 4 d <- c(x, y, z)      4 NA      NA        NA  

Leave a ReplyCancel reply