I have two data frames and I want to join them by detecting a string in one of the columns. Say I have these sample data frame columns:
df <- tibble(value = c("a <- 1:3", "b <- function()", "c <- rnorm(1:10)", "d <- c(x, y, z)"),
line = 1:4)
dfSearch <- c("a", "b", "c") %>% as_tibble()
And I want to join them where the values of dfSearch
can be found in the strings of df
, so that it looks like this:
value line
a 1
b 2
c 3
d NA
However, str_detect()
doesn’t work with vectors of unequal lengths. This is what I’ve tried:
new_df <- dfSearch %>%
filter(str_detect(value, df$value))
or
new_df <- dfSearch %>%
inner_join(., df, by=str_detect(value, df$value))
And each time I get the same error message: Error in str_detect(): ! Can't recycle string (size 3) to match pattern (size 4).
Any ideas how I can accomplish this?
>Solution :
I don’t think we need a join here, we can do this:
df %>%
mutate(line2 = if_else(sub(" .*", "", value) %in% dfSearch$value, line, line[NA]))
# # A tibble: 4 × 3
# value line line2
# <chr> <int> <int>
# 1 a <- 1:3 1 1
# 2 b <- function() 2 2
# 3 c <- rnorm(1:10) 3 3
# 4 d <- c(x, y, z) 4 NA
If you need to join for other reasons, then …
dfSearch <- tibble(value = c("a", "b", "c"), insearch = TRUE)
df %>%
mutate(value = sub(" .*", "", value)) %>%
left_join(dfSearch, by = "value")
# # A tibble: 4 × 3
# value line insearch
# <chr> <int> <lgl>
# 1 a 1 TRUE
# 2 b 2 TRUE
# 3 c 3 TRUE
# 4 d 4 NA
where you can use insearch
to NA
-out the line
if needed.
Another option is a fuzzy join:
dfSearch %>%
mutate(re = paste0("^", value, " ")) %>%
fuzzyjoin::regex_full_join(df, ., by = c("value" = "re"))
# # A tibble: 4 × 5
# value.x line value.y insearch re
# <chr> <int> <chr> <lgl> <chr>
# 1 a <- 1:3 1 a TRUE "^a "
# 2 b <- function() 2 b TRUE "^b "
# 3 c <- rnorm(1:10) 3 c TRUE "^c "
# 4 d <- c(x, y, z) 4 NA NA NA