Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I filter a data frame by all values similar to the contents of a string vector?

I am trying to filter a dataset by a vector (the column of another dataset), but instead of matching the items using %in%, I am looking to return items with a similar pattern to the items in the vector.

By similar, I mean if an item in the vector has 2 words e.g. "Orange juice", I would want to filter the data frame for all items with the word "Orange" i.e. the first word.

Below is an example, which hopefully explains better what I’m looking for!
Thank you so much in advance.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# Here is some sample data
Data <- data.frame(
col_1=c("Orange juice", "Orange cake", "Lemon curd", "Lemon pie", "Strawberry", "Lime tree"), 
col_2=c("food", "food", "food", "food", "fruit", "tree"))


# I want to filter this data by a vector (taken from another data frame) to return items that are similar to the first word of items in the vector 
vector <- "Orange", "Lemon ltd", "Grapefruit", "Peach juice" 

# I'm looking for something like this:
Data %>% filter(col_1 %like% vector)

# or something like this:
Data %>% filter(str_detect(col_1, pattern = "first word of items in vector" ))

To get this output:

  • col_1 <- "Orange juice", "Orange cake", "Lemon curd", "Lemon pie"
  • col_2 <- "food", "food", "food", "food"

>Solution :

Something like this?

library(dplyr, warn.conflicts = FALSE)
library(stringr)

df <- tibble(
  food_name = c("Orange juice", "Orange cake", "Lemon curd", "Lemon pie", "Strawberry", "Lime tree"), 
  food_category = c("food", "food", "food", "food", "fruit", "tree")
)

patterns <- c("Orange", "Lemon ltd", "Grapefruit", "Peach juice")


df %>% 
  filter(
    # First word of `food_name` is 'in' first words of `patterns`
    str_extract(food_name, "[^\\s]+") %in% str_extract(patterns, "[^\\s]+")
  )
#> # A tibble: 4 × 2
#>   food_name    food_category
#>   <chr>        <chr>        
#> 1 Orange juice food         
#> 2 Orange cake  food         
#> 3 Lemon curd   food         
#> 4 Lemon pie    food

Created on 2022-10-18 with reprex v2.0.2

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading