Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to search a vector of words for words containing two specific letters

So I’ve got a vector of 5 letter words and I want to be able to create a function that extracts the words that contain ALL of the letters in the pattern.

For example, if my vector is ("aback", "abase", "abate", "agate", "allay") and I’m looking for words that contain BOTH "a" and "b", I want the function to return ("aback", "abase", "abate"). I don’t care what position or how many times these letters occur in the words, only that the words contain both of them.

I’ve tried to do this by creating a function that is meant to combine grepl’s with an &. But the problem here is the grepl function doesn’t accept vectors as the pattern. My plan was for this function to achieve grepl("a", word_vec) & grepl("b", word_vec). I also need this to be scalable so if I want to search for all words containing "a" AND "b" AND "c", for example.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

grepl_cat <- function(str, words_vec) {
      
      pat <- str_split(str, "")
      
      first_let = TRUE

      for (i in 1:length(pat)) {
        if (first_let){
          result <- sapply(pat[i], grepl, x = word_vec)
          first_let <- FALSE
        } 
        print(pat[i])
        result <- result & sapply(pat[i], grepl, x = word_vec)
        
      }
      
      return(result)
}

word_vec[grepl_cat("abc", word_vec)]

The function I’ve written above definitely isn’t doing what it’s intended to do.

I’m wondering if there an easier way to do this with regex patterns or there’s a way to input each letter in the str into the grepl function as non-vectors.

>Solution :

A possible solution:

s <- c("aback", "abase", "abate", "agate", "allay")

subset(s, stringr::str_detect(s, "(a)(b)"))
#> [1] "aback" "abase" "abate"

Another possible solution, based on tidyverse:

library(tidyverse)

s <- c("aback", "abase", "abate", "agate", "allay")

s %>% 
  data.frame(s = .) %>% 
  filter(str_detect(s, "(a)(b)")) %>% 
  pull(s)

#> [1] "aback" "abase" "abate"
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading