Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Find two keywords if they are between 0 and 3 words apart

I want to identify strings which feature two keywords that have between 0 and 3 words between them. What I have works in most cases:

strings <- c(
  "Today is my birthday",
  "Today is not yet my birthday",
  "Today birthday",
  "Today maybe?",
  "Today: birthday"
)


grepl("Today(\\s\\w+){0,3}\\sbirthday", strings, ignore.case = TRUE)
#> [1]  TRUE FALSE  TRUE FALSE FALSE

Created on 2021-11-24 by the reprex package (v2.0.1)

My issue is with the string "Today: birthday". The problem is that a word is defined as (\\s\\w+) leaving no option for the sentence to contain any punctuation. How can I better define the regex for word so that punctuation is not excluded (best would be to ignore it).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can use

> grepl("Today(\\W+\\w+){0,3}\\W+birthday", strings, ignore.case = TRUE)
[1]  TRUE FALSE  TRUE FALSE  TRUE

Also, consider using word boundaries, non-capturing groups, and the more stable PCRE regex engine:

grepl("\\bToday(?:\\W+\\w+){0,3}\\W+birthday\\b", strings, ignore.case = TRUE, perl=TRUE)

The (?:\W+\w+){0,3}\W+ part matches zero to three occurrences of one or more non-word chars (\W+) and then one or more word chars (\w+) and then one or more non-word chars.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading