Find two keywords if they are between 0 and 3 words apart

Advertisements

I want to identify strings which feature two keywords that have between 0 and 3 words between them. What I have works in most cases:

strings <- c(
  "Today is my birthday",
  "Today is not yet my birthday",
  "Today birthday",
  "Today maybe?",
  "Today: birthday"
)


grepl("Today(\\s\\w+){0,3}\\sbirthday", strings, ignore.case = TRUE)
#> [1]  TRUE FALSE  TRUE FALSE FALSE

Created on 2021-11-24 by the reprex package (v2.0.1)

My issue is with the string "Today: birthday". The problem is that a word is defined as (\\s\\w+) leaving no option for the sentence to contain any punctuation. How can I better define the regex for word so that punctuation is not excluded (best would be to ignore it).

>Solution :

You can use

> grepl("Today(\\W+\\w+){0,3}\\W+birthday", strings, ignore.case = TRUE)
[1]  TRUE FALSE  TRUE FALSE  TRUE

Also, consider using word boundaries, non-capturing groups, and the more stable PCRE regex engine:

grepl("\\bToday(?:\\W+\\w+){0,3}\\W+birthday\\b", strings, ignore.case = TRUE, perl=TRUE)

The (?:\W+\w+){0,3}\W+ part matches zero to three occurrences of one or more non-word chars (\W+) and then one or more word chars (\w+) and then one or more non-word chars.

Leave a ReplyCancel reply