I would like to check whether in a text there are a) three consonants in a row or b) four identical letters in a row. Can someone please help me with the regular expressions?
library(tidyverse)
df <- data.frame(text = c("Completely valid", "abcdefg", "blablabla", "flahaaaa", "asdf", "another text", "a last one", "sj", "ngbas"))
consonants <- c("q", "w", "r", "t", "z", "p", "s", "d", "f", "g", "h", "k", "l", "m", "n", "b", "x")
df %>% mutate(
invalid = FALSE,
# Length too short
invalid = ifelse(nchar(text)<3, TRUE, invalid),
# Contains three consonants in a row: e.g. "ngbas"
invalid = ifelse(str_detect(text,"???"), TRUE, FALSE), # <--- Regex missing
# More than 3 identical characters in a row: e.g. "flahaaaa"
invalid = ifelse(str_detect(text,"???"), TRUE, FALSE) # <--- Regex missing
)
>Solution :
Three consonants in a row:
[qwrtzpsdfghklmnbx]{3}
Sequences of length > 3 of a specific char:
([a-z])(\\1){3}
# The double backslash occurs due to its role as the escape character in strings.
The latter uses a backreference. The number represents the ordinal number assigned to the capture group (= expression in parentheses) that is referenced – in this case the character class of latin lowercase letters.
For clarity, character case is not taken into account here.
Without backreferences, the solution gets a bit lengthy:
(aaaa|bbbb|cccc|dddd|eeee|ffff|gggg|hhhh|iiii|jjjj|kkkk|llll|mmmm|nnnn|oooo|pppp|qqqq|rrrr|ssss|tttt|uuuu|vvvv|wwww|xxxx|yyyy|zzzz)
The relevant docs can be found here.