Flawed logic with RegEx and numeric ranges

April 25, 2024

I’m trying to create a new variable called ‘group’ in a dataset called ‘data’. The variable ‘group’ should take the value "A" or "B" depending on how another variable in the dataset (of character type) ends. It just so happens that they end in a number from 7 to 24 after an underscore, as follows:

So, I want the new variable ‘group’ to be "A" when the ending number is 7 to 15 both inclusive, and "B" when the ending number is 16 to 24, again both inclusive.

I tried this mutate() function using str_detect() to discriminate within the character variable of interest:

data %>%
mutate(group = case_when(str_detect(string = year, pattern = "[7-9]|1[0-5]$") ~ "A",
                         str_detect(string = year, pattern = "1[6-9]|2[0-4]$") ~ "B"))

However the resulting output is not quite right, as you can see below.

What’s wrong in either the logic of case_when() or the RegEx itself that it gives the value "A" also to the numbers 16 to 19?

Thanks in advance!

>Solution :

The issue you experience is described in Regular expression pipe confusion.

However, adding parentheses does not help here. You need to account for the whole number check and the left boundary is an underscore.

Thus, you can implement the following logic: classify all strings ending with _ and then a number from 16 to 24 as "A", and as "B" all the rest.

You can use

data %>%
mutate(group = case_when(str_detect(string = year, pattern = "$(?<!_1[6-9]|_2[0-4])") ~ "A",
                         str_detect(string = year, pattern = "_(1[6-9]|2[0-4])$") ~ "B"))

See Regex "A" demo and Regex "B" demo.

Details:

_(1[6-9]|2[0-4])$ – matches a _, then either 1 and a digit from 6 to 9 or 2 and then a digit from 0 to 4 at the end of the string
$(?<!_1[6-9]|_2[0-4]) – matches the end of the string, and then fails the match if, immediately to the left, there is a _1 followed with a digit from 6 to 9, or _2 followed with a digit from 0 to 4.