Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Flawed logic with RegEx and numeric ranges

I’m trying to create a new variable called ‘group’ in a dataset called ‘data’. The variable ‘group’ should take the value "A" or "B" depending on how another variable in the dataset (of character type) ends. It just so happens that they end in a number from 7 to 24 after an underscore, as follows:

enter image description here

So, I want the new variable ‘group’ to be "A" when the ending number is 7 to 15 both inclusive, and "B" when the ending number is 16 to 24, again both inclusive.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I tried this mutate() function using str_detect() to discriminate within the character variable of interest:

data %>%
mutate(group = case_when(str_detect(string = year, pattern = "[7-9]|1[0-5]$") ~ "A",
                         str_detect(string = year, pattern = "1[6-9]|2[0-4]$") ~ "B")) 

However the resulting output is not quite right, as you can see below.

enter image description here

What’s wrong in either the logic of case_when() or the RegEx itself that it gives the value "A" also to the numbers 16 to 19?

Thanks in advance!

>Solution :

The issue you experience is described in Regular expression pipe confusion.

However, adding parentheses does not help here. You need to account for the whole number check and the left boundary is an underscore.

Thus, you can implement the following logic: classify all strings ending with _ and then a number from 16 to 24 as "A", and as "B" all the rest.

You can use

data %>%
mutate(group = case_when(str_detect(string = year, pattern = "$(?<!_1[6-9]|_2[0-4])") ~ "A",
                         str_detect(string = year, pattern = "_(1[6-9]|2[0-4])$") ~ "B"))

See Regex "A" demo and Regex "B" demo.

Details:

  • _(1[6-9]|2[0-4])$ – matches a _, then either 1 and a digit from 6 to 9 or 2 and then a digit from 0 to 4 at the end of the string
  • $(?<!_1[6-9]|_2[0-4]) – matches the end of the string, and then fails the match if, immediately to the left, there is a _1 followed with a digit from 6 to 9, or _2 followed with a digit from 0 to 4.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading