Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why does R `stringr::str_extract('word. 42', pattern = '\\d*')` not produce `"42"`?

I have a vector of strings of the form "letters numbers", I want to extract the numbers using RegEx implemented in stringr::str_extract with pattern "\\d*". The results are very confusing:

# R 4.2.3
# install.packages('stringr')
library(stringr)

# case 1
str_extract('word 42', '\\d*')
# ""

# case 2 (?)
str_extract('42 word', '\\d*')
# "42"

# case 3
str_extract('word 42', '\\d+')
# "42"

# case 4 (?!)
str_extract('word 42', '\\d*$')
# "42"

# case 5
str_extract('42 word', '\\d*$')
# ""

In all the cases the expected result is "42".
I am a novice with RegEx’s, but the pattern = '\\d*' seems pretty straightforward – I understand it as "match any number of consecutive numeric characters".

The fact that it doesn’t work for case 1 but does for case 2 is quite counterintuitive by itself. And then the roles seem to be reversed when using pattern = '\\d*$' (cases 4 and 5).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have experimented more with other functions (str_match and str_match_all), but the results where still not clear.

I couldn’t find such a specific thing elsewhere, so I hoped more experienced R/RegEx users could provide a clarification on what is going on under the hood.

>Solution :

I understand it as "match any number of consecutive numeric characters".

Any number including zero. And it will match at the first position where the pattern succeeds. Because \d* can successfully match zero digits, it will never look anywhere besides the beginning of the string. If there are no digits there, then you get "".

Most likely, you want \d+ instead, which matches one or more digits. Then, the match will fail at positions where there aren’t any digits, and you will get the first string of digits in the string.

But \d*$ works for you in case 4 because, again, it’s looking for the first position where there are zero or more digits followed by end of string. It could match zero digits at the end of string, but it doesn’t get a chance to because it finds the position right before the 42 before it finds the position right at the end of the string. In case 5 there are no digits at the end of the string so it has to wait until the end, where it can successfully match zero digits.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading