I have the following vector with strings:
strings <- c("ABC0001", "ABC02", "ABC10", "ABC01010", "ABC11", "ABC011", "ABC0120")
"ABC0001" "ABC02" "ABC10" "ABC01010" "ABC11" "ABC011" "ABC0120"
Desired output:
[1] "ABC1" "ABC2" "ABC10" "ABC1010" "ABC11" "ABC11" "ABC120"
My question: What is the regular expression pattern for zeros before first integer in a string?
So far I have tried:
library(stringr)
str_replace(strings,'0+', "")
which gives:
[1] "ABC1" "ABC2" "ABC1" "ABC1010" "ABC11" "ABC11" "ABC120"
Note: Not desired ABC1 in position 3. Should be ABC10
I suspect this might be easy, but I can’t get it.
I want to learn the regular expression pattern!
>Solution :
Here is a base R option using sub, with lookarounds:
strings <- c("ABC0001", "ABC02", "ABC10", "ABC01010", "ABC11", "ABC011", "ABC0120")
output <- sub("(?<=.)0+(?=.)", "", strings, perl=TRUE)
output
[1] "ABC1" "ABC2" "ABC10" "ABC1010" "ABC11" "ABC11" "ABC120"
Here is an explanation of the regex pattern being used:
(?<=.) assert that some character precedes
0+ match one or more zeroes
(?=.) assert that some character follows
The (?<=.) and (?=.) are called lookarounds. In this case, they make sure that the 0 we target are not at the very start or very end of the input value. For an input like ABC110, we want the output to be ABC110, i.e. the final zero should not be removed.