I have the following dataframe, and I need to manipulate column a to get to column a_clean:
df=data.frame(a=c("1234-12;23456-123","12345-1234",NA,"1234-013;1234-014"),a_clean=c("01234-0012;23456-0123","12345-1234",NA,"1234-0013;1234-0014"))
I need to pad the numbers before the hyphen so it’s five digits and after the hyphen so it’s 4 digits.
I don’t want to separate a to different rows, and then concat back together. My dataframe is very big and I want to do the string manipulation as fast as possible.
>Solution :
gsubfn is like gsub except the replacement argument is a function which inputs the capture groups (matches to the portions of the regular expression within parentheses) as separate arguments. The entire match is then replaced with the output of the function. This matches each of the strings of digits and then passes them as x and y to the function expressed in formula notation where they are converted to numeric and sprintf adds 0’s.
If you are using dplyr replace transform with mutate.
library(gsubfn)
transform(df, clean =
gsubfn("(\\d+)-(\\d+)", ~ sprintf("%05d-%04d", as.numeric(x), as.numeric(y)), a))
giving
a a_clean clean
1 1234-12;23456-123 01234-0012;23456-0123 01234-0012;23456-0123
2 12345-1234 12345-1234 12345-1234
3 <NA> <NA> NA
4 1234-013;1234-014 1234-0013;1234-0014 01234-0013;01234-0014