Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract leading numbers from string, but length varies R

I have a column which contains a character string containing letters and numbers. The string always starts with one or two numbers, followed by multiple characters. I am trying to separate the string based on where that first character is.

have <-
  tribble(
    ~string,
    '12main',
    '6six',
    '42go',
    '5to9'
  )

want <- 
  tribble(
    ~prefix, ~rest,
    '12', 'main',
    '6', 'six',
    '42', 'go',
    '5', 'to9'
  )

I’m sure there is a regex with separate solution but having trouble getting it working.

want <-
  have %>%
  separate(string,
           into = c('prefix', 'rest'),
           sep = "(?=[0-9])(?<=[a-zA-Z])")

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You were close, we can achieve it with one look-behind (for a number) and one look-ahead (for a non-number):

have %>%
  separate(string, sep = "(?<=[0-9])(?=[^0-9])", into = c("prefix", "rest"))
# # A tibble: 4 × 2
#   prefix rest 
#   <chr>  <chr>
# 1 12     main 
# 2 6      six  
# 3 42     go   
# 4 5      to9  

I think you had the look-around reversed: ?<= is for preceding string (should be used with [0-9]), and ?= is for following string (should be used with [^0-9] or [A-Za-z]).

Personally I find this a bit intriguing: we are splitting strings based on a 0-length pattern: there is nothing between where the previous is a number and the following is a non-number, so the split is effectively 0-length.

FYI, this does run into warnings if there are two such places in a string, such as 5to9to5:

have <- structure(list(string = c("12main", "6six", "42go", "5to9", "5to9to5")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
have
# # A tibble: 5 × 1
#   string 
#   <chr>  
# 1 12main 
# 2 6six   
# 3 42go   
# 4 5to9   
# 5 5to9to5
have %>%
  separate(string, sep = "(?<=[0-9])(?=[^0-9])", into = c("prefix", "rest"))
# Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
# # A tibble: 5 × 2
#   prefix rest 
#   <chr>  <chr>
# 1 12     main 
# 2 6      six  
# 3 42     go   
# 4 5      to9  
# 5 5      to9  

It’s a warning that you are discarding some information, over to you if you want/need to guard against this.

An alternative, since you have 5to9to5 in your real data:

have %>%
  mutate(strcapture("^([0-9]+)([^0-9].*)", string, list(prefix="", rest="")))
# # A tibble: 5 × 3
#   string  prefix rest  
#   <chr>   <chr>  <chr> 
# 1 12main  12     main  
# 2 6six    6      six   
# 3 42go    42     go    
# 4 5to9    5      to9   
# 5 5to9to5 5      to9to5

where you can now remove string if you want.

Another note: if you are intending to convert prefix into an integer or a number, then you can preclude that need by using list(prefix=0L, rest="") (or just =0) instead. That’s the proto= argument, and while its data is discarded, it is used for its names and target classes for each resulting column).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading