Extract leading numbers from string, but length varies R

April 19, 2023

I have a column which contains a character string containing letters and numbers. The string always starts with one or two numbers, followed by multiple characters. I am trying to separate the string based on where that first character is.

have <-
  tribble(
    ~string,
    '12main',
    '6six',
    '42go',
    '5to9'
  )

want <- 
  tribble(
    ~prefix, ~rest,
    '12', 'main',
    '6', 'six',
    '42', 'go',
    '5', 'to9'
  )

I’m sure there is a regex with separate solution but having trouble getting it working.

want <-
  have %>%
  separate(string,
           into = c('prefix', 'rest'),
           sep = "(?=[0-9])(?<=[a-zA-Z])")

>Solution :

You were close, we can achieve it with one look-behind (for a number) and one look-ahead (for a non-number):

have %>%
  separate(string, sep = "(?<=[0-9])(?=[^0-9])", into = c("prefix", "rest"))
# # A tibble: 4 × 2
#   prefix rest 
#   <chr>  <chr>
# 1 12     main 
# 2 6      six  
# 3 42     go   
# 4 5      to9

I think you had the look-around reversed: ?<= is for preceding string (should be used with [0-9]), and ?= is for following string (should be used with [^0-9] or [A-Za-z]).

Personally I find this a bit intriguing: we are splitting strings based on a 0-length pattern: there is nothing between where the previous is a number and the following is a non-number, so the split is effectively 0-length.

FYI, this does run into warnings if there are two such places in a string, such as 5to9to5:

have <- structure(list(string = c("12main", "6six", "42go", "5to9", "5to9to5")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
have
# # A tibble: 5 × 1
#   string 
#   <chr>  
# 1 12main 
# 2 6six   
# 3 42go   
# 4 5to9   
# 5 5to9to5
have %>%
  separate(string, sep = "(?<=[0-9])(?=[^0-9])", into = c("prefix", "rest"))
# Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
# # A tibble: 5 × 2
#   prefix rest 
#   <chr>  <chr>
# 1 12     main 
# 2 6      six  
# 3 42     go   
# 4 5      to9  
# 5 5      to9

It’s a warning that you are discarding some information, over to you if you want/need to guard against this.

An alternative, since you have 5to9to5 in your real data:

have %>%
  mutate(strcapture("^([0-9]+)([^0-9].*)", string, list(prefix="", rest="")))
# # A tibble: 5 × 3
#   string  prefix rest  
#   <chr>   <chr>  <chr> 
# 1 12main  12     main  
# 2 6six    6      six   
# 3 42go    42     go    
# 4 5to9    5      to9   
# 5 5to9to5 5      to9to5

where you can now remove string if you want.

Another note: if you are intending to convert prefix into an integer or a number, then you can preclude that need by using list(prefix=0L, rest="") (or just =0) instead. That’s the proto= argument, and while its data is discarded, it is used for its names and target classes for each resulting column).