Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Split String at First Occurrence of an Integer using R

Note I have already read Split string at first occurrence of an integer in a string however my request is different because I would like to use R.

Suppose I have the following example data frame:

> df = data.frame(name_and_address =
      c("Mr. Smith12 Some street",
        "Mr. Jones345 Another street",
        "Mr. Anderson6 A different street"))
> df
                  name_and_address
1          Mr. Smith12 Some street
2      Mr. Jones345 Another street
3 Mr. Anderson6 A different street

I would like to split the string at the first occurrence of an integer. Notice that the integers are of varying length.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The desired output can be like the following:

[[1]]
[1] "Mr. Smith"
[2] "12 Some street",

[[2]]
[1] "Mr. Jones"
[2] "345 Another street",

[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"

I have tried the following but I can not get the regular expression correct:

# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\\d+)', perl=TRUE, type.convert=TRUE)

# Attempt 2 (Does not work)
library(stringr)
str_split(fha_ltc, "\\d+")

>Solution :

You can use tidyr::extract:

library(tidyr)
df <- df %>% 
    extract("name_and_address", c("name", "address"), "(\\D*)(\\d.*)")
## => df
##           name              address
## 1    Mr. Smith       12 Some street
## 2    Mr. Jones   345 Another street
## 3 Mr. Anderson 6 A different street

The (\D*)(\d.*) regex matches the following:

  • (\D*) – Group 1: any zero or more non-digit chars
  • (\d.*) – Group 2: a digit and then any zero or more chars as many as possible.

Another solution with stringr::str_split is also possible:

str_split(df$name_and_address, "(?=\\d)", n=2)
## => [[1]]
## [1] "Mr. Smith"      "12 Some street"

## [[2]]
## [1] "Mr. Jones"          "345 Another street"

## [[3]]
## [1] "Mr. Anderson"         "6 A different street"

The (?=\d) positive lookahead finds a location before a digit, and n=2 tells stringr::str_split to only split into 2 chunks max.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading