Home Separating a column based on a string pattern in R using tidyverse functions

Questions

Separating a column based on a string pattern in R using tidyverse functions

April 5, 2024

I want to create two columns from the DATA given below. I want the first column (say ID1) to have all the letters or numbers before the first "-" starting from left to right and the second column (ID2) to have the first letters or numbers before the first "-" starting from right to left and to print them as character. The problem is I don’t know how to handle multiple "-" on the cell. Because I want the letters before the first "-" and the letters after the last "-". All the other "-" and letters or numbers or anything to be excluded.

Ideally, the result must look like this :

val	ID1	ID2	food
1	0001	U.S.A1	pizza
2	0023	?2MALTA	pasta
3	009876	!HUNGARY	berger
4	NA	NA	pizza

val = c(1,2,3,4)
tor = c( "0001-NEW YORK - M.AVENUE, NY-U.S.A1",
         "0023-cARAVAGGIO-NICOLO-PERUGIA-?2MALTA",
         "009876-bUDAPEST-!HUNGARY" ,NA)
food = c("pizza","pasta","berger","pizza")
DATA = tibble(val,tor,food)

>Solution :

The tidyverse way to do this is with tidyr::separate_wider_regex():

We can use named capture groups to specify the parts of tor that we want to keep. As the docs state, patterns is:

A named character vector where the names become column names and the values are regular expressions that match the contents of the vector. Unnamed components will match, but not be included in the output.

Emphasis mine. In this case:

tor_pattern <- c(
    ID1 = "\\d+", # any digits at the start
    "-.+-", # discard the part between hyphens
    ID2 = ".+$" # whatever appears after this until the end
)

Which gives us:

DATA |>
    tidyr::separate_wider_regex(
        cols = tor,
        patterns = tor_pattern
    )

# # A tibble: 4 × 4
#     val ID1    ID2      food  
#   <dbl> <chr>  <chr>    <chr> 
# 1     1 0001   U.S.A1   pizza 
# 2     2 0023   ?2MALTA  pasta 
# 3     3 009876 !HUNGARY berger
# 4     4 NA     NA       pizza