I want to create two columns from the DATA given below. I want the first column (say ID1) to have all the letters or numbers before the first "-" starting from left to right and the second column (ID2) to have the first letters or numbers before the first "-" starting from right to left and to print them as character. The problem is I don’t know how to handle multiple "-" on the cell. Because I want the letters before the first "-" and the letters after the last "-". All the other "-" and letters or numbers or anything to be excluded.
Ideally, the result must look like this :
| val | ID1 | ID2 | food |
|---|---|---|---|
| 1 | 0001 | U.S.A1 | pizza |
| 2 | 0023 | ?2MALTA | pasta |
| 3 | 009876 | !HUNGARY | berger |
| 4 | NA | NA | pizza |
val = c(1,2,3,4)
tor = c( "0001-NEW YORK - M.AVENUE, NY-U.S.A1",
"0023-cARAVAGGIO-NICOLO-PERUGIA-?2MALTA",
"009876-bUDAPEST-!HUNGARY" ,NA)
food = c("pizza","pasta","berger","pizza")
DATA = tibble(val,tor,food)
>Solution :
The tidyverse way to do this is with tidyr::separate_wider_regex():
We can use named capture groups to specify the parts of tor that we want to keep. As the docs state, patterns is:
A named character vector where the names become column names and the values are regular expressions that match the contents of the vector. Unnamed components will match, but not be included in the output.
Emphasis mine. In this case:
tor_pattern <- c(
ID1 = "\\d+", # any digits at the start
"-.+-", # discard the part between hyphens
ID2 = ".+$" # whatever appears after this until the end
)
Which gives us:
DATA |>
tidyr::separate_wider_regex(
cols = tor,
patterns = tor_pattern
)
# # A tibble: 4 × 4
# val ID1 ID2 food
# <dbl> <chr> <chr> <chr>
# 1 1 0001 U.S.A1 pizza
# 2 2 0023 ?2MALTA pasta
# 3 3 009876 !HUNGARY berger
# 4 4 NA NA pizza