Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Separating a column based on a string pattern in R using tidyverse functions

I want to create two columns from the DATA given below. I want the first column (say ID1) to have all the letters or numbers before the first "-" starting from left to right and the second column (ID2) to have the first letters or numbers before the first "-" starting from right to left and to print them as character. The problem is I don’t know how to handle multiple "-" on the cell. Because I want the letters before the first "-" and the letters after the last "-". All the other "-" and letters or numbers or anything to be excluded.

Ideally, the result must look like this :

val ID1 ID2 food
1 0001 U.S.A1 pizza
2 0023 ?2MALTA pasta
3 009876 !HUNGARY berger
4 NA NA pizza
val = c(1,2,3,4)
tor = c( "0001-NEW YORK - M.AVENUE, NY-U.S.A1",
         "0023-cARAVAGGIO-NICOLO-PERUGIA-?2MALTA",
         "009876-bUDAPEST-!HUNGARY" ,NA)
food = c("pizza","pasta","berger","pizza")
DATA = tibble(val,tor,food)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

The tidyverse way to do this is with tidyr::separate_wider_regex():

We can use named capture groups to specify the parts of tor that we want to keep. As the docs state, patterns is:

A named character vector where the names become column names and the values are regular expressions that match the contents of the vector. Unnamed components will match, but not be included in the output.

Emphasis mine. In this case:

tor_pattern <- c(
    ID1 = "\\d+", # any digits at the start
    "-.+-", # discard the part between hyphens
    ID2 = ".+$" # whatever appears after this until the end
)

Which gives us:

DATA |>
    tidyr::separate_wider_regex(
        cols = tor,
        patterns = tor_pattern
    )

# # A tibble: 4 × 4
#     val ID1    ID2      food  
#   <dbl> <chr>  <chr>    <chr> 
# 1     1 0001   U.S.A1   pizza 
# 2     2 0023   ?2MALTA  pasta 
# 3     3 009876 !HUNGARY berger
# 4     4 NA     NA       pizza 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading