Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

split numbers and dates into seperate columns

My data contains text strings with three important features, an id number separated by":" and a starting date and an end date. I need to get these tree numbers into three separate columns. I have tried different solutions, everything from unnest_tokens, grepl/grep, to separate, but can’t seem to get it right, I may get one date, but I can’t seem to get them in the correct order or into a data frame.

input data

input<- data.frame(
  id=c(1,2,3),
  value=c("a long title containing all sorts - off `characters` 2022:03 29.10.2021 
  21.02.2022",
  "but the strings always end with the same - document id, start date: and end date  2022:02 
  30.04.2020 18.02.2022",
  "so I need to split document id, start and end dates into separate columns 2000:01 
  07.10.2000 15.02.2021")
  )

desired output

output <-data.frame(
 id=c(1,2,3),
 value=c("a long title containing all sorts - off `characters`",
 "but the strings allwasys end with the same - document id, start date: and end date",
 "so i need to split document id, start and end dates into seperate collumns"),
 docid=c("2022:03", "2022:02", "2000:01"),
 start=c("29.10.2021", "30.04.2020", "07.10.2000"),
 end=c("21.02.2022", "18.02.2022", "15.02.2021")
  )

>Solution :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

This is most conveniently accomplished by extract: in its regex argument we exhaustively describe the strings we want to split into columns as a complex pattern in which the parts that need to go into the columns are wrapped into capture groups (...):

library(tidyr)
input %>%
  extract(value,
          into = c("value", "docid", "start", "end"),
          regex = "(.*)\\s(\\d{4}:\\d{2})\\s{1,}(.*)\\s{1,}(.*)")
  id                                                                             value   docid      start
1  1                              a long title containing all sorts - off `characters` 2022:03 29.10.2021
2  2 but the strings always end with the same - document id, start date: and end date  2022:02 30.04.2020
3  3         so I need to split document id, start and end dates into separate columns 2000:01 07.10.2000
         end
1 21.02.2022
2 18.02.2022
3 15.02.2021
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading