Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Separate string column of varying lengths and multiple delimiters into new columns using tidyr

I would like to separate the prod_code_date column (a string of varying lengths, from one "pair" to multiple "pairs") into multiple columns based on the delimiter ":" and ",".

The ":" delimiter separates related information into product code and date of purchase – which can be thought of as paired information, while the "," delimiter separates the different pairs of information for the same product number (prod_no).

Expected intermediate result from separate_wider_delim
The number of columns created should be based on the number of delimiters in the column.
The column names should be code_1, date_1, code_2, date_2, code_x, date_x, etc.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Final expected result (sample data below): A long table with the following columns: prod_no, code, date, with repeated prod_no rows for as many pairs as there are in the prod_code_date column.

library(tidyverse)

# Data
df <- tibble(prod_no = 1:4, prod_code_date = c("' ZB10.90 : 2013-04-29'", "' XJ11.90 : 2016-10-20, ZB25.22 : 2013-10-16, ZB25.29 : 2011-12-06, XJ14.20 : 2022-03-23, ZB10.90 : 2022-12-16, ZB10.90 : 2011-12-06, QP50.19 : 2016-03-11, QP12.90 : 2012-01-20, MS44.9 : 2022-03-23'", "' MS34.3 : 2022-10-04, QP13.20 : 1998-05-26, QP50.13 : 2008-10-10, MS44.9 : 2017-05-16'", "' QP10.90 : 2008-08-11, QP11.90 : 2019-04-15'"))

# Attempt (failed) using separate_wider_delim() function. After which I would have pivoted the data to a long format.
intermediate_result <- df %>% separate_wider_delim(prod_code_date, delim = c(":", ","), names = c("code_1", "date_1", "code_2", "date_2", "code_x", "date_x"))

# Expected output: A long table with the following columns: prod_no, code, date, with repeated prod_no rows for as many pairs as there are in the prod_code_date column.
final_result <- tibble(prod_no = c(1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4), code = c("QP10.90", "XJ11.90", "QP25.22", "QP25.29", "XJ14.20", "QP10.90", "QP10.90", "QP50.19", "QP12.90", "MS34.3", "QP13.20", "QP50.13", "MS44.9", "QP10.90", "QP11.90"), date = c("2013-04-29", "2016-10-20", "2013-10-16", "2011-12-06", "2022-03-23", "2022-12-16", "2011-12-06", "2016-03-11", "2012-01-20", "2022-10-04", "1998-05-26", "2008-10-10", "2017-05-16", "2008-08-11", "2019-04-15"))

>Solution :

An approach using separate_longer_delim and a consecutive separate_wider_delim

library(dplyr)
library(tidyr)

df %>% 
  separate_longer_delim(prod_code_date, ",") %>% 
  mutate(prod_code_date = gsub(" +|'", "", prod_code_date)) %>% 
  separate_wider_delim(prod_code_date, delim=":", names=c("code", "date"))
# A tibble: 16 × 3
   prod_no code    date
     <int> <chr>   <chr>
 1       1 ZB10.90 2013-04-29
 2       2 XJ11.90 2016-10-20
 3       2 ZB25.22 2013-10-16
 4       2 ZB25.29 2011-12-06
 5       2 XJ14.20 2022-03-23
 6       2 ZB10.90 2022-12-16
 7       2 ZB10.90 2011-12-06
 8       2 QP50.19 2016-03-11
 9       2 QP12.90 2012-01-20
10       2 MS44.9  2022-03-23
11       3 MS34.3  2022-10-04
12       3 QP13.20 1998-05-26
13       3 QP50.13 2008-10-10
14       3 MS44.9  2017-05-16
15       4 QP10.90 2008-08-11
16       4 QP11.90 2019-04-15
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading