Home Separate string column of varying lengths and multiple delimiters into new columns using tidyr

Questions

Separate string column of varying lengths and multiple delimiters into new columns using tidyr

September 10, 2024

I would like to separate the prod_code_date column (a string of varying lengths, from one "pair" to multiple "pairs") into multiple columns based on the delimiter ":" and ",".

The ":" delimiter separates related information into product code and date of purchase – which can be thought of as paired information, while the "," delimiter separates the different pairs of information for the same product number (prod_no).

Expected intermediate result from separate_wider_delim
The number of columns created should be based on the number of delimiters in the column.
The column names should be code_1, date_1, code_2, date_2, code_x, date_x, etc.

Final expected result (sample data below): A long table with the following columns: prod_no, code, date, with repeated prod_no rows for as many pairs as there are in the prod_code_date column.

library(tidyverse)

# Data
df <- tibble(prod_no = 1:4, prod_code_date = c("' ZB10.90 : 2013-04-29'", "' XJ11.90 : 2016-10-20, ZB25.22 : 2013-10-16, ZB25.29 : 2011-12-06, XJ14.20 : 2022-03-23, ZB10.90 : 2022-12-16, ZB10.90 : 2011-12-06, QP50.19 : 2016-03-11, QP12.90 : 2012-01-20, MS44.9 : 2022-03-23'", "' MS34.3 : 2022-10-04, QP13.20 : 1998-05-26, QP50.13 : 2008-10-10, MS44.9 : 2017-05-16'", "' QP10.90 : 2008-08-11, QP11.90 : 2019-04-15'"))

# Attempt (failed) using separate_wider_delim() function. After which I would have pivoted the data to a long format.
intermediate_result <- df %>% separate_wider_delim(prod_code_date, delim = c(":", ","), names = c("code_1", "date_1", "code_2", "date_2", "code_x", "date_x"))

# Expected output: A long table with the following columns: prod_no, code, date, with repeated prod_no rows for as many pairs as there are in the prod_code_date column.
final_result <- tibble(prod_no = c(1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4), code = c("QP10.90", "XJ11.90", "QP25.22", "QP25.29", "XJ14.20", "QP10.90", "QP10.90", "QP50.19", "QP12.90", "MS34.3", "QP13.20", "QP50.13", "MS44.9", "QP10.90", "QP11.90"), date = c("2013-04-29", "2016-10-20", "2013-10-16", "2011-12-06", "2022-03-23", "2022-12-16", "2011-12-06", "2016-03-11", "2012-01-20", "2022-10-04", "1998-05-26", "2008-10-10", "2017-05-16", "2008-08-11", "2019-04-15"))

>Solution :

An approach using separate_longer_delim and a consecutive separate_wider_delim

library(dplyr)
library(tidyr)

df %>% 
  separate_longer_delim(prod_code_date, ",") %>% 
  mutate(prod_code_date = gsub(" +|'", "", prod_code_date)) %>% 
  separate_wider_delim(prod_code_date, delim=":", names=c("code", "date"))
# A tibble: 16 × 3
   prod_no code    date
     <int> <chr>   <chr>
 1       1 ZB10.90 2013-04-29
 2       2 XJ11.90 2016-10-20
 3       2 ZB25.22 2013-10-16
 4       2 ZB25.29 2011-12-06
 5       2 XJ14.20 2022-03-23
 6       2 ZB10.90 2022-12-16
 7       2 ZB10.90 2011-12-06
 8       2 QP50.19 2016-03-11
 9       2 QP12.90 2012-01-20
10       2 MS44.9  2022-03-23
11       3 MS34.3  2022-10-04
12       3 QP13.20 1998-05-26
13       3 QP50.13 2008-10-10
14       3 MS44.9  2017-05-16
15       4 QP10.90 2008-08-11
16       4 QP11.90 2019-04-15