Home how to extract strings and create varying number of columns with the extracted values?

Questions

how to extract strings and create varying number of columns with the extracted values?

July 20, 2023

I have the following list of file paths. The data folder has sub folders with names as ID and each subfolder contains data files.

library(dplyr)
library(stringr)
library(tidyr)

paths <- c(
  "data/A101/usage.xlsx",
  "data/A101/cost.xls",
  "data/A101/A101 cost-1.csv",
  "data/B202/B202 data.csv",
  "data/B202/cost1.xlsx",
  "data/C303/usage1.xls",
  "data/D404/cost1.csv",
  "data/D404/D404 data1.xlsx",
  "data/E/E data1.csv"
)

I want to create a data frame that looks like this: site_id from the names of sub folders, file_name_1 … file_name_n with the extracted file names from each folder. file_name_1 should contain all names of the first files from each site, file_name_2 should contain all of the 2nd extracted file names (if exists), and so on.

df <- data.frame(
  site_id = unique(str_extract(paths, "[A-D]\\d+|E")),
  file_name_1 = c("usage.xlsx", "B202 data.csv", "usage1.xls", "cost1.csv", "data1.csv"),
  file_name_2 = c("cost.xls", "cost1.xlsx", NA, "D404 data1.xlsx", NA),
  file_name_3 = c("A101 cost-1.csv", NA, NA, NA, NA)
)

And I have tried the following:

test <- data.frame(
  site_id = str_extract(paths, "[A-D]\\d+|E"),
  file_name = str_extract(paths, "[^\\/]+$")
  ) %>%
  gather(key=key, value=value, -site_id) %>% 
  pivot_wider(
    id_cols = "site_id",
    names_from = key,
    values_from = value,
    names_glue = "{key}_{seq_along({key})}",
    values_fn = list
  )
View(test)

The result has only one column with the file names combined into a vector. What am I missing here?

>Solution :

Please try the below code

df <- data.frame(paths=paths) %>% 
separate(paths, into = c('new','site_id', 'file'), sep = '/') %>% 
  mutate(row=paste0('file_name_',row_number()),
         .by=site_id) %>% 
  pivot_wider(id_cols = site_id, names_from = row,values_from = file)

^{Created on 2023-07-19 with reprex v2.0.2}

# A tibble: 5 × 4
  site_id file_name_1   file_name_2     file_name_3    
  <chr>   <chr>         <chr>           <chr>          
1 A101    usage.xlsx    cost.xls        A101 cost-1.csv
2 B202    B202 data.csv cost1.xlsx      <NA>           
3 C303    usage1.xls    <NA>            <NA>           
4 D404    cost1.csv     D404 data1.xlsx <NA>           
5 E       E data1.csv   <NA>            <NA>