Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

how to extract strings and create varying number of columns with the extracted values?

I have the following list of file paths. The data folder has sub folders with names as ID and each subfolder contains data files.

library(dplyr)
library(stringr)
library(tidyr)

paths <- c(
  "data/A101/usage.xlsx",
  "data/A101/cost.xls",
  "data/A101/A101 cost-1.csv",
  "data/B202/B202 data.csv",
  "data/B202/cost1.xlsx",
  "data/C303/usage1.xls",
  "data/D404/cost1.csv",
  "data/D404/D404 data1.xlsx",
  "data/E/E data1.csv"
)

I want to create a data frame that looks like this: site_id from the names of sub folders, file_name_1file_name_n with the extracted file names from each folder. file_name_1 should contain all names of the first files from each site, file_name_2 should contain all of the 2nd extracted file names (if exists), and so on.

df <- data.frame(
  site_id = unique(str_extract(paths, "[A-D]\\d+|E")),
  file_name_1 = c("usage.xlsx", "B202 data.csv", "usage1.xls", "cost1.csv", "data1.csv"),
  file_name_2 = c("cost.xls", "cost1.xlsx", NA, "D404 data1.xlsx", NA),
  file_name_3 = c("A101 cost-1.csv", NA, NA, NA, NA)
)

And I have tried the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

test <- data.frame(
  site_id = str_extract(paths, "[A-D]\\d+|E"),
  file_name = str_extract(paths, "[^\\/]+$")
  ) %>%
  gather(key=key, value=value, -site_id) %>% 
  pivot_wider(
    id_cols = "site_id",
    names_from = key,
    values_from = value,
    names_glue = "{key}_{seq_along({key})}",
    values_fn = list
  )
View(test)

enter image description here

The result has only one column with the file names combined into a vector. What am I missing here?

>Solution :

Please try the below code

df <- data.frame(paths=paths) %>% 
separate(paths, into = c('new','site_id', 'file'), sep = '/') %>% 
  mutate(row=paste0('file_name_',row_number()),
         .by=site_id) %>% 
  pivot_wider(id_cols = site_id, names_from = row,values_from = file)

Created on 2023-07-19 with reprex v2.0.2

# A tibble: 5 × 4
  site_id file_name_1   file_name_2     file_name_3    
  <chr>   <chr>         <chr>           <chr>          
1 A101    usage.xlsx    cost.xls        A101 cost-1.csv
2 B202    B202 data.csv cost1.xlsx      <NA>           
3 C303    usage1.xls    <NA>            <NA>           
4 D404    cost1.csv     D404 data1.xlsx <NA>           
5 E       E data1.csv   <NA>            <NA>           
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading