Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Splitting strings containing multiple delimiters in R

I have a vector tissue which contains strings delimited by multiple characters. The constituent strings of the vector belong broadly to four classes:

  1. Strings which only consist of term(s) (e.g. Thymus Thyroid) separated by ,

  2. Strings which contain identifier(s) (e.g. ECO:0000313|RefSeq:XP_014046664.1) ending with }, followed by term(s) separated by ,

    MEDevel.com: Open-source for Healthcare and Education

    Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

    Visit Medevel

  3. Strings which contain a term followed by identifier(s)

  4. Strings which contain a term followed by an identifier and then term(s) separated by ,

    tissue <- c("Head kidney,Thymus,Thyroid,", 
                "Red blood cell,", 
                "ECO:0000313|RefSeq:XP_014046664.1},Muscle,",
                "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,", 
                "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,",
                "Blood,ECO:0000313|RefSeq:XP_017326031.1},",
                "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},",
                "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,")
    

For strings belonging to category 1, I could split the terms using a simple strsplit() function

unlist(strsplit("Head kidney,Thymus,Thyroid,", ","))
[1] "Head kidney" "Thymus"      "Thyroid" 

unlist(strsplit("Red blood cell,", ","))
[1] "Red blood cell"

For strings belonging to category 2, this is what I came up with and it works fine

unlist(strsplit(sub('.*\\},', "", "ECO:0000313|RefSeq:XP_014046664.1},Muscle,"), ","))
[1] "Muscle"

unlist(strsplit(sub('.*\\},', "", "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,"), ","))
[1] "Leaf"

unlist(strsplit(sub('.*\\},', "", "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,"), ","))
[1] "Head kidney"  "Muscle"       "White muscle"

For strings belonging to category 3, this worked for me

sub(',ECO:.*', "", "Blood,ECO:0000313|RefSeq:XP_017326031.1},")
[1] "Blood"

sub(',ECO:.*', "", "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},")
[1] "Spleen"

For category 4, this is what I tried and it works fine

unlist(strsplit(sub(',ECO:.*},', ",", "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,"), ","))
[1] "Brain"        "Head kidney"  "Muscle"       "Spleen"       "White muscle"

I’m looking for a solution, a single regex if possible, which can handle all these conditions and can be used directly on the vector.

>Solution :

We may remove some of the substring and then use strsplit

library(stringr)
lapply(strsplit(str_remove_all(tissue, "ECO:[^\\}]+\\}"), ","), 
     function(x) x[nzchar(x)])

-output

[[1]]
[1] "Head kidney" "Thymus"      "Thyroid"    

[[2]]
[1] "Red blood cell"

[[3]]
[1] "Muscle"

[[4]]
[1] "Leaf"

[[5]]
[1] "Head kidney"  "Muscle"       "White muscle"

[[6]]
[1] "Blood"

[[7]]
[1] "Spleen"

[[8]]
[1] "Brain"        "Head kidney"  "Muscle"       "Spleen"       "White muscle"

Or with a tidyverse work flow

library(dplyr)
library(tidyr)
str_remove_all(tissue, "ECO:[^\\}]+\\}") %>% 
  trimws(whitespace = ",+") %>%
  str_replace_all(',{2,}', ",") %>% 
  tibble(col1 = .) %>% 
  tidyr::separate(col1, into = str_c('V', 
    seq(max(str_count(.$col1, ",")) + 1)), sep = ",", fill = "right")

-output

# A tibble: 8 × 5
  V1             V2          V3           V4     V5          
  <chr>          <chr>       <chr>        <chr>  <chr>       
1 Head kidney    Thymus      Thyroid      <NA>   <NA>        
2 Red blood cell <NA>        <NA>         <NA>   <NA>        
3 Muscle         <NA>        <NA>         <NA>   <NA>        
4 Leaf           <NA>        <NA>         <NA>   <NA>        
5 Head kidney    Muscle      White muscle <NA>   <NA>        
6 Blood          <NA>        <NA>         <NA>   <NA>        
7 Spleen         <NA>        <NA>         <NA>   <NA>        
8 Brain          Head kidney Muscle       Spleen White muscle

Or using only base R

read.csv(text = gsub(",{2,}", ",", trimws(gsub("ECO:[^\\}]+\\}", 
    "", tissue), whitespace = ",+")), header = FALSE, fill = TRUE, sep=",")
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading