Regex to split a column in R after the second pipe and after the second T

February 23, 2024

I have a 1 column dataframe of thousands of lines all built on the same pattern, for example:

ids <- c("ETC|HMPI01000001|HMPI01000001.1 TAG: Genus Species, T05X3Ml2_CL10007Cordes1_1","ETC|HMPI31000002|HMPI31000002.1 TAG: Genus Species, T3X3Ml2_CL10157Cordes1_1", "ETC|HMPI01000007|HMPI01000007.1 TAG: Genus Species, T1X3Ml2_CL11231Cordes1_1")

df <- as.data.frame(ids)

> df
                                                                            ids
1  ETC|HMPI01000001|HMPI01000001.1 TAG: Genus Species, T05X3Ml2_CL10007Cordes1_1
2  ETC|HMPI31000002|HMPI31000002.1 TAG: Genus Species, T3X3Ml2_CL10157Cordes1_1
3  ETC|HMPI01000007|HMPI01000007.1 TAG: Genus Species, T1X3Ml2_CL11231Cordes1_1

I want to split these characters in 2 columns: var1 and var2 such and keep the text after the second pipe and before the first space, and the text from the second T after the space onwards. These will be the common patterns for all the lines. The expected results is:

> df
            var1                      var2
1 HMPI01000001.1 T05X3Ml2_CL10007Cordes1_1
2 HMPI31000002.1  T3X3Ml2_CL10157Cordes1_1
3 HMPI01000007.1  T1X3Ml2_CL11231Cordes1_1

I have tried several regex inspired from here, there or there.. But I cannot figure it out.

I have currently this, but it is not giving the expected results:

df2 <- df %>% separate(col = "ids", into = c("var1", "var2"), sep = "\\|([^|]+)$")
> df2
              var1 var2
1 ETC|HMPI01000001     
2 ETC|HMPI31000002     
3 ETC|HMPI01000007

Any help, preferably using a regex and the tidyverse, would be appreciated.

>Solution :

library(stringr)
library(dplyr)

df |>
  mutate(var1 = str_extract(ids, ".*\\|(\\S+)", group = 1),
         var2 = str_extract(ids, ".*, (.*$)", group = 1))

You can also use separate_wider_position, if their placement is fixed:

library(tidyr)

df |>
  separate_wider_position(ids, 
                          widths = c(17, var1 = 14, 21, var2 = max(nchar(ids))), 
                          too_few = "align_start")

#  var1           var2                     
#  <chr>          <chr>                    
# 1 HMPI01000001.1 T05X3Ml2_CL10007Cordes1_1
# 2 HMPI31000002.1 T3X3Ml2_CL10157Cordes1_1 
# 3 HMPI01000007.1 T1X3Ml2_CL11231Cordes1_1