Remove part of string with multiple occurences inside cell

I have the following dataframe:

bla = data.frame(mycol = "bla_v2_2072|ID:61462952|;bla_v2_0113|ID:61460993|")

and I want to remove everything after the first ‘|’, but the cell contains basically two substrings separated by ‘;’.

Now, I tried

gsub("\\|.*","",bla$mycol)

which gives me bla_v2_2072, but what I expect is

bla_v2_2072;bla_v2_0113

>Solution :

Using gsub():

bla$mycol <- gsub("(\\|.*?(?=;))|(\\|[^;]*$)", "", bla$mycol, perl = TRUE)

Or using the same regex pattern in tidyverse:

library(dplyr)
library(stringr)

bla %>% 
  mutate(mycol = str_remove_all(mycol, "(\\|.*?(?=;))|(\\|[^;]*$)"))

Result:

                    mycol
1 bla_v2_2072;bla_v2_0113

Explanation:

"(\\|.*?(?=;))              # literal '|' and following characters up to next ';'
              |             # or
               (\\|[^;]*$)" # literal '|' through end of string if no intervening ';'

Leave a Reply