I have the following dataframe:
bla = data.frame(mycol = "bla_v2_2072|ID:61462952|;bla_v2_0113|ID:61460993|")
and I want to remove everything after the first ‘|’, but the cell contains basically two substrings separated by ‘;’.
Now, I tried
gsub("\\|.*","",bla$mycol)
which gives me bla_v2_2072
, but what I expect is
bla_v2_2072;bla_v2_0113
>Solution :
Using gsub()
:
bla$mycol <- gsub("(\\|.*?(?=;))|(\\|[^;]*$)", "", bla$mycol, perl = TRUE)
Or using the same regex pattern in tidyverse:
library(dplyr)
library(stringr)
bla %>%
mutate(mycol = str_remove_all(mycol, "(\\|.*?(?=;))|(\\|[^;]*$)"))
Result:
mycol
1 bla_v2_2072;bla_v2_0113
Explanation:
"(\\|.*?(?=;)) # literal '|' and following characters up to next ';'
| # or
(\\|[^;]*$)" # literal '|' through end of string if no intervening ';'