I want to extract strings from a list that contains identifiers of different lengths. Essentially, I want to keep all of the characters of identifiers up to 3rd occurrence of "-", except the alphabet at the end, and remove the rest. The example of the list is below:
mylist <- c("abc-nop-7a-2","abc-nop-7b-3p", "abc-nop-18a-5p/18c-5p", "abc-xyz-198_5p")
I want the resulting list to look like:
result <- c("abc-nop-7","abc-nop-7", "abc-nop-18", "abc-xyz-198")
I have tried splitting the strings and then taking the section I want, but I was not sure how to call sections up to a certain point. I tried:
mylist <- gsub("-", "_", mylist) #"-" was not not acceptable as a character
mylist <- strsplit(mylist, "_")
sapply(mylist, `[`, 3)
But of course, the above only gives me something like this:
"7","7", "18", "198"
Is there a way to call extract 1~3 section I split in the method above? or if there are more efficient ways to do the task through stringr or something, I’d appreciate that as well.
Thanks in advance.
>Solution :
We can capture as a group and replace with the backreference (\\1)
sub("^(([^-]+-){2}[0-9]+).*", "\\1", mylist)
[1] "abc-nop-7" "abc-nop-7" "abc-nop-18" "abc-xyz-198"
the pattern matched is two ({2}) instances of characters that are not a - ([^-]+) followed by a - from the start (^) of the string, followed by one or more digits ([0-9]+), captured ((...)) and in the replacement, specify the backreference of the captured group