Home Remove first occurrence of special characters until the first word or word character in R using regex

Questions

Remove first occurrence of special characters until the first word or word character in R using regex

July 27, 2022

For my project I am looking into removing parts of text based on the pattern of special characters. I have a long .txt file that has the below structure:

mycharobj=c("---------Some text is here.---------More text is here - [3548]----- Even more text is here.-----------More text is here - [408]--------- Even more text is here again.")

String continues following the above pattern.

My target is to remove parts that start with - and end - [number], such as:

"-----------------------More text is here - [3548]"
"-----------More text is here - [408]"

I am planning to use the below to remove these parts with (will be looped in the future)

library(stringr)
library(qdapRegex)

temp=unlist(regmatches(mycharobj, gregexpr("[[:digit:]]+", mycharobj)))
mycharobj=rm_between(mycharobj, "-", paste(temp[1],"]", sep=""))

but for this to work, I need a regex expression that will remove the first occurrence of "-----------" in text until the first word or word character. If a string starts with text (word or word characters), it needs to ignore this and identify the first occurrence of "-----------" for my potential loop to work.

I was wondering if this can be done with regular expressions? Any help is appreciated. I have a very computationally demanding solution for this; split the string based on the special character "-" and then identify the parts of the text that I need through a set of conditionals. But due to the fact that it takes a lot more of the processing time, this solution is not very scalable for processing a large number of such .txt files.

>Solution :

You can use

gsub("-{9,}(?:(?!-{9}).)*?- \\[\\d+]", "", mycharobj, perl=TRUE)

See the regex demo.

Details:

-{9,} – nine or more - chars
(?:(?!-{9}).)*? – any one char, other than a line break char, zero or more but as few as possible occurrences, that does not start a nine hyphen char sequence
- \[ – a - [ string
\d+ – one or more digits
] – a ] char.