Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove first occurrence of special characters until the first word or word character in R using regex

For my project I am looking into removing parts of text based on the pattern of special characters. I have a long .txt file that has the below structure:

mycharobj=c("---------Some text is here.---------More text is here - [3548]----- Even more text is here.-----------More text is here - [408]--------- Even more text is here again.")

String continues following the above pattern.

My target is to remove parts that start with - and end - [number], such as:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

"-----------------------More text is here - [3548]"
"-----------More text is here - [408]"

I am planning to use the below to remove these parts with (will be looped in the future)

library(stringr)
library(qdapRegex)

temp=unlist(regmatches(mycharobj, gregexpr("[[:digit:]]+", mycharobj)))
mycharobj=rm_between(mycharobj, "-", paste(temp[1],"]", sep=""))

but for this to work, I need a regex expression that will remove the first occurrence of "-----------" in text until the first word or word character. If a string starts with text (word or word characters), it needs to ignore this and identify the first occurrence of "-----------" for my potential loop to work.

I was wondering if this can be done with regular expressions? Any help is appreciated. I have a very computationally demanding solution for this; split the string based on the special character "-" and then identify the parts of the text that I need through a set of conditionals. But due to the fact that it takes a lot more of the processing time, this solution is not very scalable for processing a large number of such .txt files.

>Solution :

You can use

gsub("-{9,}(?:(?!-{9}).)*?- \\[\\d+]", "", mycharobj, perl=TRUE)

See the regex demo.

Details:

  • -{9,} – nine or more - chars
  • (?:(?!-{9}).)*? – any one char, other than a line break char, zero or more but as few as possible occurrences, that does not start a nine hyphen char sequence
  • - \[ – a - [ string
  • \d+ – one or more digits
  • ] – a ] char.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading