Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove all punctuation AND the values after it at end of string in R

I have a ID variable that comes from 35 different hospitals, so has varying different arrangements of the variable, and sometimes it has the same root ID number with a secondary line number – e.g. -1, /a, _1 etc.

I want to remove the punctuation, and whatever comes after that punctuation, leaving just the root ID number.

I have currently managed to write out individual lines of code for each different iteration, but I was wondering if there was a more elegant way so that next year when the data comes in I don’t need to check for different arrangements?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

On someone else’s question I managed to find a way to remove the brackets and all the text within the brackets, but I can’t seem to figure out how to manipulate it for my purposes

df$patid<- gsub("\\s*\\([^\\)]+\\)","",df$patid)

I tried these two codes without success

df$patid<- gsub("\\[:punct:]s*$","", df$patid)
df$patid<- gsub("\\[:alnum:]s*$","", df$patid)

I also tried the clean function, which removed all the punctuation, but kept the numbers/characters after them, so that wasn’t it.

example of my current code (not all possible iterations) – These do work

df$patid<- gsub("\\-1$", "", df$patid)
df$patid<- gsub("\\-2$", "", df$patid)
df$patid<- gsub("\\-3$", "", df$patid)
df$patid<- gsub("\\-a$", "", df$patid)
df$patid<- gsub("\\-A$", "", df$patid)
df$patid<- gsub("\\-b$", "", df$patid)
df$patid<- gsub("\\-B$", "", df$patid)
df$patid<- gsub("\\b", "", df$patid)
df$patid<- gsub("\\/dd", "", df$patid)

Am not tied to gsub, am open to different methods.

Example of ID numbers

patid<- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")

Apologies if this has been answered somewhere already

>Solution :

A literal regex for what you described would be:

[[:punct:]][^[[:punct:]]]*$

This would match a final punctuation character, followed by anything which follows it, until the end of the string.

patid <- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")
output <- sub("[[:punct:]][^[[:punct:]]]*$", "", patid)
output

 [1] "MB-13-169454" "MB-13-179455" "MB-13-212235" "MB-13-212235" "MB-13-224683"
 [6] "570548260"    "570548260"    "1458629P"     "1139093D"     "8253015N"    
[11] "8253015N"     "M255858"      "M255858"      "8494392Q"     "9296741B"    
[16] "04152341421"  "04152341421"  "04152640475"  "04152821164"  "G140381883"  
[21] "G140381883"   "G140880774"   "G140880774"  
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading