Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Using gsub to extract only capital letters of a certain length

I have a string where I wish to extract the country code, this will always be in the form of capital letters with 3 characters.

mystring
"Bloggs, Joe GBR London (1)/Bloggs, Joe London (2)" 
"Bloggs, Joe London (1)/Bloggs, Joe  GBR London (2)"  
"Bloggs, Joe London (1)/Bloggs, Joe London (2)" 
"Bloggs, Joe GBR London (1)/Bloggs, Joe GBR London (2)" 
 "Bloggs, J-S GBR London (1)/Bloggs, J-S GBR London (2)" 

What I’m trying to get

mystring
GBR/
/GBR
/
GBR/GBR
GBR/GBR

Blanks are fine if there is no country, I can deal with them

I’ve tried a couple of things which I have seen on here, one which tried to remove all characters that aren’t capital but then I am left with other letters which I don’t want like the capitals from the name and location. I then tried to do similar by trying to remove all letters that don’t start and end with a capital (also had no joy due to name issues);

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

gsub("[^A-Z$]", "", mystring)

If I just keep all capital letters where there are 3 letter that might work, but I can’t quite get the code right, I think it would look something like below if anyone know or even knows a more robust method;

gsub("[^A-Z$]{3}", "", mystring)

>Solution :

I like stringr::str_extract for extracting patterns from strings. This lets you simply enter the pattern you want, rather than trying to replace everything else:

mystring = c("Bloggs, Joe GBR London (1)/Bloggs, Joe London (2)", 
"Bloggs, Joe London (1)/Bloggs, Joe  GBR London (2)"  ,
"Bloggs, Joe London (1)/Bloggs, Joe London (2)" ,
"Bloggs, Joe GBR London (1)/Bloggs, Joe GBR London (2)", 
 "Bloggs, J-S GBR London (1)/Bloggs, J-S GBR London (2)" 
)

## extract first matches
stringr::str_extract(mystring, "[A-Z]{3}")
# [1] "GBR" "GBR" NA    "GBR" "GBR"

## or get all matches with `str_extract_all`
stringr::str_extract_all(mystring, "[A-Z]{3}")
# [[1]]
# [1] "GBR"
# 
# [[2]]
# [1] "GBR"
# 
# [[3]]
# character(0)
# 
# [[4]]
# [1] "GBR" "GBR"
# 
# [[5]]
# [1] "GBR" "GBR"

It is possible to do the same in base R using substring or regmatches and regexpr as seen in answers here.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading