Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract anything between first underscore and last underscore with exceptions

I have a list of strings:

str1 <- "core21_ap_202003.xlsx"
str2 <- "core21_ap_thailand_202004.xlsx"
str3 <- "core17_eay_201008_b.xlsx"

strings <- list(str1, str2, str3)

I want to extract "ap", "ap_thailand", and "eay". I have tried:

gsub("_[^_]*$|^[^_]*_","", strings, perl=T)

Output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

[1] "ap"  "ap_thailand"  "eay_201008" 

Works for the first two occurences, but not the last one. I need "eay", not "eay_201008".

In other words, that a country name (here, Thailand) is extracted only if it exists, and then date should never be extracted.

Desired output:

[1] "ap"  "ap_thailand"  "eay" 

>Solution :

Find everything to the first underscore followed by everything until an underscore and 6 digits followed by anything. Keep the part between the undersores.

strings |>
  unlist() |>
  sub(".*?_(.*)_\\d{6}.*", "\\1", x = _) 
## [1] "ap"          "ap_thailand" "eay"   
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading