I have a list of strings:
str1 <- "core21_ap_202003.xlsx"
str2 <- "core21_ap_thailand_202004.xlsx"
str3 <- "core17_eay_201008_b.xlsx"
strings <- list(str1, str2, str3)
I want to extract "ap", "ap_thailand", and "eay". I have tried:
gsub("_[^_]*$|^[^_]*_","", strings, perl=T)
Output:
[1] "ap" "ap_thailand" "eay_201008"
Works for the first two occurences, but not the last one. I need "eay", not "eay_201008".
In other words, that a country name (here, Thailand) is extracted only if it exists, and then date should never be extracted.
Desired output:
[1] "ap" "ap_thailand" "eay"
>Solution :
Find everything to the first underscore followed by everything until an underscore and 6 digits followed by anything. Keep the part between the undersores.
strings |>
unlist() |>
sub(".*?_(.*)_\\d{6}.*", "\\1", x = _)
## [1] "ap" "ap_thailand" "eay"