Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

extracting alphanumeric patterns from a character string whose values vary in R

my gratitude in advance for any help and apologies for not being able to figure this out from other examples.

I have a vector containing names of files such as: vec = c("Img_1_(set1)_2L4_s.ext", "Img_37_(set19)_2R4_s.ext", "Img_187_(set94)_4L4_s.ext", "Img_77_(set39)_4R2_s.ext")

I want to create two–separate–additional vectors from extracting:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

1. The key letter (either L or R) between the numbers that go side-by-side, which vary from case to case. e.g., result: L,R,L,R

2. The "set" string, plus the number–which varies across cases–attached to it between brackets, with and without the brackets. e.g., result1: (set1), (set19), (set94), (set39); result2: set1, set19, set94, set39

Ideally using either stringer(), but I’m open to other –simpler?– libraries/functions.

For case 1., I tried str_extract(vec, "(?<= \\)_)[0-9]*"), as a way to get the ")_" pattern followed by a number [0-9] but all I get in return are NAs (I think I’m not quite passing alright the ")" pattern well).

For case 2., I had to made do by simply extracting the set numbers str_extract(vec, "(?<=set)[0-9]*"), and create another variable by pasting the "set" word; obviously not ideal with large data frames.

>Solution :

The set pattern is nice and easy, the letters "set" followed by one more more numbers "[0-9]+".

At least for your examples, it seems like the letters L and R don’t show up anywhere else, so we can do a very simple pattern for them too, just look for an L or an R: "L|R".

set = str_extract(vec, pattern = "set[0-9]+")
main = str_extract(vec, pattern = "L|R")
set
# [1] "set1"  "set19" "set94" "set39"
main
# [1] "L" "R" "L" "R"

If you’re worried about potentially getting false hits on the L or R because they might show up elsewhere in the input, you could make the pattern more specific, for example looking behind for a number "(?<=[0-9])" and looking ahead for a number "(?=[0-9])":

main2 = str_extract(vec, pattern = "(?<=[0-9])L|R(?=[0-9])")
main2
# [1] "L" "R" "L" "R"

And if you do want the parens with the set, you escape parens to include them in the pattern:

set_with_paren = str_extract(vec, pattern = "\\(set[0-9]+\\)")
set_with_paren
# [1] "(set1)"  "(set19)" "(set94)" "(set39)"
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading