Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to extract unique letters among word of consecutive letters?

My question might not be clear, so I’ll explain my problem using simple example.

For example, there is character x = "AAATTTGGAA".

What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.

How should I get this?

I apologize if this is duplicated, but I cannot find about this problem.

>Solution :

Here is a useful regex trick approach:

x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out

[1] "AAA" "TTT" "GG"  "AA"

The regex pattern used here says to split at any boundary where the preceding and following characters are different.

(?<=(.))  lookbehind and also capture preceding character in \1
(?!\\1)   then lookahead and assert that following character is different
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading