Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regular Expression Nucleotide Search

I am trying to find a regular expression that will allow me to know if there is a dinucleotide(Two letters) that appears 2 times in a row in my sequence. I give you an example:

Let’s suppose I have this sequence (The character ‘;’ is to make clear that I am talking about dinucleotides):

"AT;GC;TA;CC;AG;AG;CC;CA;TA;TA"

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The result I expect is that it matches the pattern AGAG and TATA.

I have tried this already but it fails because it gives me any pair of dinucleotides, not the same pair :

([ATGC]{2}){2}

>Solution :

You will need to use backreferences.

Start with matching one pair:

[ATGC]{2}

will match any pair of two of the four letters.

You need to put that in capturing parentheses and refer to the contents of the parentheses with \1, like so:

([ATGC]{2});\1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading