Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Capture integer in string and use it as part of regular expression

I’ve got a string:

s = ".,-2gg,,,-2gg,-2gg,,,-2gg,,,,,,,,t,-2gg,,,,,,-2gg,t,,-1gtt,,,,,,,,,-1gt,-3ggg"

and a regular expression I’m using

import re
delre = re.compile('-[0-9]+[ACGTNacgtn]+') #this is almost correct
print (delre.findall(s))

This returns:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

['-2gg', '-2gg', '-2gg', '-2gg', '-2gg', '-2gg', '-1gtt', '-1gt', '-3ggg']

But -1gtt and -1gt are not desired matches. The integer in this case defines how many subsequent characters to match, so the desired output for those two matches would be -1g and -1g, respectively.

Is there a way to grab the integer after the dash and dynamically define the regex so that it matches that many and only that many subsequent characters?

>Solution :

You can’t do this with the regex pattern directly, but you can use capture groups to separate the integer and character portions of the match, and then trim the character portion to the appropriate length.

import re

# surround [0-9]+ and [ACGTNacgtn]+ in parentheses to create two capture groups
delre = re.compile('-([0-9]+)([ACGTNacgtn]+)')  

s = ".,-2gg,,,-2gg,-2gg,,,-2gg,,,,,,,,t,-2gg,,,,,,-2gg,t,,-1gtt,,,,,,,,,-1gt,-3ggg"

# each match should be a tuple of (number, letter(s)), e.g. ('1', 'gtt') or ('2', 'gg')
for number, bases in delre.findall(s):
    # print the number, then use slicing to truncate the string portion
    print(f'-{number}{bases[:int(number)]}')

This prints

-2gg
-2gg
-2gg
-2gg
-2gg
-2gg
-1g
-1g
-3ggg

You’ll more than likely want to do something other than print, but you can format the matched strings however you need!

NOTE: this does fail in cases where the integer is followed by fewer matching characters than it specifies, e.g. -10agcta is still a match even though it only contains 5 characters.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading