Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python Regex Lookaround newline behavior

I am using Python in google colab. Shown below, I have three strings that each repeat ‘abcd’. I am trying to extract only ‘5678’ from the strings. For string2 I pressed Enter and then moved it over with tabs and spaces. For string3 I only pressed Enter to move it to the next line.

string1 = 'abcd1234ppppabcd5678oooo'
string2 = '''abcd1234pppp
             abcd5678oooo'''
string3 = '''abcd1234pppp
abcd5678oooo'''

reg1 = re.search('(?<=abcd)(.*)(?=oooo)', string1)
print(reg1.group(0))
reg2 = re.search('(?<=abcd)(.*)(?=oooo)', string2)
print(reg2.group(0))
reg3 = re.search('(?<=abcd)(.*)(?=oooo)', string3)
print(reg3.group(0))

Here is the output:

1234ppppabcd5678
5678
5678

I can understand why I got the results I did for the first string, but why did the code ‘work’ for string 2 and 3? Will regex automatically try and shorten the results if it’s broken up over multiple lines?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

According to the Python docs,

The special characters are:

. (Dot.) In the default mode, this matches any character except a newline. If the
DOTALL flag has been specified, this matches any character including a newline.

Since you’re using (.*) in the middle match, Regex will not match multiple lines unless you use the re.DOTALL flag:

reg1 = re.search('(?<=abcd)(.*)(?=oooo)', string1, re.DOTALL)
print(reg1.group(0))
reg2 = re.search('(?<=abcd)(.*)(?=oooo)', string2, re.DOTALL)
print(reg2.group(0))
reg3 = re.search('(?<=abcd)(.*)(?=oooo)', string3, re.DOTALL)
print(reg3.group(0))

or, alternatively,

pattern = re.compile('(?<=abcd)(.*)(?=oooo)', re.DOTALL)
for string in (string1, string2, string3):
    reg = pattern.search(string)
    print(reg.group(0))

This outputs

1234ppppabcd5678
1234pppp
             abcd5678
1234pppp
abcd5678
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading