I am using Python in google colab. Shown below, I have three strings that each repeat ‘abcd’. I am trying to extract only ‘5678’ from the strings. For string2 I pressed Enter and then moved it over with tabs and spaces. For string3 I only pressed Enter to move it to the next line.
string1 = 'abcd1234ppppabcd5678oooo'
string2 = '''abcd1234pppp
abcd5678oooo'''
string3 = '''abcd1234pppp
abcd5678oooo'''
reg1 = re.search('(?<=abcd)(.*)(?=oooo)', string1)
print(reg1.group(0))
reg2 = re.search('(?<=abcd)(.*)(?=oooo)', string2)
print(reg2.group(0))
reg3 = re.search('(?<=abcd)(.*)(?=oooo)', string3)
print(reg3.group(0))
Here is the output:
1234ppppabcd5678
5678
5678
I can understand why I got the results I did for the first string, but why did the code ‘work’ for string 2 and 3? Will regex automatically try and shorten the results if it’s broken up over multiple lines?
>Solution :
The special characters are:
.(Dot.) In the default mode, this matches any character except a newline. If the
DOTALL flag has been specified, this matches any character including a newline.
Since you’re using (.*) in the middle match, Regex will not match multiple lines unless you use the re.DOTALL flag:
reg1 = re.search('(?<=abcd)(.*)(?=oooo)', string1, re.DOTALL)
print(reg1.group(0))
reg2 = re.search('(?<=abcd)(.*)(?=oooo)', string2, re.DOTALL)
print(reg2.group(0))
reg3 = re.search('(?<=abcd)(.*)(?=oooo)', string3, re.DOTALL)
print(reg3.group(0))
or, alternatively,
pattern = re.compile('(?<=abcd)(.*)(?=oooo)', re.DOTALL)
for string in (string1, string2, string3):
reg = pattern.search(string)
print(reg.group(0))
This outputs
1234ppppabcd5678
1234pppp
abcd5678
1234pppp
abcd5678