Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why it missing part of parentheses when matching multi-character 0 or 1 time in python regex?

I only know how to match one character 0 or 1 time in regex, for example

content = "abc"
print(re.match(r'abc?', content)) #true
content = "ab"
print(re.match(r'abc?', content)) #true

Now there are two actual situations

content = "民国4年(1915年)2至3月" #include parentheses
#content = "民国4年2至3月" #not include
print(re.match(r'.*年(\(.{1,5}\))?', content).group())

The problem is the actual result is 民国4年(1915年 I don’t know why it missing the right parentheses.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

.*年 is greedy and matches 民国4年(1915年 all by itself by matching everything up to the last . With the trailing ? in (\(.{1,5}\))? it makes matching the string (1915年) optional, so the final result is only what was matched by .*年.

Make .*年 non-greedy by using .*?年 and it will only match up to the first :

import re

content1 = "民国4年(1915年)2至3月" #include parentheses
content2 = "民国4年(2至3月" # not include

print(re.match(r'.*?年(\(.{1,5}\))?', content1).group())
print(re.match(r'.*?年(\(.{1,5}\))?', content2).group())

Output:

民国4年(1915年)
民国4年
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading