Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

why regex misses the same format it found later?

This is a line from the text

46318 16May2022 31May2022

this are my regex patterns

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

date_day_month_year = ' [0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4} |' \
                      '^[0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4} |' \
                      '^[0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4}$|' \
                      ' [0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4}$|' \
                      **' [0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4} |'** \
                      '^[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4} |' \
                      **'^[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}$|'** \
                      ' [0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}$|' \
                      ' [A-Za-z]{3}[/., -]?[A-Za-z]{3}[/., -]+[0-9]{2,4}$|' \
                      ' [A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4} |' \
                      '^[A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4} |' \
                      '^[A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4}$|' \
                      ' [A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4}$'

Sorry its this big, but I’ll highlight the ones at work (**)

This is code:

x = re.findall(date_day_month_year, text, flags= re.I | re.M)
for match in x:
    print(match)

This is output:

46318 
 31May2022

Question:
How come it found 31May2022 and ignored 16May2022 which are basically the same?
My best guess is that ‘46318 ‘ took space and ‘ 31May2022’ too.
Therefore 16May2022 has no spaces, end or beggining of lines and thus doesn’t match.
But why and how can I avoid it.
Should I now make every pattern a separate one and do a for loop?
And how come 31May2022 that comes after the 16th could ‘take’ a space from it.

P.S. if i write another if in my regex (smth like '[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}' it will match so many unnessesary things. This is why I included spaces in the first place.

>Solution :

It’s exactly as you said. It has no space around it. Spaces are taken by other matches

I suggest you use \b instead of spaces to anchor on word boundaries. This anchor does not ‘consume’ the space which could help.

date_day_month_year = '\b[0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4}\b|' \
                      '\b[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}\b|' \
                      '\b[A-Za-z]{3}[/., -]?[A-Za-z]{3}[/., -]+[0-9]{2,4}\b|' \
                      '\b[A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4}\b'

Also, I’d like to recommend using regex debugging tools, f.e. https://regex101.com/

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading