why regex misses the same format it found later?

This is a line from the text

46318 16May2022 31May2022

this are my regex patterns

date_day_month_year = ' [0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4} |' \
                      '^[0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4} |' \
                      '^[0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4}$|' \
                      ' [0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4}$|' \
                      **' [0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4} |'** \
                      '^[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4} |' \
                      **'^[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}$|'** \
                      ' [0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}$|' \
                      ' [A-Za-z]{3}[/., -]?[A-Za-z]{3}[/., -]+[0-9]{2,4}$|' \
                      ' [A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4} |' \
                      '^[A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4} |' \
                      '^[A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4}$|' \
                      ' [A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4}$'

Sorry its this big, but I’ll highlight the ones at work (**)

This is code:

x = re.findall(date_day_month_year, text, flags= re.I | re.M)
for match in x:
    print(match)

This is output:

46318 
 31May2022

Question:
How come it found 31May2022 and ignored 16May2022 which are basically the same?
My best guess is that ‘46318 ‘ took space and ‘ 31May2022’ too.
Therefore 16May2022 has no spaces, end or beggining of lines and thus doesn’t match.
But why and how can I avoid it.
Should I now make every pattern a separate one and do a for loop?
And how come 31May2022 that comes after the 16th could ‘take’ a space from it.

P.S. if i write another if in my regex (smth like '[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}' it will match so many unnessesary things. This is why I included spaces in the first place.

>Solution :

It’s exactly as you said. It has no space around it. Spaces are taken by other matches

I suggest you use \b instead of spaces to anchor on word boundaries. This anchor does not ‘consume’ the space which could help.

date_day_month_year = '\b[0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4}\b|' \
                      '\b[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}\b|' \
                      '\b[A-Za-z]{3}[/., -]?[A-Za-z]{3}[/., -]+[0-9]{2,4}\b|' \
                      '\b[A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4}\b'

Also, I’d like to recommend using regex debugging tools, f.e. https://regex101.com/

Leave a Reply