This is a line from the text
46318 16May2022 31May2022
this are my regex patterns
date_day_month_year = ' [0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4} |' \
'^[0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4} |' \
'^[0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4}$|' \
' [0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4}$|' \
**' [0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4} |'** \
'^[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4} |' \
**'^[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}$|'** \
' [0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}$|' \
' [A-Za-z]{3}[/., -]?[A-Za-z]{3}[/., -]+[0-9]{2,4}$|' \
' [A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4} |' \
'^[A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4} |' \
'^[A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4}$|' \
' [A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4}$'
Sorry its this big, but I’ll highlight the ones at work (**)
This is code:
x = re.findall(date_day_month_year, text, flags= re.I | re.M)
for match in x:
print(match)
This is output:
46318
31May2022
Question:
How come it found 31May2022 and ignored 16May2022 which are basically the same?
My best guess is that ‘46318 ‘ took space and ‘ 31May2022’ too.
Therefore 16May2022 has no spaces, end or beggining of lines and thus doesn’t match.
But why and how can I avoid it.
Should I now make every pattern a separate one and do a for loop?
And how come 31May2022 that comes after the 16th could ‘take’ a space from it.
P.S. if i write another if in my regex (smth like '[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}' it will match so many unnessesary things. This is why I included spaces in the first place.
>Solution :
It’s exactly as you said. It has no space around it. Spaces are taken by other matches
I suggest you use \b instead of spaces to anchor on word boundaries. This anchor does not ‘consume’ the space which could help.
date_day_month_year = '\b[0-9]{1,2}[/., -]?[0-9]{1,2}[/., -]?[0-9]{2,4}\b|' \
'\b[0-9]{1,2}[/., -]?[A-Za-z]{3}[/., -]?[0-9]{2,4}\b|' \
'\b[A-Za-z]{3}[/., -]?[A-Za-z]{3}[/., -]+[0-9]{2,4}\b|' \
'\b[A-Za-z]{3}[/., -]?[0-9]{2}[/., -]{0,2}[0-9]{2,4}\b'
Also, I’d like to recommend using regex debugging tools, f.e. https://regex101.com/