I’m using the following pattern to identify values in a document in Python:
\d{2}\/\d{4}\s{1,}\d{1,}(\.?)\d{1,},\d{1,}
I tested this pattern on https://regexr.com/, with this string:
11/2003 480,00 12/2003 480,00 12/2003 480.00,00
And it matches the three dates and values, but when I run it on python, with the same string, it gives me these results:
['.', '.', '.']
Only dots.
What could I be possibly be missing?
>Solution :
Assumption: You’re using re.findall.
If you have capture groups in your pattern, findall will only show you these groups
\d{2}\/\d{4}\s{1,}\d{1,}(\.?)\d{1,},\d{1,}
^^^^^
Capturing group for an optional dot
11/2003 480,00 12/2003 480,00 12/2003 480.00,00
^
This gets matched
Hence, this gives us:
re.findall(
r"\d{2}\/\d{4}\s{1,}\d{1,}(\.?)\d{1,},\d{1,}",
"11/2003 480,00 12/2003 480,00 12/2003 480.00,00")
-> ['', '', '.']
Removing the capture group:
re.findall(
r"\d{2}\/\d{4}\s{1,}\d{1,}\.?\d{1,},\d{1,}",
"11/2003 480,00 12/2003 480,00 12/2003 480.00,00")
-> ['11/2003 480,00', '12/2003 480,00', '12/2003 480.00,00']
Sidenote 1:
You can reduce x{1,} to x+
\d{2}\/\d{4}\s+\d+\.?\d+,\d+
Sidenote 2:
I could assume you put this group there to make this separator and the following numbers optional, but not allwo 123.,45, you can group this with an optional Non-Capturing group
re.findall(
r"\d{2}\/\d{4}\s+\d+(?:\.\d+)?,\d+"
"11/2003 480,00 12/2003 480,00 12/2003 480.00,00")
-> ['11/2003 480,00', '12/2003 480,00', '12/2003 480.00,00']
Sidenote 3:
If you want to refer to capturing groups and keep the whole match each time, you can use re.finditer instead of re.findall, this will give you an iterator over every Match object instead of just the capture groups.