Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Unexpected behavior with regular expressions

I am trying to write a parser that detects bibliography footnotes, using regular expressions. But a particular RE is not working, and I cannot figure out why. Here is the code where I isolated the problem.

import re
PATTERN = "[\\w ]+, [\\w ]+, (\\d+(\\-\\d+)?)\\."

match_A = re.search(PATTERN, "Author, Some Book, 51–66.")
match_B = re.search(PATTERN, "Author, Some Book, 60-61.")

print(match_A != None)
print(match_B != None)

SUB_PATTERN = "\\d+(\\-\\d+)?"

match_C = re.search(SUB_PATTERN, "51–66")
match_D = re.search(SUB_PATTERN, "60–61")

print(match_C != None)
print(match_D != None)

The result is:

False
True
True
True

But I expect to obtain all True.
Can anybody reproduce this issue, or explain what is happening to me?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I am working on Windows 10. My Python version:

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32

>Solution :

Your dashes are different, the first one is a "–" ("en dash") and the second one is a "-" ("hyphen"). If you don’t believe me, google each one. You can put them into a character class:

PATTERN = "[\\w ]+, [\\w ]+, (\\d+([–-]\\d+)?)\\."

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading