Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why is this python regular expression not ignoring accents?

I am using the following regular expression for a filter of an application that connects to a MongoDB database:

{"$regex": re.compile(r'\b' + re.escape(value) + r'\b', re.IGNORECASE | re.UNICODE)}

The regular expression meets my search criteria however I have a problem and that is that it does not ignore accents. For example:

The database entry is: "Escobar, el patrón del mal Colombia historia".

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

And I search for "El patron".

I do not get any result because the "accent" in the letter O does not let me fetch the record. How can I fix it? I thought that with the re.UNICODE part I would ignore this.

>Solution :

Because o and ó are different characters. re.UNICODE does not do what you think it does, you can read about it here: https://docs.python.org/3/library/re.html#re.ASCII

You can solve this issue by first preprocessing strings to convert all such characters to their associated ascii counterparts before searching through with a regex. See: What is the best way to remove accents (normalize) in a Python unicode string?

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading