Why is this python regular expression not ignoring accents?

I am using the following regular expression for a filter of an application that connects to a MongoDB database:

{"$regex": re.compile(r'\b' + re.escape(value) + r'\b', re.IGNORECASE | re.UNICODE)}

The regular expression meets my search criteria however I have a problem and that is that it does not ignore accents. For example:

The database entry is: "Escobar, el patrón del mal Colombia historia".

And I search for "El patron".

I do not get any result because the "accent" in the letter O does not let me fetch the record. How can I fix it? I thought that with the re.UNICODE part I would ignore this.

>Solution :

Because o and ó are different characters. re.UNICODE does not do what you think it does, you can read about it here: https://docs.python.org/3/library/re.html#re.ASCII

You can solve this issue by first preprocessing strings to convert all such characters to their associated ascii counterparts before searching through with a regex. See: What is the best way to remove accents (normalize) in a Python unicode string?

Leave a Reply