Python’s default encoding got me confused.
There is an á character in a text file’s content.
The file is saved as UTF-8 in notepad.
When I don’t specify encoding=’utf-8′ in:
with open(filename,encoding='utf-8') as f:
for line in f:
print(line)
it shows up as á.
When I do add the encoding=’utf-8′ part it shows up as á.
I am wondering what sys.getdefaultencoding() is useful for, as this shows utf-8, but I still had to specify utf-8 as encoding for the á to show up in the output.
I’m using Python3.
Extra edit:
The encoding that is used is probably latin-1 extended I think. Since:
á in utf-8 maps to 0xC3 0xA1 and in latin-1 extended: 0xC3 maps to à 0xA1 maps to ¡
How could I verify that latin-1 extended will be used when not specifying encoding?
>Solution :
Read the docs in Built-in Functions -> open():
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
…
In text mode, if encoding is not specified the encoding used
is platform dependent:locale.getpreferredencoding(False)is
called to get the current locale encoding.
…
where locale.getpreferredencoding(do_setlocale=True)
Return the encoding used for text data, according to user preferences.
sys.getdefaultencoding() is different (and independent):
Return the name of the current default string encoding used by the
Unicode implementation.