Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to make Python treat literal string as UTF-8 encoded string

I have some strings in Python loaded from a file. They look like lists, but are actually strings, for example:

example_string = '["hello", "there", "w\\u00e5rld"]'

I can easily convert it into an actual list of strings:

def string_to_list(string_list:str) -> List[str]:
    converted = string_list.replace('"', '').replace('[', '').replace(']', '').split(',')
    return [s.strip() for s in converted]
as_list = string_to_list(example_string)
print(as_list) 

Which returns the following list of strings: ["hello", "there", "w\\u00e5rld"]
The problem is the encoding of the last element of the string. It looks like this when I run print(as_list), but if I run

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

for element in as_list:
    print(element)

it returns

hello
there
w\u00e5rld

I dont know what happens to the first backslash, it seems to me like it is there to escape the second one in the encoding. How do I make Python just resolve the UTF-8 character and print "wørld"? The problem is that it is a string, not an encoding, so as_list[2].decode("UTF-8") does not work.

I tried using string.decode(), and I tried plain printing

>Solution :

The correct way to decode that to a list of strings is not the insane set of string operations you’re performing. It’s just ast.literal_eval(example_string), which will handle Unicode escapes just fine:

    import ast
    
    example_string = '["hello", "there", "w\\u00e5rld"]'
    example_list = ast.literal_eval(example_string)
    for word in example_list:
        print(word)

which, assuming you have appropriate font support for the character, outputs:

hello
there
wårld

If you absolutely needed to just fix Unicode escapes, the codecs module can be used for unicode_escape decoding, but in this case, you have a legal Python literal in a string, and ast.literal_eval can do all the work.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading