Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to convert UTF-8 notation to python unicode notation

Using python3.8 I would like to convert unicode notation to python notation:

s = 'U+00A0'
result = s.lower() # output  'u+00a0'

I want to replace u+ with \u:

result = s.lower().replace('u+','\u') 

But I get the error:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

How can I convert the notation U+00A0 to \u00a0 ?

EDIT:

The reason I wanted to get \u00a0 is to further use encode method to get b'\xc2\xa0'.

My question: given a string in the following notation U+00A0 I would like to convert it to byte code b'\xc2\xa0'

>Solution :

you are struggling with the representation of something versus its value…

import re
re.sub("u\+([0-9a-f]{4})",lambda m:chr(int(m.group(1),16)),s)

but for u+00a0 this becomes \xa0

but same with the literal \u00a0

s = "\u00a0"
print(repr(s))

once you have the proper value as a unicode string you can then encode it to utf8

s = "\xa0"
print(s.encode('utf8'))
# b'\xc2\xa0'

so just final answer here

import re
s = "u+00a0"
s2 = re.sub("u\+([0-9a-f]{4})",lambda m:chr(int(m.group(1),16)),s)
s_bytes = s2.encode('utf8') # b'\xc2\xa0'
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading