Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why is UTF-8 encoded the way it is?

If I understood correctly, UTF-8 uses the following pattern to let the computer know how many bytes are going to be used to encode a character:

Byte 1 Byte 2 Byte 3 Byte 4
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Etc. But aren’t there more compact patterns? For instance, what is stopping us from using something like this:

Byte 1 Byte 2 Byte 3 Byte 4
0xxxxxxx
10xxxxxx xxxxxxxx
110xxxxx xxxxxxxx xxxxxxxx
1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Your proposed encoding wouldn’t be self-synchronizing. If you landed in the middle of a stream on an xxxxxxxx byte, you’d have no idea whether it’s in the middle of a character or not. If that random byte happened to be 10xxxxxx, you could mistake it for the start of a character. The only way to avoid this mistake is to read the entire stream error free from the beginning.

It’s an explicit goal for UTF-8 to be self-synchronizing. If you land anywhere in a UTF-8 stream, you know whether you’re in the middle of a character or not, and need to read at most 3 bytes to find the next start of a full character.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading