Why is UTF-8 encoded the way it is?

Advertisements

If I understood correctly, UTF-8 uses the following pattern to let the computer know how many bytes are going to be used to encode a character:

Byte 1 Byte 2 Byte 3 Byte 4
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Etc. But aren’t there more compact patterns? For instance, what is stopping us from using something like this:

Byte 1 Byte 2 Byte 3 Byte 4
0xxxxxxx
10xxxxxx xxxxxxxx
110xxxxx xxxxxxxx xxxxxxxx
1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx

>Solution :

Your proposed encoding wouldn’t be self-synchronizing. If you landed in the middle of a stream on an xxxxxxxx byte, you’d have no idea whether it’s in the middle of a character or not. If that random byte happened to be 10xxxxxx, you could mistake it for the start of a character. The only way to avoid this mistake is to read the entire stream error free from the beginning.

It’s an explicit goal for UTF-8 to be self-synchronizing. If you land anywhere in a UTF-8 stream, you know whether you’re in the middle of a character or not, and need to read at most 3 bytes to find the next start of a full character.

Leave a Reply Cancel reply