Why is UTF-8 encoded the way it is?

February 26, 2022

If I understood correctly, UTF-8 uses the following pattern to let the computer know how many bytes are going to be used to encode a character:

Byte 1	Byte 2	Byte 3	Byte 4
0xxxxxxx
110xxxxx	10xxxxxx
1110xxxx	10xxxxxx	10xxxxxx
11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

Etc. But aren’t there more compact patterns? For instance, what is stopping us from using something like this:

Byte 1	Byte 2	Byte 3	Byte 4
0xxxxxxx
10xxxxxx	xxxxxxxx
110xxxxx	xxxxxxxx	xxxxxxxx
1110xxxx	xxxxxxxx	xxxxxxxx	xxxxxxxx

>Solution :

Your proposed encoding wouldn’t be self-synchronizing. If you landed in the middle of a stream on an xxxxxxxx byte, you’d have no idea whether it’s in the middle of a character or not. If that random byte happened to be 10xxxxxx, you could mistake it for the start of a character. The only way to avoid this mistake is to read the entire stream error free from the beginning.

It’s an explicit goal for UTF-8 to be self-synchronizing. If you land anywhere in a UTF-8 stream, you know whether you’re in the middle of a character or not, and need to read at most 3 bytes to find the next start of a full character.