If I understood correctly, UTF-8 uses the following pattern to let the computer know how many bytes are going to be used to encode a character:
Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|
0xxxxxxx | |||
110xxxxx | 10xxxxxx | ||
1110xxxx | 10xxxxxx | 10xxxxxx | |
11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Etc. But aren’t there more compact patterns? For instance, what is stopping us from using something like this:
Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|
0xxxxxxx | |||
10xxxxxx | xxxxxxxx | ||
110xxxxx | xxxxxxxx | xxxxxxxx | |
1110xxxx | xxxxxxxx | xxxxxxxx | xxxxxxxx |
>Solution :
Your proposed encoding wouldn’t be self-synchronizing. If you landed in the middle of a stream on an xxxxxxxx
byte, you’d have no idea whether it’s in the middle of a character or not. If that random byte happened to be 10xxxxxx
, you could mistake it for the start of a character. The only way to avoid this mistake is to read the entire stream error free from the beginning.
It’s an explicit goal for UTF-8 to be self-synchronizing. If you land anywhere in a UTF-8 stream, you know whether you’re in the middle of a character or not, and need to read at most 3 bytes to find the next start of a full character.