What/Where is the formal specification of BaseN (e.g., Base64) encodings?

Advertisements

TL;DR: Is there such a thing as formal specifications to these encodings (such as ISO or other national/international standards) or is it mostly up to the developers as a general technique?


Started to go down this rabbit hole when came across this sentence (from this PhD thesis):

That is, interpreting c as a base-256 encoding of some number, with digits from least significant to most significant (i.e., a little-endian number), we print the number in base-32 with digits from most significant to least significant. (Note that the outer summation denotes string concatenation, while the inner summation denotes integer addition.) The set of digits is

digits32 = "0123456789abcdfghijklmnpqrsvwxyz"

i.e., the alphanumerics excepting the letters e, o, u, and t. This is to reduce the possibility that hash representations contain character sequences that are potentially offensive to some users (a known possibility with alphanumeric representations of numbers [11]).

Not having much experience in topic, I started out with the basics in the following order:

None of these mention Base256, but so far this is how I would sum up what BaseN encodings are (in a very oversimplified and sloppy way):

Encoding  schemes  to  represent binary data in 
textual  format based  on a  set of  characters
(e.g., chosen arbitrarily by developer, defined
by a standard/specification), where the size of
the set forms the base of the encoding scheme 
(e.g., Base64 - 64 characters).

Chose to use the word "arbitrarily" because RFC4648‘s Base32 definition differs from the Base32 used in the paper (i.e., the character set obviously does, at least).

As for Base256, the paper also doesn’t mention it anymore, and when I searched for "Base256", "base-256", ""base 256"", etc., I only found implementations but no formal specifications whatsoever. These seem similar in name only too (another reason I used the word "arbitrarily" above):

  • base256-encoding: "Base256 encoding, a.k.a. latin1 encoding, the most memory-efficient encoding possible in JavaScript."

    Couldn’t find much about "latin1 Base256 encoding" but I presume that the Base256 implementation in this projects uses the Latin1 character set as a basis.

  • base-256: "encode and decode base256 encoding as gnu-tar does (supported range is -9007199254740991 to 9007199254740991)."

    Looked up the GNU tar manual’s "GNU Extensions to the Archive Format" section, where the relevant paragraph states (emphasis mine):

    For fields containing numbers or timestamps that are out of range for the basic format, the GNU format uses a base-256 representation instead of an ASCII octal number. If the leading byte is 0xff (255), all the bytes of the field (including the leading byte) are concatenated in big-endian order, with the result being a negative number expressed in two’s complement form. If the leading byte is 0x80 (128), the non-leading bytes of the field are concatenated in big-endian order, with the result being a positive number expressed in binary form. Leading bytes other than 0xff, 0x80 and ASCII octal digits are reserved for future use, as are base-256 representations of values that would be in range for the basic format.

>Solution :

When looking for a formal specification, RFCs, ISO, or IEEE standards are what you normally want to be looking for. The specification for Base-N encoding is RFC4648.

That being said, base-256 encodings serve a completely different purposes than the base-N ones you linked.

Base-16 through base-64 are designed to encode binary data when we only have a limited character set available. Quoting RFC4648:

Base encoding of data is used in many situations to store or transfer
data in environments that, perhaps for legacy reasons, are restricted
to US-ASCII [1] data. Base encoding can also be used in new
applications that do not have legacy restrictions, simply because it
makes it possible to manipulate objects with text editors.

There is no base-N encoding other than the ones described in the RFC because for practical reasons it didn’t really matter. We might be able to compress slightly more data by using every allowed character possible in a given environment, but we lose a lot of portability and risk our code breaking after updates.

However, the base-256 encodings generally serve to store codepoints. A byte can already hold 256 different values so in a way, binary data is already stored in base-256.

Codepoints are what you we commonly think of a character. For example, Unicode characters are single code point. However, the issue we run into is that we can’t just store code points as-is. Generally we can fit any code point into 4 bytes, but it would quite inefficient to store them that way considering most languages won’t need nearly that much space per character. Generally, base-256 encodings are ways to encode a list of codepoints into as few bytes as possible.

UTF-8 is generally the most popular approach for encoding code points since it gives a decent solution for any value and allows us to quickly distinguish between characters no matter where we start reading from. Here is a rough summary of how it works from RFC3629.

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Leave a Reply Cancel reply