Zooko O'Whielacronx November 2002 edited for clarity February 2007 Corrected November 2009 human-oriented base-32 encoding INTRO The base-32 encoding implemented in this library differs from that described in RFC 3548 [1] in several ways. This document describes why we made each different choice, and also includes a section at the end on "COMPATIBILITY AND INTEROPERATION". This encoding is implemented in a project named z-base-32 [2]. This is version 0.9.4.6 of this document. The latest version should always be available at: http://zooko.com/repos/z-base-32/base32/DESIGN RATIONALE The rationale for base-32 encoding in RFC 3548 [1] is as written therein: "The Base 32 encoding is designed to represent arbitrary sequences of octets in a form that needs to be case insensitive but need not be humanly readable.". The rationale for our encoding is different -- it is to represent arbitrary sequences of octets in a form that is as convenient as possible for human users to manipulate. In particular, z-base-32 was created in order to serve the Mnet project [3], where 30-octet cryptographic values are encoded into URIs for humans to manipulate. Anticipated uses of these URIs include cut- and-paste, text editing (e.g. in HTML files), manual transcription via a keyboard, manual transcription via pen-and-paper, vocal transcription over phone or radio, etc. The desiderata for such an encoding are: * minimizing transcription errors -- e.g. the well-known problem of confusing `0' with `O' * embedding into other structures -- e.g. search engines, structured or marked-up text, file systems, command shells * brevity -- Shorter URLs are better than longer ones. * ergonomics -- Human users (especially non-technical ones) should find the URIs as easy and pleasant as possible. The uglier the URI looks, the worse. DESIGN Base The first decision we made was to use base-32 instead of base-64. An earlier version of this project used base-64, but a discussion on the p2p-hackers mailing list [4] convinced us that the added length of base-32 encoding was worth the added clarity provided by: case-insensitivity, the absence of non- alphanumeric characters, and the ability to omit a few of the most troublesome alphanumeric characters. In particular, we realized that it would probably be faster and more comfortable to vocally read off a base-32 encoded 30-octet string (48 characters, case- insensitive, no non-alphanumeric characters) than a base-64 encoded one (40 characters, case-sensitive, plus two non-alphanumeric characters). Alphabet There are 26 alphabet characters and 10 digits, for a total of 36 characters available. We need only 32 characters for our base-32 alphabet, so we can choose four characters to exclude. This is where we part company with traditional base-32 encodings. For example [1] eliminates `0', `1', `8', and `9'. This choice eliminates two characters that are relatively unambiguous (`8' and `9') while retaining others that are potentially confusing. Others have suggested eliminating `0', `1', `O', and `L', which is likewise suboptimal. Our choice of confusing characters to eliminate is: `0', `l', `v', and `2'. Our reasoning is that `0' is potentially mistaken for `o', that `l' is potentially mistaken for `1' or `i', that `v' is potentially mistaken for `u' or `r' (especially in handwriting) and that `2' is potentially mistaken for `z' (especially in handwriting). Note that we choose to focus on typed and written transcription more than on vocal, since humans already have a well-established system of disambiguating spoken alphanumerics, such as the United States military's "Alpha Bravo Charlie Delta" and telephone operators' "Is that 'd' as in 'dog'?". Order of Alphabet Some of the alphabet characters will appear more frequently than others in the final position of the encoded strings, assuming an even distribution of binary inputs. (This is true whether the lengths of your inputs are evenly distributed over all integer numbers of bits *or* evenly distributed over all integer numbers of octets.) Here is a table showing which length-in-bits (modulo 5) results in which possible trailing characters in the "abcdefghijklmnopqrstuvwxyz234567" encoding: 1: aq 2: aqiy 3: aqiyemu4 4: aqiyemu4cgkosw26 0: abcdefghijklmnopqrstuvwxyz234567 Here is the same table for our alphabet: 1: yo 2: yoea 3: yoearcwh 4: yoearcwhngkq1s46 0: ybndrfg8ejkmcpqxot1uwisza345h769 (the whole alphabet) We have permuted the alphabet to make the more commonly occuring characters also be those that we think are easier to read, write, speak, and remember. Length Encoding and Sub-Octet Data Suppose you have 10 bits of data to transmit, and the recipient (the decoder) is expecting 10 bits of data. All previous base-32 encoding schemes assume that the binary data to be encoded is in 8-bit octets, so you would have to pad the data out to 2 octets and encode it in base-32, resulting in a string 4-characters long. The decoder will decode that into 2 octets (16 bits), ignoring the least significant 4 bits of the encoded string, and then ignore the least significant 6 bits of the decoded data. In the base-32 encoding described here, if the encoder and decoder both know the exact length of the data in bits (modulo 40), then they can use this shared information to optimize the size of the transmitted (encoded) string. In the example that you have 10 bits of data to transmit, z-base-32 allows you to transmit the optimal encoded string: two characters. If the encoder and decoder aren't both designed with the requirement that they will know the exact length of the data in bits, then they can of course assume that it is "the smallest number of octets which would have required this many quintets to encode", which is simple and unambiguous but can occasionally cost an extra byte or two of encoded size. You can always use this encoding the same way you would use the other encodings -- with an "input is in 8-bit octets" assumption. This would be appropriate if the length in bits is always a multiple of 8, if both sides are not sure of the length in bits modulo 40, or if this encoding is being used in a way that optimizing one or two characters out of the encoded string isn't worth the potential confusion. Padding Traditionally base-32 encodings have specified trailing padding to round out the number of characters to an even multiple of 8. This is apparently intended as an error detection code, but we do not consider the error detection capabilities of this code to be worth the increased length of the encoded strings, so we do not do this. Letter case Lower case is easier to read. That's why people have been using it preferentially since around the 9th century CE. Isn't it about time that software engineers took advantage of mankind's millenia-old knowledge of typography instead of thoughtlessly aping the hardware limitations of terminals used in the second half of the 20th century? EXAMPLES #bits base-2 base32 base64 z-base-32 ----- ------ ------ ------ --------- 1 0 AA====== AA== y 1 1 QA====== gA== o 2 01 IA====== QA== e 2 11 QA====== gA== a 10 0000000000 AAAA==== AAA= yy 10 1000000010 QCAA==== gIA= on 20 10001011100010001000 BC4IQ=== CLiI tqre 24 111100001011111111000111 6C74O=== 8L/H 6n9hq 24 110101000111101000000100 2R5AI=== 1HoE 4t7ye 30 111101010101011110111101000011 HVK66QY= PVXvQw== 6im5sd A NOTE ON COMPATIBILITY AND INTEROPERATION If your application could interoperate with an extant standard, then you should use RFC 3548 base-32 in order to facilitate interoperation by encoding semantically identical objects into syntactically identical representations. For example, many current systems include the SHA-1 hash of the contents of a file, and this hash value can be represented for user or programmatic sharing in base-32 encoded form [5, 6, 7, 8]. These four systems all use RFC 3548 base-32 encoding. If your system will expose the SHA-1 hash of the contents of a file, then you should make sure those hash values are easily exchangeable with those systems by using the same encoding (including base, alphabet, permutation of alphabet, length-encoding, padding, treatment of illegal characters and line-breaks). If, however, the semantic meaning of the objects that you are exposing is not something that can be understood by another extant system, due to semantic differences, then you gain nothing with regard to interoperation by using the same ASCII encoding, and in fact by doing so you *hamper* interoperation by making it impossible for the applications to use syntactic features to disambiguate between semantic features. Lucas Gonze has suggested [9] that different schemes could in fact deliberately add characters which would be illegal in another scheme in order to enable syntactic differentiation. (This would be morally similar to the "check digit" included in most credit card numbers.) Obviously the more reliable semantic differentiation is an unambiguous one that is transmitted out-of-band (outside of the encoded string, that is), such as URI scheme names (e.g.: SHA1:blahblahblah or mnet://blahblahblah). However, users might not always pay the cost to preserve those. REFERENCES [1] http://www.faqs.org/rfcs/rfc3548.html [2] http://zooko.com/repos/z-base-32 [3] http://mnetproject.org/ [4] http://zgp.org/pipermail/p2p-hackers/2001-October/ [5] Gnutella [need URL for SHA1 and base-32 encoding stuff] [6] Bitzi [need URL for specification stuff] [7] CAW [need URL] [8] http://open-content.net/specs/draft-jchapweske-thex-01.html [9] http://zgp.org/pipermail/p2p-hackers/2002-November/000924.html [10] http://zgp.org/pipermail/p2p-hackers/2002-November/000927.html NEEDED TO ADD * possible new design element: optional check characters (e.g. Luhn-like algorithm) * Full spec, including the issues named by draft-josefsson-base-encoding-04.txt such as treatment of illegal chars, etc. + also issues of length-encoding, scheme-encoding (e.g. URIs), etc. * Explanation of why we avoid non-alphanumerics. * Mention of the myriad other clarity issues such as those Gojomo posted?