Next: From Bytes to Characters, Previous: Saving Buffers in Files, Up: Basic Editing [Contents][Index]
Computers understand text as a sequence of codes, which are numbers identifying some particular character. A character can represent things like letters, digits, ideograms, word separators, religious symbols, etc. Collections of character codes are called character sets.
Some character sets try to cover one or a few similar written languages. This is the case of ASCII and ISO Latin-1, for example. These character sets are small, i.e. just a few hundred codes.
Other character sets are much more ambitious. This is the case of Unicode, that tries to cover the entire totality of human languages in the globe, including the fictitious ones, like klingon. Unicode is a really big character set.
In order to store character codes in a computer’s memory, or a file, we need to encode each character code in one or more bytes. The number of bytes needed to encode a given character code depends on the range of codes in the containing set.
ASCII, for example, defines 128 character codes: a single byte is enough to encode every possible ASCII character. It is very easy to encode ASCII.
Unicode, on the contrary, defines many thousand of character codes,
and has room for many more: we would need 31 bits in order to encode
any conceivable Unicode character code. However, it would be wasteful
to use that many bits per character: most used character codes tend to
be in lower regions of the code space. For example, the code
corresponding to the Latin letter 'a'
is a fairly small number,
whereas the codes corresponding to the Klingon alphabet are really big
numbers. Consequently, some systems opt to just encode a subset of
Unicode, like the first 16 bits of the Unicode space, which is called
the Basic Multilingual Plane, and contains all the characters that
most people will ever need. There are also variable-length encodings
of Unicode, that try to use as less bytes as possible to encode any
given code. A particularly clever encoding of Unicode, designed by
Rob Pike, is backwards compatible with the ASCII encoding, i.e. it
encodes all the ASCII codes in one byte each, and the values of these
byte are the same than in ASCII. This clever encoding is called
UTF-8.
Next: From Bytes to Characters, Previous: Saving Buffers in Files, Up: Basic Editing [Contents][Index]