What is a "character encoding", "character map", "code page"?
Although these terms are not strictly equivalent they all relate to the same problematic: how to represent a symbol (or character) in a computer system. Such a representation is characterized by 2 properties:
- a character map to associate each symbol to a unique numerical ID (or code point). For instance US-ASCII defines 128 positions to represent the letters, digits and punctuation commonly used in English: "exclamation mark" (!) has code point 33, "zero" has code point 48, "Capital Letter A" has code point 65, etc.
- an encoding method to actually represent each code point in memory. With ASCII, 7 bits are sufficient to encode the entire code set: each character is usually encoded on a single byte.
Various character encodings have been invented to satisfy local requirements around the world. For example, ISO-8859-1 is an 8-bit extension of ASCII (i.e. the 128 first code points of this encoding are the same as ASCII) specifically designed for a group of European languages: it adds a set of accented letters to standard ASCII. Another version, ISO-8859-7 is suitable for Greek but cannot represent accented letters such as those used in French.
When the number of code points exceeds 256 it is required to switch to a multi-byte encoding. Shift-JIS (used in Japan) is an example of multi-byte encoding: each character is encoded using either 1 or 2 bytes.
Typically a computer system is set up with some national encoding suitable to handle the symbols required by the local language. For instance a Windows system installed in Germany uses encoding CP1252 (where CP stands for Code page) that supports symbols like 'ß' or 'ö' but will not be able to display any Greek (e.g. 'θ') or Hebrew characters (e.g. 'ℵ').