What is the meaning of UTF-8,16,32 and UCS-2?
Unicode defines the mapping between code points and symbols, the effective encoding is specified by a Unicode Transformation Format (UTF). The most commonly used UTF encodings are:
- UTF-32
- a character is represented by a 4-byte integer
- UTF-16
- a character is represented by 1 or 2 2-byte integers
- UTF-8
- a character requires between 1 and 4 bytes
Compared to the other UTF encodings UTF-8 has the advantage of being compatible with ASCII: a text that consists only of ASCII characters has the same representation in UTF-8 and ASCII. As a consequence UTF-8 is also more compact than the other UTF encodings for English and most European languages (because the majority of symbols are included in the ASCII set).
UCS-2 (Universal Character Set v2) is a deprecated encoding originally used in Windows and Java: it encodes each character on a 2-bytes integer and is therefore limited to the first 65536 code points of Unicode, this is why it has gradually been replaced by plain UTF-16.