Character encoding conversion
The main feature of the XPRNLS library is character encoding conversion: it provides dedicated routines to ease the writing of programs that work with text data encoded with heterogeneous encodings. The reference encoding used by the XPRNLS library is UTF-8 and the functions it publishes handle conversions between UTF-8 and other encodings.
The functionality of the library is made accessible from the command line via the xprnls command tool that enables shell scripts to convert text files from one encoding to another.
Character encodings are identified by encoding names. The library supports natively the encodings UTF-8, UTF-16, UTF-32, ISO-8859-1, ISO-8859-15, CP1252 and US-ASCII: these encodings are therefore available on all systems. By default UTF-16 and UTF-32 use the byte order of the architecture of the running system (e.g. big endian on a Sparc processor) but the byte order may be selected by appending LE (Little Endian) or BE (Big Endian) to the encoding name (e.g. UTF-16LE).
The availability and names of other encodings depend on the operating system:
- On Windows the library relies on the win32 API routines MultiByteToWideChar and WideCharToMultiByte. The encoding names (that are not case sensitive) can be either the code page number prefixed by CP (like CP28605) or the usual name (e.g. ISO-8859-15). Except for GB18030 (that is a variable size encoding), only single and 2-bytes encodings are supported.
- On Posix systems the library is based on the iconv function of the standard C library. Depending on the implementation the encoding names may be case sensitive.
An encoding name may also be one of the following aliases: RAW (no encoding), SYS (default system encoding), WCHAR (wide character for the C library), FNAME (encoding used for file names), TTY (encoding of the output stream of the console), TTYIN (encoding of the input stream of the console), STDIN, STDOUT, STDERR (encoding of the default input/output/error stream).