Introduction
Topics covered in this chapter:
The Xpress Natural Language Support (XPRNLS) component comes as a command line tool and a library.
- The command line tool supports various commands for transcoding text files (such as converting a text file from one encoding to another encoding) and managing message catalogs (for message translations).
- The library offers a system-independent set of routines for converting text buffers and streams from/to some encoding to/from UTF-8. It also implements a mechanism to handle message translation.
Character encoding conversion
The main feature of the XPRNLS library is character encoding conversion: it provides dedicated routines to ease the writing of programs that work with text data encoded with heterogeneous encodings. The reference encoding used by the XPRNLS library is UTF-8 and the functions it publishes handle conversions between UTF-8 and other encodings.
The functionality of the library is made accessible from the command line via the xprnls command tool that enables shell scripts to convert text files from one encoding to another.
Character encodings are identified by encoding names. The library supports natively the encodings UTF-8, UTF-16, UTF-32, ISO-8859-1, ISO-8859-15, CP1252 and US-ASCII: these encodings are therefore available on all systems. By default UTF-16 and UTF-32 use the byte order of the architecture of the running system (e.g. big endian on a Sparc processor) but the byte order may be selected by appending LE (Little Endian) or BE (Big Endian) to the encoding name (e.g. UTF-16LE).
The availability and names of other encodings depend on the operating system:
- On Windows the library relies on the win32 API routines MultiByteToWideChar and WideCharToMultiByte. The encoding names (that are not case sensitive) can be either the code page number prefixed by CP (like CP28605) or the usual name (e.g. ISO-8859-15). Except for GB18030 (that is a variable size encoding), only single and 2-bytes encodings are supported.
- On Posix systems the library is based on the iconv function of the standard C library. Depending on the implementation the encoding names may be case sensitive.
An encoding name may also be one of the following aliases: RAW (no encoding), SYS (default system encoding), WCHAR (wide character for the C library), FNAME (encoding used for file names), TTY (encoding of the output stream of the console), TTYIN (encoding of the input stream of the console), STDIN, STDOUT, STDERR (encoding of the default input/output/error stream).
Message translation with XPRNLS
The creation of message translations typically involves three steps:
- extraction of the message texts to be translated from a program source (⇒ Portable Object Template (POT) file)
- instantiation with the translations for a particular language (⇒ Portable Object (PO) file)
- compilation of the message translations (⇒ Machine Object (MO) file)
Translations are stored in a set of message catalog files: each of these files is specific to a language and a domain. A domain is a collection of messages, typically all messages of an application are grouped under a given domain. The xprnls command tool supports the necessary operations for building and managing these message catalog files.
Translations are applied in programs via the XPRNLS gettext framework for message translation: in a program using this system all strings to be translated are passed to a translation function (XNLSgettext). During the execution of the program this function returns a version of the message for the current language or the original English text itself if no translation can be found.
Example
The following example shows a minimal program using the message translation functionality:
int main(int argc,char *argv[]) { XNLSdomain dom; dom=XNLSopenmsgdom("myprg",NULL); /* Open domain 'myprg' */ printf(XNLSgettext(dom,"Hello!\n")); /* Display translation of "hello" */ XNLSclosemsgdom(dom); /* Close domain */ return 0; }
This example requires message catalog files for the domain "myprg". The first step in the generation of the message catalogs is to produce a Portable Object Template (POT) file for the domain: this text file collects all messages to be translated. For our example the file myprg.pot includes only one message (the generation of such a file can be automated using tools like GNU xgettext):
msgid "Hello!\n" msgstr ""
From this template one Portable Object (PO) file per supported language has to be produced. The creation of an initial PO file can be done using the XPRNLS command tool (see Section XPRNLS Command line tool). For instance, to generate the file myprg.fr.po (we assume that the operating system is configured for French):
xprnls init -o myprg.fr.po myprg.pot
The contents of the file myprg.fr.po generated by this command looks as follows:
msgid "" msgstr "" "Project-Id-Version: PACKAGE VERSION\n" "POT-Creation-Date: 2015-12-01 16:03+0100\n" "PO-Revision-Date: 2015-12-01 16:03+0100\n" "Last-Translator: Your name\n" "Language: fr\n" "Content-Type: text/plain; charset=ISO8859-15\n" msgid "Hello!\n" msgstr ""
In addition to the message to translate the command tool has created a header record: this portion is mostly just informative (it is however recommended to complete the missing entries) but 2 entries are exploited by XPRNLS, namely the language (of this translation) and content type (encoding of the file) must be correct: here, the language is French (fr) and the encoding is ISO-8859-15 — these settings correspond to the configuration of the system on which we have performed the xprnls command. It will be necessary to edit these values when preparing translations for a language that is different from the configuration of the operating system.
Completing a PO file consists in entering a translation for each of the messages (i.e. make sure that every msgid entry is followed by a non-empty msgstr record). Our example has only one message, the minimal form of our translation file for French therefore is the following:
msgid "" msgstr "" "Language: fr\n" "Content-Type: text/plain; charset=ISO8859-15\n" msgid "Hello!\n" msgstr "Bonjour!\n"
Once all translations have been prepared, the message catalogs are created by running the following command:
xprnls mogen -d locale myprg.*.po
This command creates the directory locale (if necessary) that contains one subdirectory per language. In each of these language specific subdirectories another directory (LC_MESSAGES) is created to store the message catalogs. Each message catalog is named after the domain name with the extension .mo (Machine Object): this is the binary version of the corresponding PO file. When executing our example program on a system configured for French the function XNLSgettext looks for the file locale/fr/LC_MESSAGES/myprg.mo to locate the required translations.
PO, POT and MO file formats
Portable Object (PO) and Portable Object Template (POT) files are text files consisting in a list of pairs of entries of the form:
msgid message msgstr translation
Where message is a text identifying the message to translate (it is usually expressed in English) and translation its translation in the language associated to the file. A POT file has only empty msgstr entries and serves as a basis for the creation of the PO files.
The '#' symbol marks the beginning of a comment: any text following after it up to the end of line is ignored.
Both the message and its translation must be expressed in the form of lists of double-quoted strings separated by spaces or newlines (each list is merged into a single text string by the PO processor). Text strings may contain C escape sequences (like "\n") as well as format markers (e.g. "%s"). A translation must include the same format markers as the original text and they must appear in the same order (otherwise the translation will be ignored).
Usually PO files include the special message empty string ("") the translation of which is used to record management information (like name of the author, date of creation etc) instead of an actual translation. The syntax of such a record is a succession of definitions of the form:
property: value\n
Although the command xprnls init (used to create an initial PO file from a POT file) will generate a certain number of assignments for this header, only two of them are effectively used by XPRNLS: the language associated to the file (e.g. "Language: it\n") and its encoding (e.g. "Content-Type: text/plain; charset=ISO8859-1\n").
A Machine Object (MO) file is a compiled version of a PO file that is created by the command xprnls mogen: this is the format required by the translation routines of XPRNLS. The binary MO format used by XPRNLS is platform independent and compatible with the GNU version of gettext (GMO).
© 2001-2024 Fair Isaac Corporation. All rights reserved. This documentation is the property of Fair Isaac Corporation (“FICO”). Receipt or possession of this documentation does not convey rights to disclose, reproduce, make derivative works, use, or allow others to use it except solely for internal evaluation purposes to determine whether to purchase a license to the software described in this documentation, or as otherwise set forth in a written software license agreement between you and FICO (or a FICO affiliate). Use of this documentation and the software described in it must conform strictly to the foregoing permitted uses, and no other use is permitted.