Initializing help system before first use

Character encoding in Mosel

Topics covered in this chapter:

This chapter addresses a number of questions relating to character encoding, in particular:

What is a 'character encoding', 'character map', 'code page'?

Although these terms are not strictly equivalent they all relate to the same problematic: how to represent a symbol (or character) in a computer system. Such a representation is characterized by 2 properties:

  1. a character map to associate each symbol to a unique numerical ID (or code point). For instance US-ASCII defines 128 positions to represent the letters, digits and punctuation commonly used in English: "exclamation mark" (!) has code point 33, "zero" has code point 48, "Capital Letter A" has code point 65, etc.
  2. an encoding method to actually represent each code point in memory. With ASCII, 7 bits are sufficient to encode the entire code set: each character is usually encoded on a single byte.

Various character encodings have been invented to satisfy local requirements around the world. For example, ISO-8859-1 is an 8-bit extension of ASCII (i.e. the 128 first code points of this encoding are the same as ASCII) specifically designed for a group of European languages: it adds a set of accented letters to standard ASCII. Another version, ISO-8859-7 is suitable for Greek but cannot represent accented letters such as those used in French.
When the number of code points exceeds 256 it is required to switch to a multi-byte encoding. Shift-JIS (used in Japan) is an example of multi-byte encoding: each character is encoded using either 1 or 2 bytes.

Typically a computer system is set up with some national encoding suitable to handle the symbols required by the local language. For instance a Windows system installed in Germany uses encoding CP1252 (where CP stands for Code page) that supports symbols like 'ß' or 'ö' but will not be able to display any Greek (e.g. 'θ') or Hebrew characters (e.g. 'ℵ').

What is Unicode?

Unicode is a universal encoding aimed at representing all known symbols such that a single encoding can be used for any country/language. Unicode is widely adopted and most computer systems use it internally to store character strings: the Windows operating system (and file system) uses this encoding as well as most Unix/Linux systems. Programming environments like Java or .NET are also based on Unicode.

Note that in China the GB18030 encoding is preferred to Unicode: this is a universal encoding published by the Chinese National Standard.

What is the meaning of UTF-8,16,32 and UCS-2?

Unicode defines the mapping between code points and symbols, the effective encoding is specified by a Unicode Transformation Format (UTF). The most commonly used UTF encodings are:

UTF-32
a character is represented by a 4-byte integer
UTF-16
a character is represented by 1 or 2 2-byte integers
UTF-8
a character requires between 1 and 4 bytes

Compared to the other UTF encodings UTF-8 has the advantage of being compatible with ASCII: a text that consists only of ASCII characters has the same representation in UTF-8 and ASCII. As a consequence UTF-8 is also more compact than the other UTF encodings for English and most European languages (because the majority of symbols are included in the ASCII set).

UCS-2 (Universal Character Set v2) is a deprecated encoding originally used in Windows and Java: it encodes each character on a 2-bytes integer and is therefore limited to the first 65536 code points of Unicode, this is why it has gradually been replaced by plain UTF-16.

What is a BOM?

For UTF-16 and UTF-32 the byte ordering has to be known (in fact we should refer to UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE to take into account the endianness of the encoding). In order to avoid incorrect interpretation of these encodings a Byte Order Mark (BOM) may be put at the beginning of documents: it consists in a sequence of bytes that identifies both the encoding (UTF-16 or UTF-32) and the byte ordering used (Little Endian or Big Endian).

Although UTF-8 does not require any byte ordering information, a dedicated BOM can be used with this encoding: its primary purpose is to differentiate UTF-8 from other byte-oriented encodings. If not interpreted this marker takes the form of the 3-characters sequence "" (in ISO-8859-1 or CP1252): a document starting with this sequence must be read with an UTF-8 enabled software.

Which character encoding is configured on my computer?

You can use the command 'xprnls info' of the XPRNLS command tool to identify which encoding is used on your system. The following example shows the output produced for western European Windows / 'latin' encoding with UK English as the selected language (the program output is highlighted in bold face):

> xprnls info
Language: en
Default encodings:
System: CP1252
Console: CP437
File names: CP1252
Wide chars: UTF-16LE

Note that Xpress Workbench works with UTF-8 character encoding, independent of the system settings.

Which files are concerned by character encoding in Mosel?

Starting with version 4.0 Mosel is working in UTF-8. This concerns

  • the internal representation of text
  • all external APIs (i.e. all Mosel libraries)
  • the communication with the system via Unicode (Windows) or system encoding (Posix)

All streams and text files default to UTF-8. There is no impact on applications that only use pure ASCII (first 127 characters), but text data files and source code using other encodings might require conversions or tagging. Note that no changes are required for other file types such as spreadsheets or databases.

Model source and text data files in Mosel format: Specify the encoding with the annotation !@encoding. For example if you are editing your model with an editor that employs the encoding CP1252:

!@encoding CP1252
model "my testmodel"
 ...

Other text/string input or output: Convert the encoding via the enc: prefix to file names and streams or by using the conversion routines of the XPRNLS library or command tool (see paragraph 'How can I convert the character encoding of a text file' below).

How can I convert the character encoding of a text file?

Text format data files (other than the Mosel initializations format for which the !@encoding marker can be used) such as CSV files or files accessed via fopen that do not use UTF-8 encoding need to be converted with the 'enc:' prefix when accessing them from within a Mosel model.
Example:

! Encoding names are operating system dependent, eg CP1252, ISO88591
fopen(enc:GB18030,testdata.txt", F_INPUT)

! Apply an encoding along with a BOM to a text file, saving it with a new name
fcopy("myfile.csv", F_INPUT, "enc:UTF-8+bom,mynewfile.csv", F_OUTPUT)

It is usually preferrable to specify the encoding used by a data file as shown above, but Mosel also implements shorthands for encodings configured on the system running the model.

! Encoding aliases:
! raw, sys, wchar, fname, tty, ttyin, stdin, stdout, stderr
initializations to "mmsheet.csv:enc:sys,testoutput.csv"
  ...
end-initializations

Using the prefix enc:sys means that the default system encoding is employed (which corresponds to the behaviour of Mosel versions prior to Mosel 4).

On the API level, you can use the XPRNLS library to convert to/from UTF-8 encoding (please see the reference manual XPRNLS command tool and library for the full documentation of its functionality):

  • this library is platform independent and has no external dependency
  • it handles encoding conversions between UTF-8 and local encodings
  • it implements UTF-8/16/32(LE+BE), ISO-8859-1/15, ASCII, CP1252
  • other supported encodings depend on the operating system
// Open a file using the C function 'fopen' with a file name coming from Mosel
f = fopen(XNLSconvstrto(XNLS_ENC_SYS,filename,-1,NULL),"r");

Alternatively, you can use the XPRNLS command tool for converting the character encoding of text files between any two supported encodings:

xprnls conv -f CP1252 -t UTF8 -o outfile.txt myfile.txt

Note: you can display the list of the available xprnls commands by entering

xprnls

at the command prompt.


© 2001-2024 Fair Isaac Corporation. All rights reserved. This documentation is the property of Fair Isaac Corporation (“FICO”). Receipt or possession of this documentation does not convey rights to disclose, reproduce, make derivative works, use, or allow others to use it except solely for internal evaluation purposes to determine whether to purchase a license to the software described in this documentation, or as otherwise set forth in a written software license agreement between you and FICO (or a FICO affiliate). Use of this documentation and the software described in it must conform strictly to the foregoing permitted uses, and no other use is permitted.