Text handling and regular expressions
The module mmsystem provides a large set of text handling functionality, including
- the types text, parsectx, and textarea
- text formatting routines (number format, upper/lower case)
- parsing routines
- regular expressions
In the following subsections we show some examples of text handling with Mosel, for a full description of the available functinoality please refer to the chapter mmsystem of the 'Mosel Language Reference Manual'.
text vs. string
Although apparently denoting similar concepts, the purpose and usage recommendations for the types string and text in Mosel models are quite distinct: any string defined in a model is added to the model's names dictionary and is only freed at termination of the model run, this is not the case for model objects of the type text. The type string therefore should be used whenever it is a question of identifying objects, so in particular for index sets.
The type text is in general the more appropriate choice for descriptive or editable texts, including reporting or logging messages, and any texts generated via (partial) copies or concatenation. A text object can be altered, allowing for a considerably wider set of operations (such as insertion, deletion) in comparison with strings. Furthermore, with the IO driver text: a public text object can be used as input or output file in a model (see Section List of I/O drivers).
It is, however, not always possible to draw a clear line between where to use string or text. A number of module subroutines therefore define multiple versions, accepting both, string or text arguments. Note further that if required, Mosel automatically converts from the type string to text, but not the other way round.
Parsing text
In the example below we configure the global parser settings to read real numbers from a text that has fields separated by commas.
declarations values: list of real comma=getchar(",",1) ! ASCII value for "," end-declarations txt:= text(", , 123.4 , 345.6 ,") ! Parsing without context setparam("sys_sepchar", comma) ! Comma as separation character setparam("sys_trim", true) ! Trim blanks around separation character while (nextfield(txt)) do ! Get next field values+= [parsereal(txt)] ! Read a real number from the field writeln("Read up to position ", getparam("sys_endparse")) end-do writeln("Values read: ", values) ! Output: [0,0,123.4,345.6,0]
The same behavior can be achieved with a parser context—here we do not modify any global settings, which has the advantage of preventing possible interactions with other parser settings that may be used elsewhere in our model.
declarations pctx: parsectx values: list of real comma=getchar(",",1) ! ASCII value for "," end-declarations txt:= text(", , 123.4 , 345.6 ,") ! Parsing real numbers with context setsepchar(pctx, comma) ! Comma as separation character settrim(pctx, true) ! Trim blanks around separation character while (nextfield(txt,pctx)) do ! Get next field values+= [parsereal(txt, pctx)] ! Read a real number from the field writeln("Read up to position ", pctx.endparse) end-do writeln("Values read: ", values) ! Output: [0,0,123.4,345.6,0]
When implementing data handling for optimization applications, it is good practice to add error handling to the parsing loop, for example to check whether the fields are formatted as expected:
pctx.endparse:=0 ! Start at the beginning of text pctx.sepchar:=comma ! Comma as separation character pctx.trim:=true ! Trim blanks around separation character while (nextfield(txt,pctx)) do ! Get next field if getchar(txt, pctx.endparse)=comma or pctx.endparse>=txt.size then values+=[0.0] ! The field is empty else r:=parsereal(txt, pctx) ! Read a real number from the field if getsysstat=0 then values+= [r] else writeln("Malformed field contents at position ", pctx.endparse, " (", copytext(txt, pctx.endparse,pctx.endparse+2), ")") end-if end-if writeln("Read up to position ", pctx.endparse) end-do writeln("Values read: ", values) ! Output: [0,0,123.4,345.6,0]
One might also choose to work with multiple parser contexts (e.g. using an 'inner' context pctxi for reading some part of each field from the original text—here an integer number that is read from a string containing a real).
declarations pctx,pctx1: parsectx ivalues: list of integer comma=getchar(",",1) ! ASCII value for "," end-declarations txt:= text(", , 123.4 , 345.6 ,") setsepchar(pctx, comma) ! Comma as separation character settrim(pctx, true) ! Trim blanks around separation character while (nextfield(txt,pctx)) do ! Get next field tt:=parsetext(txt, pctx) ! Get contents of the field pctxi.endparse:=1 ! Reset start to beginning of the text i:=parseint(tt,pctxi) ! Read an integer number from the field if getsysstat=0 then ivalues+= [i]; end-if writeln("Read up to position ", pctx.endparse) end-do writeln("Values read: ", ivalues) ! Output: [123,345]
Regular expressions
A regular expression (in the following abbreviated to regex) is a sequence of characters that form a search pattern. Regex are used to describe or match a set of strings according to certain syntax rules. Mosel supports the Basic Regular Expressions syntax (BRE) and the Extended Regular Expressions syntax (ERE) of the POSIX standard, the implementation of regular expression matching relies on the TRE library.
Here are some examples of regular expression matching and replacement with some explanations of the meaning of the employed regex—for a complete description of the supported regex syntax the reader is refered to the documentation of the TRE library (see http://laurikari.net/tre), another useful resource are the examples provided on the page en.wikipedia.org/wiki/Regular_expression.
The following example (regex.mos) displays all strings containing 'My' that occur in a text. The first matching statement uses BRE syntax, it displays all strings starting with 'My' irrespective of upper/lower case spelling (option REG_ICASE). The second matching statement uses ERE syntax (option REG_EXTENDED) to retrieve all strings containing 'My' other than at their beginning. We have chosen to retrieve different individual portions of the matching string (specified via the parantheses in the regular expression statement) the positions of which are stored in their order of occurrence into the array m (of type textarea)
declarations m: array(range) of textarea t: text end-declarations t:="MyValue=10,Sometext Mytext MoretextMytext2, MYVAL=1.5 mYtext3" m(0).succ:=1 while (regmatch(t, '\<My\(\w*\)', m(0).succ, REG_ICASE, m)) writeln("Word starting with 'My': ", copytext(t,m(0))) ! Output: MyValue Mytext MYVAL mYtext3 m(0).succ:=1 while (regmatch(t, '\w+((My)(\w*))', m(0).succ, REG_ICASE+REG_EXTENDED, m)) writeln("String containing 'My' (not at beginning): ", copytext(t,m(0)), " (", copytext(t,m(1)), "=", copytext(t,m(2)) , "+", copytext(t,m(3)), ")") ! Output: MoretextMytext2 (Mytext2=My+text2)
The special characters used in the formulation of the regular expressions above have the following meaning: \< marks the beginning of a word, \w denotes alphanumeric or underscore characters, * means 0 or more times and + stands for 1 or more times.
The following Mosel code snippet shows how to replace matching expressions in a text that contains dates with different formats:
t:="date1=20/11/2010,date2=1-Oct-2013,date3=2014-6-30" numr:= regreplace(t, '([[:digit:]]{4})-([01]?[[:digit:]])-([0-3]?[[:digit:]])', '\3/\2/\1', 1, REG_EXTENDED) if numr>0 then writeln(numr, " replacements: ", t) end-if
This is the output produced by the code above:
1 replacements: date1=20/11/2010,date2=1-Oct-2013,date3=30/6/2014
There are alternative ways of stating the same regular expression with BRE or ERE syntax, for example:
numr:= regreplace(t, '\(\d\{4\}\)-\([01]\{0,1\}\d\)-\([0-3]\{0,1\}\d\)', '\3/\2/\1' ) numr:= regreplace(t, '(\d{4})-([01]{0,1}\d)-([0-3]{0,1}\d)', '\3/\2/\1', 1, REG_EXTENDED )
In these replacement statements we have used the following special characters for stating regular expressions: \d or [:digit:] indicates a numerical character, square brackets contain a set of possible character matches, {M,N} means minimum M and maximum N match count and ? stands for 0 times or once.