Text handling and regular expressions

The module mmsystem provides a large set of text handling functionality, including

the types text, parsectx, and textarea
text formatting routines (number format, upper/lower case)
parsing routines
regular expressions

In the following subsections we show some examples of text handling with Mosel, for a full description of the available functinoality please refer to the chapter mmsystem of the 'Mosel Language Reference Manual'.

text vs. string

Although apparently denoting similar concepts, the purpose and usage recommendations for the types string and text in Mosel models are quite distinct: any string defined in a model is added to the model's names dictionary and is only freed at termination of the model run, this is not the case for model objects of the type text. The type string therefore should be used whenever it is a question of identifying objects, so in particular for index sets.

The type text is in general the more appropriate choice for descriptive or editable texts, including reporting or logging messages, and any texts generated via (partial) copies or concatenation. A text object can be altered, allowing for a considerably wider set of operations (such as insertion, deletion) in comparison with strings. Furthermore, with the IO driver text: a public text object can be used as input or output file in a model (see Section List of I/O drivers).

It is, however, not always possible to draw a clear line between where to use string or text. A number of module subroutines therefore define multiple versions, accepting both, string or text arguments. Note further that if required, Mosel automatically converts from the type string to text, but not the other way round.

Parsing text

In the example below we configure the global parser settings to read real numbers from a text that has fields separated by commas.

 declarations
   values: list of real
   comma=getchar(",",1)              ! ASCII value for ","
 end-declarations

 txt:= text(", , 123.4 , 345.6 ,")

! Parsing without context
 setparam("sys_sepchar", comma)      ! Comma as separation character
 setparam("sys_trim", true)          ! Trim blanks around separation character
 while (nextfield(txt)) do           ! Get next field
   values+= [parsereal(txt)]         ! Read a real number from the field
   writeln("Read up to position ", getparam("sys_endparse"))
 end-do
 writeln("Values read: ", values)    ! Output: [0,0,123.4,345.6,0]

The same behavior can be achieved with a parser context—here we do not modify any global settings, which has the advantage of preventing possible interactions with other parser settings that may be used elsewhere in our model.

 declarations
   pctx: parsectx
   values: list of real
   comma=getchar(",",1)              ! ASCII value for ","
 end-declarations

 txt:= text(", , 123.4 , 345.6 ,")

! Parsing real numbers with context
 setsepchar(pctx, comma)             ! Comma as separation character
 settrim(pctx, true)                 ! Trim blanks around separation character
 while (nextfield(txt,pctx)) do      ! Get next field
   values+= [parsereal(txt, pctx)]   ! Read a real number from the field
   writeln("Read up to position ", pctx.endparse)
 end-do
 writeln("Values read: ", values)    ! Output: [0,0,123.4,345.6,0]

When implementing data handling for optimization applications, it is good practice to add error handling to the parsing loop, for example to check whether the fields are formatted as expected:

 pctx.endparse:=0                    ! Start at the beginning of text
 pctx.sepchar:=comma                 ! Comma as separation character
 pctx.trim:=true                     ! Trim blanks around separation character
 while (nextfield(txt,pctx)) do      ! Get next field
   if getchar(txt, pctx.endparse)=comma or pctx.endparse>=txt.size then
     values+=[0.0]                   ! The field is empty
   else
     r:=parsereal(txt, pctx)         ! Read a real number from the field
     if getsysstat=0 then values+= [r]
     else
       writeln("Malformed field contents at position ", pctx.endparse,
       " (", copytext(txt, pctx.endparse,pctx.endparse+2), ")")
     end-if
   end-if
   writeln("Read up to position ", pctx.endparse)
 end-do
 writeln("Values read: ", values)    ! Output: [0,0,123.4,345.6,0]

One might also choose to work with multiple parser contexts (e.g. using an 'inner' context pctxi for reading some part of each field from the original text—here an integer number that is read from a string containing a real).

 declarations
   pctx,pctx1: parsectx
   ivalues: list of integer
   comma=getchar(",",1)              ! ASCII value for ","
 end-declarations

 txt:= text(", , 123.4 , 345.6 ,")

 setsepchar(pctx, comma)             ! Comma as separation character
 settrim(pctx, true)                 ! Trim blanks around separation character
 while (nextfield(txt,pctx)) do      ! Get next field
   tt:=parsetext(txt, pctx)          ! Get contents of the field
   pctxi.endparse:=1                 ! Reset start to beginning of the text
   i:=parseint(tt,pctxi)             ! Read an integer number from the field
   if getsysstat=0 then ivalues+= [i]; end-if
   writeln("Read up to position ", pctx.endparse)
 end-do
 writeln("Values read: ", ivalues)    ! Output: [123,345]

Regular expressions

A regular expression (in the following abbreviated to regex) is a sequence of characters that form a search pattern. Regex are used to describe or match a set of strings according to certain syntax rules. Mosel supports the Basic Regular Expressions syntax (BRE) and the Extended Regular Expressions syntax (ERE) of the POSIX standard, the implementation of regular expression matching relies on the TRE library.

Here are some examples of regular expression matching and replacement with some explanations of the meaning of the employed regex—for a complete description of the supported regex syntax the reader is refered to the documentation of the TRE library (see http://laurikari.net/tre), another useful resource are the examples provided on the page en.wikipedia.org/wiki/Regular_expression.

The following example (regex.mos) displays all strings containing 'My' that occur in a text. The first matching statement uses BRE syntax, it displays all strings starting with 'My' irrespective of upper/lower case spelling (option REG_ICASE). The second matching statement uses ERE syntax (option REG_EXTENDED) to retrieve all strings containing 'My' other than at their beginning. We have chosen to retrieve different individual portions of the matching string (specified via the parantheses in the regular expression statement) the positions of which are stored in their order of occurrence into the array m (of type textarea)

declarations
  m: array(range) of textarea
  t: text
end-declarations
t:="MyValue=10,Sometext Mytext MoretextMytext2, MYVAL=1.5 mYtext3"
m(0).succ:=1
while (regmatch(t, '\<My\(\w*\)', m(0).succ, REG_ICASE, m))
  writeln("Word starting with 'My': ", copytext(t,m(0)))
! Output:  MyValue  Mytext  MYVAL  mYtext3

m(0).succ:=1
while (regmatch(t, '\w+((My)(\w*))', m(0).succ, REG_ICASE+REG_EXTENDED, m))
  writeln("String containing 'My' (not at beginning): ",
   copytext(t,m(0)), " (", copytext(t,m(1)), "=", copytext(t,m(2)) ,
   "+", copytext(t,m(3)), ")")
! Output:  MoretextMytext2 (Mytext2=My+text2)

The special characters used in the formulation of the regular expressions above have the following meaning: \< marks the beginning of a word, \w denotes alphanumeric or underscore characters, * means 0 or more times and + stands for 1 or more times.

The following Mosel code snippet shows how to replace matching expressions in a text that contains dates with different formats:

t:="date1=20/11/2010,date2=1-Oct-2013,date3=2014-6-30"
numr:= regreplace(t, '([[:digit:]]{4})-([01]?[[:digit:]])-([0-3]?[[:digit:]])',
              '\3/\2/\1', 1, REG_EXTENDED)
if numr>0 then
 writeln(numr, " replacements: ", t)
end-if

This is the output produced by the code above:

1 replacements: date1=20/11/2010,date2=1-Oct-2013,date3=30/6/2014

There are alternative ways of stating the same regular expression with BRE or ERE syntax, for example:

numr:= regreplace(t, '\(\d\{4\}\)-\([01]\{0,1\}\d\)-\([0-3]\{0,1\}\d\)',
         '\3/\2/\1' )
numr:= regreplace(t, '(\d{4})-([01]{0,1}\d)-([0-3]{0,1}\d)',
         '\3/\2/\1', 1, REG_EXTENDED )

In these replacement statements we have used the following special characters for stating regular expressions: \d or [:digit:] indicates a numerical character, square brackets contain a set of possible character matches, {M,N} means minimum M and maximum N match count and ? stands for 0 times or once.

Contents

Index

Glossary

Search Results

Text handling and regular expressions

text vs. string

Parsing text

Regular expressions