TTT: Text Tokenisation Tool
Prev	Chapter 5. The Program fsgmatch	Next

Character Level fsgmatch

In this section we give a brief overview of fsgmatch at the character level. In Getting Started we start with some very simple examples designed to illustrate the basic rule formalism. In More Complex Rules we give examples of more complex rules. Using a Lexicon describes the use of lexicons.

Getting Started

The examples in this section can be found in the grammar $TTT/GRAM/char/toy.gr and can be tried out by piping the example file $TTT/EGS/plain/toy-egs through the pipeline in $TTT/runtoy, as in the following command:

cat $TTT/EGS/plain/toy-egs | $TTT/runtoy | more

At the character level, fsgmatch processes a stream of characters searching for strings which match regular expressions specified in the rules of the grammar file. When a match is found, the string is rewritten according to the specification given by the rule. For example, the following rule

  <RULE name="eng-am1" targ="colour">
    <REL match="color"></REL>
  </RULE>

will match any instance of the string "color" and will rewrite it as "colour". This is a very simple instance of an fsgmatch rule. A rule is an XML RULE element which contains a number of REL elements. A RULE element must always have the attribute name and may additionally have other attributes. In this case it has the attribute targ which is where the rewrite is specified. With character level fsgmatch, the rewrite can be of any kind. In the following rule, not only is "color" rewritten as "colour" but it is enclosed in braces:

  <RULE name="eng-am2" targ="{colour}">
    <REL match="color"></REL>
  </RULE>

When a rule contains more than one REL, it must be specified whether the RELs are to be thought of as a disjunction or whether they indicate a sequence. The attribute type on the RULE element should be given one of the following values: SEQ, DISJF, DISJ. When type is left unspecified the default value is SEQ. To illustrate, the following rule finds the sequence "color of eggplants" and rewrites it as "colour of aubergines":

  <RULE name="eng-am3" targ="colour of aubergines">
    <REL match="color"></REL>
    <REL match="(\n|[ ])+of(\n|[ ])+"></REL>
    <REL match="eggplants"></REL>
  </RULE>

The second REL contains a more complex regular expression than a simple string. Regular expressions in fsgmatch are very similar to standard regular expressions used by Perl and many Unix programs. In our example, the second REL matches a sequence of whitespaces or newlines followed by the string "of" followed by more whitespaces or newlines. This means the rule as a whole will match a wide range of strings which vary in the nature and size of the gaps between the words. The rewrite however, specifies just one whitespace between each word. If we want to preserve the whitespace from input to output the rule must be formulated as follows:

  <RULE name="eng-am4" targ="&A-REW;&B-VAL;&C-REW;">
    <REL var="A" match="color" rewrite="colour"></REL>
    <REL var="B" match="&WSORNL;+of&WSORNL;+"></REL>
    <REL var="C" match="eggplants" rewrite="aubergines"></REL>
  </RULE>

Here we specify on each relevant REL what the rewrite for that portion of the string is, using the attribute rewrite. We also uniquely identify each REL using variable names as encoded in the attribute var. In the targ attribute of the RULE we use some built-in XML entities which refer either to the rewrite of a string or its actual value, i.e. the string itself. Thus the rewrite for the whole rule is the rewrite of the first REL (&A-REW;) followed by whatever the second REL matches (&B-VAL;) followed by the rewrite of the third REL (&C-REW;).

Notice also the use of the XML entity &WSORNL; in the middle REL. This entity is defined in the grammar as follows:

  <!ENTITY WSORNL       "(\n|[ ])">

i.e. it expands as the regular expression which matches either a whitespace or a newline. Thus XML entities can be used to define shorthands for regular expressions which are frequently used.

The following rule demonstrates the use of disjunctive RELs. The type values DISJF and DISJ indicate that only one of the RELs should match. In the case of DISJF, the search stops at the first match while in the case of DISJ the longest match succeeds. In this case there is no conflict between the strings so the choice between DISJ and DISJF has no consequences.

  <RULE name="eng-am5" type="DISJ" targ="&S-REW;">
    <REL match="color"    rewrite="colour"></REL>
    <REL match="center"   rewrite="centre"></REL>
    <REL match="theater"  rewrite="theatre"></REL>
    <REL match="realize"  rewrite="realise"></REL>
    <REL match="flavor"   rewrite="flavour"></REL>
    <REL match="aluminum" rewrite="aluminium"></REL>
    <REL match="chip"     rewrite="crisp"></REL>
    <REL match="cookie"   rewrite="biscuit"></REL>
  </RULE>

In other cases, the choice is important. In the following example the intention is to convert every instance of the word "eight" to the digit "8" and every instance of the word "eighteen" to "18". If care is not taken, the "eight" in "eighteen" might be matched by the wrong REL and "eighteen" would be converted to "8een".

  <RULE name="eights" type="DISJ" targ="&S-REW;">
    <REL match="eight"    rewrite="8"></REL>
    <REL match="eighteen" rewrite="18"></REL>
  </RULE>

The rule as it is given works correctly since it looks for the longest match. An alternative way of writing the rule is as follows:

  <RULE name="eights" type="DISJF" targ="&S-REW;">
    <REL match="eighteen" rewrite="18"></REL>
    <REL match="eight"    rewrite="8"></REL>
  </RULE>

Here the REL with the longer string is listed first and in combination with type="DISJF" this ensures the same behaviour as with the previous version of the rule.

In the following sections we look at more examples of character level fsgmatch rules and demonstrate other aspects of the rule formalism. We use examples from the grammars in $TTT/GRAM/char/ to illustrate.

More Complex Rules

A first point to make about our character level grammars is that we are using them to produce XML markup. The output of a rule does not alter the input string except that it wraps start and end tags around it. However, at the character level the rewrite mechanism is not specifically tuned to XML markup and there is an issue relating to the open angle bracket character <. This character has special status in XML and can never appear except as part of an XML construct. Whenever it needs to appear just as a character then an XML entity must be used. XML has a built in entity < which expands as the character-denoting entity <, thus either of the following two expressions should be used to encode the expression x < y.

  x &lt; y
  x &#60; y

Our character level grammar $TTT/GRAM/char/words.gr, searches the input stream for sequences of characters that count as words and marks them up with the XML tag W. Thus if the input to the grammar consists of this sentence, we want the output to look as folows. (We have added linebreaks to make the example readable.)

<W C='W'>Thus</W> <W C='W'>if</W> <W C='W'>the</W> <W C='W'>input</W>
<W C='W'>to</W> <W C='W'>the</W> <W C='W'>grammar</W> <W C='W'>consists</W> 
<W C='W'>of</W> <W C='W'>this</W> <W C='W'>sentence</W><W C='CM'>,</W> 
<W C='W'>we</W> <W C='W'>want</W> <W C='W'>the</W> <W C='W'>output</W> 
<W C='W'>to</W> <W C='W'>look</W> <W C='W'>like</W> <W C='W'>this</W><W C='CM'>:</W>

In order to get this output however, we cannot use the open angle bracket character directly since fsgmatch at the character level simply deals with character sequences. Instead we use the entity < and use a Perl program $TTT/SCRIPTS/openangle.perl to convert it to <. To give an example, the following rule matches a single instance of a comma, a colon or a semi-colon.

  <RULE  name="comma" targ="&lt;W C='CM'>&S-VAL;&lt;/W>">
    <REL match="(,|:|;)"></REL>
  </RULE>

When the match succeeds, the string is rewritten according to the specification in the attribute targ. The string itself, &S-VAL;, is preceded by the sequence <W C='CM'> and followed by </W>. All the other rules in $TTT/GRAM/char/words.gr perform a similar kind of rewrite thus the actual output of a call to fsgmatch with words.gr is as follows (again with added linebreaks):

&#60;W C='W'>Thus&#60;/W> &#60;W C='W'>if&#60;/W> &#60;W C='W'>the&#60;/W>
&#60;W C='W'>input&#60;/W> &#60;W C='W'>to&#60;/W> &#60;W C='W'>the&#60;/W> 
&#60;W C= 'W'>grammar&#60;/W> &#60;W C='W'>consists&#60;/W> &#60;W C='W'>of&#60;/W> 
&#60;W C='W'>this&#60;/W> &#60;W C='W'>sentence&#60;/W>& #60;W C='CM'>,&#60;/W> 
&#60;W C='W'>we&#60;/W> &#60;W C='W'>want&#60;/W> &#60;W C='W'>the&#60;/W> 
&#60;W C='W'>output&#60;/W> &#60;W C='W'>to&#60;/W> &#60;W C=' W'>look&#60;/W>
&#60;W C='W'>like&#60;/W> &#60;W C='W'>this&#60;/W>&#60;W C='CM'>:&#60;/W>

This output must then be converted to the desired form using the program $TTT/SCRIPTS/openangle.perl. It is an artificial constraint imposed by the designers of XML that < cannot appear just as a character and this has led to this clumsy extra step. We plan to eliminate this in the next release by changing fsgmatch so that XML mark-up is a properly supported option at the character level, just as it is at the SGML level.

In the rules we have seen so far, the RELs have contained a regular expression which must be matched. It is possible to make a REL call another rule and thereby structure the rules into a grammar. For example, the following rule defines a disjunction of two RELs where the type="REF" specifications indicate that the strings to be matched are defined by other rules. The names of these other rules are encoded as the value of the attribute match, in this case, "quote" and "bracket".

  <RULE name="quote-or-br" type="DISJF">
    <REL type="REF" match="quote"></REL>
    <REL type="REF" match="bracket"></REL>
  </RULE>

When a rule calls another rule in this way, the rule being called must be defined earlier in the grammar file than the rule that calls it. If the rules are not ordered in the file correctly then an error will occur. Note that recursion is disallowed: a rule cannot call itself nor can it call a rule that calls it.

By grouping the quote and bracket rules together in the rule "quote-or-br"we can then refer to their disjunction in other rules, as for example in the following rule which describes a sequence of an optional quote or bracket followed by either a word-punctuation sequence or just a word or a symbol sequence.

  <RULE name="words" targ="&A-REW;&B-REW;">
    <REL var="A" type="REF" match="quote-or-br" m_mod="QUEST"></REL>
    <REL var="B" type="GROUP"  match="DISJF">
       <REL type="REF" match="word-punct"></REL>
       <REL type="REF" match="word"></REL>
       <REL type="REF" match="symbol-word"></REL>
    </REL> 
  </RULE>

Other aspects of this rule are described below, but for the moment we draw attention to the m_mod="QUEST" specification on the "quote-or-br" REL. The m_mod attribute is used to control the way the match defined in the REL is used. When m_mod is not specified it receives the default value PLAIN which indicates that a simple match is to be made. Other values for m_mod are QUEST, STAR, PLUS, TEST and TEST-NO. QUEST is like the symbol `?' in regular expressions in that it indicates that the match is optional. STAR and PLUS are the equivalent of Kleene plus and Kleene star: they specify one or more iterations and zero or more iterations respectively. TEST is a way of constraining left or right context in that the string that a TEST REL defines must be present in the position it specifies but, when the rule is successful, the test substring is not counted as part of the match. The following rule provides an example:

  <RULE name="word-ws" targ="&A-REW;">
    <REL var="A" type="REF" match="words"></REL>
    <REL var="B"            match="&WSORNL;+" m_mod="TEST"></REL>
  </RULE>

Here the first REL is searching for a word as defined by the rule words and the second REL is searching for the whitespace following the word. The whitespace, however, is not part of the successful match. Rather, it is a constraint on the material in the right context to ensure that a full word rather than a partial word is identified. Notice that &A-REW; in the targ picks up the rewrite value of the substring A as defined by the rule "words".

TEST-NO is like TEST except that it tests for the absence of a substring rather than its presence, i.e. it can be used to require that a certain string is not in the context.

While rules usually define either concatenations or disjunctions, it is sometimes useful to be able to combine these in one rule. For example, the rule words discussed above and reproduced here

  <RULE name="words" targ="&A-REW;&B-REW;">
    <REL var="A" type="REF" match="quote-or-br" m_mod="QUEST"></REL>
    <REL var="B" type="GROUP"  match="DISJF">
       <REL type="REF" match="word-punct"></REL>
       <REL type="REF" match="word"></REL>
       <REL type="REF" match="symbol-word"></REL>
    </REL> 
  </RULE>

describes a concatenation of two elements, first a quote or bracket character and then a string which has a disjunctive definition. The method of combining these is to use the type="GROUP" attribute on a REL. This allows the REL to contain a number of other RELs which are interpreted either sequentially (match="SEQ") or disjunctively (match="DISJF").

To illustrate a sequence embedded in a disjunction, we update the rule eng-am5 from the Getting Started section. The initial rule consisted of a disjunctive list of word strings and to this we add a REL at the beginning that describes a sequence as one of the disjuncts:

  <RULE name="eng-am5" type="DISJF" targ="&S-REW;">
    <REL type="GROUP" match="SEQ" rewrite="chips">
       <REL match="french"></REL>
       <REL match="&WSORNL;+"></REL>
       <REL match="fries"></REL>
    </REL>
    <REL match="color"    rewrite="colour"></REL>
    <REL match="center"   rewrite="centre"></REL>
    <REL match="theater"  rewrite="theatre"></REL>
    <REL match="realize"  rewrite="realise"></REL>
    <REL match="flavor"   rewrite="flavour"></REL>
    <REL match="aluminum" rewrite="aluminium"></REL>
    <REL match="chip"     rewrite="crisp"></REL>
    <REL match="cookie"   rewrite="biscuit"></REL>
  </RULE>

Using a Lexicon

For many practical tasks it will be convenient or necessary to build a lexicon containing words or strings of elements that a rule-set is being designed to handle. Lexicon files for fsgmatch are fairly simple files: they contain entries which consist of the lexical item followed by one or more tags to indicate their category. The following are some entries from the lexicon $TTT/LEX/numbers.lex.

five   UNIT
fifteen   TEEN
fifty     TY
fifth     UNITH ORDFRAC
fifths    UNITH FRAC
fifteenth    TEENTH ORDFRAC
fifteenths   TEENTH FRAC
fiftieth    TIETH ORDFRAC
fiftieths   TIETH FRAC

Multi-word lexical entries may occasionally be required and there is a special double colon delimiter for these to indicate where the entry ends and where the tag(s) begin:

a few        :: QUANT QUSING
quite a few  :: QUANT QUSING

In order for a grammar to use information from a lexicon, a LEX element must be included. The following two LEX elements are included in $TTT/GRAM/char/words.gr:

<LEX type="PHRASE"
     file_name="&TTTDIR;/LEX/numbers.lex" 
     alias="NUMBERLEX"></LEX>

<LEX type="PHRASE"
     file_name="&TTTDIR;/LEX/numex.lex" 
     alias="NUMEXLEX"></LEX>

These identify and name two lexicons to be used by the grammar. The type="PHRASE" attribute indicates that the lexicon may contain multi-word entries. If it is not included then multi-word entries will not get used. Remember that the entity &TTTDIR; in the file_name specification is defined in the file $TTT/RES/common-ent and that it expands to the pathname to your TTT directory as specified by you when installing the system (see Installing TTT).

To perform lexical look-up, a CONSTR element is included in a REL. The following is the rule in $TTT/GRAM/char/words.gr which looks up a string in either of the two lexicons. This rule will match any alphabetic string with the constraint that it can be looked up successfully in one or other of the lexicons. The check_in attribute names the lexicon to be consulted. The check_tags attribute indicates which tag should appear in the entry. In this case, the "*" value indicates that any tag is acceptable, since it is sufficient in this case just to know that the string occurs in one of the lexicons. The check_mod attribute allows one to abstract away from upper/lower case distinctions. Here, the check_mod="LOWERCASE" specification indicates that a lowercase version of the string should be looked up. This allows a word, e.g. "five", to be specified once in the lexicon but to be able to match any kind of casing, e.g. "five", "Five", "FIVE", "fIVe". Other values for check_mod are UPPERCASE, FIRST-LOWERCASE, FIRST-UPPERCASE and NO (which is the default).

  <RULE name="in-num-lexs" type=DISJF>
    <REL match="[A-z]+">
      <CONSTR  check_in="NUMBERLEX" check_tags="*"  
               check_mod="LOWERCASE">
      </CONSTR>
    </REL>
    <REL match="[A-z]+">
      <CONSTR  check_in="NUMEXLEX" check_tags="*"
               check_mod="LOWERCASE">
      </CONSTR>
    </REL>
  </RULE>

Once the general look-up rule is in place, other rules can call it. In $TTT/GRAM/char/words.gr there are a number of rules which decide whether to split a hyphenated word or not. One of these deals with cases such as "twenty-five", "four-dollar", "three-quarters" where the fact that one of the sub-words is in one of the lexicons is grounds for making a split.

  <RULE name="special-hyphen-rule1" targ=
 "&lt;W C='W'>&A-VAL;&lt;/W>&lt;W C='DASH'>&B-VAL;&lt;/W>&lt;W C='W'>&C-VAL;&lt;/W>">
    <REL var="A" type="REF" match="in-num-lexs"></REL>
    <REL var="B" match="[\-]"></REL>
    <REL var="C" type="GROUP" match="SEQ">
       <REL type="REF" match="in-num-lexs"></REL>
       <REL type="REF" match="fullstop" m_mod="QUEST"></REL>
    </REL>
  </RULE>

This rule matches, first a word which occurs in one of the lexicons, then a hyphen, and then a word from one of the lexicons followed by an optional full stop. The rewrite of the entire string (abstracting away from the open angle bracket issue) is the first word wrapped as a <W C='W'> element followed by the hyphen wrapped as <W C='DASH'> element followed by the second word wrapped as a <W C='W'> element.

Further information about using lexicons can be found in Lexicons at the SGML Level in the SGML level fsgmatch section.

Prev	Home	Next
The Program fsgmatch	Up	SGML Level fsgmatch