TTT: Text Tokenisation Tool
Prev	Chapter 5. The Program fsgmatch	Next

SGML Level fsgmatch

In this section we describe fsgmatch at the SGML level. The differences between character and SGML level fsgmatch are not complex. The basic difference is that character level fsgmatch views its input stream as a sequence of characters while SGML level fsgmatch views its input stream as a sequence of XML elements and as a consequence there is less flexibility in rewriting operations at the SGML level. For the most part, SGML level fsgmatch is used to add XML structure by wrapping sequences of smaller XML elements into larger ones. There is also some scope for changing attribute values.

Our use of fsgmatch at the two levels embodies a distinction between low-level processing with character level fsgmatch up to the point where paragraphs and words have been marked up, followed by higher-level processing with SGML level fsgmatch where specific types of word sequences are bundled into larger units. This level of processing enables the grammar writer to describe units such as dates or bibliographical references in a way that abstracts away from nitty-gritty details such as where whitespace or linebreaks occur.

SGML Level Grammars

In order for the fsgmatch program to process at the SGML level, the RULES element of the grammar must have a type="SGML" specification. Apart from this, the major difference with character level fsgmatch is in the match attribute of RELs. At the character level the value of match is a regular expression describing character sequences but at the SGML level it first specifies an XML element to be matched and then optionally specifies the PCDATA content of the XML element, described either as the full exact string or by using regular expressions. The REL below matches a W element whose PCDATA content is exactly the string "a". Notice the syntax where the XML element name is given first and separated by "/" from the description of its content. The "#" stands for PCDATA and the "=" indicates that what follows is the entire and exact string.

  <REL match="W/#=a"></REL>

This REL will match the element <W>a</W> and no other. If we want it to match not just the word "a" but also its uppercase variant "A", we need to use a regular expression:

  <REL match="W/#~^[Aa]$"></REL>

Here the "~" indicates that the definition of the content is a regular expression rather than an exact string. The initial "^" and final "$" in the regular expression are special characters that signal the start and end of the content of the element. (Thus they are just like "^" and "$" in other Unix regular expressions except that they signal the start and end of an XML element instead of the start and end of a line.)

Certain match attributes will be frequently used and in such cases it is often useful to define them as XML entities. For example, the match value from the previous REL is defined as the following entity in the grammar $TTT/GRAM/sgml/numbers.gr.

  <!ENTITY JUSTA          "W/#~^[Aa]$">

This entity can then be used in rules, such as the following which groups word sequences such as "a thousand", "four thousand", "27 thousand", "two hundred and thirty-nine thousand" etc.

  <RULE name="thou" targ_sg="PHR[C='CD']">
    <REL type="GROUP" match="DISJF">
       <REL            match="&JUSTA;"></REL>
       <REL type="REF" match="to-999-mix"></REL>
    </REL> 
    <REL match="W/#~^([Tt]housand|THOUSAND)$"></REL>
  </RULE>

The first main REL is a GROUP type of REL (as described above) which specifies a disjunction of either the word "a" (or "A") using the entity &JUSTA; or of a sequence as defined by the rule to-999-mix (i.e. numbers such as "four", "27", "two hundred and thirty-nine"). The second main REL uses a regular expression to pick up the word "thousand" and upper-case variants of it. Notice that there is no need to say anything about the whitespace that occurs between the elements that this rule combines - at the SGML level, character data between elements is effectively invisible to the rules and is simply preserved from input to output.

Notice that this rule has the attribute targ_sg rather than the targ that we saw at the character level. The targ_sg attribute indicates which XML element is to be wrapped around the sequence that the rule matches, and optionally it also specifies attribute values for that element. In the current rule, the sequence is to be wrapped in a PHR element whose C attribute has the value CD (for "cardinal"). If the input contains the following sequence

  <W C='W'>more</W> <W C='W'>than</W> <W C='W'>three</W> 
  <W C='W'>thousand</W> <W C='W'>people</W><W C='.'>.</W>

the output will be as follows

  <W C='W'>more</W> <W C='W'>than</W> <PHR C='CD'><W C='CD'>three</W> 
  <W C='W'>thousand</W></PHR> <W C='W'>people</W><W C='.'>.</W>

It is also possible to use targ_sg just to add to or alter the attributes on an existing XML element as, for example, in the following rule (also from $TTT/GRAM/sgml/numbers.gr) which recognises the numbers "one" to "nine".

  <RULE name = "unit" targ_sg="@[C='CD']">
    <REL match = "W">
      <CONSTR  check_in="LEX" check_tags="UNIT *"  
               check_mod="LOWERCASE">
      </CONSTR>
   </REL>
  </RULE>

In the input, "nine" will already be marked as a W element with the attribute C='W' (i.e. <W C='W'>nine</W>). This rule looks up a word in the lexicon and if it has the tag UNIT then the @ indicates that the element name should be kept as it was (i.e. W) but it should be given the attribute-value pair C='CD'. If this attribute is already specified, as it is in this case, then the new value overwrites the old. If the attribute was previously unspecified then the new attribute-value pair is simply added. (N.B. Changing attribute values as described here would be particularly useful for making corrections to systematic errors produced by a part-of-speech tagger, especially given the m_mod='TEST' facility which allows left and right context to be constrained.)

Note that it is advisable to ensure that mark-up added using SGML level fsgmatch is consistent with the DTD of the document being processed, even though XML does not require the use of DTDs. Many of the pipelines distributed with the TTT release convert files to XML format using the DTD $TTT/RES/general.dtd,xml. A glance at this will show that the PHR element added by the rule above is indeed included in the DTD.

Lexicons at the SGML Level

The method of using lexicons is essentially the same at the two fsgmatch levels, though they tend to be more extensively used at the SGML level and for this reason there may be more aspects to lexicon use exemplified in our SGML level grammars.

In the grammar $TTT/GRAM/sgml/numbers.gr, two slightly different methods of performing lexical look-up were included for illustrative purposes. The lexicon is identified by the same means as described above, namely by adding a LEX element to the RULES element in the grammar file:

<LEX type="PHRASE"
     file_name="&TTTDIR;/LEX/numbers.lex" 
     alias="LEX"></LEX>

The first rule in the grammar, units, which we discussed above and which is reproduced here, performs look-up in the way we have already described. A W element is found and its content is looked up in the way described by the CONSTR element: the lexicon is the one named LEX as defined in the LEX element above; the word must occur with the tag UNIT and the look-up mode is LOWERCASE.

  <RULE name = "unit" targ_sg="@[C='CD']">
    <REL match = "W">
      <CONSTR  check_in="LEX" check_tags="UNIT *"  
               check_mod="LOWERCASE">
      </CONSTR>
   </REL>
  </RULE>

An alternative way to perform look-up is to define a general purpose look-up rule which can then be called by a large number of other rules. The rule below, check-num-in-lex, uses a new attribute arg with a variable value shared with check_tags in the CONSTR element. This can then be called by other rules which bind the variable to a particular tag that they are looking for. For example, the teen rule calls check-num-in-lex and binds the argument $1 to the tag specification "TEEN *". This enables the user to avoid unnecessary repetition of the check_in and check_mod aspects of lexical look-up which are invariant across a large number of rules.

  <RULE name="check-num-in-lex"  arg="$1">
     <REL match="W">
        <CONSTR   
           check_in="LEX" check_tags="$1"  check_mod="LOWERCASE">    
        </CONSTR>
     </REL>
  </RULE>

  <RULE name="teen" targ_sg="@[C='CD']">  
     <REL type="REF" match="check-num-in-lex">
        <ARG bind='$1'>TEEN  *</ARG>
    </REL> 
  </RULE>

We have so far glossed over details of the ways in which the check_tags attribute may be specified. In most cases in our grammars we use specifications of the form check_tags="<tag> *" where <tag> is the category we are looking for and "*" ranges over all other tags that the entry might have associated with it. If we want to ensure that the entry only has the one tag and no others, we would omit the "*" in the specification. When more than one tag is specified they have a conjunctive interpretation unless the "*" specification is included, in which case the interpretation is disjunctive. Thus check_tags="UNITH ORD" means that the entry must have both a UNITH tag and an ORD tag and no others. On the other hand, check_tags="UNITH ORD *" would match an entry which had one or other of the UNITH and ORD tags and which may possibly have other tags as well. It is possible to give a negative specification using -v to introduce the tag. For example, check_tags="-v UNIT" would succeed if the word was in the lexicon but didn't have a UNIT tag. When -v is used in multiple specifications it gets a conjunctive interpretation. For example check_tags="UNITH -v ORD *" would match an entry that did have a UNITH tag and didn't have an ORD tag (though it may have others).

Prev	Home	Next
Character Level fsgmatch	Up	Summary of fsgmatch rule formalism