TTT: Text Tokenisation Tool
Prev		Next

Chapter 5. The Program fsgmatch

Table of Contents
Running fsgmatch
Character Level fsgmatch
SGML Level fsgmatch
Summary of fsgmatch rule formalism

The core program in the TTT system is called fsgmatch (Fast SGml MATCH). This is a general purpose transducer which processes an input stream and rewrites it according to a set of rules provided in a grammar file (by convention, a file with a ".gr" extension). It can be used to alter the input in a variety of ways, however the grammars provided with the TTT system are all used simply to add mark-up information. fsgmatch can be thought of as having two different modes of operation according to whether the input stream is to be considered as a stream of characters (character level fsgmatch) or as a stream of SGML/XML elements (SGML level fsgmatch). The directory structure we have provided reflects this distinction and you will find grammars for use at the character level in the directory $TTT/GRAM/char/ and grammars for use at the SGML level in the directory $TTT/GRAM/sgml/. The default mode of operation is the character level; to induce SGML level operation the RULES element in a grammar file must be given the attribute TYPE="SGML". In the next section we briefly explain how to run the fsgmatch program and in the following two sections we describe the rule formalism at the character level and at SGML level.

Running fsgmatch

The input files to fsgmatch must be XML files, although at the character level the transductions specified in the grammar do not have to bear any relation to XML mark-up. In the pipelines we have provided, a first step in the process converts either plain ASCII or SGML files into a minimal XML format. Optionally this conversion can be reversed at the end of the pipeline. The grammar files are also XML files with a format defined in the DTD $TTT/RES/RuleSpec.dtd.

When running fsgmatch, the command must specify which XML elements of the input stream are to be processed and which grammar file is to be used. Thus the following command

cat <input> | fsgmatch -q ".*/TEXT" toy.gr

pipes the input stream to fsgmatch and specifies that the program is to work on the contents of XML TEXT elements using the grammar called toy.gr. (N.B. for simplicity, pathnames have been omitted in this example. Also note that the `query' part of the command, -q ".*/TEXT", is expressed in the standard LT XML query language: see chapter 4 of the LT XML User Guide for details.)

A third optional argument to fsgmatch specifies which rules from the grammar file are to be applied. If, as in the example above, the third argument is omitted, then the system applies the rule-set specified as the value of the apply attribute in the RULES element of the grammar. A glance at toy.gr (actually, $TTT/GRAM/char/toy.gr) will show that the rule-set all is the value of apply. If the grammar writer wishes to observe the performance of just one of the rules then this can be achieved by using the the third argument (indicated by the option -targ) as follows:

cat <input> | fsgmatch -q ".*/TEXT" -targ "eights" toy.gr

Here the -targ option selects a specific rule, eights, as the rule to be applied. The -targ option can also be iterated to allow the selection of more than one rule, as in the following example.

cat <input> | fsgmatch -q ".*/TEXT" -targ "eng-am4" -targ "eights" toy.gr

Note that a limited amount of help information about fsgmatch is available by typing:

$TTT/bin/fsgmatch -h

Prev	Home	Next
Output Display		Character Level fsgmatch