The TTT system provides a flexible means of tokenising texts and
adding markup at various levels. The main component of the TTT
system is a program called fsgmatch. This is a general purpose
cascaded transducer which processes an input stream deterministically
and rewrites it according to a set of rules provided in a grammar
file. Although it can be used to alter the input in a variety of ways,
the grammars provided with the TTT system are all used simply to add
mark-up information. We have provided grammars to segment texts into
paragraphs, segment paragraphs into words, recognise numerical
expressions, mark up money, date and time expressions in newspaper
texts, and mark up bibliographic information in academic texts.
This documentation provides a description of the rule formalism and
the grammars so that users will be able to alter existing grammars to
suit their own needs or develop new rule sets for particular purposes.
While the bulk of tokenisation is performed using hand-crafted rules,
the TTT system also contains components where the rules result from
machine learning. The system has two components which were trained by
the maximum entropy modelling method. The first is a part-of-speech
tagger which assigns syntactic category labels to words, and the
second is a sentence boundary disambiguator which determines whether a
full stop is part of an abbreviation or a marker of a sentence
boundary.
sgmltrans — translate XML files into another format.
sgmlperl — a version of sgmltrans which allows embedded Perl code as rule bodies
sggrep — works like the grep
program in searching a
file for regular string expressions. However, unlike
grep
, it is
aware of the tree structure of XML
files.