TTT: Text Tokenisation Tool
Prev	Chapter 3. Pipelines	Next

Variants of runplain

A number of the pipelines that we provide are variants of runplain which we have included to demonstrate a particular aspect of our tools. These are: runsgml, runmuc, runplain-lttok, runplain-wsj, runltpos and runplain-xt. Each of these contains detailed comments so we need only provide an outline here.

The runsgml pipeline is almost identical to runplain except it takes a particular, simple kind of SGML-marked up text as input. This requires a different first step in the pipeline to perform the conversion to XML but after that the pipeline is the same as runplain.

The runmuc pipeline demonstrates some aspects of conversion from the kind of SGML provided for the MUC-7 competition to XML and back again after the addition of mark-up. This pipeline produces three output files. The first is the XML file resulting from the calls to fsgmatch. The second is an SGML file which is identical to the input except for the new NUMEX and TIMEX mark-up - this is the format needed for the MUC-7 scorer. The third output is an HTML version for viewing in the browser.

The runplain-lttok pipeline is the same as runplain except that the step where words are recognised is done by lttok (see Chapter 6: lttok) rather than by a call to character level fsgmatch using $TTT/GRAM/char/words.gr. The output of lttok is mostly the same as from $TTT/GRAM/char/words.gr though there are some differences in whether hyphenated words get split and in which alphanumeric sequences get bundled together. It may be preferable to use lttok when processing large quantities of text since it uses finite-state automata and it may well be faster than fsgmatch. It may also be preferable to use lttok if the input file already contains some mark-up at the word level.

The runplain-wsj pipeline is just like runplain except that it demonstrates another variety of input: the raw text files of the Wall Street Journal part of the Penn Treebank.

The runltpos pipeline demonstrates how ltpos, the part-of-speech tagger, can be used in a TTT pipeline. The example input is plain text as with runplain but this time, after words and sentence boundaries have been identified, the words are tagged by ltpos. The output is converted to HTML with the verbs highlighted.

The runplain-xt pipeline demonstrates an alternative way of mapping XML output to HTML. This uses James Clark's XT which is available free-of-charge from http://www.jclark.com/xml/xt.html. Note that this pipeline will only work correctly if you have installed XT in the location suggested in the comments in the pipeline file.

Prev	Home	Next
Pipelines	Up	Rule Editing and Output Display