TTT: Text Tokenisation Tool
Prev	Chapter 2. Installing TTT	Next

Contents of the Distribution

After the tar file has been unpacked, the following files and directories will be found in the toplevel directory ($TTT).

00README         OUTPUT/          runbibtutorial*  runplain-lttok*
DOC/             RES/             runcitations*    runplain-wsj*
EGS/             SCRIPTS/         runltpos*        runplain-xt*
GRAM/            bin/             runmuc*          runsgml*
LEX/             runbiblio*       runplain*        runtoy*

The 00README file contains the unpacking and installation information from the previous section. The executable files (with names beginning "run") each contain an example pipeline to demonstrate different aspects of the system (see the Pipelines chapter for discussion).

The DOC subdirectory contains this document in both single file and multi-file HTML versions (tttdoc.html and book1.htm respectively). For on-line use the multi-file HTML is best, the other is mainly for printing.

The EGS subdirectory contains some example texts which can be used when running our pipelines. (The comments at the top of each pipeline file suggest which particular example file is suitable as input.) EGS is split into two subdirectories, plain and sgml. EGS/plain contains files which are plain ascii text with no mark-up at all. These are ones which need an initial very basic conversion to XML as a first step in the pipeline, since the input to the main TTT program fsgmatch must be an XML file. EGS/sgml contains files which are already either SGML or XML files. XML files are immediately ready for processing but SGML files need an initial conversion to XML. The four texts in EGS/sgml/smallmuc7.sgml are newspaper texts taken from the first set of training data provided for the Named Entity Recognition task in the Seventh Message Understanding Conference (MUC-7) competition. We are grateful to Nancy Chinchor at SAIC for giving us permission to release these texts with our system. Note that the files EGS/sgml/texts.sgml and EGS/plain/texts contain the same four texts but in different formats.

The GRAM subdirectory is split into GRAM/char and GRAM/sgml. This reflects the two modes of operation of the fsgmatch program, character level fsgmatch and SGML level fsgmatch (see The Program fsgmatch for discussion). Character level grammars are located in GRAM/char. These perform the more low-level mark up such as identification of headings and paragraphs and the initial segmentation of text into words. SGML level grammars are located in GRAM/sgml. These include bibliographical grammars which mark up both in-text citations and end-of-text reference list items (Finding and Structuring Bibliographic Information) and grammars which add a subset of the MUC-7 Named Entity mark-up (The NUMEX and TIMEX Grammars).

The LEX subdirectory contains lexicon files which are called by the grammar files. numbers.lex, numex.lex and timex.lex are used by the NUMEX and TIMEX grammars while the other lexicons are used by the bibliographical grammars. Note that names.lex is created ``on-the-fly'' by the pipeline in runbibtutorial (see Tutorial: Extracting a Lexicon) and its contents will change according to the most recent input to the pipeline.

The OUTPUT subdirectory is our suggested location for the output of our pipelines, though output can, of course, be directed to any part of your filespace or to the screen. Many of our pipelines include a final step where the marked up XML document is converted to HTML with special style conventions to highlight the new mark-up. (We have found that this method of displaying mark-up makes it relatively easy to spot errors.) The OUTPUT/HTML subdirectory is the location we use for HTML output. We use the OUTPUT/MUC subdirectory as the location for the three different output files that result from the runmuc pipeline. The OUTPUT/SCRIPTS subdirectory contains the rule files that are used by the program sgmltrans to convert output from XML to HTML. There is also a small Perl program specifically for making some conversions in the MUC-7 format texts. The subdirectory OUTPUT/xt has been included in case users wish to use James Clark's XT as an alternative method of converting from XML to HTML. XT is available free-of-charge from http://www.jclark.com/xml/xt.html and is an implementation in Java of the tree construction part of XSL. Should you wish to use XT, it can be downloaded and unpacked in the OUTPUT/xt directory into the two subdirectories OUTPUT/xt/xt and OUTPUT/xt/xp which are currently empty. The file OUTPUT/xt/general.xsl contains rules for converting from XML to HTML and the executable file OUTPUT/xt/runxt is one which works for us given the directory structure just described. See the pipeline runplain-xt for further details on how to run XT.

The bin subdirectory contains the TTT programs in their compiled form (binaries). These are:

fsgmatch     ltstop       sgdelmarkup  sgmlperl
ltpos        lttok        sggrep       sgmltrans

The program fsgmatch is the core TTT program as described in Chapter 5 (The Program fsgmatch). The program ltstop is a statistical sentence boundary disambiguator which determines whether a full stop is part of an abbreviation or a marker of a sentence boundary (see Chapter 6: ltstop). In most of our pipelines we use character level fsgmatch to segment paragraphs into words but the program lttok performs the same task and may be preferable in some circumstances. For this reason we have included it in the release (see Chapter 6: lttok). The LTG part-of-speech (POS) tagger LT POS adds POS information to words in XML documents in a format that is entirely consistent with TTT processing. Since POS information may be useful in the tokenisation process, we have included ltpos in our release (see Part of Speech Tagging: ltpos). We do not use POS information in any of our grammars but we have created the example pipeline runltpos in order to demonstrate how to use it. The programs sgmltrans and sgmlperl are both part of the LT XML toolkit but are released with TTT since they are used in our pipelines. We use sgmltrans to convert output from XML to HTML and sgmlperl to create the ``on-the-fly'' lexicon in the runbibtutorial pipeline. We have also included the LT XML program sggrep because it is a very useful program. sggrep is similar to grep except that it is used for searching for XML elements in XML files. Documentation for sgmltrans, sgmlperl and sggrep is included in the appendix (User callable programs). The program sgdelmarkup is a TTT specific program which removes XML mark-up. This is useful for deleting mark-up which has been created en route but which should not appear in the final output (see The Program sgdelmarkup).

The SCRIPTS subdirectory contains a number of Perl programs which make small changes to documents which cannot be performed by other TTT programs. For example, in order for a plain ascii text to be processed by fsgmatch it must first be converted to XML. We do this with the program SCRIPTS/plain2xml.perl which adds initial XML and DOCTYPE lines and wraps the text as a TEXT element inside a DOCS element. Other Perl programs in SCRIPTS convert back from XML to the original format.

The RES subdirectory contains resource files of various kinds. The DTDs for our various XML files are located at the top level of RES - these are the three files with the dtd,xml extension. Also at the top level is the file RuleSpec.dtd which is the DTD for the fsgmatch rule formalism. (As with other TTT resource files, the grammars are themselves XML documents.) The final toplevel file in RES is common-ent which contains a single XML entity to define the full pathname to the $TTT directory. This file is accessed by other resource files and serves as a central definition of the location of $TTT. Note that this is the one file that needs to be edited when installing TTT (see Unpacking and Installing). The directories RES/POS-RES and RES/TOK-RES contain resource files for the ltpos part-of-speech tagger and the lttok and ltstop programs (Chapter 6).

Prev	Home	Next
Installing TTT	Up	Pipelines