TTT: Text Tokenisation Tool

Claire Grover

Colin Matheson

Andrei Mikheev

Language Technology Group
Human Communication Research Centre
University of Edinburgh
2 Buccleuch Place
Edinburgh
EH8 9LW
Scotland

http://www.ltg.ed.ac.uk/

Abstract

The TTT system provides a flexible means of tokenising texts and adding markup at various levels. The main component of the TTT system is a program called fsgmatch. This is a general purpose cascaded transducer which processes an input stream deterministically and rewrites it according to a set of rules provided in a grammar file. Although it can be used to alter the input in a variety of ways, the grammars provided with the TTT system are all used simply to add mark-up information. We have provided grammars to segment texts into paragraphs, segment paragraphs into words, recognise numerical expressions, mark up money, date and time expressions in newspaper texts, and mark up bibliographic information in academic texts. This documentation provides a description of the rule formalism and the grammars so that users will be able to alter existing grammars to suit their own needs or develop new rule sets for particular purposes. While the bulk of tokenisation is performed using hand-crafted rules, the TTT system also contains components where the rules result from machine learning. The system has two components which were trained by the maximum entropy modelling method. The first is a part-of-speech tagger which assigns syntactic category labels to words, and the second is a sentence boundary disambiguator which determines whether a full stop is part of an abbreviation or a marker of a sentence boundary.


Table of Contents
Acknowledgments
1. Introduction
What is Tokenisation?
The XML Background
The TTT Programs and Grammars
Statistical Modelling
The MUC-7 Competition
2. Installing TTT
Unpacking and Installing
Contents of the Distribution
3. Pipelines
The runplain pipeline
Variants of runplain
4. Rule Editing and Output Display
Rule Editing
Output Display
5. The Program fsgmatch
Running fsgmatch
Character Level fsgmatch
Getting Started
More Complex Rules
Using a Lexicon
SGML Level fsgmatch
SGML Level Grammars
Lexicons at the SGML Level
Summary of fsgmatch rule formalism
6. The Programs lttok and ltstop
lttok
ltstop
7. Part of Speech Tagging: ltpos
8. The NUMEX and TIMEX Grammars
A Grammar for Numbers
The NUMEX Grammar
The TIMEX Grammar
9. Finding and Structuring Bibliographic Information
Converting Plain Text to XML
Character-Level Processing
Reference Lists
Processing the Input in Stages
Grammars for Reference Lists
The AUTHOR field
The DATE field
Publication Information
Journal names
Volume Specifications
Page Ranges
In-Text Citations
Dates in citations
Extent information
Pipelines
Pipeline for Reference Lists
Pipeline for Citations
Tutorial: Extracting a Lexicon
10. The Program sgdelmarkup
I. User callable programs
sgmltrans — translate XML files into another format.
sgmlperl — a version of sgmltrans which allows embedded Perl code as rule bodies
sggrep — works like the grep program in searching a file for regular string expressions. However, unlike grep , it is aware of the tree structure of XML files.
References