TTT: Text Tokenisation Tool

Claire Grover

Colin Matheson

Andrei Mikheev

Language Technology Group
Human Communication Research Centre
University of Edinburgh
2 Buccleuch Place
Edinburgh
EH8 9LW
Scotland

http://www.ltg.ed.ac.uk/

Abstract

The TTT system provides a flexible means of tokenising texts and adding markup at various levels. The main component of the TTT system is a program called fsgmatch. This is a general purpose cascaded transducer which processes an input stream deterministically and rewrites it according to a set of rules provided in a grammar file. Although it can be used to alter the input in a variety of ways, the grammars provided with the TTT system are all used simply to add mark-up information. We have provided grammars to segment texts into paragraphs, segment paragraphs into words, recognise numerical expressions, mark up money, date and time expressions in newspaper texts, and mark up bibliographic information in academic texts. This documentation provides a description of the rule formalism and the grammars so that users will be able to alter existing grammars to suit their own needs or develop new rule sets for particular purposes. While the bulk of tokenisation is performed using hand-crafted rules, the TTT system also contains components where the rules result from machine learning. The system has two components which were trained by the maximum entropy modelling method. The first is a part-of-speech tagger which assigns syntactic category labels to words, and the second is a sentence boundary disambiguator which determines whether a full stop is part of an abbreviation or a marker of a sentence boundary.

Table of Contents

Acknowledgments

1. Introduction

What is Tokenisation?
The XML Background
The TTT Programs and Grammars
Statistical Modelling
The MUC-7 Competition

2. Installing TTT

Unpacking and Installing
Contents of the Distribution

3. Pipelines

The runplain pipeline
Variants of runplain

4. Rule Editing and Output Display

Rule Editing
Output Display

5. The Program fsgmatch

Running fsgmatch

Character Level fsgmatch

Getting Started
More Complex Rules
Using a Lexicon

SGML Level fsgmatch

SGML Level Grammars
Lexicons at the SGML Level

Summary of fsgmatch rule formalism

6. The Programs lttok and ltstop

lttok
ltstop

7. Part of Speech Tagging: ltpos

8. The NUMEX and TIMEX Grammars

A Grammar for Numbers
The NUMEX Grammar
The TIMEX Grammar

9. Finding and Structuring Bibliographic Information

Converting Plain Text to XML

Character-Level Processing

Reference Lists

Processing the Input in Stages

Grammars for Reference Lists

The AUTHOR field
The DATE field

Publication Information

Journal names
Volume Specifications
Page Ranges

In-Text Citations

Dates in citations
Extent information

Pipelines

Pipeline for Reference Lists
Pipeline for Citations

Tutorial: Extracting a Lexicon

10. The Program sgdelmarkup

I. User callable programs

sgmltrans — translate XML files into another format.
sgmlperl — a version of sgmltrans which allows embedded Perl code as rule bodies
sggrep — works like the grep program in searching a file for regular string expressions. However, unlike grep , it is aware of the tree structure of XML files.

References

		Next
		Acknowledgments