TTT: Text Tokenisation Tool
Prev		Next

Chapter 6. The Programs lttok and ltstop

Table of Contents
lttok
ltstop

The two programs lttok and ltstop identify and mark up words and sentence-final full stops respectively. They are separate programs but the default mode for running lttok (without the option -no_split) actually causes both programs to be run at once. Thus in our pipeline runplain-lttok there is no separate ltstop process. The following two sections describe the two programs separately but it should be borne in mind that by default lttok includes ltstop.

lttok

The program lttok provides an alternative to fsgmatch for identifying words in a text. It uses finite state techniques to process the input stream and it marks up words as XML elements. The rules for lttok are specified in the resource file $TTT/RES/TOK-RES/lttok_res.xml. The FSA element in the resource file contains the rules which recognise strings as potential words: every REX element specifies a regular expression which should be matched to get a token of a certain type. Most of the REX elements classify strings as words (name=W) although if it is possible to specify a particular type, such as punctuation (name=PUNCT), then this is done. In our example pipeline, runplain-lttok, all strings matched are tagged as <W> elements with the type encoded as the value of the attribute C. Thus, a word like "recognise" will be marked-up as <W C='W'>recognise</W> while a punctuation mark such as "?" will be marked up as <W C='.'>?</W>. The following (simplified) example searches for a string matching the regular expression [0-9]+[ \-]*((th)|(rd)).

   <REX name=ORD>[0-9]+[\-]*((th)|(rd))</REX>

When this is satisfied the matching string is classified as an ordinal (name=ORD). The REX rules are applied in the order in which they are specified in the resource file and the longest matching rule is selected.

It is also possible to specify a lookahead for a rule using the bo attribute of a REX element:

   <REX name=ORD bo=1>[0-9]+[ \-]*((th)|(rd))[, ]</REX>

This regular expressions allows for a whitespace after the number (e.g. "5 th") but in order to prevent false tokenisation in e.g. "5 thousand", the right context is specified as "[, ]". However, we do not want this right context to be included in the string so we specify that after matching the processor should back off one character (bo=1).

In the pipeline runplain-lttok, the call to lttok is as follows:

$TTT/bin/lttok -q ".*/P|TITLE" -mark W -attr C $TTT/RES/TOK-RES/lttok_res.xml

This uses the LT QUERY query -q ".*/P|TITLE" to specify that only the text inside paragraphs and titles should be processed by lttok. The filename argument at the end is the name of the resource file that is to be used. W is designated as the name of the element that the words should be marked up as and C is the attribute which the type assigned by lttok is the value of. Note that it is possible to give different values to the -mark and -attr options but if you do so, you must ensure that the DTD for your document contains a definition of the new element and its attribute. In our pipeline lttok is used without the option -no_split and this means that the call to lttok will also cause ltstop to be run. Thus the following input (from $TTT/EGS/plain/fullstops)

    This is an example text. It contains abbreviations like U.S. and e.g. and it
    contains words etc. The idea is to see what ltstop does with it.

used in the following pipeline

    cat $TTT/EGS/plain/fullstops \
    | $TTT/SCRIPTS/plain2xml.perl \
    | $TTT/bin/fsgmatch -q ".*/TEXT" $TTT/GRAM/char/paras.gr \
    | $TTT/SCRIPTS/openangle.perl \
    | $TTT/bin/lttok -q ".*/P|TITLE" -mark W -attr C $TTT/RES/TOK-RES/lttok_res.xml

will be marked up like this (note the empty <W C='.'></W> after "etc.").

    <W C='W'>This</W> <W C='W'>is</W> <W C='W'>an</W> <W C='W'>example</W>
    <W C='W'>text</W><W C='.'>.</W> <W C='W'>It</W> <W C='W'>contains</W> 
    <W C='W'>abbreviations</W> <W C='W'>like</W> <W C='W'>U.S.</W> <W C='W'>and</W>
    <W C='W'>e.g.</W> <W C='W'>and</W> <W C='W'>it</W> <W C='W'>contains</W>
    <W C='W'>words</W> <W C='W'>etc.</W><W C='.'></W> <W C='W'>The</W> 
    <W C='W'>idea</W> <W C='W'>is</W> <W C='W'>to</W> <W C='W'>see</W> 
    <W C='W'>what</W> <W C='W'>ltstop</W> <W C='W'>does</W> <W C='W'>with</W> 
    <W C='W'>it</W><W C='.'>.</W>

If lttok is used with the -no_split option then ltstop is not also called and the output looks as follows:

    <W C='W'>This</W> <W C='W'>is</W> <W C='W'>an</W> <W C='W'>example</W> 
    <W C='W'>text.</W> <W C='W'>It</W> <W C='W'>contains</W> 
    <W C='W'>abbreviations</W> <W C='W'>like</W> <W C='W'>U.S.</W> <W C='W'>and</W> 
    <W C='W'>e.g.</W> <W C='W'>and</W> <W C='W'>it</W> <W C='W'>contains</W>
    <W C='W'>words</W> <W C='W'>etc.</W> <W C='W'>The</W> <W C='W'>idea</W> 
    <W C='W'>is</W> <W C='W'>to</W> <W C='W'>see</W> <W C='W'>what</W> 
    <W C='W'>ltstop</W> <W C='W'>does</W> <W C='W'>with</W> <W C='W'>it.</W>

In this case, a separate call to ltstop would be needed to decide the status of the full stops.

Prev	Home	Next
Summary of fsgmatch rule formalism		ltstop