TTT: Text Tokenisation Tool
Prev	Chapter 6. The Programs lttok and ltstop	Next

ltstop

The program ltstop is a maximum entropy sentence boundary disambiguator. For each full stop, it tries to determine whether it is an end-of-sentence marker ("I went home. I was tired .."), or whether it is part of an abbreviation ("I asked Mr. Brown ... ") or whether it is both ("I come from San Francisco, Cal. This is ...").

The lttok resource file $TTT/RES/TOK-RES/lttok_res.xml is also the resource file for ltstop. Here you will find the element SPLITTER which specifies the locations of various other resource files. ltstop uses both a list of known abbreviations and a list of known non-abbreviations, and these are specified in the LEX element inside SPLITTER. If ltstop incorrectly separates a full stop from an abbreviation, it is likely that this abbreviation is absent from the abbreviation list. In this case, you can add the new abbreviation to the list in $TTT/RES/TOK-RES/abbr.lex.

The model attribute of the outer FSME element inside the SPLITTER element specifies the location of the statistical model. This model cannot be changed since the software for training models is not included in this release.

In all our pipelines which use ltstop the call to it is as follows:

$TTT/bin/ltstop -q ".*/P" -mark "W[C='.']" $TTT/RES/TOK-RES/lttok_res.xml

Here the filename argument at the end is the name of the resource file that is to be used. The LT QUERY query -q ".*/P" specifies that ltstop should operate on paragraph elements. The -mark "W[C='.']" option requires that when it decides that a full stop is a sentence boundary it marks it as a W element with a C='.' attribute. Notice that when a full stop is both part of an abbreviation and a sentence boundary marker, it is left attached to the abbreviation but a new empty <W C='.'></W> is inserted as the sentence boundary marker (see the example "etc." in the example output below).

Note that processes applying to the input stream prior to a call to ltstop may either have marked full stops as separate tokens or have left them attached to the preceding word. In pipelines such as runplain, where we use character level fsgmatch with $TTT/GRAM/char/words.gr to segment the text into words, the full stops are usually left attached to the preceding word. Similarly, using lttok with the -no_split option will also leave full stops attached to preceding words. ltstop examines each full stop and decides whether it is sentence boundary - if it is, it is marked as a separate token but if it is not, then it is attached to the preceding word, which is an abbreviation. If it is both then an empty full stop (<W C='.'></W>) is inserted to mark the sentence boundary. With the input file $TTT/EGS/plain/fullstops, introduced in the previous section, and the following pipeline

    cat $TTT/EGS/plain/fullstops \
    | $TTT/SCRIPTS/plain2xml.perl \
    | $TTT/bin/fsgmatch -q ".*/TEXT" $TTT/GRAM/char/paras.gr \
    | $TTT/SCRIPTS/openangle.perl \
    | $TTT/bin/fsgmatch -q ".*/P|TITLE" $TTT/GRAM/char/words.gr \
    | $TTT/SCRIPTS/openangle.perl \
    | $TTT/bin/ltstop -q ".*/P" -mark "W[C='.']" $TTT/RES/TOK-RES/lttok_res.xml

we get the following output which is identical to the first example output shown in the previous section.

    <W C='W'>This</W> <W C='W'>is</W> <W C='W'>an</W> <W C='W'>example</W> 
    <W C='W'>text</W><W C='.'>.</W> <W C='W'>It</W> <W C='W'>contains</W> 
    <W C='W'>abbreviations</W> <W C='W'>like</W> <W C='W'>U.S.</W> <W C='W'>and</W>
    <W C='W'>e.g.</W> <W C='W'>and</W> <W C='W'>it</W> <W C='W'>contains</W>
    <W C='W'>words</W> <W C='W'>etc.</W><W C='.'></W> <W C='W'>The</W> 
    <W C='W'>idea</W> <W C='W'>is</W> <W C='W'>to</W> <W C='W'>see</W> 
    <W C='W'>what</W> <W C='W'>ltstop</W> <W C='W'>does</W> <W C='W'>with</W> 
    <W C='W'>it</W><W C='.'>.</W>

If a process applying to the input stream prior to a call to ltstop has marked full stops as separate tokens then ltstop will still determine the status of each full stop and mark it appropriately. Thus ltstop will still produce the output in the same format as above, even though this may involve re-attaching some full stops to preceding words.

By using the -mark option for ltstop, it is possible to assign a tag other than <W C='.'> to a sentence boundary marker, though you must remember to define your new sentence boundary element in the DTD for the document. In the case where the input full stops are attached to the preceding word, this will result in all sentence boundary full stops being marked with the new tag. If the full stops are already split off in the input, however, ones which remain split off will not be reassigned to the new tag. Thus the output may have some full stops marked as <W C='.'> and some marked with your new tag (e.g. empty sentence boundaries after abbreviations like "etc." in the output above).

Prev	Home	Next
The Programs lttok and ltstop	Up	Part of Speech Tagging: ltpos