TTT: Text Tokenisation Tool
Prev		Next

Chapter 3. Pipelines

Table of Contents
The runplain pipeline
Variants of runplain

The TTT release includes the following pipelines:

runbiblio       runltpos        runplain-lttok  runsgml
runbibtutorial  runmuc          runplain-wsj    runtoy
runcitations    runplain        runplain-xt

The next section describes the runplain pipeline which is the main pipeline for demonstrating MUC-7-style mark-up using the NUMEX and TIMEX grammars. The section after describes the pipelines runmuc, runsgml, runplain-lttok, runplain-wsj, runplain-xt and runltpos which are all variants of runplain. The pipelines runbiblio, runbibtutorial and runcitations demonstrate the bibliographical grammars. Detailed discussion of these can be found in Pipelines in Chapter 9. The pipeline runtoy is a very small illustration of how to use fsgmatch (see Getting Started).

The runplain pipeline

The full runplain pipeline is as follows:

    $TTT/SCRIPTS/plain2xml.perl \
    | $TTT/bin/fsgmatch -q ".*/TEXT" $TTT/GRAM/char/paras.gr \
    | $TTT/SCRIPTS/openangle.perl \
    | $TTT/bin/fsgmatch -q ".*/P|TITLE" $TTT/GRAM/char/words.gr \
    | $TTT/SCRIPTS/openangle.perl \
    | $TTT/bin/ltstop -q ".*/P" -mark "W[C='.']" $TTT/RES/TOK-RES/lttok_res.xml \
    | $TTT/bin/fsgmatch  -q ".*/P|TITLE" $TTT/GRAM/sgml/numbers.gr \
    | $TTT/bin/fsgmatch  -q ".*/P|TITLE" $TTT/GRAM/sgml/numex.gr \
    | $TTT/bin/fsgmatch  -q ".*/P|TITLE" $TTT/GRAM/sgml/timex.gr \
    | $TTT/bin/sgmltrans -r $TTT/OUTPUT/SCRIPTS/generaltrans

The first step in the pipeline uses a Perl program to convert plain text to XML. Suppose we input the file $TTT/EGS/plain/tiny which looks like this:

    In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share.

    Late last night the company announced a growth of 20%.

Then the output of $TTT/SCRIPTS/plain2xml.perl is this:

    <?xml version='1.0'?>
    <!DOCTYPE DOCS SYSTEM '<full pathname to TTT dir>/RES/general.dtd,xml'>
    <DOCS>
    <TEXT>
    In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share.

    Late last night the company announced a growth of 20%.
    </TEXT>
    </DOCS>

Thus the conversion to XML involves the addition of an XML header plus a DOCTYPE line which points to a general purpose DTD for the document. The pathname in this line is added by the Perl program using the environment variable $TTT which should be set when the system is installed. The Perl program also wraps a TEXT element around the contents of the input file and a DOCS element around the TEXT element. If a different kind of XML format is needed then the program $TTT/SCRIPTS/plain2xml.perl can be changed but equivalent changes will also be needed in the DTD $TTT/RES/general.dtd,xml.

The next step in the pipeline uses character level fsgmatch (see Character Level fsgmatch) with the grammar $TTT/GRAM/char/paras.gr to mark up paragraphs and titles. As we discuss in the section More Complex Rules, there is an issue concerning the character "<" when processing at the character level and the Perl program $TTT/SCRIPTS/openangle.perl must currently be used to convert the underlying element "<" to "<". (We hope to eliminate this step in future releases.) The result of the second and third lines of the pipeline looks like this:

    <?xml version='1.0'?>
    <!DOCTYPE DOCS SYSTEM "<full pathname to TTT dir>/RES/general.dtd,xml" >
    <DOCS>
    <TEXT>
    <P>In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share.</P>

    <P>Late last night the company announced a growth of 20%.</P>
    </TEXT>
    </DOCS>

The next step in the pipeline uses character level fsgmatch again but with the grammar $TTT/GRAM/char/words.gr to segment paragraphs and titles into individual words. Again, since this is character level, it is piped through $TTT/SCRIPTS/openangle.perl to sort out the "<" character. The output of the fourth and fifth steps is then as follows except that we have inserted extra linebreaks to make the output readable. Note that all of the calls to fsgmatch in this pipeline add XML elements but do not alter the input in any other way.

    <?xml version='1.0'?>
    <!DOCTYPE DOCS SYSTEM "<full pathname to TTT dir>/RES/general.dtd,xml" >
    <DOCS>
    <TEXT>
    <P><W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W>
    <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W>
    <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W><W C='CM'>,</W> 
    <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W>
    <W C='W'>share.</W></P> 

    <P><W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W>
    <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W>
    <W C='W'>of</W> <W C='CD'>20</W><W C='W'>%</W><W C='.'>.</W></P>
    </TEXT>
    </DOCS>

The sixth step in the pipeline uses the maximum entropy sentence boundary disambiguator, ltstop as described in Chapter 6: ltstop. This examines all full stop characters whether they have been marked up as separate words, as at the end of our second sentence, or whether they are included with the preceding word, as with both full stops in our first sentence. For each full stop, the system determines whether it is a sentence boundary marker or part of an abbreviation. If it is a sentence boundary marker, then it is wrapped in a separate <W C='.'> element, but if it is part of an abbreviation it is not separated off (as with Corp. in our example). In cases where it is both, a new empty <W C='.'> is created to indicate the sentence boundary, though we do not have such a case in our example. The output of this stage is the same as the input except that the full stop after "share" is separated off. (Again, here and below we add extra linebreaks to make the example readable.)

    <?xml version='1.0'?>
    <!DOCTYPE DOCS SYSTEM "<full pathname to TTT dir>/RES/general.dtd,xml" >
    <DOCS>
    <TEXT>
    <P><W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W>
    <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W>
    <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W><W C='CM'>,</W>
    <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> 
    <W C='W'>share></W<W C='.'>.</W></P>

    <P><W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W>
    <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W>
    <W C='W'>of</W> <W C='CD'>20</W><W C='W'>%</W><W C='.'>.</W></P>
    </TEXT>
    </DOCS>

Once words and sentence boundaries have been marked up, processing with fsgmatch at the SGML level can proceed. From now on, we use fsgmatch to group sequences of XML elements into larger XML elements. The first of these stages (the seventh line of the pipeline) uses the grammar $TTT/GRAM/sgml/numbers.gr to identify multi-word numbers. In our example text, the only effect that this stage has is to group the words "102" and "million" into a <PHR C='CD'> element.

    <?xml version='1.0'?>
    <!DOCTYPE DOCS SYSTEM "<full pathname to TTT dir>/RES/general.dtd,xml" >
    <DOCS>
    <TEXT>
    <P><W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W>
    <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> 
    <W C='W'>$</W><PHR C='CD'><W C='CD'>102</W> <W C='W'>million</W></PHR>
    <W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> 
    <W C='W'>a</W> <W C='W'>share</W><W C='.'>.</W></P>

    <P><W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W>
    <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W>
    <W C='W'>of</W> <W C='CD'>20</W><W C='W'>%</W><W C='.'>.</W></P>
    </TEXT>
    </DOCS>

The next stage adds MUC-7-style NUMEX mark-up (see Chapter 8). The output of this stage is as follows:

    <?xml version='1.0'?>
    <!DOCTYPE DOCS SYSTEM "<full pathname to TTT dir>/RES/general.dtd,xml" >
    <DOCS>
    <TEXT>
    <P><W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W>
    <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> 
    <NUMEX TYPE='MONEY'><W C='W'>$</W><PHR C='CD'><W C='CD'>102</W> 
    <W C='W'>million</W></PHR></NUMEX><W C='CM'>,</W> <W C='W'>or</W> 
    <NUMEX TYPE='MONEY'><W C='CD'>34</W> <W C='W'>cents</W></NUMEX> <W C='W'>a</W>
    <W C='W'>share</W><W C='.'>.</W></P>

    <P><W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W>
    <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W>
    <W C='W'>of</W> <NUMEX TYPE='PERCENT'><W C='CD'>20</W><W C='W'>%</W></NUMEX>
    <W C='.'>.</W></P>
    </TEXT>
    </DOCS>

The final call to fsgmatch adds MUC-7-style TIMEX mark-up (see Chapter 8):

    <?xml version='1.0'?>
    <!DOCTYPE DOCS SYSTEM "<full pathname to TTT dir>/RES/general.dtd,xml" >
    <DOCS>
    <TEXT>
    <P><W C='W'>In</W> <TIMEX TYPE='DATE'><W C='W'>July</W> 
    <W C='CD'>1995</W></TIMEX> <W C='W'>CEG</W> <W C='W'>Corp.</W> 
    <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> 
    <NUMEX TYPE='MONEY'><W C='W'>$</W><PHR C='CD'><W C='CD'>102</W> 
    <W C='W'>million</W></PHR></NUMEX><W C='CM'>,</W> <W C='W'>or</W> 
    <NUMEX TYPE='MONEY'><W C='CD'>34</W> <W C='W'>cents</W></NUMEX>
    <W C='W'>a</W> <W C='W'>share</W><W C='.'>.</W></P>

    <P><TIMEX TYPE='TIME'><W C='W'>Late</W> <W C='W'>last</W> 
    <W C='W'>night</W></TIMEX> <W C='W'>the</W> <W C='W'>company</W> 
    <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W> <W C='W'>of</W> 
    <NUMEX TYPE='PERCENT'><W C='CD'>20</W><W C='W'>%</W></NUMEX><W C='.'>.</W>
    </P>
    </TEXT>
    </DOCS>

At this point, the mark-up is complete and all that is left is the question of how to display the output. Although a number of different kinds of element have been added, not all of them will be needed in the output. In the runplain pipeline we convert the output to HTML and in the process we lose all of the new mark-up that we do not explicitly retain. The last line in the pipeline uses the program sgmltrans to perform the conversion to HTML using the rules specified in the file $TTT/OUTPUT/SCRIPTS/generaltrans. This converts NUMEX and TIMEX elements to HTML spans and retains the <P> element since this has an interpretation in HTML. The conversion process also associates CSS style information with the various subtypes of span so that the different NUMEX and TIMEX elements can be highlighted in different colours. The final output of the pipeline, then, is the following, which can be viewed in a browser (http://www.ltg.ed.ac.uk/software/ttt/tinyout.html.) For more details on how to convert output for viewing in a browser, see Output Display in Chapter 4.)

    <HTML>
    <HEAD>
    <TITLE>TTT Output</TITLE>
    <STYLE>
    H2  {color:black}
    SPAN.PHR-CD {background:CCCCFF}
    SPAN.WRD-CD {background:CCCCFF}
    SPAN.PHR-ORD {background:CCCCFF}
    SPAN.WRD-ORD {background:CCCCFF}
    SPAN.PHR-FRAC {background:CCCCFF}
    SPAN.WRD-FRAC {background:CCCCFF}
    SPAN.PHR-FRACORD {background:CCCCFF}
    SPAN.WRD-FRACORD {background:CCCCFF}
    SPAN.PHR-RANGE {background:CCCCFF}
    SPAN.PHR-QUANT {background:CCCCFF}
    SPAN.DATE    {background:CCFFCC}
    SPAN.TIME    {background:CCFFFF}
    SPAN.MONEY   {background:FFFFCC}
    SPAN.PERCENT {background:FFCCFF}
    </STYLE>
    </HEAD>
    <BODY>

    <P>In <SPAN CLASS='DATE'>July 1995</SPAN> CEG Corp. posted net of 
    <SPAN CLASS='MONEY'>$102 million</SPAN>, or <SPAN CLASS='MONEY'>34 cents</SPAN>
    a share.</P>

    <P><SPAN CLASS='TIME'>Late last night</SPAN> the company announced a growth of 
    <SPAN CLASS='PERCENT'>20%</SPAN>.</P>

    </BODY>
    </HTML>

This method of displaying output is very useful for debugging purposes but, if the output is to be used in an application, it may well be preferable to keep it as an XML document but strip out the mark-up that is not needed in the end product. For example, if the paragraph, phrase and word mark-up is not required it can be removed using the program sgdelmarkup. If the last line of runplain is replaced with the following three calls to sgdelmarkup

    | $TTT/bin/sgdelmarkup -q ".*/P" \
    | $TTT/bin/sgdelmarkup -q ".*/PHR" \
    | $TTT/bin/sgdelmarkup -q ".*/W"

then we get output that looks like this:

    <?xml version='1.0'?>
    <!DOCTYPE DOCS SYSTEM "<full pathname to TTT dir>/RES/general.dtd,xml" >
    <DOCS>
    <TEXT>
    In <TIMEX TYPE='DATE'>July 1995</TIMEX> CEG Corp. posted net of 
    <NUMEX TYPE='MONEY'>$102 million</NUMEX>, or <NUMEX TYPE='MONEY'>34 
    cents</NUMEX> a share.

    <TIMEX TYPE='TIME'>Late last night</TIMEX> the company announced a growth of 
    <NUMEX TYPE='PERCENT'>20%</NUMEX>.
    </TEXT>
    </DOCS>

As a final step, we can opt to convert the text back into its plain ascii format but leaving the NUMEX and TIMEX mark-up in place. To do this we can use the Perl program $TTT/SCRIPTS/xml2plain.perl at the end of the pipeline and the output will look as follows:

    In <TIMEX TYPE='DATE'>July 1995</TIMEX> CEG Corp. posted net of 
    <NUMEX TYPE='MONEY'>$102 million</NUMEX>, or <NUMEX TYPE='MONEY'>34 
    cents</NUMEX> a share.

    <TIMEX TYPE='TIME'>Late last night</TIMEX> the company announced a growth of 
    <NUMEX TYPE='PERCENT'>20%</NUMEX>.

Prev	Home	Next
Contents of the Distribution		Variants of runplain