TTT: Text Tokenisation Tool
Prev		Next

Chapter 7. Part of Speech Tagging: ltpos

The ltpos program assigns part-of-speech labels (POS tags) to words in a text. It is a probabilistic part-of-speech tagger based on Hidden Markov Models using Maximum Entropy probability estimators (see various papers at http://www.ltg.ed.ac.uk/~mikheev/papers.html for discussion of the statistical methods utilised in this and other LTG programs.) The tagger's resources are pre-trained on the Brown Corpus which provides a good model of general English. ltpos is also available both separately as LT POS and as part of LT CHUNK accessible from the LTG web pages. Users who want to make significant use of a tagger might want to acquire the full ltpos distribution with its accompanying resource files and documentation. Tagger re-training utilities are not provided with ltpos but the full distribution allows the tagger to be extended to cover new lexica.

ltpos can handle both plain ASCII and XML marked-up files. In our example pipeline, runltpos, the input is an XML file. In XML mode the words in the input file can either be already marked-up as words or not. If they are not already marked-up, one option will cause them to become so and in this case the DTD for the file must allow for the new mark-up.

The resource files for ltpos can be found in $TTT/RES/POS-RES. The lexical coverage of the tagger is supported by a lexicon and by unknown-word guessing rules. The location of these is specified in the resource file $TTT/RES/POS-RES/resource.penn,xml. The full set of options for ltpos are listed in its help file which is accessed by typing "$TTT/bin/ltpos -h". Further information is also provided with the full LT POS distribution. In our runltpos pipeline, the call to ltpos is as follows:

  $TTT/bin/ltpos -q+ ".*/P" -tok -element_label "W" \
  -pos_attr P $TTT/RES/POS-RES/resource.penn,xml

Here, the query option -q+ ".*/P" specifies that ltpos should tag words inside <P> elements. The -tok option indicates that the input has already been tokenised into words and the -element_label "W" option specifies that the words have been wrapped as <W> elements. (In fact <W> is the default for this option, so technically this could be omitted in the command line - we include it for discussion.) The -pos_attr P option in our example specifies that the POS tag should be made the value of the P attribute. (The DTD in $TTT/RES/general.dtd,xml permits P as an attribute of W.) As a result, a word such as <W C='W'>days</W> in the input will appear as <W P='NNS' C='W'>days</W> in the output. If the -pos_attr option had been used to select C as the POS tag attribute, then the previous value of C would have been changed to the POS tag: <W C='NNS'>days</W>. Also note that if the -pos_attr option is omitted then the C attribute is the default location for the POS tag, i.e. the result would be the same. The $TTT/RES/POS-RES/resource.penn,xml argument to the ltpos command provides a resource file for tagging using the Penn Treebank tagset.

There is a difference between -q+ and -q: -q+ is used when words are marked-up as XML elements and it causes the POS tag to be encoded as an attribute value of the XML element that wraps the word; -q is used when words are not marked-up as XML elements and it causes the POS tag to be attached to the end of the word.

For the -q+ case there are two possibilities: (1) if the input has already been tokenised into words then the POS tag must be assigned as an attribute value of the XML element that already wraps the word (in our case, the <W> element). Note that in this case the element name has to be the same both in the input and in the -element_label option; (2) if the words have not already been identified, then ltpos will mark then up as <W> elements and make the POS tag be the value of the C attribute of the word.

The -q option can only be used when the input has not already been tokenised into words and it causes the POS tag to be attached to the word separated by an underscore (e.g. "The_DT").

If you wish to explore ltpos further, we suggest experimenting with the following pipelines in order to familiarise yourself with the different options available.

cat $TTT/EGS/plain/toy-egs | $TTT/SCRIPTS/plain2xml.perl \
| $TTT/bin/fsgmatch -q ".*/TEXT" $TTT/GRAM/char/paras.gr \
| $TTT/SCRIPTS/openangle.perl \
| $TTT/bin/fsgmatch -q ".*/P|TITLE" $TTT/GRAM/char/words.gr \
| $TTT/SCRIPTS/openangle.perl \
| $TTT/bin/ltstop -q ".*/P" -mark "W[C='.']" $TTT/RES/TOK-RES/lttok_res.xml \
| $TTT/bin/ltpos -q+ ".*/P" -tok $TTT/RES/POS-RES/resource.penn,xml

cat $TTT/EGS/plain/toy-egs | $TTT/SCRIPTS/plain2xml.perl \
| $TTT/bin/fsgmatch -q ".*/TEXT" $TTT/GRAM/char/paras.gr \
| $TTT/SCRIPTS/openangle.perl \
| $TTT/bin/ltpos -q+ ".*/P" $TTT/RES/POS-RES/resource.penn,xml

cat $TTT/EGS/plain/toy-egs | $TTT/SCRIPTS/plain2xml.perl \
| $TTT/bin/fsgmatch -q ".*/TEXT" $TTT/GRAM/char/paras.gr \
| $TTT/SCRIPTS/openangle.perl \
| $TTT/bin/ltpos -q+ ".*/P" -element_label "PHR" $TTT/RES/POS-RES/resource.penn,xml

cat $TTT/EGS/plain/toy-egs | $TTT/SCRIPTS/plain2xml.perl \
| $TTT/bin/fsgmatch -q ".*/TEXT" $TTT/GRAM/char/paras.gr \
| $TTT/SCRIPTS/openangle.perl \
| $TTT/bin/ltpos -q ".*/P" $TTT/RES/POS-RES/resource.penn,xml

cat $TTT/EGS/plain/toy-egs | $TTT/bin/ltpos $TTT/RES/POS-RES/resource.penn,xml

Prev	Home	Next
ltstop		The NUMEX and TIMEX Grammars