TTT: Text Tokenisation Tool
Prev		Next

Chapter 8. The NUMEX and TIMEX Grammars

Table of Contents
A Grammar for Numbers
The NUMEX Grammar
The TIMEX Grammar

We tested the TTT system by using it to participate in the Named Entity task of the MUC-7 competition. This involved developing extensive rulesets to mark up named entities (ENAMEX: persons, organisations and locations), temporal expressions (TIMEX: dates and times) and numerical expressions (NUMEX: sums of money, percentages). Detailed information about precisely which expressions should be marked up can be found in the MUC-7 Named Entity Task Definition (http://www.muc.saic.com/proceedings/ne_task.html). In this release we distribute a general purpose grammar for recognising numbers ($TTT/GRAM/sgml/numbers.gr) and slightly modified versions of our MUC-7 grammars for NUMEX and TIMEX expressions ($TTT/GRAM/sgml/numex.gr and $TTT/GRAM/sgml/timex.gr respectively). We use these grammars in the runplain pipeline and its variants where they apply in the order numbers.gr - numex.gr - timex.gr. The grammar numbers.gr must apply before the other two since it identifies numerical expressions that may form part of larger NUMEX or TIMEX expressions. For example, it will identify the string "two and a half" as a number and this could occur in money expressions such as "two and a half cents" or dates such as "two and a half years ago". In the following sections we provide brief descriptions of the three grammars. More detailed information can be found in the comments in the grammar files themselves.

A Grammar for Numbers

One of the example grammars, $TTT/GRAM/sgml/numbers.gr, contains a fairly extensive analysis of numbers in various forms. Most cases of decimals, and character-based number expressions (`digit numbers' like 1,256,387), are handled quite straightforwardly. Describing `text numbers', such as one thousand and twenty-six, is clearly more complicated. Combinations of digits and text are also possible, as in 30 million, and the grammar does allow various forms of these. A relatively extensive grammar of text fractions (three-fourths, forty-nine hundredths, and suchlike) is also included.

The comments in the grammar file contain details on what the rules are looking for; however a short description of the overall assumptions of the grammar of text numbers is appropriate as the basic approach covers the mixed forms (digit and text) and fractions also. Generally, the grammar should recognise almost any form of text number up to a trillion trillion - so all of the examples below are accepted:

one hundred and sixty-five thousand
three hundred million seventy five thousand four hundred and one
eighteen thousand five hundred and ninety-two
nine hundred and ninety-nine thousand billion

The grammar assumes a central position in text numbers which characterises the whole number - in the four examples above this is thousand, million, hundred, and billion respectively. The grammar then assumes a `modifying' number on the left of the fulcrum; in the above examples these are:

one hundred and sixty-five
three hundred
eighteen
nine hundred and ninety-nine thousand

The element following the central position is not necessary, as in the first and fourth cases. In the other two the relevant parts of the numbers are:

seventy five thousand four hundred and one
five hundred and ninety-two

Note that no attempt is made to provide a `semantics' for the numbers - they are just recognised as numerical expressions. However it should be possible to use the targ values to build up numerical representations of the text expressions if this is required.

Prev	Home	Next
Part of Speech Tagging: ltpos		The NUMEX Grammar