We tested the TTT system by using it to participate in the Named Entity task of the MUC-7 competition. This involved developing extensive rulesets to mark up named entities (ENAMEX: persons, organisations and locations), temporal expressions (TIMEX: dates and times) and numerical expressions (NUMEX: sums of money, percentages). Detailed information about precisely which expressions should be marked up can be found in the MUC-7 Named Entity Task Definition (http://www.muc.saic.com/proceedings/ne_task.html). In this release we distribute a general purpose grammar for recognising numbers ($TTT/GRAM/sgml/numbers.gr) and slightly modified versions of our MUC-7 grammars for NUMEX and TIMEX expressions ($TTT/GRAM/sgml/numex.gr and $TTT/GRAM/sgml/timex.gr respectively). We use these grammars in the runplain pipeline and its variants where they apply in the order numbers.gr - numex.gr - timex.gr. The grammar numbers.gr must apply before the other two since it identifies numerical expressions that may form part of larger NUMEX or TIMEX expressions. For example, it will identify the string "two and a half" as a number and this could occur in money expressions such as "two and a half cents" or dates such as "two and a half years ago". In the following sections we provide brief descriptions of the three grammars. More detailed information can be found in the comments in the grammar files themselves.
One of the example grammars, $TTT/GRAM/sgml/numbers.gr, contains a fairly extensive analysis of numbers in various forms. Most cases of decimals, and character-based number expressions (`digit numbers' like 1,256,387), are handled quite straightforwardly. Describing `text numbers', such as one thousand and twenty-six, is clearly more complicated. Combinations of digits and text are also possible, as in 30 million, and the grammar does allow various forms of these. A relatively extensive grammar of text fractions (three-fourths, forty-nine hundredths, and suchlike) is also included.
The comments in the grammar file contain details on what the rules are looking for; however a short description of the overall assumptions of the grammar of text numbers is appropriate as the basic approach covers the mixed forms (digit and text) and fractions also. Generally, the grammar should recognise almost any form of text number up to a trillion trillion - so all of the examples below are accepted:
one hundred and sixty-five thousand three hundred million seventy five thousand four hundred and one eighteen thousand five hundred and ninety-two nine hundred and ninety-nine thousand billion
The grammar assumes a central position in text numbers which characterises the whole number - in the four examples above this is thousand, million, hundred, and billion respectively. The grammar then assumes a `modifying' number on the left of the fulcrum; in the above examples these are:
one hundred and sixty-five three hundred eighteen nine hundred and ninety-nine thousand
The element following the central position is not necessary, as in the first and fourth cases. In the other two the relevant parts of the numbers are:
seventy five thousand four hundred and one five hundred and ninety-two
Note that no attempt is made to provide a `semantics' for the numbers - they are just recognised as numerical expressions. However it should be possible to use the targ values to build up numerical representations of the text expressions if this is required.