A TAGGER FOR ESTONIAN

Project Description

Abstract:

After deciding on a suitable tagset for Estonian and a specific text genre to use we will start training Eric Brill's tagger for Estonian. The preliminary results of our training look promising, therefore try the online version of this Estonian tagger.

Introduction:

Many wordforms can belong to different wordclasses. E.g. the English word run can be a noun or a verb. In most cases, however, these wordforms can be disambiguated by the context:

  (1) We run to the beach.
  (2) The run was a success.

In (1) run is a verb, in (2) it is a noun. If we attach wordclass information, so-called 'tags' to the individual words, the 'tagged' version of these two sentences will look something like this:

  (1) We/PRONOUN run/VERB to/PREP the/ARTICLE beach/NOUN
  (2) The/ARTICLE run/NOUN was/VERB a/ARTICLE success/NOUN

The Brill tagger is a program which can learn context rules from a correctly tagged text, and which can employ the acquired knowledge to predict tags in untagged texts.

The stage in which the tagger learns things from tagged texts is called training. In training, the tagger learns about possible tags for words, and it spots and remembers statistical correlations. In our example, a tagger will find out that the word run can be /NOUN and /VERB. It will also spot that when a word like run (or any other word which can be noun or verb, e.g. can) occurs, it will (almost) always be /NOUN after a word which is /ARTICLE, but (almost) always /VERB after a /PRONOUN. The tagger remembers this fact in a rule.

The stage in which a tagger employs the acquired knowledge to predict tags is called tagging. In our run example, the tagger will provisionally tag every run as /VERB, because this is the more frequent case. But whenever run is preceded by an article, the tag will be changed to /NOUN, by virtue of the rule learned as described above.

By automatically extracting and learning hundreds of such rules from an ever bigger training corpus, the tagging accuracy will increase.

Procedure:

We will first have to decide on a suitable tagset for Estonian. When doing so, one has to address questions like:

How fine-grained should a tagset be? In English, e.g., should one use different tags for the definite article the and the indefinite article a ? Or, in Estonian, should noun case information be included, or should we only use one tag /NOUN for all cases ?
Is there a widely accepted tagset for Estonian ? If so, should we also use it, or perhaps alter it in places ? Could we use a Finnish tagset and alter it where appropriate ?

Secondly, we need to find and select an Estonian corpus (most likely via the WWW), or at least a collection of Estonian texts of the genre we want to use.

Thirdly, we will begin to manually tag a small training text. We will then automatically tag a second portion of the same corpus, using the rules learnt from the first training texts. We will correct the (still many) mistakes and train the tagger with this bigger text. Continuing this procedure, both the size of our training text and the tagging accuracy will incrementally increase.

If we reach satisfactory results, we will put an online version of the tagger on the web, as the Zurich group has done for their German Brill Tagger at http://www.ifi.unizh.ch/CL/tagger/. If you know any German, feel free to try out this address in order to see what we can expect.

Prerequisites:

None. At least half of the participants should know Estonian, of course.

I am looking forward to welcoming you in Zurich

Gerold Schneider (gschneid@ifi.unizh.ch)

Source: http://www.ifi.unizh.ch
Date of last modification: