When processing large corpora, it is often necessary to lemmatize the wordforms. This is usually done by a morphological analyser which can, in any case, undo inflection but sometimes even derivation and compounding. The latter is especially useful for German which exhibits very productive compounding. But when using such a system we notice that lemmatisation is a frequent source of ambiguities. Some wordforms genuinely belong to two lemmas of the same part of speech such as rasten which can be a form of rasen (to race) or rasten (to rest). Others belong to two lemmas of different word classes such as meinen, which can be either a possessive pronoun my or a verb to mean. This latter ambiguity can be easily resolved by a part-of-speech tagger or a parser.
The former ambiguity is much harder to deal with. In the case of verbs a parser might be able to distinguish between the two lemmas if they subcategorize for different complements. It gets more difficult for nouns which often do not have clear subcategorization requirements. But our corpus studies show that nouns are a frequent source of ambiguous lemmas, in particular if different segmentations are taken into account. When analysing the Neue Zürcher Zeitung we found that close to 10% of all noun types are assigned more than one lemma by our lemmatizer, the Gertwol system. The following examples show a number of ambiguous German noun forms whose lemma alternatives correspond to very different word meanings.
Abteilungen ==> (die) Abt~ei#lunge OR (die) Ab|teil~ung Ministern ==> (der) Mini|stern OR (der) Minister Flugzeuge ==> (der) Flug#zeug~e OR (das) Flug|zeug Verbrechen ==> (der) Verb#rechen OR (das) Ver|brech~enSome of these ambiguities may be resolved by using the gender information. But for many others this criterion cannot be employed since the variants show the same gender. We therefore investigated a method to use the segmentation information to decide on the correct lemma for German nouns. Our method relies on the segmentation information of the Gertwol system [Gertwol 94]. Gertwol distinguishes four types of segmentation:
Hoffnungsträger ==> Hoffn~ung\s#träge noun derived from adjective Hoffn~ung\s#träg~er regular noun
Lohneinbussen ==> Lohn#ein#bus = 8 points Lohn#ein|buß~e = 7 points Geldwäschereibestimmung ==> Geld#wäsch~e#reib~e#stimm~ung = 15 points Geld#wäsch~er#eib~e#stimm~ung = 15 points Geld#wäsch~er~ei#be|stimm~ung = 13 pointsIt is important that a strong composition boundary counts more than the sum of a weak boundary and a derivation boundary, since often these lead to alternative lemmas. When manually checking the result on 400 ambiguous nouns we found that these two rules lead to the correct lemma for around 85% of the ambiguously segmented noun lemmas. But we noticed that some of the remaining errors were due to some rarely used morphemes. Consider the following example where the rarely used word Stag (a hemp rope) is a possible compound segment.
Arbeitstag ==> Arbeit\s#tag = 4 points Arbeit#stag = 4 points
`Bad' segment | Prefered segment | Example |
buchs | buch | Liederbuchs ==> Lieder#buch |
port | sport | Motorsport ==> Motor#sport |
reis | reis~e | Ferienreise ==> Ferien#reis~e |
samt | amt | Arbeitsamt ==> Arbeit\s#amt |
stag | tag | Arbeitstag ==> Arbeit\s#tag |
tand | stand | Wohlstand ==> Wohl#stand |
tuba | stube | Badestube ==> Bad\e#stube |
Obviously the preference list needs to be adapted to the text type. For out newspaper corpus we found it advantaguous to prefer Reis~e (trip) over Reis (rice), whereas for a cookbook the opposite will be true.
As a refinement we will use frequency information of the segments to improve the preference list. That is, we will check for all compound segments how often they occur in a given corpus and build the preference list automatically according to the most frequent segments.