Choosing the right lemma when analysing German nouns

Martin Volk
University of Zurich
Department of Computer Science
Computational Linguistics Group
Winterthurerstr. 190, CH-8057 Zurich
volk@ifi.unizh.ch

Introduction

When processing large corpora, it is often necessary to lemmatize the wordforms. This is usually done by a morphological analyser which can, in any case, undo inflection but sometimes even derivation and compounding. The latter is especially useful for German which exhibits very productive compounding. But when using such a system we notice that lemmatisation is a frequent source of ambiguities. Some wordforms genuinely belong to two lemmas of the same part of speech such as rasten which can be a form of rasen (to race) or rasten (to rest). Others belong to two lemmas of different word classes such as meinen, which can be either a possessive pronoun my or a verb to mean. This latter ambiguity can be easily resolved by a part-of-speech tagger or a parser.

The former ambiguity is much harder to deal with. In the case of verbs a parser might be able to distinguish between the two lemmas if they subcategorize for different complements. It gets more difficult for nouns which often do not have clear subcategorization requirements. But our corpus studies show that nouns are a frequent source of ambiguous lemmas, in particular if different segmentations are taken into account. When analysing the Neue Zürcher Zeitung we found that close to 10% of all noun types are assigned more than one lemma by our lemmatizer, the Gertwol system. The following examples show a number of ambiguous German noun forms whose lemma alternatives correspond to very different word meanings.

Abteilungen ==>  (die) Abt~ei#lunge  OR (die) Ab|teil~ung
Ministern   ==>  (der) Mini|stern    OR (der) Minister
Flugzeuge   ==>  (der) Flug#zeug~e   OR (das) Flug|zeug
Verbrechen  ==>  (der) Verb#rechen   OR (das) Ver|brech~en
Some of these ambiguities may be resolved by using the gender information. But for many others this criterion cannot be employed since the variants show the same gender. We therefore investigated a method to use the segmentation information to decide on the correct lemma for German nouns. Our method relies on the segmentation information of the Gertwol system [Gertwol 94]. Gertwol distinguishes four types of segmentation:
  1. "Elements that can occur as independent words are separated with a strong boundary character (#). Verb stems occurring as first elements are still an exception to this rule. Examples: Berg#wiese, Schreib#maschine"
  2. "Prepositions, prefixes and non-independent elements are separated with a weak boundary character (|)." Examples: Vor|schule, geo|morpho|log~isch. The weak boundary is also used before non-independent second parts of the compounds. Examples: Mensch\en|recht~ler, zwei|jähr~ig. Half-suffixes ("a productive word whose meaning in compounds has changed from its meaning as an independent word") are also separated with a weak boundary. Examples: Laub|werk but: Nach|schlag\e#werk.
  3. "Linking elements may occur before the boundaries of compound words or suffixes. They are separated with a backslash (\). On the left-hand side of the linking element is the stem of the word." Examples: Büch\er#bus, Fried~e\ns#freund
  4. Derivational suffixes are separated by a tilde character.
Building on these segmentations we have developed a disambiguation method that determines the correct lemma for around 90% of ambiguous noun lemmas. It does not rely on tagging or parsing but only on the internal structure of the competing lemmas. The method is therefore well suited for shallow corpus investigations.

The disambiguation method

The disambiguation method is based on the observation that in most cases the noun lemma with the least internal complexity is the preferred lemma. Complexity depends on strong and weak composition boundaries as well as on derivation boundaries. If, for example, one lemma has a strong composition boundary and the other has a weak boundary then the lemma with the weak boundary is preferred. The linking elements do not influence the complexity. The disambiguation method works in three steps.
  1. Gertwol distinguishes regular nouns and derived nouns, where derived nouns are marked if they are derived from adjectives (e.g. das Gute) or participles (das Geschehene, der Sehende). Our first rule: If a noun has competing lemmas, where at least one lemma is marked by Gertwol as a regular noun then discard all derivational noun lemmas. Example:
    Hoffnungsträger ==>
    	Hoffn~ung\s#träge     noun derived from adjective 
    	Hoffn~ung\s#träg~er   regular noun
    
  2. Gertwol distinguishes strong and weak composition boundaries as well as derivational boundaries. These will be counted for every lemma according to the following scores: Our second rule: The lemma with the smallest overall point score is the best lemma. Examples:
    Lohneinbussen ==>
    	Lohn#ein#bus     = 8 points
    	Lohn#ein|buß~e   = 7 points
    	
    Geldwäschereibestimmung ==>
    	Geld#wäsch~e#reib~e#stimm~ung   = 15 points
    	Geld#wäsch~er#eib~e#stimm~ung   = 15 points
    	Geld#wäsch~er~ei#be|stimm~ung   = 13 points
    
    It is important that a strong composition boundary counts more than the sum of a weak boundary and a derivation boundary, since often these lead to alternative lemmas. When manually checking the result on 400 ambiguous nouns we found that these two rules lead to the correct lemma for around 85% of the ambiguously segmented noun lemmas. But we noticed that some of the remaining errors were due to some rarely used morphemes. Consider the following example where the rarely used word Stag (a hemp rope) is a possible compound segment.
    Arbeitstag ==>
    	Arbeit\s#tag   = 4 points
    	Arbeit#stag    = 4 points
    
  3. We therefore use preferences to exclude the most exotic compound segments. With a list of 14 preferences our method improved to 90% correct lemmas for our newspaper corpus. Some examples from the preference list:

    `Bad' segment Prefered segment Example
    buchs buch Liederbuchs ==> Lieder#buch
    port sport Motorsport ==> Motor#sport
    reis reis~e Ferienreise ==> Ferien#reis~e
    samt amt Arbeitsamt ==> Arbeit\s#amt
    stag tag Arbeitstag ==> Arbeit\s#tag
    tand stand Wohlstand ==> Wohl#stand
    tuba stube Badestube ==> Bad\e#stube

    Obviously the preference list needs to be adapted to the text type. For out newspaper corpus we found it advantaguous to prefer Reis~e (trip) over Reis (rice), whereas for a cookbook the opposite will be true.

Conclusion

We have developed a method to find the correct noun lemma in cases of segmentation ambiguity. The method is based on some simple heuristics which use the Gertwol composition and derivation boundaries. The program is written in Perl and can be used as a filter on the Gertwol output, thus reducing the ambiguity for further processing steps.

As a refinement we will use frequency information of the segments to improve the preference list. That is, we will check for all compound segments how often they occur in a given corpus and build the preference list automatically according to the most frequent segments.

References

[Gertwol 94]
Mariikka Haapalainen and Ari Majorin: GERTWOL: Ein System zur automatischen Wortformerkennung deutscher Wörter. Lingsoft, Inc. 1994.