Any computer system for natural language processing has to struggle with the problem of ambiguities. If the system is meant to extract precise information from a text, these ambiguities must be resolved. One of the most frequent ambiguities arises from the attachment of prepositional phrases (PPs). A PP that follows a noun can be attached to the noun or to the verb. In this book we propose a method to resolve such ambiguties in German sentences based on cooccurrence values derived from a shallow parsed corpus.
Corpus processing is therefore an important preliminary step. We introduce the modules for proper name recognition and classification, Part-of-Speech tagging, lemmatization, phrase chunking, and clause boundary detection. We processed a corpus of more than 5 million words from the Computer-Zeitung, a weekly computer science newspaper. All information compiled through corpus processing is annotated to the corpus.
In addition to the training corpus, we prepared a 3000 sentence test corpus with manually annotated syntax trees. From this treebank we extracted over 4000 test cases with ambiguously positioned PPs for the evaluation of the disambiguation method. We also extracted test cases from the NEGRA treebank in order to check the domain dependency of the method.
The disambiguation method is based on the idea that a frequent cooccurrence of two words in a corpus indicates binding strength. In particular, we measure the cooccurrence strength between nouns (N) and prepositions (P) and on the other hand between verbs (V) and prepositions. The competing cooccurrence values of N+P versus V+P are compared to decide whether to attach a prepositional phrase (PP) to the noun or to the verb. A variable word order language like German poses special problems for determining the cooccurrence value between verb and preposition since the verb may occur at different positions in a sentence. We tackle this problem with the help of a clause boundary detector to delimit the verb's access range.
Still, the cooccurrence values for V+P are much stronger than for N+P. We need to counterbalance this inequality with a noun factor which is computed from the general tendency of all prepositions to attach to verbs rather than to nouns. It is shown that this noun factor leads to the optimal attachment accuracy.
The method for determining the cooccurrence values is gradually refined by distinguishing sure and possible attachments, different verb readings, idiomatic and non-idiomatic usage, deverbal versus regular nouns, as well as the head noun from the prepositional phrase. In parallel we increase the coverage of the method by using various clustering techniques: lemmatization, core of compounds, proper name classes and the GermaNet thesaurus.
In order to evaluate the method we used the two test sets. We also varied the training corpus to determine its influence on the cooccurrence values. As the ultimate corpus, we tried cooccurrence frequencies from the WWW.
Finally, we compared our method to another unsupervised method and to two supervised methods for PP attachment disambiguation. We show that intertwining our cooccurrence-based method with the supervised Back-off model leads to the best results: 81% correct attachments for the Computer-Zeitung test set.