CL-LogoEvaluation of Machine Translation Systems

People involved:
Martin Volk, Britta Heidemann
People previously involved:
Melchior Amgarten, Toni Arnold, Catherine Ayer, Geneviève Härdi, Marc Luder, Mattia Mastropietro, Dominic A. Merz, Angela Niederbäumer, Jeannette Roth, Gerold Schneider, Mirjam Oberholzer, Wei Peng
early 1996 to 1999
no external funding; MT systems donated by the distributors


In spring 1996 we (a group of CL students under the guidance of Dr. Martin Volk) started to evaluate commercial Machine Translation (MT) systems. Our aim was to find out whether MT systems in the low price range (< SFr. 2000) could be used in a real life situation, that is whether such an Machine Translation system would increase a translator's productivity. In a first step we concentrated on systems that translate between English and German. We evaluated 6 systems and presented our findings at a seminar in late September 1996.

The evaluation consisted of compiling a list of criteria for self evaluation and three experiments with external volunteers, mostly students from the local interpreter school. The list of criteria consisted of technical, linguistic and ergonomic issues. Within the linguistic criteria much care was taken to measure lexical and grammatical coverage of the systems.

Lexicon size is an important factor for a successful translation. Unfortunately, only few distributors spell out the lexicon size of their system. And even if they do, their counting methods may differ considerably. It was therefore necessary to develop a method for a more objective measurement. The group worked with adjectives, nouns, and verbs, extracted from a lexical database. The words were compiled from three frequency classes (the 100 most frequent words, 100 words with frequency 25 or less - a medium frequency, and 100 words with frequency 1). Translating the words from these classes with every system gave a clear picture of the respective lexical coverage. Recognition of the words from the low frequency class ranged between 20% and 82%.

Grammatical coverage is often directly related to translation quality. Therefore grammatical coverage was evaluated with a test suite. It contained 120 sentences displaying the most important syntactic phenomena for translation from German to English. Coverage of these phenomena varied between 50% and 80%.

Three experiments were performed to judge

  1. the information content of the translations. People were asked to answer questions on the basis of the translations.
  2. the user-friendliness. People were asked to acquaint themselves with a system, translate a given text, add items to the system lexicon and postedit the outcome. Afterwards they were interviewed to judge the systems performance.
  3. the translation quality. People were asked to rank the translations from different systems all produced from the same source text.

In the meantime we have continuously followed up with our tests with the latest versions for all (serious) PC-based machine translation systems for the language pairs German - English and German - French. We have refined our methods for testing the lexical coverage and extended our test suite for grammar tests. The results of the latest evaluation (together with detailed system descriptions) are available in a documentation brochure (

Author: Martin Volk (
Date of last modification: 20.01.02