Exkurs: BLEU-Score

[ Zurück ] [ Zurück (Seitenende) ] [ Seitenende ] [ Überkapitel ] [ Bitte Skript-Fehler melden ]

10.6. Exkurs: BLEU-Score

Automatische Evaluation

Idee

Automatische Evaluation misst die Qualität einer maschinellen Übersetzung, indem sie mit einer oder besser mehreren menschlichen Referenz-Übersetzungen verglichen wird.

Vorteile

Menschliche Evaluation ist aufwändig und langsam , automatische Berechnung einer metrischen Güte ist billig und schnell.

Deﬁnition 10.6.1 (Bilingual Evaluation Understudy (BLEU)). Eine der aktuell wichtigsten Metriken zur automatischen bilingualen Evaluation ist der BLEU-Score .

BLEU: Unigramm-Präzision

MT: It is a guide to action which ensures that the military always obeys the commands of the party.
MT: It is to insure the troops forever hearing the activity guidebook that party direct.

HT: It is a guide to action that ensures that the military will forever heed Party commands.
HT: It is the guiding principle which guarantees the military forces always being under the command of the Party.
HT: It is the practical guide for the army always to heed the directions of the party.

Deﬁnition 10.6.2 (Unigramm-Präzision P₁). Die Unigramm-Präzision (Token-Präzision) eines Übersetzungskandidaten misst, wie hoch der Anteil der Wörter aus allen Referenzübersetzungen an allen Tokenvorkommen eines Kandidaten ist: $P = C- 1 N$

N = Anzahl Token des Kandidaten; C = Anzahl Token des Kandidaten, welche in einer Referenzübersetzung erscheinen

Unigramm-Evaluation

Frage

Wie hoch sind P₁ von MT1 und MT2?

Tokenvorkommen

MT1: . a action always commands ensures guide is it military of party that the the the to which
MT2: . is it party that the the to

Notwendigkeit für Clipping

Problem der Wiederholung

Kandidat: the the the the the the the
HT1: the cat sat on the mat
HT2: there is a cat on the mat

Wie hoch ist die P₁ des “idiotischen” Kandidaten? 7
7

Clipping der Kandidatenvorkommen

Ein Token darf maximal sooft gezählt werden, wie es in einer einzelnen Referenzübersetzung vorkommt. Wie hoch ist P₁ des Kandidaten mit Clipping?

Uni-, Bi-, Tri- und Quadrigramme

Längere Textabschnitte im Vergleich

Welche N-Gramme aus den Referenztexten ﬁnden sich im MT-Kandidaten?

MT: It is a guide to action which ensures that the military always obeys the commands of the party.

HT: It is a guide to action that ensures that the military will forever heed Party commands.
HT: It is the guiding principle which guarantees the military forces always being under the command of the Party.
HT: It is the practical guide for the army always to heed the directions of the party.

Geometrisches Mittel der N-Gramm-Präzisionen

Die Precisionwerte der 1-4-Gramme eines Kandidaten werden geometrisch gemittelt : $P = (P1 × P2 × P3 × P4)1∕4$

Problem der Kürze

Kandidat: of the
HT1: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Wie hoch ist die P₁ des Kandidaten?

Recall-Mass kompensieren

Normalerweise würde ein Präzisionsmass mit Recall verrechnet, um solche Eﬀekte zu mindern. Wir haben aber mehrere Referenzübersetzungen. Als Ausweg wird ungewöhnliche Kürze des Kandidaten bestraft.

Strafabzug für Kürze über Korpus

Schritt: Bestimme die Gesamt-Länge c der Kandidatenübersetzung.
Schritt: Bestimme die Gesamt-Länge r der Referenzübersetzungen, indem jeweils die kürzeste (NIST-Variante) oder zur höchsten Bewertung führende Referenzübersetzung genommen wird.
Schritt: Bestimme Kürze: brevity = r∕c
Schritt: Bestimme Strafabzug (brevity penalty): ${ 1 falls c > r BP = e(1−brevity) falls c ≤ r$

Beispiel 10.6.3 (Realistischer Faktor).
Wenn Kandidatenübersetzung 1000 Token zählt (c = 1000) und Referenzlänge als 1100 Token zählt (l = 1100), dann BP = e^1−1.1 = e^−0.1 = 0.905

BLEU als Formel
BLEU-Score ergibt sich aus Multiplikation von Brevity Penalty mit der geometrisch gemittelten Präzision aus 1-4-Grammen.

BLEU = BP × (P1 × P2 × P3 × P4)1∕4 = BP × P

Wert von 1 heisst “perfekte” Übereinstimmung, Wert 0 heisst keine Übereinstimmung.

Eigenschaften

BLEU betont enge lokale Übereinstimmung und vernachlässigt Unstimmigkeiten, welche sich darüber hinaus ergeben können:“Ensures that the military it is a guide to action which always obeys the commands of the party.” wäre gleich gut wie Kandidat 1.

Wie zuverlässig bildet BLEU das menschliches Urteil ab?

Wortvarianz (Synonyme) wird nur berücksichtigt, wenn in Referenzübersetzungen enthalten
Unwichtige und wichtige Inhalts-Wörter werden gleich behandelt
Für denselben BLEU-Score gibt es Millionen von Kombinationen mit unterschiedlichster Übersetzungsqualität
Regelbasierte Übersetzungssysteme werden gegenüber statistischen gerne abgestraft

pict

Abbildung 10.12:

Korrelation von menschlichen und BLEU-Bewertungen nach [Callison-Burch et al. 2006]

[ Zurück ] [ Zurück (Seitenende) ] [ Seitenbeginn ] [ Überkapitel ] [ Bitte Skript-Fehler melden ]