BLEU - Machine Translation Evaluation techniques

5.2 Machine Translation Evaluation techniques

5.2.1 BLEU

BLEU (Papineni et al. 2002) (BiLingual Evaluation Understudy) is an automated MT evaluation technique and was used as an understudy of skilled human judges in translation. The idea is to measure the closeness of a machine text to its human equivalent using weighted average of phrases matched with variable length (n-grams). It enables the use of

5.2. Machine Translation Evaluation techniques 92

multiple reference solutions from different experts and this allows for legitimate differences in word choice and sequence. The BLEU score is a precision-based metric which can use multiple reference solutions and aggregates the precision scores from different word lengths; this concept is known as modified n-gram precision. BLEU also ensures that the machine text’s length is comparable to at least one of the reference solutions using a brevity penalty. Modified n-gram precision matches position-independent n-grams; where n ≥ 1 and each uni-gram (1-gram) is typically a keyword but can also be a stand-alone special charac- ter or punctuation. This is similar to precision measure in TCBR. However, it is modified to ensure that n-grams can be matched across multiple reference solutions. Each n-gram is matched to the reference solution with the maximum count for the n-gram. The overall precision is a geometric average of all individual n-gram precisions from 1 to N . Us- ing N = 4 has been found to give the best correlation to human judgements (Papineni et al. 2002). In comparison to the criteria used in human evaluation, uni-gram precision (i.e. n = 1) measures adequate coverage of a machine text while n-gram precision (when

n > 1) shows grammatical correctness.

Brevity Penalty (BP) on the other hand ensures that the length of machine text is penalized if it is shorter than all the reference solutions. This is because a machine text of shorter length might have a very high n-gram precision if most of its keywords occur in any of the reference solutions. Therefore, modified n-gram precision alone fails to enforce proper translation length. BP focuses solely on penalizing shorter machine texts as unnecessarily long texts will have been penalized by the modified n-gram precision. Although recall has been combined with precision to overcome problems with text lengths to give measures like f-measure (Baeza-Yates & Ribeiro-Neto 1999), it cannot be used in BLEU because it employs the use of multiple references and each reference might use different word choices and order. Also, recalling all choices gives a bad translation since a good translation will only use one of the possible choices. BP is formulated as a decaying exponential function which gives a value of 1 when machine text’s length is at least identical to any of the reference solutions length otherwise it gives a value less than 1. The BLEU

5.2. Machine Translation Evaluation techniques 93

metric is calculated as follows.

BLEU = BP · exp(1 N N n=1 log p_n) (5.1) p_n= i ⎛ ⎜

⎝ # of n-grams in segment i of machine text matched in segment i of any of the reference solutions

⎞ ⎟ ⎠

i(# of n-grams in segment i of machine text)

BP = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 1 if l_sys > l∗_ref

exp(1−l∗reflsys) _{if l}

sys ≤ l∗ref

where

p_n= n-gram precision BP = brevity penalty l_sys = length of machine text i = 1 for TCBR l∗_ref = nearest reference solution length to machine text

N = maximum size of n-gram (i.e. n = 1 . . . N )

It is important to note that the entire text is typically regarded as one segment in TCBR (i.e. i = 1) when calculating precision, p_n. This is because there is usually no knowledge of aligned segments between proposed and reference texts unlike MT where translations are done segment by segment. Figures 5.2 and 5.3 show a sample retrieval from our H&S dataset and its corresponding BLEU calculation. Here, we compare the proposed solution with the three reference solutions and show sample BLEU calculation for both single and multiple references as well as when N= 1 and N=2 in Figure 5.3(a)-(d). Precision with a single reference solution (say Ref Solution1) when N=1 matches only keywords “nursing” and “staﬀ” from the machine text resulting in a precision of 0.6 as shown in Figure 5.3(a). However, the keyword “examined” is also matched when multiple reference solutions are in use; see Figure 5.3(b). A bigger, more accurate and reliable BLEU score of 1.0 is obtained with multiple references. The expectation is that this value from the use of multiple references will correlate better with human judgements since the meaning of the proposed solution is identical to the three reference solutions. Figures 5.3(c) and (d) also repeats the BLEU calculation for N=2. Here, just one (“nursing staﬀ”) of

5.2. Machine Translation Evaluation techniques 94

Problem: “paent fell to ﬂoor when geng out of bed.”

QUERY

Problem: “paent slid off of her bed and fell to the floor.” Soluon: “examined by nursing staff.”

RETRIEVED CASE

Ref Soluon1: “paent checked by nursing staﬀ.” Ref Soluon2: “ﬁrst aid.”

Ref Soluon3: “examined by medical staﬀ.”

mulple reference soluons

retrieved soluon/ machine text

Figure 5.2: A test case with multiple reference solutions

SINGLE REFERENCE (N=1, i.e. unigram)

# of keywords in machine text (l_sys)= 3 Ref soluon length (l*_ref)= 4 [Ref Soln 1] Number of matches with reference= 2 Unigram precision (p1)= 2/3 = 0.67

BP= exp(1- 4/3)= 0.719 [i.e. l_sys<l*_ref]

MULTIPLE REFERENCE (N=1)

# of keywords in machine text (l_sys)= 3

Closest Reference length (l*_ref)= 3 [Ref Soln 3] Number of matches with reference= 3 Unigram precision (p1)= 3/3 = 1.0

BP= exp(1- 1/1)= 1.0 [i.e. l_sys=l*_ref]

SINGLE REFERENCE (N=2, uni/bigrams)

# of bigrams in machine text (l_sys)= 2 Ref soluon length (l*_ref)= 4 [Ref Soln 1] Number of bigram matches with ref = 1 Unigram precision (p1) = 0.67

Bigram precision (p2)= 1/2 = 0.5

BP= exp(1- 4/3)= 0.719 [i.e. l_sys<l*_ref]

MULTIPLE REFERENCE (N=2)

# of bigrams in machine text (l_sys)= 2

Closest Reference length (l*_ref)= 3 [Ref Soln 3] Number of bigram matches with refs= 1 Unigram precision (p1) = 1

Bigram precision (p2)= 1/2 = 0.5

BP= exp(1- 1/1)= 1.0 [i.e. l_sys=l*_ref]

(a) (b)

(c) (d)

Figure 5.3: Sample BLEU calculation with H&S dataset

the two bi-grams (“examined nursing” and “nursing staﬀ”) is matched in the reference solutions (single or multiple). The bi-gram precision and previously calculated uni-gram precision are then used to calculate the BLEU2 score. Notice that the brevity penalty (BP) value remains the same since the length of the solutions (reference and proposed) are based only on counts of keywords. Also, the BLEU2 scores are less than their BLEU1 scores since BLEU uses an arithmetic average and the number of n-gram matches typically decreases with increasing n. Another important observation is that the same uni-gram

5.2. Machine Translation Evaluation techniques 95

precision can be obtained if the multiple references are concatenated together. However, this will not make sense for n > 1 because n-grams created from the end of one reference and the beginning of another might be incorrectly matched during precision calculation.

In document Case reuse in textual case-based reasoning. (Page 106-110)