The F-score metric - A new implementation of the Tree-DOT model

5.2 A new implementation of the Tree-DOT model

6.2.4 The F-score metric

Melamed et al. (2003) and Turian et al. (2003) apply the standard measures of precision

and recall to the evaluation of MT output. In general terms, precision and recall scores for

candidate item C with respect to reference item R are calculated according to equations 6.21 and 6.22 respectively.

precision(C_|R) = |C∩R|

|C_| (6.21)

recall(C_|R) = |C∩R|

|R_| (6.22)

A method of calculating the intersection between two sentences must be defined in order

C B A I C D E A • B • C • • D • E • F B • A • I • C • •

Figure 6.1: Bitext grid illustrating the relationship between an example candidate translation and its corresponding reference translation - the words of the candidate translation are shown from left to right across the top of the grid and the words of the reference translation are shown from top to bottom down the left-hand side of the grid. Each bullet, called ahit, indicates a word contained in both the candidate and reference strings. (This illustration is adapted from Figure 1 of (Melamed et al., 2003; Turian et al., 2003).)

reference translations. Precisely such a definition is given in (Melamed et al., 2003; Turian

et al., 2003), where a bitext grid is used to show the intersection of two texts. An example

of a bitext grid – adapted from Figure 1 of (Melamed et al., 2003) and (Turian et al.,

2003) – is given in Figure 6.1, where the candidate string reads from left to right across

the top of the grid and the reference string from top to bottom down the left-hand side

of the grid. The intersections between these strings are marked by bullets (termedhits), i.e. each cell in the grid referring to the same candidate and reference word constitutes a

point of intersection.

If we simply take _|C_∩R_| to be the number of hits in the grid, then the count is over-estimated as some words will be counted more than once. For example, in Figure

6.1 the first candidate word ‘C’ gets two hits as this word appears twice in the reference

translation – the total number of hits for each candidate word can be seen at a glance by

simply counting the number of hits in its column in the grid. The concept of a matching

(Melamed et al., 2003; Turian et al., 2003) is used to avoid this problem, where a matching

is a reduced grid such that there is at most one hit in each row and each column. Examples of such matchings for the grid in Figure 6.1 are given in Figure 6.2. Amaximum matching

is a matching in which there are hits for as many candidate words as possible; in Figure

(a) (b) (c) C B A I C D E A • B • C • ′ D ′ E ′ F B ′ A ′ I • C ′ • C B A I C D E A • B • C ′ ′ D • E • F B ′ A ′ I • C • • C B A I C D E A ′ B ′ C • ′ D • E • F B • A • I • C ′ •

Figure 6.2: (a), (b) and (c) are examples of matchings for the grid in Figure 6.1. Hits which were in the original grid but are not contained in the matching are marked _′. In each matching, each row and column in the grid contains a single hit. (This illustration is adapted from Figure 1 of (Melamed et al., 2003; Turian et al., 2003).)

is the number of hits in a maximum matching – in Figure 6.2 (b) and (c), the MMS is 7 – and the MMS can never exceed the length of the shorter of the strings being compared.

The intersection between candidate and reference translations can be computed as

the MMS (Melamed et al., 2003; Turian et al., 2003) and precision and recall calculated

according to formulae (6.23) and (6.24).

precision(C_|R) = M M S(C, R)

|C_| (6.23)

recall(C_|R) = M M S(C, R)

|R_| (6.24)

However, these measurements do not penalise either for incorrect word order or non-

contiguous hits, i.e. grids (b) and (c) in Figure 6.2 both contain the same number of hits and so receive exactly the same precision and recall scores despite the fact that grid (c) shows a matched 4-word sequence whereas the largest correct sequence shown in grid (b) has only 2 words. In order to reward correct word order, the definition of match size is generalised by treating runs as atomic units (Melamed et al., 2003; Turian et al., 2003). Each run is converted to an aligned block which is its minimum enclosing square; this is

illustrated in Figure 6.3 (b) and (c) where the blocks of cells marked with circles correspond to the runs in Figure 6.2 (b) and (c).

(b) (c) C B A I C D E A ° ° B ° ° C D ° ° E ° ° F B A I ° ° C ° ° ° C B A I C D E A B C ° D ° ° E ° ° F B ° ° ° ° A ° ° ° ° I ° ° ° ° C ° ° ° °

Figure 6.3: (b) and (c) are examples of maximum matchings for the grid in Figure 6.1. (This illustration is adapted from Figure 1 of (Melamed et al., 2003; Turian et al., 2003).)

in terms of the area of the aligned blocks by defining the weight of any single run as the

square of its length. Thus, the calculation of match size MMS for a particular maximum

matching M is calculated according to equation (6.25) (Melamed et al., 2003; Turian et al.,

2003): M M S(M) = s X r_∈M length(r)2 _(6.25)

According to this definition of match size, the grid in Figure 6.3 (b) is of size√12_{+ 2}2_{+ 2}2_{+ 2}2 ₌

3.61 whereas grid (c) is of size√12_{+ 4}2_{+ 2}2_{= 4}_._{58. As precision and recall are calculated}

according to equations 6.23 and 6.24 as before, grid (c) now scores higher than grid (b). As computing the MMS for any candidate and reference pair is NP-hard, Turian et al.

(2003) use an approximation which finds the true maximum match size 99% of the time.

In document Hearne DOT thesis goodmanreductions pdf (Page 160-163)