Conclusions - Promising ideas, poor performance

5.1 Promising ideas, poor performance

5.1.3 Conclusions

As outlined in section 4.2, the Tree-DOT model embodies many desirable MT system

characteristics and, consequently, would seem to be an MT paradigm worthy of investigation. Poutsma’s pilot experiments yielded disappointing results, however, and he puts

forward some possible explanations as to why this was the case (Poutsma, 2000):

• the dataset used was extremely small in size as it contained only 266 tree pairs; better performance is assumed if a larger corpus were to be used;

• the quality of the translations in the example base was poor as they were translated into English by non-native speakers and, as a result, the output translations were

also poor;

• the trees in the dataset were wide and shallow rather than deep, meaning that varying tree depth did not have a great deal of impact on output quality, and Poutsma

suggests that varying treewidth would be more appropriate;

• it was frequently the case that two or more translations had roughly the same probability and the less preferred translation would have scored better;

• in the German language, case, number and gender influence the choice of word form and translation quality would improve if this information were integrated into the

model.

We agree with Poutsma that the small number of sentence pairs and poor translation

quality in his dataset contributed to the poor performance. However, it is also the case

that the trees themselves were lacking in linguistic complexity. The total number of

fragments yielded from 226 tree pairs in his dataset was 33479. In contrast, the 810 tree

pairs contained in the English-French section of the HomeCentre corpus (which will be

described more fully in section 6.1) yields a maximum number of fragments in excess of

just under 28% of the number of pairs in the HomeCentre, it yields just 0.01% of the number of fragments. We suggest that, not only does the dataset need to be larger, but

the analyses in the dataset need to provide a greater level of linguistic detail in order to

fully assess the capabilities of the model.3 _{We agree with Poutsma that incorporating}

information on features such as number and gender into the model is likely to significantly

improve translation quality. This issue is discussed further in chapter 8. However, we also

feel that the Tree-DOT model merits further investigation and evaluation before moving

on to more linguistically-complex models.

Implementing the Tree-DOT model is analogous to implementing Tree-DOP. Given the

discussion in sections 2.4 and 2.5.1 on the difficulties of efficiently implementing the Tree- DOP model, however, it is clear that Poutsma’s implementation lacks the sophistication

to be able to handle a dataset which constitutes a significant increase in terms of size and

complexity on the dataset used previously. Furthermore, it is possible that in the situations

Poutsma mentions where two or more translations had roughly the same probability and

the less preferred translation would have scored better, this is at least partly attributable

to the inadequacies of his sampling methodology.

In conclusion, the algorithms developed to make Tree-DOP more efficient must be

adapted for Tree-DOT and a more robust implementation built to facilitate experiments

using larger, more complex datasets. Until such a system is in place and these experiments carried out, it will not be possible to fully demonstrate the merits of the Tree-DOT

approach to translation.

3_{Much of Poutsma’s discussion as to the quality of translations produced by the DOT model centers}

around manual comparison with the translations output by the Systran MT system for the same sets of test sentences. Clearly, the poor quality of the translations in Poutsma’s training data meant that the DOT model yielded poor translations compared to those generated by Systran. Consequently, we agree that poor training translation quality contributed to the disappointing evaluation outcome. However, we do not feel that this is the most appropriate way to evaluate data-driven MT systems, and that automatic evaluation gives a better picture of how such models perform. If a data-driven system is trained on poor quality translations, then we expect to get poor translations out, but if the reference translations are also poor then we still expect to score well. In a data-driven system, the aim is to model the data supplied, and if evaluation is over a held-out portion of this data then we expect to scorewell even if the output translations are, in human terms, of poor quality. Although he also performed automatic evaluation against reference translations using a metric he defined himself (Poutsma, 2000):58, Poutsma’s work predates the development of the automatic evaluation metrics – Bleu (Papineni et al., 2001, 2002), NIST (NIST, 2002; Doddington, 2002) and F-score (Melamed et al., 2003; Turian et al., 2003)) – currently in use. Poutsma’s metric, termed ‘Largest Translation Part’, is far less sophisticated than these newer metrics, and so it is difficult to draw meaningful conclusions from the automatic evaluation he presents.

In document Hearne DOT thesis goodmanreductions pdf (Page 125-127)