5.1 Promising ideas, poor performance
5.1.3 Conclusions
As outlined in section 4.2, the Tree-DOT model embodies many desirable MT system
characteristics and, consequently, would seem to be an MT paradigm worthy of investi- gation. Poutsma’s pilot experiments yielded disappointing results, however, and he puts
forward some possible explanations as to why this was the case (Poutsma, 2000):
• the dataset used was extremely small in size as it contained only 266 tree pairs; better performance is assumed if a larger corpus were to be used;
• the quality of the translations in the example base was poor as they were translated into English by non-native speakers and, as a result, the output translations were
also poor;
• the trees in the dataset were wide and shallow rather than deep, meaning that varying tree depth did not have a great deal of impact on output quality, and Poutsma
suggests that varying treewidth would be more appropriate;
• it was frequently the case that two or more translations had roughly the same prob- ability and the less preferred translation would have scored better;
• in the German language, case, number and gender influence the choice of word form and translation quality would improve if this information were integrated into the
model.
We agree with Poutsma that the small number of sentence pairs and poor translation
quality in his dataset contributed to the poor performance. However, it is also the case
that the trees themselves were lacking in linguistic complexity. The total number of
fragments yielded from 226 tree pairs in his dataset was 33479. In contrast, the 810 tree
pairs contained in the English-French section of the HomeCentre corpus (which will be
described more fully in section 6.1) yields a maximum number of fragments in excess of
just under 28% of the number of pairs in the HomeCentre, it yields just 0.01% of the number of fragments. We suggest that, not only does the dataset need to be larger, but
the analyses in the dataset need to provide a greater level of linguistic detail in order to
fully assess the capabilities of the model.3 We agree with Poutsma that incorporating
information on features such as number and gender into the model is likely to significantly
improve translation quality. This issue is discussed further in chapter 8. However, we also
feel that the Tree-DOT model merits further investigation and evaluation before moving
on to more linguistically-complex models.
Implementing the Tree-DOT model is analogous to implementing Tree-DOP. Given the
discussion in sections 2.4 and 2.5.1 on the difficulties of efficiently implementing the Tree- DOP model, however, it is clear that Poutsma’s implementation lacks the sophistication
to be able to handle a dataset which constitutes a significant increase in terms of size and
complexity on the dataset used previously. Furthermore, it is possible that in the situations
Poutsma mentions where two or more translations had roughly the same probability and
the less preferred translation would have scored better, this is at least partly attributable
to the inadequacies of his sampling methodology.
In conclusion, the algorithms developed to make Tree-DOP more efficient must be
adapted for Tree-DOT and a more robust implementation built to facilitate experiments
using larger, more complex datasets. Until such a system is in place and these experi- ments carried out, it will not be possible to fully demonstrate the merits of the Tree-DOT
approach to translation.
3Much of Poutsma’s discussion as to the quality of translations produced by the DOT model centers
around manual comparison with the translations output by the Systran MT system for the same sets of test sentences. Clearly, the poor quality of the translations in Poutsma’s training data meant that the DOT model yielded poor translations compared to those generated by Systran. Consequently, we agree that poor training translation quality contributed to the disappointing evaluation outcome. However, we do not feel that this is the most appropriate way to evaluate data-driven MT systems, and that automatic evaluation gives a better picture of how such models perform. If a data-driven system is trained on poor quality translations, then we expect to get poor translations out, but if the reference translations are also poor then we still expect to score well. In a data-driven system, the aim is to model the data supplied, and if evaluation is over a held-out portion of this data then we expect to scorewell even if the output translations are, in human terms, of poor quality. Although he also performed automatic evaluation against reference translations using a metric he defined himself (Poutsma, 2000):58, Poutsma’s work predates the development of the automatic evaluation metrics – Bleu (Papineni et al., 2001, 2002), NIST (NIST, 2002; Doddington, 2002) and F-score (Melamed et al., 2003; Turian et al., 2003)) – currently in use. Poutsma’s metric, termed ‘Largest Translation Part’, is far less sophisticated than these newer metrics, and so it is difficult to draw meaningful conclusions from the automatic evaluation he presents.