Machine Translation Evaluation

Part I: Research Context

2.4.5 Machine Translation Evaluation

Traditionally, the evaluation of MT output has been carried out by human evaluators with linguistic competence who have been trained to some extent to measure concepts such as fluency, accuracy, and overall quality. Fluency is understood here as the extent to which the target text ‘reads well’ in the target language, and accuracy is understood as the extent to which the target reflects the meaning of the source text. Drawbacks to such a method of evaluation are that it can be resource intensive, and may produce different results from one evaluator to another and even from the same evaluator on separate occasions. Along with the growth in the development of MT systems, a need arose to ascertain if changes to a system resulted in quantifiable improvements. While human evaluation would be an ideal method for such an evaluation, it may simply not be possible, especially as systems may be changed many times in short succession. Therefore, a resource-cheap means of evaluation was sought to assist MT developers in the evaluation of their systems. This was the motivation for the development of automatic evaluation metrics or AEMs. The basic premise of AEMs is that the “closer a machine translation is to a professional human translation, the better it is” (Papineni et al. 2002, p. 311). To make this comparison, AEMs are given a reference translation created by a human translator, which is typically assumed to be the ‘gold standard’ or ideal translation.

The most commonly used AEMs are string-based in that they compare strings of the MT output text to those of the reference translation. String-based AEMs include General Text Matcher or GTM (Turian et al. 2003) and Bilingual Evaluation Understudy or BLEU (Papineni et al. 2002). Such metrics can be useful for charting the development of an MT system in time, however, AEMs are difficult to interpret outside of the MT research community in that it remains unclear if higher scores on an AEM truly equate to a better translation. Nevertheless, AEMs are in widespread use in MT and they provide valuable information, which is often used for the comparison of MT systems (e.g. Callison- Burch et al. 2006, Huang and Papineni 2007). Other notable AEMs in the literature are: Meteor (Banerjee and Lavie 2005) and Translation Edit Rate or

TER (Snover et al. 2009), both of which allow for the two strings being compared to differ in the use of synonyms without being penalised and both of which allow for multiple reference translations.

The link between AEM results and human evaluations has been the subject of much debate and research (Coughlin 2003). There is evidence to support the belief that AEMs correlate well with human judgment in certain contexts (Kuleska and Shieber 2004) but not others (Och and Ney 2004). BLEU has been shown to have correlate well with human evaluation at the corpus and document level (Specia et al. 2010), although its accuracy at sentence level is thought to be questionable (Callison-Burch et al. 2006).

Working on similar corpora to the current study, Tatsumi (2009) found GTM to have a stronger correlation than either TER or BLEU with post-editing speed where a higher GTM score was reflected in faster post-editing. Similarly, Sun (2010) also found GTM to have the strongest correlation with post-editing speed. Once again BLEU and TER were used but showed weaker correlations and it was postulated by Sun that GTM scores are best suited to simple sentences rather than more complex or incomplete sentences.

Other studies have found GTM to correlate best with human evaluation involving European languages, e.g. Cahill (2009), Agarwal and Lavie (2008). In the context of this study, GTM, BLEU and TER are used in the evaluation of the MT output and will be described in more detail in the next chapter.

2.4.6 Section Summary

This section focused on the topic of machine translation. It provided a description of how machine translation systems have developed and the current use of rule-based and statistical machine translation systems employed in the current study, and operationalised in the next chapter. This was followed by an exploration of the use of controlled language in conjunction with machine translation systems, and finally, a review of the evaluation of machine translation systems by means of automatic evaluation metrics.

2.5 Eye Tracking

2.5.1 Section Overview

While a increasing number of studies using eye tracking been carried out in translation studies and related areas of translation process studies, and audio- visual translation, much relevant information is available from earlier eye tracking studies conducted in related fields such as cognitive psychology, psycholinguistics, and usability research. In the following, the literature related to translation studies will first be reviewed as it is most relevant to the present research. This will be followed by a review of the most relevant work from a much larger body of literature from the latter domains.

Fundamentally, an eye tracker monitors and records activity/movements of the eyes, and the pupil in particular, by means of video and infrared cameras. The data gathered by the hardware are processed and made available for examination by supporting software. An example of such a setup is the Tobii 1750 (www.tobii.se) and its supporting software Tobii Studio.

To facilitate comprehension of the following paragraphs, a brief explanation of common eye tracking terminology is provided below:

 Area of interest (AOI): an arbitrary area defined by the researcher and usually intended to coincide with a specific visual or textual phenomenon (e.g. the headline of a text) in the material being examined by participants in an eye-tracking study;

 Fixation time/duration/length: the duration of time the eye focuses on an item;

 Fixation count: the number of occasions on which the eye focuses on an item;

 Gaze time/observation length: the duration of time spent gazing within a particular AOI;

 Pupil dilation/pupil size: the size of the pupil in millimetres and its constriction and dilation in response to stimuli (e.g. external stimulus, internal cognitive effort);

 Regression: “any eye movement that begins at the right-most point the reader has fixated and leaves the currently fixated region to the left' (Pickering and Traxler, p. 945);

 Scan path: the way in which the eyes look at items (e.g. a line of text);

 Saccade: a movement from one point of fixation to another; this movement is not fluid and is typically made in a series of short jumps which humans are unaware of.

In document Investigating the effects of controlled language on the reading and comprehension of machine translated texts: A mixed-methods approach (Page 57-62)