Evaluation – Measurement Validity and Reliability

Chapter 4: Methodology

4.2 Evaluation – Measurement Validity and Reliability

existence, characteristics, size and/or quantity of some variable through systematic recording and organization of observation” (1991: 100). Developing valid measurement is a primary concern for researchers, which indicates that researchers are indeed measuring the concepts they intended to measure and the variable is measured in a consistent and stable manner (Frey et al. 1991).

Measuring the effect of an approach on the translation of prepositions is one of the core research objectives of the study. To choose the appropriate human and automatic evaluation, specific problems have to be taken into consideration. To obtain an overview of the translation quality of prepositions of the RBMT system requires human examination so that questions like which error is the most frequent can be answered. Evaluation of the errors in translated prepositions is the first step taken in this study before we apply any approaches to improve the Baseline translation. The details of this evaluation are reported in Chapter 5.

As for measuring the effects of an approach, translations can be compared and evaluated both by humans and automatic evaluation metrics.

Automatic evaluation metrics can report in a quantitative way the scores of the overall translations which can reflect whether or not there is a difference between two translations. For this study, we selected three of the most widely used metrics, namely, BLEU, GTM and TER. Several factors influenced this decision: they are widely used in the area of MT evaluation; they are able to evaluate Chinese output; they are reported to correlate with human evaluation to some extent; they are straightforward to use; no additional large datasets of linguistic information are needed. One particular reason for the choice of GTM

was that it is the default evaluation metric embedded in SymEval (an evaluation software program used in Symantec) (Roturier 2009). In Section 2.2.2.2 where GTM was previously explained, we mentioned that the weight of GTM can be changed. The higher the weight, the more penalties on the word order difference between an MT output and its reference translation. The most commonly used weight is the default setting (e=1) which applies no penalty to word order differences. Turian et al. (2003) concluded from their evaluation of Chinese output that (e=1) correlated better with human evaluation than GTM (e=2). Another common weight of GTM is e=1.2 which is used in some evaluation campaigns (Callison-Burch et al. 2007). In addition, GTM (e=1.2) is also internally employed by Symantec (Roturier 2009). Therefore, in this study, both the default (e=1) and (e=1.2) are reported throughout all experiments. Note that only automatic evaluation scores of a system and of each sentence were reported and examined. We did not extract isolated translation of prepositions to be scored as most automatic metrics are designed to work on text or sentence level instead of on short syntactic constituents.

Qualitative comparison of the differences between two translations requires human evaluation, particularly for preposition evaluation. Although the focus of the study is to improve the translation of prepositions, it is not desirable to obtain better translations of prepositions at the expense of lowering overall translation quality. Therefore, besides preposition evaluation, sentence level evaluation is also indispensible. Scoring an output at sentence level according to its adequacy and fluency is a pervasive evaluation approach (Flanagan 2009; Callison-Burch et al. 2007; LDC 2005; Hovy et al. 2002). However, more recent work has revealed that ranking is more intuitive, reliable and evaluator-friendly (Duh 2008; Vilar et

al. 2007). This type of evaluation is also found to be widely employed in some MT evaluation campaigns (Callison-Burch et al. 2009, 2008, 2007) by asking human judges to only rank the candidate translations from best to worst. For the purpose of this study, ranking at both preposition and sentence levels are conducted. The results are complemented by a detailed qualitative analysis of the outputs by the author.

There is no ideal profile described in the literature today as to the best evaluators. However, using professional translators who are familiar with the technical documents of this study would increase consistency and validity (Aranberri 2009). As to the adequate number of evaluators, many researchers pointed out that at least three or four evaluators should be used (Arnold et al. 1994; Carroll 1966). A minimum of four evaluators were employed in this study based on the above-mentioned information. Another consideration is the constraints of the research budget as the evaluation will become more costly with more evaluators involved.

The reliability of the results of automatic and human evaluation also needs to be examined. The reliability of human evaluation can be reported by the inter-evaluator and/or intra-evaluator correlation (such as Kappa scores) introduced in Section 2.2.3. The reliability of automatic evaluation can be verified by examining their correlation with human evaluation. Using both of them in a study can make use of the advantages of both and increase the overall validity of the results.

Frey et al. (1991) pointed out that “a researcher who intends to use a technique must make sure it has been validated at some point by its originators and has been used previously in research” (p.122). The measurement technique

used in this study meets this requirement as both the selected automatic metrics and the form of human evaluation in this study have been and remain widely used in the field of machine translation.

In document An Investigation into Automatic Translation of Prepositions in IT Technical Documentation from English to Chinese (Page 89-92)