Human evaluation based on usability judgements

Chapter 3 Translation Quality Assessment (TQA)

3.3 Quality assessment of MT

3.3.1 Human evaluation

3.3.1.3 Human evaluation based on usability judgements

Castilho et al. (2018) call for usability tests for translated content. They point out that both academia and industry rarely take end users into account when evaluating translation quality, thus the acceptance of translated content has rarely been measured. Van Slype (1979) states that MT acceptability can only be effectively measured by its end users. Roturier (2006) echoes his point and argues that the evaluator of quality has to be the real user of the translated text, in order to maximize the ecological validity of

the evaluation study. Suojanen et al. (2015) point out that traditional TQA has the problem of only focusing on the end product, which means any changes can lead to a substantial loss of money and time. They thus suggest the user-centred translation (UCT) approach, which emphasizes the central role of users in both the production and evaluation of translation. They assert that the use of the innovative UCT model can help the translation industry better face the competitive market and meet the needs of clients.

The usability of MT output has been investigated in a number of studies (see Section 2.2.4.2). Similarly, performance-based measures (objective or subjective) are used to assess how users use the final product or service with translated content. They provide real usage data and are often adopted in the localization industry. Usability, performance, and acceptability are user-centred concepts. Castilho et al. (2018, p.20) treat acceptability as a part of the concept of usability and define it as “the degree to which the target or output text meets the needs and expectations of its reader(s) or user(s)”. A related term, more commonly used in AVT research, is ‘reception’. Section 2.4 has elaborated on viewers’ reception of subtitles, which is similar to usability evaluation to some extent.

An example of usability evaluation for subtitles is the transLectures project (see Section 2.3.2). According to Orlič et al. (2014), the quality of the subtitles was evaluated by five undergraduate and graduate students of translation studies in a Slovenian university through the transLectures player. The students participated in a two-phase evaluation process with the help of a Spanish university. Whenever they spotted an error of

transcription or translation, they could pause to do post-editing. In addition, the power users and authors of transLectures were free to edit the translations and transcriptions. After the two testing phases, the results were evaluated using RTF and WER (see Section 2.3.2 for their definitions). Another example is the ALST project, where Ortiz-Boix and Matamala (2015) propose a quality assessment of post-edited and human translated wildlife documentary films from English to Spanish at three levels: language experts’ assessment, dubbing studio’s assessment, and end users’ assessment. The end users’ assessment is the highlight of their research. Based on a mixture of holistic and analytic approaches (Lommel et al., 2014), the language experts carried out the experiment with three evaluation rounds. The dubbing studio received all the scripts and videos, and made a professional recording according to standard procedures. A researcher made observations and collected data on changes made by the dubbing director during the recording process. Following Gambier’s reception model (2009), the last level of the research was carried out with end users from three perspectives: understanding, enjoyment, and preferences. Data was collected through both pre-task and post-task questionnaires. The pre-task questionnaire was designed to collect end users’ demographic information, while the post-task questionnaire was to collect their comprehension, enjoyment, and preferences. Results show that human-translated texts performed better than post-edited texts, but the difference was not significant and could be considered not meaningful.

While it is important to do usability evaluation, Castilho et al. (2018) acknowledge that evaluators for research-based MT quality assessment are mostly students and amateurs who have never received any professional training due to limited resources. In fact, “it

appears from the data available that professional or trained evaluators are the exception in MTE tasks, rather than the rule” (p.23). They report that TQA may be conducted individually, in groups, or in crowds (specifically referring to large-scale crowdsourced evaluation campaigns). The more people involved, the more time and money will be required, but the evaluation will have higher validity. Meanwhile, they also admit that it is not really practical to conduct large-scale acceptability tests with the real end users of MT. In fact, crowdsourcing evaluation is often the chosen option. Graham et al. (2017) agree that crowdsourcing evaluation is more common with MT now, especially for large-scale projects. Mitchell (2015) also agrees that crowd evaluation is gaining popularity, usually with the aid of Amazon Mechanical Turk.45

However, it can be considerably subjective and inconsistent when evaluators have no professional training. In this case, the results of their TQA usually have low intra- and inter-annotator agreement (see the discussion at the end of Section 3.3.1.2). Meanwhile, even evaluators who have received professional training can have a low inter-annotator agreement. For example, Jia et al. (2019b) recruited 30 first-year postgraduate students specialized in translation for comparing post-editing NMT output and human translation (English to Chinese). The kappa scores for fluency and accuracy were 0.0334 and 0.0744 respectively, which indicates only slight agreement among the 30 evaluators based on

Landis and Koch’s (1977) standard. Why there was such low agreement is perhaps because postgraduate translation students are still novices rather than experienced professionals.

It is a fact that human evaluation can result in a low inter-annotator agreement, although it does not mean their TQA is useless, rather that there will be some impact on the study’s validity. To better assess the quality of MT output, triangulation is necessary. Saldanha and O’Brien (2013, p.23) endorse the practice of triangulation in translation studies and acknowledge that it can help cross-check the results from different sets of data. Therefore, a combination of human evaluation and automatic evaluation can be a good choice. A discussion of automatic evaluation is thus provided in the next section.

In document A reception study of machine translated subtitles for MOOCs (Page 90-94)