• No results found

Chapter 6: Statistical Post-editing

6.2.4 Further Exploration

In order to make use of automatic metrics to filter out those equal or indistinguishable translations, the data we gathered from the human evaluation were further explored to answer this question: How great a difference has to exist between two automatic scores so that the number of ties assigned by human translators is reduced to a minimum?

Generally speaking, we assume that the greater the differences between two automatic scores, the less likely that human evaluators would evaluate the two translations as ties. To check if this is true, we broke down the score difference into ten intervals (such as 0-0.1) and calculated the percentage of times that humans assigned ties for the pairs within each interval. For example, if the GTM (e=1) scores for two translations are 0.64 and 0.53 respectively, the difference

Figure 6.7 plots the the percentages of ties assigned by human evaluators to the translations within each score difference interval.

1 0 2 0 3 0 4 0 5 0 P e rc e n t o f T ie s 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

Score Difference Intervals

GTM (e=1) GTM (e=1.2)

BLEU TER

Figure 6.7: Percentage of ties assigned within each score interval

It can be seen that for any automatic metric, if the difference between two scores is not greater than 0.1, 50% or more translations will be evaluated as ties by human evaluators. Even when the difference between two scores is greater than 0.2, there are still around 30% of translations evaluated as ties. It can be confirmed that for TER and BLEU, even when two scores of two translations differ by more than 0.5, there is still a large percentage of translations that may be judged as ties by human evaluators.

The detailed percent of ties in each difference interval for each automatic metric is reported below in Table 6.20.

Score Difference Scales GTM (e=1) GTM (e=1.2) TER BLEU [0,0.1] 49.63 50.89 51.50 51.17 (0.1,0.2] 30.77 32.43 32.56 32.47 (0.2,0.3] 24.41 25.98 26.77 31.36 (0.3,0.4] 22.95 25.21 21.28 29.91 (0.4,0.5] 8.33 18.03 24.22 26.37 (0.5,0.6] / 12.5 13.33 21.48 (0.6,0.7] / 10 33.33 12.80 (0.7,0.8] / / / 9.17 (0.8,0.9] / / / 14.29

Table 6.20: Percent of ties within each score difference interval

Based on the information displayed in Table 6.20 and Figure 6.7, we can conclude that GTM (e=1) would be the most suitable metric to be used in this study to pick up distinguishable pairs of translation for evaluation. This is also supported by the fact that our previous examination has already shown that GTM (e=1) correlates best with human evaluation at sentence level. In the analysis below, GTM refers to GTM (e=1) unless otherwise specified.

From Table 6.20 we can see that only when two GTM scores differ by more than 0.2, fewer than one third of the translations evaluated are ties, according to human evaluators. Therefore, if the focus of an evaluation is on the different translations from different systems, GTM scores at sentence level can be compared first and those pairs with their scores differing above the value of 0.2 can be selected and evaluated. The proposal can be stated as follows:

If an evaluation is to compare translations (for example, A and B) and the purpose is to reveal the improvements and degradations of A compared to B, in order to avoid a lot of ties from human evaluation which are of little value to the purpose of the evaluation, the GTM scores of the two translations at sentence level can be compared first, and then only those sentence pairs where the GTM scores of A and B differ by more than 0.2 should be retained for human evaluation.

Undoubtedly, to determine which score difference interval is the most suitable for a study depends on the purpose, the size, the time and the budget of the study. If

researchers/system developers want to compare the similarities between translations, then those sentences with a score difference below 0.1 can be selected for comparison. Alternatively, if one wants to focus on the benefits and drawbacks of one approach compared to a Baseline system, then translations with a score difference above 0.1 or (0.2) can be selected.

We did a post-hoc test on the effect of this method before applying it to our further experiments. As mentioned, all the sentences and translations were expanded into pair wise comparisons. We extracted translation pairs and their human rankings if the GTM scores of the two translations in a pair differed by more than (or were equal to) 0.2. We chose the upper bound (0.2) of the second interval ((0.1, 0.2]) because, while 30.77% of all translations are ties within this scale, most of these ties occur below the 0.2 value and just 6.6% of ties within this interval were assigned when two scores had a difference value equal to 0.2. Using 0.2 as the threshold, we can select more translations to be evaluated while keeping the number of possible ties to a minimum.

Using this filtering criterion, only 11.54% of all translations were left for evaluation. Since we already had their automatic scores and their human rankings, we can verify the validity of this filtering method by posing the following two questions. First, are the overall rankings of the four systems (Baseline, SPED, SPEP and SPEF) on the extracted pairs consistent with the conclusions we draw from the whole sample? Second, are the inter-evaluator correlations and the correlation between automatic and human evaluation improved?

As described above, we calculated the percentage of times that each system was evaluated as better than any other system. The numbers are reported in Table 6.21.

Percentage of times

SPED 40.63%

SPEP 33.98%

SPEF 19.53%

Baseline 5.86%

Table 6.21: Percent of times that one system was evaluated as better than the others in the post-hoc test

The sequence of the four systems shown in Table 6.21 is consistent with our previous conclusion, i.e. SPED is the best and the Baseline is the worst. There is a smaller difference between SPED and SPEP compared to the differences between the other pairs. In other words, evaluating 11.54% of all translations leads to the same conclusion as evaluating all translations. The results verify the usability of our proposal with respect to saving evaluation time and cost.

The second aim of reducing the number of ties is to increase the inter-evaluator correlation to get more valid results. To test this, we calculated the inter-evaluator correlation of the four evaluators on the selected translation pairs. The four evaluators enjoyed high agreement. 66.07% of all pairs received a majority vote. The Kappa value for the inter-evaluator agreement is 0.336 compared to 0.273 obtained from all translations. Although both K values belong to the category of Fair agreement, there is indeed noticeable improvement. The advantage is that no evaluation result has to be discarded in order to increase the correlation level.

Finally, we also checked the correlation between automatic and human evaluation. Since we used GTM (e=1) to filter out sentences, we can only check the consistency level between GTM and human evaluation on the extracted pairs of translations. Following the same procedure described in Section 6.2.3.3, we checked the number of sentences for which the automatic and human evaluation

agreed with each other (Table 6.22). This consistency level is slightly higher than that reported in Table 6.19 (66%).

H1 H2 H3 H4 Average

GTM (e=1) 79.08% 90.56% 32.91% 67.65% 67.55%

Table 6.22: Consistency level between automatic score and human evaluation in the post-hoc test

In summary, using our proposed criteria to filter the number of sentences to be evaluated by humans not only can save time and resources but also improves validity. This proposal can be easily applied to other studies. However, as mentioned, the specific values have to be dependent on the purpose of the study and the available resources.

6.3 Summary

We come to the conclusion from our analysis that the unmodified general SPE system can produce better translations both at sentence level and preposition level than the Baseline RBMT systems. A modification to the phrase table of the SPE system failed to outperform the unmodified SPE system but generated significantly better translations than the Baseline system, especially on sentence level translation. The most frequently corrected preposition error by the general SPE system was Incorrect Position. Incomplete translation of preposition, especially in or on was the second most frequently corrected error. One of the modified SPE modules did bring some unique improvements. The main difference between SPED and SPEP and SPEF is the size of their phrase table. Results show that the more phrases (i.e. the more information that is presented in the phrase table) in an SPE module, the better the translation of prepositions. In other words, the translation of prepositions is not in isolation but closely related with

translation of other parts of a sentence. Therefore, in our further research, we proposed general approaches instead of ones that are preposition focused.

We also found out that one major factor influencing the correlation between human evaluators is the indistinguishable translations. Some studies propose discarding results of the evaluators that correlates worst with the others in order to obtain reliable results. However, this approach is not suitable for experiments with a limited number of evaluators. Moreover, the approach does not help in saving time and cost. Therefore, we propose to make the evaluation purpose-specific and reduce the number of evaluations so that the evaluation task could be simplified. This approach is advantageous in the following ways. First, it could save time and resources. Second, by simplifying the evaluation process, the reliability of the results could be enhanced instead of discarding data. Finally, it could help system developers to determine whether an improvement in an automatic score is significant or not or whether/how human evaluation should be conducted.

Another important by-product of the current study is that in terms of Chinese IT document evaluation, all the three automatic metrics involved in this study correlated well with human evaluation at system level. However, at sentence level, GTM (e=1) stands out as the best automatic metric.

Chapter 7: Dictionary Customisation and Source