• No results found

User Feedback collected from web users

4.5 Domain Adaptation

4.5.6 User Feedback collected from web users

The previous experiment evaluated the proposed strategy with real feedback, i.e. the Augmented Corpus was built using post-editions of output translation instead of simulating feedback with a target reference. However, this feedback came from researchers of machine translation, who tried to write post-editions that were as close as possible to the original translation in order to learn new features for the 2012’s WMT evaluation metrics campaign. Therefore, all data provided could be safely used to improve the SMT system, just like the target references in the previous examples.

This section describes a more realistic scenario, extracted from the results of Barrón Cedeño et. al. [7], in which the system was available to the general public through a website and users from all over the

4.5. DOMAIN ADAPTATION 67

BLEU NIST TER METEOR

FAUST Raw

Baseline 34.47 8.28 51.76 54.87 Feedback filtered 35.22† 8.41* 50.19 55.83 Feedback w/o filter 34.41 8.27 51.15 55.32

FAUST Clean

Baseline 38.64 8.68 46.91 58.42 Feedback filtered 39.49† 8.80* 45.42 59.31 Feedback w/o filter 38.61 8.64 46.80 58.60

Table 4.15: Results for the feedback experiments. ‘⇤’ and ‘†’ indicate confidence levels of 0.99 and 0.95, respectively

world translated sentences of any domain and provided feedback as they wished, producing a very noisy collection of data [69]. The paper proposes a Support Vector Machine (SVM) classifier for feedback filtering and we collaborate in the research applying the Derived Unit strategy with the filtered data to evaluate the effect during translation.

The filtering was applied over a collection of 6.6K user feedback instances, provided by Reverso2 from

its MT weblogs. After receiving the filtered feedback, we used it to build an Augmented Corpus to enhance the translation model of the basleline Phrase-based SMT system by using the Derived Units strategy. For the alpha parameter of the Derived Units strategy (used to weight the contribution of the baseline and derived phrase tables) we chose ↵ = 0.60 from the experience of the previous experiments. Also, similarly to the previous experiments, the language model was the result of a combination of language models from news, UN and Europarl, with the addition of a new language model build with monolingual data provided by Reverso. The development set used was the FAUST clean corpus. The systems were evaluated with two test sets: the raw and clean input of the FAUST test corpus and its corresponding translation as reference. Details on the Augmented Corpus, the monolingual data for the language model and the FAUST corpus used for development and testing can be seen in Appendix A.5.

Table 4.15 presents the results on the FAUST test set for three system configurations: the baseline system (Baseline), a derived system using the filtered feedback and a derived system using all the feed- back, i.e. without the filtering proposed in [7]. The automatic metrics used to measure the performance of the different systems were BLEU, NIST, METEOR and TER. In the table, the symbols ‘⇤’ and ‘†’ indicate confidence levels of 0.99 and 0.95, respectively).

From the results obtained in this experiment, we observe that introducing feedback without any filtering (the rows “Feedback w/o filter” in Table 4.15) does not improve the MT system performance. Thus, the feedback cannot be added as such with unfiltered noise. In the previous experiment we were confident in using all feedback (and obtained significant improvement in translation quality) because it was either simulated with translation references or collected from post-editions made by human experts

for research purposes. From these results we can conclude that the feedback approach proposed by Barrón Cedeño et. al. [7] together with the Derived Units strategy described in this thesis introduce a significant improvement over the baseline system, providing a learning methodology well-suited to the problem of user feedback.

Qualitative Analysis of the Adaptated System

Additional to the automatic results reported in [7], we will now present an human analysis focused on the different phenomena that the system was able to fix and adapt.

The analysis was done by five experts who studied 414 translation-triplets from the 50% selected feedback (FAUST Clean) including: source language sentence, baseline translation and the translation made with the improved system. For each of the sentences, the annotators were asked to compare the “baseline” and “improved” translations. The possible outputs for the comparison were “better” (if the “improved” output was better than the baseline), “worse” (in case of the opposite), “same” (if the change did not alter the general quality) and ambiguous / can’t say (if it was not possible to determine whether the changed was for better or worse).

The comparison was done at two levels: 1. Overall adequacy and fluency.

2. Detailed level identifying change at different phenomena.

(a) Function words: it includes insertion, replacement or deletion of function words. (b) Word fertility: it includes insertion, replacement or deletion of non-function words.

(c) Lexical changes: it includes choosing a different translation for the same source phrase and Out-Of-Vocabulary (OOV) translations that the baseline could not translate.

(d) Reordering.

(e) Morphology: it includes changes in person, genre, number and tense of verbs and nouns. (f) Harmful element (eg, a mistranslation caused by a bad feedback) into the model.

Finally, we selected 10 common translation-triplets for all annotators in order to compute the agreement according to Cohen’s kappa, resulting in k = 0.57. Results are presented in Table 4.16.

Once we had analyzed the qualitative analysis results, we observed lexical translation to be the main aspect that feedback usage is affecting (⇡ 60% Better+Same+Worse). These changes mostly came from corrections based on mistranslations other than minimizing the impact of OOV, as they were reduced only by 0.3%. Other significant aspects that are affected by our methodology are reordering (22%) and morphology (20%). In a third term would be the function words (14%) and the insertion/deletion

4.6. DERIVED UNITS OVER TRAINING CORPUS 69