System Combination Approaches - Hybrid machine translation using binary classification models t

As we have seen in the previous section, machine translation is possible using various methodological paradigms, each having its individual strengths and weaknesses. Considering the wide range of available methods it makes sense to think aboutcombining translation output, which is also called multi-engine machine translation (MEMT) or hybrid MT. Think, for instance, about lexi- cal coverage in a machine translation model. For statistical systems, it only depends on the contents of the parallel training data. “The more data, the bet-

16_{Typically, a system is considered hybrid if it combines translation output which has been} generated by MT systems implementing different technological paradigms, e.g., statistical and rule-based MT. We use the term hybrid in a broader sense and include system combination approaches which work on only one methodological paradigm.

T1 T2 T3 . . . TN

system combination

S T0

Figure 1.2: Overview on the basic problem setting for system combination

Linguistic phenomena

Syntax, Structural Lexical Lexical Lexical

Paradigm Morphology Semantics Semantics Adaptivity Reliability

RBMT ++ + − −− ₊

SMT − −− ₊ ₊ −

Table 1.1: Informal comparison of RBMT and SMT methodologies w.r.t. their strengths and weaknesses. Adapted from a EuroMatrix Plus presentation.

ter” is a phrase often heard in this respect. The situation is different for rule- based systems. As RBMT engines rely on linguistically informed rules and knowledge bases whose production is both expensive and time-consuming, their lexical coverage cannot adapt to new trends as quickly as SMT.

Complementary Strengths and Weaknesses

Quite often errors are complementary among the different MT methods. Thus, in theory, a clever combination of multiple candidate translationscould yield a translation with improved, overall translation quality. Table 1.1 gives a quick overview on how 1) rule-based and 2) statistical machine translation methods handle a selection of linguistic phenomena. Note how strengths and weaknesses are indeed complementary. The table has been adapted from FP7 funded research project EuroMatrix Plus (ICT 231720) which conducted

Algorithm 1Decision problem for system combination approaches Require: set of source sentences S,

Require: translation output from N systems, T = {T1, T2, . . . , TN}. Ensure: |S| = |T1|= |T2|= . . . = |TN|

1: T0←_{compute best translation(S, T )} _.Compute best translation given input data

2: returnT0 .Return combined translation output

research on hybrid machine translation. More information on activities in EM+ WP2 can be found in the corresponding project reports and publica- tions, e.g., [Federmann et al., 2011] or [Wolf et al., 2011]. One of the lessons learnt during our work on the combination of RBMT and SMT technology was that integration on a deeper linguistic level is a complex task. The addition of parallel data extracted from a given SMT system proved to be problematic as these phrases contained too little linguistic annotation to be included in the lexical resources of the rule-based engine. This also affected the amount of data which could be integrated into the rule-based MT system—as we had to augment phrase pairs with additional, linguistic annotation—which in turn meant losing one of the advantages of statistical learning, namely the power of inference from very large data sets. While the research in the EuroMatrix Plus project was successful and resulted in several interesting extensions of the given rule-based system, there remains a lot of potential for future improvements.

Combination Approaches

There exists a wide range of system combination approaches. Many of these aim at solving the decision problem shown in Algorithm 1. Other approaches may work sequentially or follow some other methodology. In this thesis, we focus on parallel combination by sentence selection.

Google Understanding a language is an old dream .

Bing Understanding a language is an old dream of humanity .

Systran Understanding a language is an old dream of mankind .

Lucy The understanding of a language is an old mankind dream .

They differ in design and implementation of compute best translation which computes the final combination result. A common solution to approach this problem is the application ofconfusion network decoding. Using word alignment techniques, individual candidate translations T_i are aligned to a pre-selected, designated translation “backbone” or “skeleton”. The candidate translations form a network, i.e., a connected graph. Edges between different target words are labeled with transition probabilities—in this case, translation probabilities or estimated future costs obtained from the decoding engine—thus spanning the network of all possible generations considering the alignment and vocabulary from the given set of candidate translations. An example of a confusion network is shown in Figure 1.3. Note that the possibility of “empty” or so-called transitions allows for the generation of a large number of translations which were not originally contained in the set of candidates. A lot of these, however, do not representvalid sentences due to combinatorial effects such as, e.g., double prepositions, wrong agreement or other phenomena. On the other hand, confusion network decoding is able to generate translations which contain good parts from several candidate translations, provided their transition probabilities promote the corresponding decoding paths throughout the network.

We will provide a more detailed discussion of this technique as part of our literature review in Chapter 2. As a side note, it is also possible to apply confusion network decoding to the task of n-best list re-ranking for a single machine translation system. System combination using confusion networks has become the dominant methodological approach in recent years,

0 1 the 2 understanding 3 of 3 a 4 language 4 5 is 6 an 7 old 8 mankind 9 dream 9 10 of 11 humanity mankind 12 .

Figure 1.3: Example of a confusion network graph encoding four different translations of German sentence “Das Verstehen einer Sprache ist ein alter Men- schheitstraum”. Note how many invalid sentences can be produced. Dashed arrows illustrate the effect of additional, general transitions between nodes which would contribute further derivations, many of which invalid.

e.g., in the system combination tasks undertaken as part of the Workshop on Statistical Machine Translation (WMT ’09–’11).

Sentence Selection

As we have remarked in the previous section, transitions in a confusion network may lead to ungrammatical and erroneous translations. This even holds if all given source sentences were perfect translations, due to the “generative” nature of the decoding process. It is also clear that any errors introduced in the word alignment phase can proliferate and may cause degra- dations of the resulting translation. The selection of the underlying translation backbone has an influence on the outcome as well.

Considering these shortcomings there has also been (somewhat scarce) research on the problem ofsentence selection. Given a set of candidate translations, the combined translation is computed by selecting the best among the given candidates, in unaltered form: e pluribus unum, immutatum. It

is obvious that such an approach cannot “fuse” phrasal phenomena from multiple sentences into the final translation. On the other hand, it also is impossible that the chosen translation is altered—i.e., potentially degraded in terms of translation quality—in any way.

Selection mechanisms have been studied by [Hildebrand and Vogel, 2008, Hildebrand and Vogel, 2009, Hildebrand and Vogel, 2010]. Overall, interest in this research topic has, however, been limited; most likely due to the prevalence of aforementioned confusion-network-based methods. Improved techniques able to solve the selection problem on the sentence level could also be applied on the level of sub phrases, making them an interesting area for further research in our view.

In document Hybrid machine translation using binary classification models trained on joint, binarised feature vectors (Page 35-40)