Phrase-Based Model - Statistical Machine Translation

2.2 Statistical Machine Translation

2.2.3 Phrase-Based Model

Phrase-based machine translation (PBMT) models [Koehn et al., 2003] partially overcome the strong independence assumptions which are imposed to make word- based models efficient. By using phrases instead of words as atomic translation units, local contexts can be efficiently encoded.

Starting point for the phrase-based approach are symmetrized many-to-many word alignments generated by running a word alignment algorithm in both translation directions. From this joint alignment,phrase-pairs (aligned sub-sequences of source and target words) can be extracted. A standard requirement for phrase- pairs is consistency with the word alignment, as deőned in Och et al. [1999]: A phrase-pair is consistent iff all source words of a phrase are only aligned to words of the target phrase (including Null) and vice-versa.

Estimation of the translation model is trivial for this model, since an explicit alignment is given. It can simply be estimated through MLE:

p( ˜f|e˜) = _∑c( ˜f ,˜e)

˜ f′c( ˜f ,˜e)

, (2.25)

slightly abusing notation,˜·being phrases. After extracting phrases over a large sentence-aligned corpus (bitext), the result of this process is a so-calledphrase-table, which contains all phrase-pairs that could be extracted and stores them in an efficient manner.

Decoding in this model breaks down into two phases: First, the source sentence to be translated is segmented according to the source-sides of phrases in the phrase table. All segmentations are considered equally likely at őrst. In a second step, each source phrase is translated into a target phrase according to the entries of a phrase-table, making sure that each source phrase is translated exactly once. Source phrases can be processed in any order, and the target translation hypothesis is built left to right. This implies that reordering of phrases is allowed on the target-side. Since most natural languages do not demand this level of ŕexibility, Koehn et al. [2003] propose to incorporate adistortion model:

d(ai−bi−1) =α|ai−bi−1−1|, (2.26)

whereαis a free parameter,ai is the start position of the source phrase yielding

This allows penalizing large łjumpsž in the source sentence. Furthermore, another parameterω is added to the model, counting the number of produced words on the target side, the so-called word penalty.

The different models and sub-models in the original phrase-based approach of Koehn et al. [2003] are combined without explicit weighting during decoding (omitting source segmentation):

ˆ e= arg max e p(e|f) =p(e)ω|e| [ ∏ i p(fi|ei)d(ai−bi−1) ] , (2.27)

where|e|is the length of the current (partial) translation hypothesis, and fi is

theith source-phrase with its corresponding target sideei. The log-linear model, as

introduced by Och and Ney [2002] for phrase-based statistical machine translation, enables to use a (learned) weighted combination of the different models:

ˆ e= arg max e w1logp(e) +w2logω| e|₊ [ ∑ i

w3logp(fi|ei) +w4logd(ai−bi−1) ]

. (2.28) On the one hand this allows discriminatively learning task-speciőc weights [Och, 2003] as well as efficient weighted decoding, on the other hand, from a modeling perspective, it enables adding arbitrary sub-models as features to make the model more expressive. The decision function can, analoguosly to Equation 2.16, be written more compactly as a single dot product:

e= arg max

⟨w, ϕ(e,f)⟩, (2.29)

whereϕis a function mapping the hypothesis to a joint feature space, just as previously described for the word-based models.

It is important to note that the combinatorial issues when decoding are similar to the ones with word-based models. This is why a variety of approximate search techniques have to be applied in order to efficiently őnd a good translation hypotheses ˆe, such as beam search and hypotheses recombination for stack-based

decoding [Och et al., 2001; Koehn, 2004a]. Most of the decoding algorithms for statistical machine translation are instances of dynamic programming, and as such, they produce asearch graph in the form of a probabilistic őnite state transducer (FST), which can be exploited using general graph algorithms, such as efficient

In a search graph, vertices represent a (partial) hypothesis and its internal state17

. Edges are associated with phrase applications, covering parts of the source. Complete hypotheses form a complete path through the search graph, covering the whole source.

It is important to note, that the approximations for decoding in phrase-based statistical machine translation signiőcantly divert from the originally proposed model: Since the sum in arg maxe

∑

ap(f,a|e) is intractable, we instead seek

arg maxemaxap(f,a|e), maximizing over all possible segmentations and align-

ments. However, as Och and Ney [2002] note, this can be alleviated by including

a in the joint feature map, deőning feature functions using the alignmenta, i.e.

binary word translation features or identiőers for phrase-pairs.

The most widely used implementation of phrase-based statistical machine translation is theMoses toolkit [Koehn et al., 2007], which implements all of the discussed algorithms.

In document Preference Learning for Machine Translation (Page 40-42)