Combining Phrase-Based and Template-Based
alignment models in Statistical Translation
Jesús Tomás1 and Francisco Casacuberta2
1 Escuela Politécnica Superior de Gandia, Universidad Politécniva de Valencia,
46730 Gandia, Spain [email protected]
2 Institut Tecnològic d’Informàtica, Universidad Politécniva de Valencia,
46071 Valencia, Spain [email protected]
Abstract. In statistical machine translation, single-word based models have an
important deficiency; they do not take contextual information into account for the translation decision. A possible solution called Phrase-Based, consists in translating a sequence of words instead of a single word. We show how this ap-proach obtains interesting results in some corpora. One shortcoming of the phrase-based alignment models is that they do not have the generalization ca-pability in word reordering. A possible solution could be the template-based approach, which uses sequences of classes of words instead of sequences of words. We present a template-based alignment model that uses a Part Of Speech tagger for word classes. We also propose an improved model that com-bines both models. The basic idea is that if a sequence of words has been seen in training, the phrase-based model can be used; otherwise, the template-based model can be used. We present the results from different tasks.
1 Introduction
Statistical machine translation has been formalized in [1]. This approach defines a translation model by introducing an alignment, which defines the correspondence between the words of source sentences and target sentences. The optimal parameters of this model are estimated using statistical theory.
The most common statistical translation models can be classified as single-word based (SWB) alignment models. Models of this kind assume than an input word is generated by only one output word [1][10]. This assumption does not correspond to the nature of natural language; in some cases, we need to know a word group in order to obtain a correct translation.
classes. However, the lexical model continues to be based on word-to-word corre-spondence. An example of this kind of model is the alignment-template model [5]. In this model, the word classes are learned automatically using a bilingual corpus. In this paper, we present a monotone TB model that uses a Part Of Speech tagger for word classes.
Recent works present a simple alternative to these models, the phrase-based (PB) approach [3][7][12]. These methods explicitly learn the probability of a sequence of words in a source sentence being translated to another sequence of words in the target sentence. One shortcoming of the PB alignment models is the generalization capabil-ity. If a sequence of words has not been seen in training, the model cannot reorder it properly.
We also propose an improved model that combines PB and TB approaches. The basic idea is that if a sequence of words has been seen in training, the PB model can be used; otherwise, the generalization capability of the TB model can be used.
The organization of the paper is as follows. First, we review the statistical ap-proach to machine translation. Second, we introduce a direct apap-proach to the problem. Then, we revise the PB approach and propose different methods to estimate the pa-rameters. Afterwards, we propose a TB approach that uses a POS tagger for word classes and we show a possible combination of two of the models in question. Fi-nally, we report some experimental results. The system was tested by translating three different tasks.
2 Statistical
Translation
2.1 Noisy-channel approach
The goal of statistical translation is to translate a given source language sentence f = f1...f|f|, to a target sentence e = e1...e|e|. The methodology used [1] is based on the defi-nition of a function Pr(e|f) that returns the probability of translating the input sentence f into the output sentence e. Once this function is estimated, the problem can be for-mulated to compute a sentence e that maximizes the probability Pr(e|f) for a given f. Using Bayes’ theorem, we can write:
) Pr( ) )Pr( Pr( ) Pr( f e | f e f | e = (1)
And, therefore, statistical translation can be presented as: ) Pr( ) Pr( arg e f|e e' max e = (2)
Equation (2) summarizes the three following matters to be solved:
• An output language model is needed to distinguish valid sentences from in-valid sentences in the target language, Pr(e).
• An algorithm must be designed to search for the sentence e that maximizes this product.
We focus our attention on the translation model, Pr(f|e).This probability distribu-tion is too general to be used in a table look-up approach, because there is a huge number of possible values f. Therefore, we have to reduce the number of free parame-ters by reducing the dependencies.
Some models assume that an input word fj, in position j, is generated by one target word ei, in position i. Models of this kind are referred to as single-word based align-ment models. A possible alignalign-ment within f and e is referred to as variable a= a1...a|f|. If aj=i, the input word in position j is aligned to the output word in position i. As a result, the translation probability can be broken down into a lexicon probability and an alignment probability. In a more general definition of alignment, we can assume that an input word is generated by several words in the output. We can express this kind of alignment by assigning several output words to each input position instead of a single output word: A= A1...A|f| Aj ⊆{i: i=1,..,|e|}.
2.2 Direct Approach in Statistical Translation
Our approach does not follow equation (2). We estimate Pr(e|f) directly without using Bayes’ theorem [6]. As usual, we can express this likelihood in terms of alignments:
∑
= a f | a e, f | e ) Pr( ) Pr( (3)Without loss of generality, we can write [1]:
(
)
∏
= − − − = I i i i i i i i a e I e a e I a I 1 1 1 1 1 1 1 1 , , , )Pr( | , , , ) | Pr( ) | Pr( ) Pr(e,a|f f f f (4)Now, we define an alignment as a=a1...a|e|; if ai=j, the output word in position i is aligned to the input word in position j. If we assume that, 1
1 − i e is independent of f , , 1 I
ai in the third term of (4), we can write1:
) , , | Pr( ) , , , | Pr( ) | Pr( ) Pr( 1 ) | Pr( ) Pr( 1 1 1 1 1 1 1 1 f f f f | a e, I a e I e a a e e e I i i i i i i i I i i − − − =
∏
= (5)1 It is easy to demonstrate that P(a|b,c)=P(a|b)P(a|c)/P(a) with b and c being independent events. Thus, if we suppose that 1
1 −
i
e is independent of a1,I,f
i ,we can write:
) Pr( ) , , | Pr( ) | Pr( ) , , , | Pr( 1 11 1 1 1 i i i i i i i i e I a e e e I e a e f f − − = language model
As in the standard approach, we have similarly broken down our initial distribution into several distributions: There are three principal distributions: the language model, the alignment model and the lexical model.
3
Monotone Phrase-Based Translation
The principal innovation of the phrase-based (PB) alignment model [3][7][12] is that it attempts to calculate the translation probabilities of word sequences (phrases) rather than of only single words. Figure 1 shows the same sentence written in five different languages.
Se requerirá una acción de la Comunidad para la puesta en práctica É necessária uma acção por parte da Comunidade para pôr plenamente em prática Sarà necessaria un'azione della Comunità per dare piena attuazione Une action est nécessaire au niveau communautaire afin de mettre pleinementen œuvre Action is required by the Community in order to implement fully
Fig. 1. Equivalent phrases in a sentence in Spanish, Portuguese, Italian, French and English
As can be seen from this example, we join phrases that are translated together in a natural way. The other property of this translation model is that the alignment be-tween the phrases is monotone-constrained. The example shows how the sentences are monotone-translated.
The generative process, which allows for the translation of a sentence, can be bro-ken down into the following steps: First, the input sentence is segmented into phrases. Then, each phrase is translated to the corresponding output phrase. The output sen-tence is made by concatenating the output phrases in the same order as in the input phrases.
This model uses a particular kind of alignment that we call monotone alignment using phrases.
f: the configuration program is loaded e: el programa de configuración está cargado A: {1}, {2,3}, {2,3}, {2,3}, {4,5}, {4,5}
Fig. 2. Example of monotone alignment using phrases
Using the definition of generalized alignment (A=A1...A|e| Ai ⊆{j: j=1,..,|f|}), we can define a monotone alignment using phrases if the following holds:
(Ai= Ai+1) ∨ (∀j1∈ Ai ; j2∈ Ai+1 : j1 < j2) ∀ i=1..|e|-1 (6) A monotone alignment entails a segmentation of sentences f and e, in K phrases. We denote these sequences of phrases as: ~f=~f1...f~K and ~e=~e1...~eK. Thus, we can
i
e~ is aligned with the phrase ~fi, with i =1…|e~|. As usual, we can obtain the transla-tion probability by adding the probability of all possible monotone alignments A:
∑
= A A, A )Pr( ) Pr( ) Pr(e|f | f e| f =∑
A A, ) Pr( ) (e e| f α (7)∑
∑
== = = | ~ | | ~ | ; ~ : ~ ~ : ~ ) ~ | ~ Pr( ) ( f e f f f e e e f e eα
(8) In (7), we assume that all alignments have the same probability α(e). This parame-ter is not relevant for translation and will be omitted. Equation (8) is an alparame-ternative expression of equations, where the monotone alignment is explicitly indicated. If we assume that the phrase e~i is produced only by the phrase fi~, we can write:
∏
= = | ~ | 1 ) ~ | ~ ( ) ~ | ~ Pr( e f e f e i i i p (9)where the parameter p(e~|f~) estimates the probability that the word group, ~f, be
translated to the word group e~. These are the only parameters of this model. 3.1 Training with a phrase-aligned corpus
In the training phase, we can estimate the parameters of the model by using a parallel corpus, which is aligned sentence to sentence [3][7]. We need to maximize equation (8) subject to the constraints that hold for each~f:
∑
= e f | e p ~ ) 1 ~ ~ ( (10)Using standard maximization techniques, we obtain [7]:
∑
∏
∑
∑
== = = = − = = = | ~ | | ~ | ; ~ : ~ | ~ | 1 ~ | 1 ~ : ~ 1 ~ ( | ) (~ ) (~ ) ) ~ | ~ ( ~ ~ ~ ~ f e f f f e | e e e e e f f e i i e p i i f i δ e i f e p λ δ (11)where δ is the Kronecker delta function, which is defined as: δ (true)=1 y δ (false)=0.
The parameters that we are interested in estimating, p(e~|f~), appear on both sides of
equation 12. Thus, we need to use the EM algorithm in an iterative procedure [1]. 3.2 Training with a word-aligned corpus
The parameters of the model can also be obtained using a word-aligned corpus [12]. We don’t have a word-aligned corpus, so we align the corpus automatically using single-word models trained with the free software GIZA++ [6].
phrases from the word aligned corpus. Basically, a bilingual phrase consists of a pair of m consecutive source words that has been aligned with n consecutive target words. Different criteria can define the set of bilingual phrases BP of the sentence pair (f; e) with alignment a2. The criteria used (12) is illustrated in Figure 3.
< ∨ < < ∨ < ∀ = ∨ ≤ ≤ ≤ ≤ ∀ = ) ( ) ( ); ( ) ( : ) 0 ( ) ( ; : : ) , ( ) ( 2 1 2 1 2 1 2 1 2 1 2 1 j j j j i i j j j j j j j i i i i j j j j BP a a a a e e f f a e, f, ... ... (12) e: configuration program f: programa de configuración a: 2 0 1
BP={configuration-configuración, program-programa,configuration-de configuración,
program-programa de, configuration program-programa de configuración}
Fig. 3. Example of extracting a set of bilingual phrases from two aligned sentences In the second step, we estimate the parameters of the model. This can be done via relative frequencies: ) ~ ( ) ~ , ~ ( ) ~ | ~ ( f N e f N f e t = (13)
where N(~f)denotes the number of times that phase ~f has appeared and N(f~,e~) is the number of times that the bilingual phrase f~-e~ has appeared. In [12], the noisy-channel approach is used. Therefore, they use the function:
) ~ ( ) ~ , ~ ( ) ~ | ~ ( e N e f N e f t = (14)
We tried another way of estimating the parameters. We needed to maximize (8) subject to the constraint that the monotone alignment in this function be consistent with the training alignment. Using standard maximization techniques we obtain:
) ~ ~ ~ ~ | ~ | | ~ | ; ~ : ~ | ~ | 1 ~ | 1 ~ : ~ 1 ~ ( | ) (~ ) (~ ) (~ ~ ( ) ) ~ | ~ (
∑
∑
∏
∑
== = = = − = = − ∈ = f ef f f e | e e e e a e, f, e f f e i i e t i i f i δ e i δ f e BP f e t λ δ (15)As in section 1, the parameters can be estimated using the EM algorithm.
4
Monotone Template-Based Translation
The PB approach has one obvious drawback: it does not have generalization capabil-ity in word reordering. If we try to translate the English phrase “configuration pro-gram” to Spanish, and this phrase has not been seen in the training, the model can not output the correct sentence “programa de configuración”. If we analyze the example, the reason for this word reordering is the POS that each word has in the sentence. Two nouns in English are frequently translated in to Spanish by «2nd noun + de + 1st noun». A possible solution could be the template-based (TB) approach, that uses sequences of classes of words instead of sequences of words. We present a TB model that uses a POS tagger for word classes.
The generative process, which allows for the translation of a sentence, can be bro-ken down into the following steps: First, each word is tagged with the corresponding class. Second, the input sentence is segmented into phrases. Then, the corresponding reordering template for each class sequence is selected. The order of the output phrases is the same as the order of the input phrases. Finally, a word is selected for each output position.
As in PB model, we add all possible monotone alignments to obtain the translation probability ( as shown in (8)); and we assume that the phrase e~i is produced only by the phrase f~i, as shown in (9).
To estimate the translation probability between two phrases Pr(e~|f~), we introduce
the concept of template. In our implementation, a template z is the pair (F,R), which describes a possible word reordering R of a word class sequence F.
z =(F,R): Fj ∈C j=1..|F| Ri ∈ E ∪
ℵ
+ i=1..|R| (16) where C is the set of input word classes, E is the output vocabulary and ℵ+ is the set of natural numbers greater than 0.F = JJ NN NN R = 3 de 2 1
Fig. 4. Template example between an English phrase and a Spanish phrase
The template example in Figure 4 is used as follows. The input phrase “new con-figuration program” can be tagged as JJ NN NN (adjective, noun, noun)3. To translate this phrase, we can use a template whose class sequence F matches this sequence. If we use the template in the example, we obtain four words: the first is aligned with the third word in the input, the second is the word de, the third is aligned with the second, and the last is aligned with the first. Using the most probable translation for each aligned word, the output phrase “programa de configuración nuevo” can be obtained. We introduce the hidden variable of template z in the phrase translation probability:
where Pr(z|f~) is the probability of applying a template given the input phrase;
Pr(e~|z, ~f) is the probability of the output phrase given the template and the input
phrase [5]. We assume that the use of a template depends only on the word class in the input phrase:
) )
|~ ( | (~)
Pr(z f =p z C f (18)
where C(~f) maps a word sequence to its classes. To estimate the second distribution,
we assume that the generation of the word e~i depends only on the position R i:
∏
= = = |e| 1 i i i R f e ( R e f z e ~ ) ~ , | ~ Pr ~ ~ ~ Pr( | , ) δ(| | | |) (19)We use alternative ways on the kind of element in R:
E ∈ ℵ ∈ = = + i i i i R i i i R R R e f e p( f R e i if if ) ) ~ ( ~ | ~ ~ ~ ( Pr δ ) , | (20)
If Ri is an alignment index (Ri∈
ℵ
+), we use the word-to-word translation prob-ability; if Ri is an output word (Ri∈E), the word e~i must be the same as Ri.This model has only two distributions: p(z|F) probability of applying a template given a class sequence, and p(e|f) word-to-word translation probability.
4.1 Training
We obtain the parameters of this model using a word-aligned corpus. Given a pair of aligned sentences (f, e, a): First, each word in the input sentence is tagged with the corresponding POS. Then, the set of bilingual phrases is extracted using the function BP(f, e, a). For each bilingual phrase (f~,e~), we extract a template z=(F,R) where:
F= C( f~) 1..|~| 0 ~ 0 e i if e if R i i i i i ∀ = = ≠ = a a a (21) Figure 5 shows an example of template extraction from a bilingual phrase. Then, the parameters can be estimated using relative frequency:
) ( ) ( ) | ( F n z n F z p = ) ( ) , ( ) | ( f n e f n f e P = A (22)
: configuration program F:NN NN : programa de configuración R:2 de 1 a : 2 0 1 f ~ e~
Fig. 5. Example of template extraction from a bilingual phrase
p( 2 de 1 | NN NN) = 0,226 p(1 | NN NN) = 0.041 p( 2 1 | NN NN) = 0,031 p( 1 2 | NN NN) = 0,109 p( 2 | NN NN) = 0.039 p( 2 2 1 | NN NN) = 0.029 p( 2 1 | NN NN) = 0,074 p( 1 1 2 | NN NN) = 0.035 p( 2 del 1 | NN NN) = 0.028 p( 2 1 | JJ NN) = 0,294 p(1 | JJ NN) = 0,058 p( 2 de 1 | JJ NN) = 0,026 p(1 2 | JJ NN) = 0,197 p( 2 | JJ NN) = 0,038 p( 1 1 | JJ NN) = 0,026 p( de 2 1 | JJ NN) = 0,068 p( 1 de 2 | JJ NN) = 0,028 p( 1 2 2 | JJ NN) = 0,022
Fig. 6. Some template probabilities for two word-class sequences in an English-Spanish task
(NN – proper name, JJ – adjective)
5
Combining Phrase-Based and Template-Based Translation
The above models present advantages and disadvantages. The TB model can general-ize while the PB cannot. On the other hand, PB model explicitly translates each phrase and is capable of translating some phrases that the TB model cannot.
We consider the two models to be complementary, and so we combine them. The basic idea is that when a phrase is translated, if it has been seen in the training, the PB model is used. Otherwise, the TB model is preferred.
The new model can be defined as a lineal combination of these two models:
= ) ~ | ~ Pr(e f (1−α)p(e~|~f)+ ∑ z p z C f e z f) ~ , | ~ Pr( )) ~ ( | ( α (23)
where α is a factor to adjust the relevance of the two models.
6 Search
hypotheses obtained, and we take the bigger one. (We will refer to it as Add all alignments). [12] proposes extending the search to allow for non-monotone transla-tion. In this extension, all possible reorderings in the output sequence of phrases have the same probability as the monotone one (We will refer to it as Non-monotone).
7
Experimental Results
In order to evaluate the performance of these approaches, we carried out several ex-periments using tree tasks. The Xerox task was compiled using some Xerox technical manuals [9]. This is a reduced-domain task and many phrases in the text have been seen in the training. The El Periódico task was obtained from the Internet edition of a general newspaper published daily in Catalan and Spanish. The training corpus was made up of 10 months of the newspaper. The test was obtained from different tasks. The Hansards task consists of debates in the Canadian Parliament. This task has a very large vocabulary. Table 1 presents some statistical information about these cor-pora after the pre-processing phase.
Table 1. Statistical information of the selected corpora for experiments
Task: Xerox El Periódico Hansards
Translation direction: English Spanish Spanish Catalan English French
Training Sentences pairs 45,493 643,961 137,381
Running words 517K 575K 7,180K 7,435K 1,941K 2,130K
Vocabulary 7,272 9,947 129K 128K 29,479 37,554
Test Sentences 500 120 250
Running words 5,719 6,425 2,211 2,179 2,633 2,805
We selected the trigram model with lineal interpolation for the language model. The results of the translation experiments using the phrase-based approach are summarized in Tables 2 and 3. The main conclusions are that the direct approach produces better translations than the source-channel approach and that adding all possible alignments is better than using the best alignment.
When translating Romanic languages, word reordering is not essential and there-fore the PB approach is appropriate. For an unrestricted task, in Spanish-Catalan translation, we obtained better results than some rule-based commercial systems. Table 2. Translation word-error rate of the PB model for different configurations in the Xeros
corpus. Default: max. length phrases=14, direct model, add all alignments
Max. length
phrases TWER parameters
Estimation of
parameters TWER Search criteria TWER
6 38.8 678K Phrase-aligned (11) 33.6 Best alignment 30.5
8 31.2 933K Direct model (13) 30.1 Add all alignmets 30.1
10 30.6 1,149K Source chanel (14) 33.9 Non-monotone 33.2
12 33.3 1,328K Using EM alg. (15) 31.3
Table 3. Comparison of Spanish-Catalan commercial translators with the PB approach (max.
length phrase=3, direct model, monotone search adding all alignments). All references WER correspond to the percentage of words that must be changed in the output in order to obtain a correct translation Translators TWER all references WER Internostrum 11.9 4.9 Salt 9.9 3.0 Incyta 10.9 3.1
Phrase Based Model 10.7 3.8
Table 4 shows the results obtained in the Hansards task using PB model, TB model and the integration of two models. We can show how the integration of the two models obtains the best results.
Table 4. Translation word-error rate of the PB model, TB model and integrated model in the
Handshards corpus (max. length phrases=3, max. length templates=3, direct model, all
alignments)
Model TWER parameters
Phrase Based 64,9 1,185K Template Based 68,2 235K
Integrated 63,8 1,420K
8 Conclusion
We have described several approaches for performing statistical machine translation. They are based on monotone alignments using sequences of words. The phrase-based approach is very simple and the search can be performed in reduced time. This method can obtain good translation results in certain tasks such as some reduced-domain tasks or between Romanic languages. For an unrestricted task, in Spanish-Catalan translation, we obtained better results than some rule-based commercial sys-tems. An important drawback of this method is that it does not have generalization capability in word reordering.
We also present a template-based alignment model. This model presents three main differences with the alignment-template model presented in [5]: We use POS for word classes; we use word classes only in the source language; and we translate the sequence of templates in a monotone way.
We think these two models are complementary, and we propose an improved model that consists of a lineal combination of these models. We obtained the best results with the combined model.
Acknowledgements
This work was partially funded by the Spanish CICYT under grant TIC2000-1599-C02 and the IST Programme of the European Union under grant IST-2001-32091.
References
1. Brown, P.F., Della Pietra, S., Della Pietra, V., Mercer, R.L.: The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (1993) 263–311 2. Casacuberta, F.: Inference of finite-state transducers by using regular grammars and mor-phisms. In A.L. Oliveira, editor, Grammatical Inference: Algorithms and Applications, volume 1891 of Lecture Notes in Computer Science, pages 1-14. Springer-Verlag, 2000. 5th International Colloquium Grammatical Inference -ICGI2000-. Lisboa. Portugal. Septiembre. 3. Marcu, D., Wong W.: A Phrase-Based, Joint Probability Model for Statistical Machine
Translation. Proc. of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, July (2002)
4. Marcus, M., Kim, G., Marcinkiewicz, M., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., Schasberger, B.: The Penn treebank: Annotating predicate argument structure. In ARPA Human Language Technology Workshop. (1994)
5. Och, F.J., Tillmann, C., Ney, H.: Improved alignment models for statistical machine transla-tion. In Proc. Of the Joint SIGDAT Conf. On Empirical Methods in Natural Language Proc-essing and Very Large Corpora, Maryland, USA. (1999) 20-28
6. Och, F.J., Ney, H.: Improved Statistical Alignment Models. Proc. of the 38th Annual Meet-ing of the Association for Computational LMeet-inguistics, Hong Kong, China, October (2000) 7. Tomás, J., Casacuberta, F.: Monotone Statistical Translation using Word Groups.
Proceed-ings of the Machine Translation Summit VIII, Santiago, Spain (2001)
8. Tomás, J., Casacuberta, F.: Binary Feature Classification for Word Disambiguation in Sta-tistical Machine Translation. Proceedings of the 2nd International Workshop on Patter Rec-ognition in Information Systems, Alicante, Spain (2002)
9. TT2: TransType2-Computer-Assisted Translation (TT2). Information Society Technologies (IST) Programme. IST-2001-32091 (2002)
10. Vogel, S., Ney, H., Tillmann, C.: HMM-Based Alignment in Statistical Word Translation. International conference on Computational Linguistics, Copenhagen, Denmark. (1996) 11. Wang, Y., Waibel, A.: Modeling with structures in statistical machine translation. In
COL-ING-ACL’98: 36th Annual Meeting of the Association for Computational Linguistics and 17th Int. Conf. on Computational Linguistics, V2, Montreal, Canada (1998) 1357-1363 12. Zens, R., Och, F.J., Ney, H.: Phrase-Based Statistical Machine Translation. In Proc.