A Comparison of Alignment Models for Statistical Machine Translation

(1)

Translation

Franz Josef Och and Hermann Ney

Lehrstuhl fur InformatikVI, Computer Science Department

RWTH Aachen - University ofTechnology

D-52056Aachen, Germany

foch,[email protected] ache n.de

Abstract

Inthispaper,wepresentandcomparevarious

align-mentmodelsforstatisticalmachinetranslation. We

propose to measure the quality of an alignment

model using the quality of the Viterbi alignment

comparedtoamanually-producedalignmentand

de-scribearenedannotationscheme toproduce

suit-ablereferencealignments. Wealsocomparethe

im-pactofdierentalignmentmodelsonthetranslation

qualityofastatisticalmachinetranslationsystem.

1 Introduction

Instatisticalmachinetranslation(SMT)itis

neces-sarytomodelthetranslationprobabilityPr(f J

=f denotesthe(French)sourceande I

1 =e

denotes the (English) target string. Most SMT

models (Brown et al., 1993; Vogel et al., 1996)

trytomodelword-to-wordcorrespondencesbetween

sourceandtargetwordsusinganalignmentmapping

fromsourcepositionj totargetpositioni=a

j .

We canrewrite the probabilityPr(f J

troducingthe`hidden'alignmentsa J

ToallowforFrenchwordswhichdonotdirectly

cor-respond to any English word an articial 'empty'

word e

0

isadded to thetarget sentence atposition

i=0.

The dierent alignment models we present

pro-vide dierent decompositions of Pr(f J

alignment^a J

1

forwhichholds

^

for a specic model is called Viterbi alignment of

thismodel.

In this paper we will describe extensions to the

1996)andcomparethesetoModels1-4of(Brown

etal., 1993). Weproposeto measurethe qualityof

analignmentmodelusingthequalityoftheViterbi

alignmentcompared to amanually-produced

align-ment. Thishastheadvantagethatoncehaving

pro-ducedareferencealignment,theevaluationitselfcan

beperformedautomatically. Inaddition,itresultsin

averypreciseandreliableevaluationcriterionwhich

is well suited to assess various design decisions in

modelingandtrainingofstatisticalalignment

mod-els.

Itiswellknownthatmanuallyperformingaword

alignment is a complicated and ambiguous task

(Melamed, 1998). Therefore, to produce the

refer-encealignmentweusearened annotationscheme

whichreducesthecomplicationsandambiguities

oc-curringin themanualconstructionofaword

align-ment. Asweusethealignmentmodelsfor machine

translationpurposes,wealsoevaluatetheresulting

translationqualityofdierentmodels.

2 Alignment with HMM

IntheHidden-Markov alignment model weassume

arst-order dependence for the alignments a

j and

thatthetranslationprobabilitydepends onlyona

j

Later, wewill describea renement with a

depen-dence on e

aj

1

in the alignment model. Putting

everything together, we have the following basic

HMM-basedmodel:

with the alignment probability p(iji 0

;I) and the

translation probability p(fje). To nd a Viterbi

alignment for the HMM-based model we resort to

dynamicprogramming(Vogelet al.,1996).

(2)

EM-c(fje;f;e)=

IntheM-stepthelexiconandtranslation

probabili-tiesare:

Toavoidthesummationoverallpossiblealignments

a,(Vogeletal.,1996)usethemaximum

approxima-tionwhereonlytheViterbialignmentpathisusedto

collectcounts. WeusedtheBaum-Welch-algorithm

(Baum,1972)totrainthemodel parametersin our

experiments. Thereby it is possible to perform an

eÆcienttrainingusingallalignments.

To make the alignment parameters independent

from absolute word positions we assume that the

alignmentprobabilitiesp(iji 0

;I)depend onlyonthe

jump width (i i 0

). Using a set of non-negative

parametersfc(i i 0

)g, we can write the alignment

probabilitiesintheform:

p(iji

This form ensures that for each word position i 0

,

i 0

= 1;:::;I, the alignment probabilities satisfy the

normalizationconstraint.

Extension: rened alignment model

The count table c(i i 0

) has only 2I

max

1

en-tries. Thismightbesuitableforsmallcorpora,but

for large corpora it is possible to make amore

re-ned model ofPr(a

). Especially,we

analyzedthe eect of adependence one

aj

1 orf

j .

As adependence on allEnglishwordswould result

inahugenumberofalignmentparametersweuseas

(Brownet al., 1993)equivalence classesGoverthe

EnglishandtheFrenchwords. HereGisamapping

of words to classes. This mapping is trained

au-tomaticallyusing amodication ofthe method

de-scribedin(KneserandNey,1991). Weuse50classes

inourexperiments. Themostgeneralformof

align-ment distributionthat we consider in the HMMis

p(a

Extension: emptyword

In the original formulationof theHMM alignment

model there is no `empty' word which generates

French words having no directly aligned English

0

1993) isnotpossibleifwewant tomodelthejump

distances i i 0

, asthe position i =0of theempty

word is chosen arbitrarily. Therefore, to introduce

the empty wordweextendtheHMMnetworkbyI

empty wordse 2I

I+1

. TheEnglishworde

i

hasa

cor-responding empty word e

i+I

. The position of the

empty wordencodesthe previously visited English

word.

Weenforcethefollowingconstraintsforthe

tran-sitionsin theHMMnetwork(iI, i

Theparameterp H

0

istheprobabilityofatransition

totheemptyword. Inourexperimentswesetp H

0 =

0:2.

Smoothing

Fora better estimation of infrequentevents we

in-troducethefollowingsmoothingofalignment

prob-abilities:

Replacing the dependence on a

j 1

in the HMM

alignment model by adependence on j, weobtain

amodelwhichcanbeseenasazero-order

Hidden-MarkovModelwhichissimilartoModel2proposed

by(Brownetal.,1993). Assumingauniform

align-ment probability p(ijj;I) = 1=I, we obtain Model

1.

Assumingthatthedominatingfactorinthe

align-mentmodelofModel2isthedistancerelativetothe

diagonallineofthe(j;i)planethemodelp(ijj;I)can

bestructuredasfollows(Vogelet al.,1996):

p(ijj;I)=

Thismodel will be referredto asdiagonal-oriented

Model2.

4 Model 3 and Model 4

Model: Thefertilitymodelsof(Brownetal.,1993)

explicitlymodeltheprobabilityp(je)that the

En-glishworde

i

isalignedto

(3)

alignment model like Model 2 including in

addi-tionfertilityparameters. Model 4of (Brown et al.,

1993) is also a rst-order alignment model (along

the source positions) like the HMM, but includes

also fertilities. In Model 4 the alignment position

j of anEnglishworddepends onthealignment

po-sition of the previous English word (with non-zero

fertility)j 0

. Itmodelsajumpdistancej j 0

(for

con-secutiveEnglish words) while in the HMM ajump

distancei i 0

(forconsecutiveFrenchwords)is

mod-eled. ThefulldescriptionofModel4(Brownet al.,

1993)israthercomplicatedastherehavetobe

con-sidered the cases that English words have fertility

largerthanoneand that Englishwordshave

fertil-ityzero.

For trainingof Model 3and Model4, weuse an

extension of the program Giza (Al-Onaizan et al.,

1999). Since thereisnoeÆcientwayinthese

mod-els to avoid the explicit summation over all

align-mentsintheEM-algorithm,thecountsarecollected

onlyoverasubsetofpromisingalignments. Itisnot

knownaneÆcientalgorithmtocomputetheViterbi

alignment for the Models 3 and 4. Therefore, the

Viterbi alignment is computed only approximately

usingthemethoddescribedin(Brownet al.,1993).

The models 1-4 are trained in succession with the

nal parameter values of onemodel serving as the

startingpointforthenext.

A special problem in Model 3 and Model 4

con-cerns the deciency of the model. This results in

problems in re-estimation of the parameter which

describes the fertility of the empty word. In

nor-malEM-training,thisparameterissteadily

decreas-ing,producingtoomanyalignmentswiththeempty

word. Therefore weset the probability foraligning

a source word with the empty word at a suitably

chosenconstantvalue.

As in theHMMweeasilycan extendthe

depen-dencies in the alignment model of Model 4 easily

using the word class of the previous English word

E = G(e

i 0),

or the word class of the French word

F =G(f

j

)(Brownetal.,1993).

5 Including a Manual Dictionary

We propose here a simple method to make use of

a bilingual dictionary as an additional knowledge

sourceinthetrainingprocessbyextendingthe

train-ingcorpuswiththedictionaryentries. Thereby,the

dictionary is used already in EM-training and can

improvenotonlythealignmentforwordswhichare

inthedictionarybutindirectlyalsoforotherwords.

Theadditionalsentencesin the trainingcorpus are

weightedwithafactorF

lex

duringtheEM-training

ofthelexiconprobabilities.

Weassign the dictionary entries whichreally

co-occurinthetrainingcorpusahighweightF and

perimentsweuseF

lex

=10fortheco-occurring

dic-tionaryentries which is equivalent to addingevery

dictionaryentrytentimesto thetrainingcorpus.

6 The Alignment Template System

The statistical machine-translation method

descri-bedin(Ochetal.,1999)isbasedonawordaligned

training corpus and thereby makes use of

single-wordbased alignmentmodels. The key element of

thisapproacharethealignmenttemplateswhichare

pairsofphrasestogetherwithanalignmentbetween

the words within the phrases. The advantage of

the alignment template approach over word based

statistical translation models is that word context

and local re-orderings are explicitly taken into

ac-count. Wetypicallyobservethatthisapproach

pro-ducesbettertranslationsthanthesingle-wordbased

models. Thealignmenttemplatesareautomatically

trainedusing a parallel training corpus. For more

informationaboutthealignmenttemplateapproach

see(Och etal.,1999).

7 Results

Wepresent resultsontheVerbmobil Task which is

aspeech translationtaskin thedomainof

appoint-mentscheduling,travelplanning,andhotel

reserva-tion(Wahlster,1993).

We measure the quality of the abovementioned

alignmentmodelswithrespect toalignment quality

andtranslation quality.

To obtain a reference alignment for evaluating

alignment quality, we manually aligned about 1.4

percentof ourtrainingcorpus. Weallowed the

hu-mans who performed the alignment to specify two

dierentkindsofalignments: anS (sure)alignment

which is used for alignments which are

unambigu-ously and a P (possible) alignment which is used

foralignmentswhichmightormightnotexist. The

P relation is used especially to align wordswithin

idiomaticexpressions,freetranslations,andmissing

functionwords. ItisguaranteedthatSP. Figure

1showsan exampleofamanuallyalignedsentence

withSandPrelations. Thehuman-annotated

align-mentdoes notprefer anytranslation direction and

maythereforecontainmany-to-oneandone-to-many

relationships. The annotation has been performed

by two annotators, producing sets S

1 , P

1 , S

2 , P

2 .

Thereferencealignmentisproducedbyformingthe

intersectionofthesurealignments(S=S

1 \S

2 )and

theunionofthepossiblealignments(P =P

1 [P

2 ).

ThequalityofanalignmentA=f(j;a

j

)gis

mea-suredusing thefollowingalignmenterrorrate:

AER (S;P;A)=1

jA\Sj+jA\Pj

(4)

(5)

intoGerman(Germanwordshavefertilities)andGermaninto English.

English!German German!English

Dictionary no yes no yes

EmptyWord no yes yes no yes yes

Model1 17.8 16.9 16.0 22.9 21.7 20.3

Model2 12.8 12.5 11.7 17.5 17.1 15.7

Model2(diag) 11.8 10.5 9.8 16.4 15.1 13.3

Model3 10.5 9.3 8.5 15.7 14.5 12.1

HMM 10.5 9.2 8.0 14.1 12.9 11.5

Model4 9.0 7.8 6.5 14.0 12.5 10.8

Table 5: Eect of dierent alignment models on

translationquality.

AlignmentModel

inTraining WER[%] SSER[%]

Model1 49.8 22.2

HMM 47.7 19.3

Model4 48.6 16.8

Theresultsareshownin Table5. Wesee aclear

improvementin translation quality asmeasuredby

SSERwhereasWERismoreorlessthesameforall

models. The improvementisdue to betterlexicons

and better alignmenttemplates extracted from the

resultingalignments.

8 Conclusion

We have evaluated various statistical alignment

models by comparing the Viterbi alignment of the

model with a human-made alignment. We have

shown thatbyusingmoresophisticatedmodelsthe

qualityofthealignmentsimprovessignicantly. F

ur-ther improvements in producing better alignments

areexpectedfromusingtheHMMalignmentmodel

tobootstrapthefertilitymodels,frommakinguseof

cognates,andfromstatisticalalignmentmodelsthat

arebasedonwordgroupsratherthansinglewords.

Acknowledgment

This article has been partially supported as

part of the Verbmobil project (contract number

01 IV 701T4) bythe German Federal Ministryof

Education,Science,ResearchandTechnology.

References

Y.Al-Onaizan,J.Curin,M.Jahr,K.Knight,J.

Laf-Smith, and D. Yarowsky. 1999. Statistical

ma-chine translation, nal report, JHU workshop.

http://www.clsp.jhu.edu/ws99/projects/mt/

finalreport/mt-final-report.ps.

L.E. Baum. 1972. An Inequality and Associated

MaximizationTechniqueinStatisticalEstimation

for ProbabilisticFunctions of Markov Processes.

Inequalities,3:1{8.

P.F.Brown,S. A.DellaPietra, V. J.DellaPietra,

andR.L.Mercer. 1993. Themathematicsof

sta-tistical machine translation: Parameter

estima-tion. ComputationalLinguistics,19(2):263{311.

R.KneserandH.Ney. 1991. FormingWordClasses

by StatisticalClustering forStatistical Language

Modelling. In1.Quantitative LinguisticsConf.

I.D.Melamed. 1998. Manualannotationof

transla-tionalequivalence: TheBlinkerproject. Technical

Report98-07,IRCS.

S. Nieen, F. J. Och, G. Leusch, and H. Ney.

2000. Anevaluationtoolformachinetranslation:

Fastevaluationformtresearch. InProceedings of

the SecondInternationalConferenceonLanguage

Resources and Evaluation, pages 39{45, Athens,

Greece, MayJune.

F.J.Och,C.Tillmann,andH.Ney. 1999. Improved

alignmentmodelsfor statistical machine

transla-tion. In In Proc. of the Joint SIGDAT Conf. on

Empirical Methods in Natural Language Pro

cess-ingandVeryLargeCorpora,pages20{28,

Univer-sityofMaryland,CollegePark,MD,USA,June.

S. Vogel, H. Ney, and C. Tillmann. 1996.

HMM-based word alignment in statistical translation.

InCOLING '96: The16th Int.Conf.on

Compu-tational Linguistics,pages836{841,Copenhagen,

August.

W.Wahlster. 1993. Verbmobil: Translationof

face-to-face dialogs. In Proc. of the MT Summit IV,