Translation
Franz Josef Och and Hermann Ney
Lehrstuhl fur InformatikVI, Computer Science Department
RWTH Aachen - University ofTechnology
D-52056Aachen, Germany
foch,[email protected] ache n.de
Abstract
Inthispaper,wepresentandcomparevarious
align-mentmodelsforstatisticalmachinetranslation. We
propose to measure the quality of an alignment
model using the quality of the Viterbi alignment
comparedtoamanually-producedalignmentand
de-scribearenedannotationscheme toproduce
suit-ablereferencealignments. Wealsocomparethe
im-pactofdierentalignmentmodelsonthetranslation
qualityofastatisticalmachinetranslationsystem.
1 Introduction
Instatisticalmachinetranslation(SMT)itis
neces-sarytomodelthetranslationprobabilityPr(f J
=f denotesthe(French)sourceande I
1 =e
denotes the (English) target string. Most SMT
models (Brown et al., 1993; Vogel et al., 1996)
trytomodelword-to-wordcorrespondencesbetween
sourceandtargetwordsusinganalignmentmapping
fromsourcepositionj totargetpositioni=a
j .
We canrewrite the probabilityPr(f J
troducingthe`hidden'alignmentsa J
ToallowforFrenchwordswhichdonotdirectly
cor-respond to any English word an articial 'empty'
word e
0
isadded to thetarget sentence atposition
i=0.
The dierent alignment models we present
pro-vide dierent decompositions of Pr(f J
alignment^a J
1
forwhichholds
^
for a specic model is called Viterbi alignment of
thismodel.
In this paper we will describe extensions to the
1996)andcomparethesetoModels1-4of(Brown
etal., 1993). Weproposeto measurethe qualityof
analignmentmodelusingthequalityoftheViterbi
alignmentcompared to amanually-produced
align-ment. Thishastheadvantagethatoncehaving
pro-ducedareferencealignment,theevaluationitselfcan
beperformedautomatically. Inaddition,itresultsin
averypreciseandreliableevaluationcriterionwhich
is well suited to assess various design decisions in
modelingandtrainingofstatisticalalignment
mod-els.
Itiswellknownthatmanuallyperformingaword
alignment is a complicated and ambiguous task
(Melamed, 1998). Therefore, to produce the
refer-encealignmentweusearened annotationscheme
whichreducesthecomplicationsandambiguities
oc-curringin themanualconstructionofaword
align-ment. Asweusethealignmentmodelsfor machine
translationpurposes,wealsoevaluatetheresulting
translationqualityofdierentmodels.
2 Alignment with HMM
IntheHidden-Markov alignment model weassume
arst-order dependence for the alignments a
j and
thatthetranslationprobabilitydepends onlyona
j
Later, wewill describea renement with a
depen-dence on e
aj
1
in the alignment model. Putting
everything together, we have the following basic
HMM-basedmodel:
with the alignment probability p(iji 0
;I) and the
translation probability p(fje). To nd a Viterbi
alignment for the HMM-based model we resort to
dynamicprogramming(Vogelet al.,1996).
EM-c(fje;f;e)=
IntheM-stepthelexiconandtranslation
probabili-tiesare:
Toavoidthesummationoverallpossiblealignments
a,(Vogeletal.,1996)usethemaximum
approxima-tionwhereonlytheViterbialignmentpathisusedto
collectcounts. WeusedtheBaum-Welch-algorithm
(Baum,1972)totrainthemodel parametersin our
experiments. Thereby it is possible to perform an
eÆcienttrainingusingallalignments.
To make the alignment parameters independent
from absolute word positions we assume that the
alignmentprobabilitiesp(iji 0
;I)depend onlyonthe
jump width (i i 0
). Using a set of non-negative
parametersfc(i i 0
)g, we can write the alignment
probabilitiesintheform:
p(iji
This form ensures that for each word position i 0
,
i 0
= 1;:::;I, the alignment probabilities satisfy the
normalizationconstraint.
Extension: rened alignment model
The count table c(i i 0
) has only 2I
max
1
en-tries. Thismightbesuitableforsmallcorpora,but
for large corpora it is possible to make amore
re-ned model ofPr(a
). Especially,we
analyzedthe eect of adependence one
aj
1 orf
j .
As adependence on allEnglishwordswould result
inahugenumberofalignmentparametersweuseas
(Brownet al., 1993)equivalence classesGoverthe
EnglishandtheFrenchwords. HereGisamapping
of words to classes. This mapping is trained
au-tomaticallyusing amodication ofthe method
de-scribedin(KneserandNey,1991). Weuse50classes
inourexperiments. Themostgeneralformof
align-ment distributionthat we consider in the HMMis
p(a
Extension: emptyword
In the original formulationof theHMM alignment
model there is no `empty' word which generates
French words having no directly aligned English
0
1993) isnotpossibleifwewant tomodelthejump
distances i i 0
, asthe position i =0of theempty
word is chosen arbitrarily. Therefore, to introduce
the empty wordweextendtheHMMnetworkbyI
empty wordse 2I
I+1
. TheEnglishworde
i
hasa
cor-responding empty word e
i+I
. The position of the
empty wordencodesthe previously visited English
word.
Weenforcethefollowingconstraintsforthe
tran-sitionsin theHMMnetwork(iI, i
Theparameterp H
0
istheprobabilityofatransition
totheemptyword. Inourexperimentswesetp H
0 =
0:2.
Smoothing
Fora better estimation of infrequentevents we
in-troducethefollowingsmoothingofalignment
prob-abilities:
Replacing the dependence on a
j 1
in the HMM
alignment model by adependence on j, weobtain
amodelwhichcanbeseenasazero-order
Hidden-MarkovModelwhichissimilartoModel2proposed
by(Brownetal.,1993). Assumingauniform
align-ment probability p(ijj;I) = 1=I, we obtain Model
1.
Assumingthatthedominatingfactorinthe
align-mentmodelofModel2isthedistancerelativetothe
diagonallineofthe(j;i)planethemodelp(ijj;I)can
bestructuredasfollows(Vogelet al.,1996):
p(ijj;I)=
Thismodel will be referredto asdiagonal-oriented
Model2.
4 Model 3 and Model 4
Model: Thefertilitymodelsof(Brownetal.,1993)
explicitlymodeltheprobabilityp(je)that the
En-glishworde
i
isalignedto
alignment model like Model 2 including in
addi-tionfertilityparameters. Model 4of (Brown et al.,
1993) is also a rst-order alignment model (along
the source positions) like the HMM, but includes
also fertilities. In Model 4 the alignment position
j of anEnglishworddepends onthealignment
po-sition of the previous English word (with non-zero
fertility)j 0
. Itmodelsajumpdistancej j 0
(for
con-secutiveEnglish words) while in the HMM ajump
distancei i 0
(forconsecutiveFrenchwords)is
mod-eled. ThefulldescriptionofModel4(Brownet al.,
1993)israthercomplicatedastherehavetobe
con-sidered the cases that English words have fertility
largerthanoneand that Englishwordshave
fertil-ityzero.
For trainingof Model 3and Model4, weuse an
extension of the program Giza (Al-Onaizan et al.,
1999). Since thereisnoeÆcientwayinthese
mod-els to avoid the explicit summation over all
align-mentsintheEM-algorithm,thecountsarecollected
onlyoverasubsetofpromisingalignments. Itisnot
knownaneÆcientalgorithmtocomputetheViterbi
alignment for the Models 3 and 4. Therefore, the
Viterbi alignment is computed only approximately
usingthemethoddescribedin(Brownet al.,1993).
The models 1-4 are trained in succession with the
nal parameter values of onemodel serving as the
startingpointforthenext.
A special problem in Model 3 and Model 4
con-cerns the deciency of the model. This results in
problems in re-estimation of the parameter which
describes the fertility of the empty word. In
nor-malEM-training,thisparameterissteadily
decreas-ing,producingtoomanyalignmentswiththeempty
word. Therefore weset the probability foraligning
a source word with the empty word at a suitably
chosenconstantvalue.
As in theHMMweeasilycan extendthe
depen-dencies in the alignment model of Model 4 easily
using the word class of the previous English word
E = G(e
i 0),
or the word class of the French word
F =G(f
j
)(Brownetal.,1993).
5 Including a Manual Dictionary
We propose here a simple method to make use of
a bilingual dictionary as an additional knowledge
sourceinthetrainingprocessbyextendingthe
train-ingcorpuswiththedictionaryentries. Thereby,the
dictionary is used already in EM-training and can
improvenotonlythealignmentforwordswhichare
inthedictionarybutindirectlyalsoforotherwords.
Theadditionalsentencesin the trainingcorpus are
weightedwithafactorF
lex
duringtheEM-training
ofthelexiconprobabilities.
Weassign the dictionary entries whichreally
co-occurinthetrainingcorpusahighweightF and
perimentsweuseF
lex
=10fortheco-occurring
dic-tionaryentries which is equivalent to addingevery
dictionaryentrytentimesto thetrainingcorpus.
6 The Alignment Template System
The statistical machine-translation method
descri-bedin(Ochetal.,1999)isbasedonawordaligned
training corpus and thereby makes use of
single-wordbased alignmentmodels. The key element of
thisapproacharethealignmenttemplateswhichare
pairsofphrasestogetherwithanalignmentbetween
the words within the phrases. The advantage of
the alignment template approach over word based
statistical translation models is that word context
and local re-orderings are explicitly taken into
ac-count. Wetypicallyobservethatthisapproach
pro-ducesbettertranslationsthanthesingle-wordbased
models. Thealignmenttemplatesareautomatically
trainedusing a parallel training corpus. For more
informationaboutthealignmenttemplateapproach
see(Och etal.,1999).
7 Results
Wepresent resultsontheVerbmobil Task which is
aspeech translationtaskin thedomainof
appoint-mentscheduling,travelplanning,andhotel
reserva-tion(Wahlster,1993).
We measure the quality of the abovementioned
alignmentmodelswithrespect toalignment quality
andtranslation quality.
To obtain a reference alignment for evaluating
alignment quality, we manually aligned about 1.4
percentof ourtrainingcorpus. Weallowed the
hu-mans who performed the alignment to specify two
dierentkindsofalignments: anS (sure)alignment
which is used for alignments which are
unambigu-ously and a P (possible) alignment which is used
foralignmentswhichmightormightnotexist. The
P relation is used especially to align wordswithin
idiomaticexpressions,freetranslations,andmissing
functionwords. ItisguaranteedthatSP. Figure
1showsan exampleofamanuallyalignedsentence
withSandPrelations. Thehuman-annotated
align-mentdoes notprefer anytranslation direction and
maythereforecontainmany-to-oneandone-to-many
relationships. The annotation has been performed
by two annotators, producing sets S
1 , P
1 , S
2 , P
2 .
Thereferencealignmentisproducedbyformingthe
intersectionofthesurealignments(S=S
1 \S
2 )and
theunionofthepossiblealignments(P =P
1 [P
2 ).
ThequalityofanalignmentA=f(j;a
j
)gis
mea-suredusing thefollowingalignmenterrorrate:
AER (S;P;A)=1
jA\Sj+jA\Pj
intoGerman(Germanwordshavefertilities)andGermaninto English.
English!German German!English
Dictionary no yes no yes
EmptyWord no yes yes no yes yes
Model1 17.8 16.9 16.0 22.9 21.7 20.3
Model2 12.8 12.5 11.7 17.5 17.1 15.7
Model2(diag) 11.8 10.5 9.8 16.4 15.1 13.3
Model3 10.5 9.3 8.5 15.7 14.5 12.1
HMM 10.5 9.2 8.0 14.1 12.9 11.5
Model4 9.0 7.8 6.5 14.0 12.5 10.8
Table 5: Eect of dierent alignment models on
translationquality.
AlignmentModel
inTraining WER[%] SSER[%]
Model1 49.8 22.2
HMM 47.7 19.3
Model4 48.6 16.8
Theresultsareshownin Table5. Wesee aclear
improvementin translation quality asmeasuredby
SSERwhereasWERismoreorlessthesameforall
models. The improvementisdue to betterlexicons
and better alignmenttemplates extracted from the
resultingalignments.
8 Conclusion
We have evaluated various statistical alignment
models by comparing the Viterbi alignment of the
model with a human-made alignment. We have
shown thatbyusingmoresophisticatedmodelsthe
qualityofthealignmentsimprovessignicantly. F
ur-ther improvements in producing better alignments
areexpectedfromusingtheHMMalignmentmodel
tobootstrapthefertilitymodels,frommakinguseof
cognates,andfromstatisticalalignmentmodelsthat
arebasedonwordgroupsratherthansinglewords.
Acknowledgment
This article has been partially supported as
part of the Verbmobil project (contract number
01 IV 701T4) bythe German Federal Ministryof
Education,Science,ResearchandTechnology.
References
Y.Al-Onaizan,J.Curin,M.Jahr,K.Knight,J.
Laf-Smith, and D. Yarowsky. 1999. Statistical
ma-chine translation, nal report, JHU workshop.
http://www.clsp.jhu.edu/ws99/projects/mt/
finalreport/mt-final-report.ps.
L.E. Baum. 1972. An Inequality and Associated
MaximizationTechniqueinStatisticalEstimation
for ProbabilisticFunctions of Markov Processes.
Inequalities,3:1{8.
P.F.Brown,S. A.DellaPietra, V. J.DellaPietra,
andR.L.Mercer. 1993. Themathematicsof
sta-tistical machine translation: Parameter
estima-tion. ComputationalLinguistics,19(2):263{311.
R.KneserandH.Ney. 1991. FormingWordClasses
by StatisticalClustering forStatistical Language
Modelling. In1.Quantitative LinguisticsConf.
I.D.Melamed. 1998. Manualannotationof
transla-tionalequivalence: TheBlinkerproject. Technical
Report98-07,IRCS.
S. Nieen, F. J. Och, G. Leusch, and H. Ney.
2000. Anevaluationtoolformachinetranslation:
Fastevaluationformtresearch. InProceedings of
the SecondInternationalConferenceonLanguage
Resources and Evaluation, pages 39{45, Athens,
Greece, MayJune.
F.J.Och,C.Tillmann,andH.Ney. 1999. Improved
alignmentmodelsfor statistical machine
transla-tion. In In Proc. of the Joint SIGDAT Conf. on
Empirical Methods in Natural Language Pro
cess-ingandVeryLargeCorpora,pages20{28,
Univer-sityofMaryland,CollegePark,MD,USA,June.
S. Vogel, H. Ney, and C. Tillmann. 1996.
HMM-based word alignment in statistical translation.
InCOLING '96: The16th Int.Conf.on
Compu-tational Linguistics,pages836{841,Copenhagen,
August.
W.Wahlster. 1993. Verbmobil: Translationof
face-to-face dialogs. In Proc. of the MT Summit IV,