toEvaluateStatistial MahineTranslation
MannyRayner,PaulaEstrella,PierretteBouillon
UniversityofGeneva,TIM/ISSCO
40bvdduPont-d'Arve,CH-1211Geneva4,Switzerland
fEmmanuel.Rayner,Paula.Estrella,Pierrette.Bouillongunige.h
BethAnnHokey MailStop19-26,UCSCUARC
NASAAmesResearhCenter,MoffettField,CA940351000
bahokeyus.edu
YukieNakao
LINA,NantesUniversity,2,ruedelaHoussiniere,BP9220844322NantesCedex03
yukie.nakaouniv-nantes.fr
Abstrat
Although Statistial Mahine Translation (SMT) is now the dominant paradigm withinMahineTranslation,wearguethat
itis farfrom learthat it anoutperform Rule-BasedMahineTranslation(RBMT) on small-to medium-voabulary
applia-tionswherehighpreision ismore impor-tant than reall. Apartiularly important pratialexampleismedialspeeh trans-lation. We report the results of
exper-iments where we ongured the various grammarsandrule-setsinanOpenSoure medium-voabularymulti-lingualmedial
speehtranslationsystemtogeneratelarge aligned bilingual orpora for English !
Frenh and English ! Japanese, whih werethenusedtotrainSMTmodelsbased on the ommon ombination of Giza++,
Moses and SRILM. The resulting SMTs were unable fully to reprodue the per-formaneoftheRBMT,withperformane topping out, even for English ! Frenh,
withlessthan70%oftheSMTtranslations of previously unseen sentenes agreeing with RBMTtranslations. When the
out-puts of the two systems differed, human judges reported the SMT result as fre-quently being worse than the RBMT
re-sult,andhardly everbetter; moreover,the addedrobustnessoftheSMTonlyyielded asmallimprovementinreall,withalarge
penaltyinpreision.
1 Introdution
WhenStatistialMahineTranslation(SMT)was
rst introdued inthe early 90s, itenountered a hostilereeption,andmanypeopleintheresearh ommunitywereunwillingtobelieveitouldever
be a serious ompetitor to symboli approahes (f.forexample(Arnoldetal.,1994)).The pendu-lumhasnowswungallthewaytotheotherendof
thesale;rightnow,theprevailingwisdomwithin the researh ommunity is that SMT is the only truly viablearhiteture, and that rule-based
ma-hinetranslation(RBMT)isultimatelydoomedto failure. Inthis paper, one of our initialonerns willbetoargueforaompromiseposition. Inour
opinion,theinitialseptiismaboutSMTwasnot groundless;theargumentspresentedagainstit of-tentooktheformofexamplesinvolvingdeep
lin-guistireasoning,whih,itwaslaimed,wouldbe hardtoaddressusingsurfaemethods.Proponents of RBMT had, however, greatly underestimated
theextent towhihSMTwould beable totakle theproblemof robustness, whereitappearstobe farmorepowerfulthanRBMT.Formostmahine
translation appliations, robustness is the entral issue,soSMT'surrentpreemineneishardly sur-prising.
Evenforthelarge-voabularytaskswhereSMT doesbest,thesituationisbynomeansaslearas one might imagine: aording to (Wilks, 2007),
purely statistial systems are still unable to out-perform SYSTRAN.In this paper, we will
how-everbe moreonernedwithlimited-domain MT tasks,whererobustnessisnotthekeyrequirement, and aurayisparamount. Animmediate
lishingitselfasananappliationareaofsome sig-niane (Bouillon et al., 2006; Bouillon et al., 2008a). Translationinmedialappliationsneeds
to be extremely aurate, sine mistranslations an have serious oreven fatal onsequenes. At the panel disussionat the2008 COLING
work-shoponsafety-ritial speehtranslation (Rayner etal.,2008), theonsensusopinion, basedon in-putfrompratisingphysiians,wasthatan
appro-priate evaluation metri for medial appliations wouldbeheavilyslantedtowardsauray,as op-posedtorobustness. Ifthemetriisnormalisedso
astoaward0pointsfornotranslation,and1point for a orret translation, the estimate wasthat a suitable sore for an inorret translation would
besomethingbetween25and100 points. With theserequirements,itseemsunlikelythatarobust, broad-overage arhiteture has muh hane of
suess.Theobviousstrategyistobuilda limited-domainontrolled-languag esystem,andtuneitto thepointwhereaurayreahesthedesiredlevel.
Forsystems ofthis kind, itis atleast
oneiv-ablethatRBMTmaybeabletooutperformSMT. Thenextquestionishowtoinvestigatetheissues in a methodologially even-handed way. A few
studies, notably(Seneffetal.,2006),suggestthat rule-based translationmayinfatbepreferablein these ases. (Another related experiment is
de-sribed in (Dugast et al., 2008), though this was arried out ina large-voabulary system). These studies,however,havenotbeenwidelyited.One
possible explanation is suspiion about method-ologial issues. Seneff andherolleagues trained their SMT system on 20000 sentene pairs, a smallnumberbythestandardsofSMT.Itisa
pri-orinotimplausiblethat moretraining datawould have enabled themtoreate anSMTsystem that wasasgoodas,orbetterthan,therule-based
sys-tem.
In this paper, our primary goal is to take this kindofobjetionseriously,anddevelopa
method-ology designed toenable a tight omparison be-tween rule-based and statistial arhitetures. In partiular, we wish to examine the widely be-lieved laim that SMT is now inherently better
than RBMT. Inorder to do this, we start with a limited-domain RBMTsystem;weuseitto
auto-matiallygeneratealargeorpusofalignedpairs, whih isusedtotrain aorresponding SMT sys-tem.Wethenomparetheperformaneofthetwo
Ourargumentwill be that this situation
essen-tiallyrepresentsanupperboundforwhatis possi-bleusingtheSMTapproah inalimiteddomain. Ithas beenwidely remarked that quality, aswell asquantity,oftrainingdata isimportant forgood
SMT; in many projets, signiant effort is ex-pended tolean theoriginal training data. Here, sinethedataisautomatiallygeneratedbya
rule-based system, we an be sure that it is already ompletely lean(inthesenseof beinginternally onsistent), and weangenerate aslarge a
quan-tity of it as we require. The appliation, more-over,usesonlyasmallishvoabulary andafairly onstrainedsyntax. IfthederivedSMTsystem is
unabletomaththeoriginal RBMTsystem's per-formane, it seems reasonable to laim that this shows that there are types ofappliations where
RBMTarhiteturesaresuperior.
The experiments desribed have been arried
out using MedSLT, an Open Soure interlingua-based limited-domain medial speeh translation system. The restofthepaperisorganisedas
fol-lows. Setion2providesbakgroundonthe Med-SLTsystem. Setion3 desribes the experimen-talframework,andSetion4theresultsobtained.
Setion5onludes.
2 TheMedSLTSystem
MedSLT (Bouillon et al., 2005; Bouillon et al.,
2008b)isamedium-voabulary interlingua-based OpenSourespeehtranslationsystemfor dotor-patient medial examination questions, whih
provides any-language-to-any-l an gua ge transla-tion apabilities for all languages in the set En-glish, Frenh, Japanese, Arabi, Catalan. Both
speehreognition andtranslation arerule-based. SpeehreognitionrunsontheNuane8.5 reog-nition platform, with grammar-based language
modelsbuiltusingtheOpenSoureRegulus om-piler. As desribed in (Rayner et al., 2006), eahdomain-speilanguage modelisextrated
from a general resoure grammar using orpus-basedmethodsdrivenbyaseedorpusof domain-spei examples. The seed orpus, whih typi-ally ontains between 500 and 1500 utteranes,
is then used a seond time to add probabilisti weights to the grammar rules; this substantially
lan-riedoutareshowninFigure1.
Language Voab WER SemER
English 447 6% 11%
Frenh 1025 8% 10%
Japanese 422 3% 4%
Table 1: Reognition performane for English, Frenh and Japanese MedSLT reognisers. V o-ab = number of surfae words in soure
lan-guagereogniservoabulary; WER=Word Er-ror Rate for soure language reogniser, on in-overagematerial;SemER=semantierrorrate
(proportionofutteranesfailingtoprodueorret interlingua)forsourelanguagereogniser,on in-overagematerial.
Atrun-time, thereogniser produesa soure-langage semanti representation. This is rst translated by one set of rulesinto an interlingual
form, and then by a seond setinto a target lan-guage representation. A target-language Regu-lusgrammar,ompiledintogenerationform,turns
this intoone ormorepossible surfaestrings, af-ter whih a set of generation preferenes piks one out. Finally, theseleted string isrealised in
spokenform. Robustnessissuesareaddressedby means of a bak-up statistial reogniser, whih drivesa robust embedded help system. The
pur-pose of the help system (Chatzihrisas et al., 2006)istoguidetheusertowardssupported ov-erage; it performs approximate mathing of
out-put fromthestatistial reogniseragain a library ofsenteneswhihhavebeenmarkedasorretly proessed during system development, and then
presentsthelosestmathestotheuser.
Examplesoftypial Englishdomain sentenes andtheirtranslationsintoFrenhandJapaneseare showninFigure2.
3 Experimentalframework
In the literature on language modelling, there is
a known tehnique for bootstrapping a statisti-allanguagemodel(SLM)fromagrammar-based language model (GLM). The grammar whih formsthebasis oftheGLMissampledrandomly
inordertoreateanarbitrarilylargeorpusof ex-amples; these examples are then used asa
train-ingorpustobuildtheSLM(Jurafskyetal.,1995; Jonson,2005).Weadaptthisproessina straight-forward way to onstrut an SMT for a given
mar,thesoure-to-interlingu atranslationrules,the interlingua-to-targ et-l an gua ge rules, and the tar-getlanguagegenerationgrammar. Westartinthe
same way,usingthesoure languagegrammarto build a randomly generated soure language or-pus; asshown in (Hokey et al., 2008), it is
im-portanttohaveaprobabilisti grammar. Wethen use the omposition of the other omponents to attempttotranslateeahsourelanguagesentene
into a target language equivalent, disarding the examples for whih no translation is produed. The result is an aligned bilingual orpus of
ar-bitrary size, whih anbe used to train an SMT model.
We used this method to generate aligned or-porafor thetwoMedSLTlanguagepairs English
!FrenhandEnglish!Japanese. Foreah lan-guagepair,werstgeneratedone million soure-languageutteranes;wenextlteredthemtokeep
only examples whih werefullsentenes, as op-posed to elliptial phrases, and nally used the translationrulesandtarget-languagegeneratorsto
attempt to translate eah sentene. This reated approximately 305K aligned sentene-pairs for English!Frenh(1901KwordsEnglish,1993K
words Frenh), and 311K aligned sentene-pairs for English ! Japanese (1941K words English, 2214K words Japanese). We held out 2.5% of
eah set as development data, and 2.5% as test data. UsingGiza++,MosesandSRILM(Ohand Ney,2000;Koehnetal.,2007;Stolke,2002),we
trainedSMTmodelsfrominreasingly large sub-setsofthetrainingportion,usingthedevelopment portionintheusualwaytooptimizeparameter
val-ues.Finally,weusedtheresultingmodelsto trans-latethetestportion.
Ourprimary goalwastomeasure theextent to
whihthederived versionsof theSMTwereable toapproximate theoriginal RBMTondatawhih waswithintheRBMT'soverage. Thereisa sim-pleandnaturalwaytoperformthismeasurement:
weapplytheBLEUmetri(Papinenietal.,2001), with the RBMT's translation taken as the refer-ene. Thismeansthat perfetorrespondene
be-tween the two translations would yield a BLEU soreof1.0.
This raises an important point. The BLEU
Frenh Avez-vousmalaudessusdesyeux?
Japanese Itamiwamenouenoataridesuka?
English Haveyouhadthepainformorethanamonth?
Frenh Avez-vousmaldepuisplusd'unmois? Japanese Ikkagetsuijouitamiwatsuzukimashitaka?
English Isthepainassoiatedwithnausea?
Frenh Avez-vousdesnaus´eesquandvousavezladouleur?
Japanese Itamutohakikewaokorimasuka?
English Doesbrightlightmakethepainworse?
Frenh Ladouleurest-elleaggrav´eeparunelumiereforte? Japanese Akaruihikariwomirutozutsuwahidokunarimasuka?
Table2:ExamplesofEnglishdomainsentenes,andthesystem'stranslationsintoFrenhandJapanese.
theextent towhihitapproximates human trans-lations. It is important tobring in human judge-ment, to evaluate the ases where the SMT and
RBMTdiffer. If,inthese ases, ittranspiredthat humanjudgestypiallythoughtthattheSMTwas asgood as theRBMT,then thedifferene would
bepurely aademi. Weneedtosatisfyourselves thathumanjudgestypiallyasribedifferenes be-tween SMT and RBMT to shortomings in the
SMTratherthanintheRBMT.
Conretely, we olleted all the different hSoure, SMT-translation, RBMT-translationi triples produed during the ourse of the
ex-periments, and extrated those where the two translationsweredifferent. Werandomlyseleted a set of examples for eah language pair, and
asked human judges to lassifythem into one of thefollowingategories:
RBMT better: The RBMTtranslation was
better,intermsofpreservingmeaningand/or beinggrammatiallyorret;
SMTbetter: The SMTtranslation was bet-ter,intermsofpreservingmeaningand/or
be-inggrammatiallyorret;
Similar: Both translations were about equally good OR the soure sentene was meaninglessinthedomain.
In order to show that our metris are intuitively
meaningful,itissufienttodemonstrate thatthe frequenyofourrene ofRBMTbetterisboth
large in omparison to that of SMTbetter, and aounts for a substantial proportion of thetotal population.
Finally, we onsider the question of whether the SMT, whihis apable of translating out-of-grammar sentenes, anadd usefulrobustness to
thebasesystem.Weolleted,fromthesetusedin theexperimentsdesribedin(Rayneretal.,2005), alltheEnglishsenteneswhihfailedtobe
trans-lated into Frenh. We used the best version of the English ! Frenh SMT to translate eah of these sentenes, andasked humanjudges to
eval-uate the translations as being learly aeptable, learlyunaeptable,orborderline.
Inthenextsetion,wepresenttheresultsofthe variousexperimentswehavejustdesribed.
4 Results
We begin with Figure 1, whih shows non-standardBLEUsoresforversionsoftheEnglish ! Frenh SMT system trained on quantities of
datainreasing from14287to285740pairs. As anbeseen,translation performaneimprovesup to about 175000 pairs. After this, it levels out
at around BLEU = 0.90, well below that of the RBMTsystem with whih it is being ompared. Amorediretwaytoreporttheresultissimplyto ounttheproportion oftestsentenes thatarenot
inthetrainingdata,whiharetranslatedsimilarly bytheSMTandtheRBMT.Thisguretopsoutat around68%.
The results strongly suggest that the SMT is unable to repliate the RBMT's performane at all losely even in an easy language-pair,
irre-spetive ofthe amountoftraining dataavailable. Outofuriosity,andtoreassureourselvesthatthe
ut-numberofpairsoftraining sentenesfor English ! Frenh; training and test data both indepen-dentlygenerated,heneoverlapping.
terane seedorpus usedtogeneratethe
gram-mar (f. Setion 2). This produed utterly dis-malperformane,withBLEU=0.52.Theresultis moreinteresting thanitmayrstappear, sine,in
speehreognition, thediffereneinperformane betweentheSLMstrainedfromseedorporaand largegeneratedorporaisfairlysmall(Hokeyet
al.,2008).
Itseemedpossiblethattheimprovementin per-formanewithinreasedquantitiesoftrainingdata might, in effet, only be due to the SMT
fun-tioning as a translation memory; sine training and testdata are independently generated by the same random proess, they overlap,with the
de-gree ofoverlap inreasingasthetraining setgets larger. In order to investigate this hypothesis, werepeatedtheexperimentswithdatawhihhad
been uniqued, so that the training and test sets were ompletely disjoint, and neither ontained any dupliate sentenes
1
. Infat, Figure 2show
thatthegraphforuniqued English!Frenhdata are fairly similar tothe one forthe original non-uniqueddatashowninFigures1. Themain
differ-eneisthat thenon-standardBLEUsore forthe
1
Ouropinionisthatthisisnotarealistiwaytoevaluate theperformaneofasmall-voabularysystem;forexample, inMedSLT,oneexpetsthatatleastsometrainingsentenes, e.g.Whereisthepain?,willalsoourfrequentlyintest data.
numberof pairsoftraining sentenesfor English ! Frenh; training and test data both indepen-dently generated, then uniqued to remove
dupli-atesandoverlappingitems.
uniqued data, unsurprisingly, tops out at a lower level, reeting thefatthat atranslation
mem-oryeffetdoesindeedourtosomeextent.
Results for English ! Japanese showed the same trendsasEnglish!Frenh, butweremore
pronouned. Table 3 ompares the performane of the best versions of the SMTs for the two language-pairs, using both plain and artiially
uniqued data. We see that, with plain data, the English!JapaneseSMTfallseven furthershort ofrepliating theperformaneoftheRBMTthan was the ase for English ! Frenh; BLEU is
only 0.76. The differene between the plain and uniqued versions is also more extreme. BLEU
(0.64)isonsiderablylowerfortheversiontrained onuniqueddata,suggesting thattheSMTforthis language pair is nding it harder to generalise,
and is in effet loser to funtioning asa trans-lation memory. This is onrmed by ounting thesentenes intestdataand notintraining data whih weretranslated similarly by theSMT and
theRBMT;wend thatthegure topsoutatthe verylowvalueof26%.
Asnotedinourdisussion oftheexperimental
English!Frenh
Generated Generated 0.90
Gen/uniqued Gen/uniqued 0.85
English!Japanese
Generated Generated 0.76
Gen/uniqued Gen/uniqued 0.64
Table3:Translationperformane,intermsof
non-standard BLEU metri, for different ongura-tions, training on all available data of the
spe-ied type. Generated = data randomly gener-ated; Gen/uniqued = data randomly generated, then uniqued so that dupliates areremoved and
testandtrainingpairsdonotoverlap.
neessary toestablish whatthe differenes mean intermsofhumanjudgements. Weonsequently
turntoevaluation ofthepairsforwhihtheSMT and theRBMTsystems produed different trans-lationresults.
Table 4 shows theategorisation, aording to theriteriaoutlinedattheendofSetion3,for500
English ! Frenh pairs randomly seleted from thesetofexamples whereRBMTand SMTgave different results; we asked three judges to
evalu-atethemindependently,and ombinedtheir judg-mentsbymajoritydeisionwhereappropriate.We observed a very heavy bias towards the RBMT,
withunanimousagreementamongthejudgesthat theRBMTtranslationwasbetterin201/500ases, and 2-1agreement in a further 127. In ontrast,
there were only 4/500 ases where the judges unanimouslythoughtthattheSMTtranslationwas preferable, with a further 12 supported by a
ma-jority deision. The rest of the table gives the aseswheretheRBMTandSMTtranslationswere judgedthesameorasesinwhihthejudges
dis-agreed; there were only 41/500 ases where no majority deision wasreahed. Our overall on-lusion is that we are justied in evaluating the
SMTbyusingtheBLEUsoreswiththeRBMTas thereferene. Oftheaseswherethetwosystems differ, only a tiny fration, atmost 16/500, indi-ate a better translation from theSMT, and well
overhalf aretranslated better by theRBMT. Ta-ble5presentstypialexamplesofbadSMT
trans-lations in the English ! Frenh pair, ontrasted withthetranslationsproduedbytheRBMT.The rsttwoaregrammatialerrors(asuperuous
ex-tra verb in the rst, and agreement errors inthe
seond). The thirdis an bad hoie of tense and preposition; althoughgrammatial,thetarget lan-guagesentenefailstopreservethemeaning,and,
rather than referring to a 20 day period ending now, insteadrefers toa20 day period sometime inthepast.
Result Agreement Count
RBMTbetter alljudges 201
RBMTbetter majority 127 SMTbetter alljudges 4
SMTbetter majority 12
Similar alljudges 34 Similar majority 81
Unlear disagree 41
Total 500
Table4: ComparisonofRBMTand SMT
perfor-maneon500randomlyhosenEnglish!Frenh translation examples, evaluated independently by threejudges.
Table 6shows asimilar evaluation forthe
En-glish ! Japanese. Here, the differene between theSMTandRBMTversionswassopronouned thatwefeltjustiedintakingasmallersample,of
only150sentenes. Thistime,92/150aseswere unanimously judged as having a better RBMT
translation,andthere wasnotasingleasewhere even a majority found that the SMT was better. Agreement was good here too, with only 8/150 asesnotyieldingatleastamajoritydeision.
Result Agreement Count
RBMTbetter alljudges 92
RBMTbetter majority 32
SMTbetter alljudges 0 SMTbetter majority 0
Similar alljudges 2 Similar majority 16
Unlear disagree 8
Total 150
Table 6: Comparison of RBMT and SMT
per-formane on 150 randomly hosen English ! Japanesetranslationexamples,evaluated indepen-dentlybythreejudges.
RBMTFrenh vosmauxdetetesont-ilsaus´espardeshangementsdetemp´erature
(yourheadahesare-theyausedbyhangesoftemperature)
SMTFrenh avez-vousvosmauxdetetesont-ilsaus´espardeshangementsdetemp´erature (have-youyourheadahesare-theyausedbyhangesoftemperature)
English areheadahesrelievedintheafternoon RBMTFrenh vosmauxdetetediminuent-ilsl'apres-midi
(yourheadahes(MASC-PLUR)derease-MASC-PLURtheafternoon) SMTFrenh vosmauxdetetediminue-t-ellel'apres-midi
(yourheadahes(MASC-PLUR)derease-FEM-SINGtheafternoon)
English haveyouhadthemfortwentydays
RBMTFrenh avez-vousvosmauxdetetedepuisvingtjours
(have-youyourheadahessinetwentydays) SMTFrenh avez-vouseuvosmauxdetetependantvingtjours
(have-youhadyourheadahesduringtwentydays)
Table 5: Examples ofinorretSMTtranslations from Englishinto Frenh. Errors arehighlighted in
bold.
the SMTould have an advantage; robustness is generally astrength ofstatistial approahes. We
return to English ! Frenh in Table 7, whih presentstheresultofrunningthebestSMTmodel on the357 examplesfromthe testsetin(Rayner
et al., 2005) whih failedto be translated by the RBMT.Wedividethesetintoategoriesbasedon thereasonforfailureoftheRBMT.
In the most populous group, translations that
failed due to out of voabulary items, the SMT was, more or less by onstrution, also unable to produe a translation. For the 110 items that
wereoutofgrammaroveragefortheRBMT,the SMTprodued38goodtranslations,andanother4 borderline translations. Therewere50items that
were withinthe soure grammar overage of the RBMT,butfailedsomewhereintransfer and gen-eration proessing. Of those, the majority (32)
representedbadsouresentenes,onsideredas ill-formedforthepurposesofthisexperiment.Out of the remaining items that were within RBMT
grammaroverage,theSMTmanagedtoprodue 5goodtranslationsand1borderlinetranslation.In total, onthemostlenientinterpretation, theSMT
produed 48 additional translations out of 357. Whilethisimprovementinreallisarguablyworth having,itwouldomeattheprieofasubstantial
delineinpreision.
5 DisussionandConlusions
Wehavepresentedanovelmethodologyfor om-paring RBMT and SMT, and tested it on a
spe-Result Count
Outofvoabulary
Badtranslation 187
Outofsouregrammaroverage
Goodtranslation 38
Badtranslation 44 Borderlinetranslation 4 Badsouresentene 34
Insouregrammaroverage
Goodtranslation 5 Badtranslation 12
Borderlinetranslation 1 Badsouresentene 32
Total 357
Table7: English!Frenh SMTperformaneon examplesfromthetestsetwhihfailedtobe
trans-latedbytheRBMT,evaluatedbyonejudge.
i pair ofRBMTand SMT arhitetures. Our laim is that these results show that the version of SMTusedhereisnotinfatapableof
repro-duingtheoutputoftheRBMTsystem. Although therehasbeensomeinterestinattempting totrain SMTsystems fromRBMToutput,theevaluation issuesthatarisewhenomparingSMTandRBMT
versions of a high-preision limited-domain sys-tem aredifferent from those arising inmost MT
Thisisnot,infat,true.
Whenwehavedisussedthemethodologywith people who work primarily with SMT, we have heard two main objetions. The rst is that the
SMTisbeingtrainedonRBMToutput,andhene an only be worse; a ommon suggestion is that a system trainedon human-produed translations
ould yield better results. It isnotat all implau-sible that an SMT trained on this kind of data mightperformbetteronmaterialwhihisoutside
the overage of the RBMT system. In this do-main, however, the important issue is preision, not reall; what is ritial is the ability to
trans-late auratelyonmaterial that iswithinthe on-strainedlanguagedenedbytheRBMToverage. The RBMTengine givesverygood performane
on in-overage data, as hasbeen shown in other evaluationsoftheMedSLTsystem,e.g.(Rayneret al., 2005); over97%ofall in-overage sentenes
areorretlytranslated. Human-generated transla-tionswouldoften,nodoubt,bemore naturalthan thoseproduedbytheRBMT,andtherewouldbe
slightly fewer outright mistranslations. But the primary reason why the SMT is doing badly is not that the training material ontains bad
trans-lations, but rather that the SMT is inapable of orretlyreproduingthetranslationsitseesinthe training data. EvenintheeasyEnglish!Frenh
language-pair,theSMToftenproduesadifferent translation fromtheRBMT.Itouldapriorihave been oneivable that the differenes were
unin-teresting, inthesense thatSMToutputs different fromRBMToutputswereasgood,orevenbetter. Infat,Table4showthatthisisnottrue;whenthe twotranslationsdiffer, althoughtheSMT
transla-tionanoasionallybebetter,itisusuallyworse. Table 6 shows that this problem is onsiderably more aute in English ! Japanese. Thus the
SMTsystem's inabilitytomodelthe RBMT sys-tempointstoareallimitation.
IftheSMThadinsteadbeentrainedon
human-generated data, its performane on in-overage materialouldonlyhaveimprovedsubstantiallyif theSMTforsomereasonfounditeasiertolearnto reprodue patterns inhuman-generated data than
in RBMT-generated data. This seems unlikely. TheSMTisbeingtrainedfromasetoftranslation
pairswhihareguaranteedtobeompletely on-sistent,sinetheyhavebeenautomatially gener-atedbytheRBMT;thefatthattheRBMTsystem
its favour. IftheSMTisunable toreproduethe RBMT'soutput,itisreasonable toassumeitwill have even greater difulty reproduing
transla-tions presentinnormalhuman-generated training data,whihisalwaysfarfromonsistent,andwill havealargervoabulary.
Theseondobjetion wehaveheard isthatthe non-standardBLEUsoreswhihwehaveusedto measure performane usetheRBMTtranslations
asareferene. People arequiktopointoutthat, ifrealhumantranslationsweresoredinthisway, theywoulddoless wellon thenon-standard
met-ris thantheRBMTtranslations. Thisis,indeed, absolutelytrue, andexplainswhyitwasessential toarry outtheomparisonjudging shownin
Ta-bles4and 6. Ifwehad omparedhuman transla-tionswithRBMTtranslationsinthesameway,we wouldhave found that human translations whih
differedfromRBMTtranslationsweresometimes better, and hardly ever worse. This would have shown that the non-standard metris were
inap-propriate for thetask ofevaluating human trans-lations. Intheatualaseonsideredinthispaper, wenda ompletelydifferentpattern: the differ-enes are one-sided intheopposite diretion,
in-diating that the non-standard metris do in fat agreewithhumanjudgementshere.
Ageneral objetion toall these experiments is
that there may be more powerful SMT arhite-tures. We used the Giza++/Moses/SRILM om-bininationbeause itisthedefato standard. We
have posted the datawe usedat http://www. bahr.net/geaf2009; this will allow other groupstoexperimentwithalternatearhitetures,
and determine whether they do in fat yield sig-niantimprovements. Forthemoment,however, wethinkitisreasonabletolaimthat,indomains
wherehigh auray isrequired, itremains tobe shownthatSMTapproahesareapableof ahiev-ingthelevelsofperformane thatrule-based
D.Arnold,L.Balkan,S.Meijer,R.L.Humphreys,and L.Sadler.1994.MahineTranslation:An Introdu-toryGuide.Blakwell,Oxford.
P. Bouillon, M. Rayner, N. Chatzihrisas, B.A. Hokey,M. Santaholma,M.Starlander,Y.Nakao, K.Kanzaki,andH.Isahara. 2005.Ageneri multi-lingual open soure platform for limited-domain medialspeehtranslation. InProeedingsofthe 10th Conferene of the European Assoiation for Mahine Translation (EAMT), pages 5058, Bu-dapest,Hungary.
P.Bouillon,F.Ehsani,R.Frederking,andM.Rayner, editors. 2006. ProeedingsoftheHLT-NAACL In-ternationalWorkshop onMedial Speeh Transla-tion,NewYork.
P. Bouillon, F. Ehsani, R. Frederking, M. MTear, and M. Rayner, editors. 2008a. Proeedings of the COLING Workshop on Speeh Proessing for SafetyCritial TranslationandPervasive Applia-tions,Manhester.
P. Bouillon, G. Flores, M. Georgesul, S. Halimi, B.A. Hokey, H. Isahara, K. Kanzaki, Y. Nakao, M. Rayner, M. Santaholma, M. Starlander, and N.Tsourakis. 2008b. Many-to-manymultilingual medialspeehtranslationonaPDA. In Proeed-ings of The Eighth Confereneof the Assoiation forMahineTranslationintheAmerias, Waikiki, Hawaii.
N. Chatzihrisas, P. Bouillon, M. Rayner, M. San-taholma, M. Starlander, andB.A. Hokey. 2006. Evaluating task performane for a unidiretional ontrolledlanguagemedialspeehtranslation sys-tem. In Proeedings of the HLT-NAACL Interna-tional Workshop on Medial Speeh Translation, pages916,NewYork.
L.Dugast,J.Senellart,andP.Koehn. 2008. Canwe relearnanRBMT system? InProeedingsofthe ThirdWorkshoponStatistialMahineTranslation, pages175178,Columbus,Ohio.
B.A. Hokey, M. Rayner, and G. Christian. 2008. Trainingstatistiallanguagemodelsfrom grammar-generateddata: Aomparativease-study. In Pro-eedingsofthe6thInternationalConfereneon Nat-uralLanguageProessing,Gothenburg,Sweden.
R.Jonson.2005.Generatingstatistiallanguage mod-elsfrom interpretation grammars in dialogue sys-tems. In Proeedings of the 11th EACL, Trento, Italy.
A.Jurafsky,C.Wooters,J.Segal,A.Stolke, E. Fos-ler, G. Tajhman, and N. Morgan. 1995. Us-ingastohastiontext-freegrammarasalanguage modelfor speeh reognition. In Proeedings of the IEEE International Conferene on Aoustis, SpeehandSignalProessing,pages189192.
M. Federio, N. Bertoldi, B. Cowan, W. Shen, C.Moran,R.Zens,etal.2007.Moses:Opensoure toolkit forstatistial mahine translation. In AN-NUAL MEETING-ASSOCIATION FOR COMPU-TATIONALLINGUISTICS,volume45,page2.
F.J.OhandH.Ney. 2000. Improvedstatistial align-mentmodels. InProeedings of the 38thAnnual Meetingofthe Assoiationfor Computational Lin-guistis,HongKong.
K.Papineni,S.Roukos,T.Ward,andW.-J.Zhu.2001. BLEU: amethod for automati evaluation of ma-hinetranslation. ResearhReport,Computer Si-eneRC22176(W0109-022),IBMResearh Divi-sion,T.J.WatsonResearhCenter.
M. Rayner, P. Bouillon, N. Chatzihrisas, B.A. Hokey,M.Santaholma,M.Starlander,H.Isahara, K.Kanzaki, and Y. Nakao. 2005. A methodol-ogyfor omparing grammar-basedand robust ap-proahestospeehunderstanding. InProeedings ofthe9thInternationalConfereneonSpoken Lan-guageProessing(ICSLP),pages11031107, Lis-boa,Portugal.
M. Rayner, B.A. Hokey, and P. Bouillon. 2006. Putting Linguistis into Speeh Reognition: The RegulusGrammarCompiler. CSLIPress,Chiago.
M.Rayner,P.Bouillon,G.Flores,F.Ehsani,M. Star-lander,B.A.Hokey,J.Brotanek,andL.Biewald. 2008. Asmall-voabularysharedtask formedial speehtranslation. InProeedingsoftheCOLING Workshop on Speeh Proessing for Safety Criti-alTranslationandPervasiveAppliations, Manh-ester.
S.Seneff,C.Wang,andJ.Lee. 2006. Combining lin-guistiandstatistialmethodsforbi-diretional En-glishChinese translation inthe ightdomain. In ProeedingsofAMTA2006.
A.Stolke. 2002. SRILM- an extensible language modelingtoolkit. InSeventhInternational Confer-eneonSpokenLanguageProessing.ISCA.