Using Artificially Generated Data to Evaluate Statistical Machine Translation

(1)

toEvaluateStatistial MahineTranslation

MannyRayner,PaulaEstrella,PierretteBouillon

UniversityofGeneva,TIM/ISSCO

40bvdduPont-d'Arve,CH-1211Geneva4,Switzerland

fEmmanuel.Rayner,Paula.Estrella,Pierrette.Bouillongunige.h

BethAnnHokey MailStop19-26,UCSCUARC

NASAAmesResearhCenter,MoffettField,CA940351000

bahokeyus.edu

YukieNakao

LINA,NantesUniversity,2,ruedelaHoussiniere,BP9220844322NantesCedex03

yukie.nakaouniv-nantes.fr

Abstrat

Although Statistial Mahine Translation (SMT) is now the dominant paradigm withinMahineTranslation,wearguethat

itis farfrom learthat it anoutperform Rule-BasedMahineTranslation(RBMT) on small-to medium-voabulary

applia-tionswherehighpreision ismore impor-tant than reall. Apartiularly important pratialexampleismedialspeeh trans-lation. We report the results of

exper-iments where we ongured the various grammarsandrule-setsinanOpenSoure medium-voabularymulti-lingualmedial

speehtranslationsystemtogeneratelarge aligned bilingual orpora for English !

Frenh and English ! Japanese, whih werethenusedtotrainSMTmodelsbased on the ommon ombination of Giza++,

Moses and SRILM. The resulting SMTs were unable fully to reprodue the per-formaneoftheRBMT,withperformane topping out, even for English ! Frenh,

withlessthan70%oftheSMTtranslations of previously unseen sentenes agreeing with RBMTtranslations. When the

out-puts of the two systems differed, human judges reported the SMT result as fre-quently being worse than the RBMT

re-sult,andhardly everbetter; moreover,the addedrobustnessoftheSMTonlyyielded asmallimprovementinreall,withalarge

penaltyinpreision.

1 Introdution

WhenStatistialMahineTranslation(SMT)was

rst introdued inthe early 90s, itenountered a hostilereeption,andmanypeopleintheresearh ommunitywereunwillingtobelieveitouldever

be a serious ompetitor to symboli approahes (f.forexample(Arnoldetal.,1994)).The pendu-lumhasnowswungallthewaytotheotherendof

thesale;rightnow,theprevailingwisdomwithin the researh ommunity is that SMT is the only truly viablearhiteture, and that rule-based

ma-hinetranslation(RBMT)isultimatelydoomedto failure. Inthis paper, one of our initialonerns willbetoargueforaompromiseposition. Inour

opinion,theinitialseptiismaboutSMTwasnot groundless;theargumentspresentedagainstit of-tentooktheformofexamplesinvolvingdeep

lin-guistireasoning,whih,itwaslaimed,wouldbe hardtoaddressusingsurfaemethods.Proponents of RBMT had, however, greatly underestimated

theextent towhihSMTwould beable totakle theproblemof robustness, whereitappearstobe farmorepowerfulthanRBMT.Formostmahine

translation appliations, robustness is the entral issue,soSMT'surrentpreemineneishardly sur-prising.

Evenforthelarge-voabularytaskswhereSMT doesbest,thesituationisbynomeansaslearas one might imagine: aording to (Wilks, 2007),

purely statistial systems are still unable to out-perform SYSTRAN.In this paper, we will

how-everbe moreonernedwithlimited-domain MT tasks,whererobustnessisnotthekeyrequirement, and aurayisparamount. Animmediate

(2)

lishingitselfasananappliationareaofsome sig-niane (Bouillon et al., 2006; Bouillon et al., 2008a). Translationinmedialappliationsneeds

to be extremely aurate, sine mistranslations an have serious oreven fatal onsequenes. At the panel disussionat the2008 COLING

work-shoponsafety-ritial speehtranslation (Rayner etal.,2008), theonsensusopinion, basedon in-putfrompratisingphysiians,wasthatan

appro-priate evaluation metri for medial appliations wouldbeheavilyslantedtowardsauray,as op-posedtorobustness. Ifthemetriisnormalisedso

astoaward0pointsfornotranslation,and1point for a orret translation, the estimate wasthat a suitable sore for an inorret translation would

besomethingbetween25and100 points. With theserequirements,itseemsunlikelythatarobust, broad-overage arhiteture has muh hane of

suess.Theobviousstrategyistobuilda limited-domainontrolled-languag esystem,andtuneitto thepointwhereaurayreahesthedesiredlevel.

Forsystems ofthis kind, itis atleast

oneiv-ablethatRBMTmaybeabletooutperformSMT. Thenextquestionishowtoinvestigatetheissues in a methodologially even-handed way. A few

studies, notably(Seneffetal.,2006),suggestthat rule-based translationmayinfatbepreferablein these ases. (Another related experiment is

de-sribed in (Dugast et al., 2008), though this was arried out ina large-voabulary system). These studies,however,havenotbeenwidelyited.One

possible explanation is suspiion about method-ologial issues. Seneff andherolleagues trained their SMT system on 20000 sentene pairs, a smallnumberbythestandardsofSMT.Itisa

pri-orinotimplausiblethat moretraining datawould have enabled themtoreate anSMTsystem that wasasgoodas,orbetterthan,therule-based

sys-tem.

In this paper, our primary goal is to take this kindofobjetionseriously,anddevelopa

method-ology designed toenable a tight omparison be-tween rule-based and statistial arhitetures. In partiular, we wish to examine the widely be-lieved laim that SMT is now inherently better

than RBMT. Inorder to do this, we start with a limited-domain RBMTsystem;weuseitto

auto-matiallygeneratealargeorpusofalignedpairs, whih isusedtotrain aorresponding SMT sys-tem.Wethenomparetheperformaneofthetwo

Ourargumentwill be that this situation

essen-tiallyrepresentsanupperboundforwhatis possi-bleusingtheSMTapproah inalimiteddomain. Ithas beenwidely remarked that quality, aswell asquantity,oftrainingdata isimportant forgood

SMT; in many projets, signiant effort is ex-pended tolean theoriginal training data. Here, sinethedataisautomatiallygeneratedbya

rule-based system, we an be sure that it is already ompletely lean(inthesenseof beinginternally onsistent), and weangenerate aslarge a

quan-tity of it as we require. The appliation, more-over,usesonlyasmallishvoabulary andafairly onstrainedsyntax. IfthederivedSMTsystem is

unabletomaththeoriginal RBMTsystem's per-formane, it seems reasonable to laim that this shows that there are types ofappliations where

RBMTarhiteturesaresuperior.

The experiments desribed have been arried

out using MedSLT, an Open Soure interlingua-based limited-domain medial speeh translation system. The restofthepaperisorganisedas

fol-lows. Setion2providesbakgroundonthe Med-SLTsystem. Setion3 desribes the experimen-talframework,andSetion4theresultsobtained.

Setion5onludes.

2 TheMedSLTSystem

MedSLT (Bouillon et al., 2005; Bouillon et al.,

2008b)isamedium-voabulary interlingua-based OpenSourespeehtranslationsystemfor dotor-patient medial examination questions, whih

provides any-language-to-any-l an gua ge transla-tion apabilities for all languages in the set En-glish, Frenh, Japanese, Arabi, Catalan. Both

speehreognition andtranslation arerule-based. SpeehreognitionrunsontheNuane8.5 reog-nition platform, with grammar-based language

modelsbuiltusingtheOpenSoureRegulus om-piler. As desribed in (Rayner et al., 2006), eahdomain-speilanguage modelisextrated

from a general resoure grammar using orpus-basedmethodsdrivenbyaseedorpusof domain-spei examples. The seed orpus, whih typi-ally ontains between 500 and 1500 utteranes,

is then used a seond time to add probabilisti weights to the grammar rules; this substantially

(3)

lan-riedoutareshowninFigure1.

Language Voab WER SemER

English 447 6% 11%

Frenh 1025 8% 10%

Japanese 422 3% 4%

Table 1: Reognition performane for English, Frenh and Japanese MedSLT reognisers. V o-ab = number of surfae words in soure

lan-guagereogniservoabulary; WER=Word Er-ror Rate for soure language reogniser, on in-overagematerial;SemER=semantierrorrate

(proportionofutteranesfailingtoprodueorret interlingua)forsourelanguagereogniser,on in-overagematerial.

Atrun-time, thereogniser produesa soure-langage semanti representation. This is rst translated by one set of rulesinto an interlingual

form, and then by a seond setinto a target lan-guage representation. A target-language Regu-lusgrammar,ompiledintogenerationform,turns

this intoone ormorepossible surfaestrings, af-ter whih a set of generation preferenes piks one out. Finally, theseleted string isrealised in

spokenform. Robustnessissuesareaddressedby means of a bak-up statistial reogniser, whih drivesa robust embedded help system. The

pur-pose of the help system (Chatzihrisas et al., 2006)istoguidetheusertowardssupported ov-erage; it performs approximate mathing of

out-put fromthestatistial reogniseragain a library ofsenteneswhihhavebeenmarkedasorretly proessed during system development, and then

presentsthelosestmathestotheuser.

Examplesoftypial Englishdomain sentenes andtheirtranslationsintoFrenhandJapaneseare showninFigure2.

3 Experimentalframework

In the literature on language modelling, there is

a known tehnique for bootstrapping a statisti-allanguagemodel(SLM)fromagrammar-based language model (GLM). The grammar whih formsthebasis oftheGLMissampledrandomly

inordertoreateanarbitrarilylargeorpusof ex-amples; these examples are then used asa

train-ingorpustobuildtheSLM(Jurafskyetal.,1995; Jonson,2005).Weadaptthisproessina straight-forward way to onstrut an SMT for a given

mar,thesoure-to-interlingu atranslationrules,the interlingua-to-targ et-l an gua ge rules, and the tar-getlanguagegenerationgrammar. Westartinthe

same way,usingthesoure languagegrammarto build a randomly generated soure language or-pus; asshown in (Hokey et al., 2008), it is

im-portanttohaveaprobabilisti grammar. Wethen use the omposition of the other omponents to attempttotranslateeahsourelanguagesentene

into a target language equivalent, disarding the examples for whih no translation is produed. The result is an aligned bilingual orpus of

ar-bitrary size, whih anbe used to train an SMT model.

We used this method to generate aligned or-porafor thetwoMedSLTlanguagepairs English

!FrenhandEnglish!Japanese. Foreah lan-guagepair,werstgeneratedone million soure-languageutteranes;wenextlteredthemtokeep

only examples whih werefullsentenes, as op-posed to elliptial phrases, and nally used the translationrulesandtarget-languagegeneratorsto

attempt to translate eah sentene. This reated approximately 305K aligned sentene-pairs for English!Frenh(1901KwordsEnglish,1993K

words Frenh), and 311K aligned sentene-pairs for English ! Japanese (1941K words English, 2214K words Japanese). We held out 2.5% of

eah set as development data, and 2.5% as test data. UsingGiza++,MosesandSRILM(Ohand Ney,2000;Koehnetal.,2007;Stolke,2002),we

trainedSMTmodelsfrominreasingly large sub-setsofthetrainingportion,usingthedevelopment portionintheusualwaytooptimizeparameter

val-ues.Finally,weusedtheresultingmodelsto trans-latethetestportion.

Ourprimary goalwastomeasure theextent to

whihthederived versionsof theSMTwereable toapproximate theoriginal RBMTondatawhih waswithintheRBMT'soverage. Thereisa sim-pleandnaturalwaytoperformthismeasurement:

weapplytheBLEUmetri(Papinenietal.,2001), with the RBMT's translation taken as the refer-ene. Thismeansthat perfetorrespondene

be-tween the two translations would yield a BLEU soreof1.0.

This raises an important point. The BLEU

(4)

Frenh Avez-vousmalaudessusdesyeux?

Japanese Itamiwamenouenoataridesuka?

English Haveyouhadthepainformorethanamonth?

Frenh Avez-vousmaldepuisplusd'unmois? Japanese Ikkagetsuijouitamiwatsuzukimashitaka?

English Isthepainassoiatedwithnausea?

Frenh Avez-vousdesnaus´eesquandvousavezladouleur?

Japanese Itamutohakikewaokorimasuka?

English Doesbrightlightmakethepainworse?

Frenh Ladouleurest-elleaggrav´eeparunelumiereforte? Japanese Akaruihikariwomirutozutsuwahidokunarimasuka?

Table2:ExamplesofEnglishdomainsentenes,andthesystem'stranslationsintoFrenhandJapanese.

theextent towhihitapproximates human trans-lations. It is important tobring in human judge-ment, to evaluate the ases where the SMT and

RBMTdiffer. If,inthese ases, ittranspiredthat humanjudgestypiallythoughtthattheSMTwas asgood as theRBMT,then thedifferene would

bepurely aademi. Weneedtosatisfyourselves thathumanjudgestypiallyasribedifferenes be-tween SMT and RBMT to shortomings in the

SMTratherthanintheRBMT.

Conretely, we olleted all the different hSoure, SMT-translation, RBMT-translationi triples produed during the ourse of the

ex-periments, and extrated those where the two translationsweredifferent. Werandomlyseleted a set of examples for eah language pair, and

asked human judges to lassifythem into one of thefollowingategories:

RBMT better: The RBMTtranslation was

better,intermsofpreservingmeaningand/or beinggrammatiallyorret;

SMTbetter: The SMTtranslation was bet-ter,intermsofpreservingmeaningand/or

be-inggrammatiallyorret;

Similar: Both translations were about equally good OR the soure sentene was meaninglessinthedomain.

In order to show that our metris are intuitively

meaningful,itissufienttodemonstrate thatthe frequenyofourrene ofRBMTbetterisboth

large in omparison to that of SMTbetter, and aounts for a substantial proportion of thetotal population.

Finally, we onsider the question of whether the SMT, whihis apable of translating out-of-grammar sentenes, anadd usefulrobustness to

thebasesystem.Weolleted,fromthesetusedin theexperimentsdesribedin(Rayneretal.,2005), alltheEnglishsenteneswhihfailedtobe

trans-lated into Frenh. We used the best version of the English ! Frenh SMT to translate eah of these sentenes, andasked humanjudges to

eval-uate the translations as being learly aeptable, learlyunaeptable,orborderline.

Inthenextsetion,wepresenttheresultsofthe variousexperimentswehavejustdesribed.

4 Results

We begin with Figure 1, whih shows non-standardBLEUsoresforversionsoftheEnglish ! Frenh SMT system trained on quantities of

datainreasing from14287to285740pairs. As anbeseen,translation performaneimprovesup to about 175000 pairs. After this, it levels out

at around BLEU = 0.90, well below that of the RBMTsystem with whih it is being ompared. Amorediretwaytoreporttheresultissimplyto ounttheproportion oftestsentenes thatarenot

inthetrainingdata,whiharetranslatedsimilarly bytheSMTandtheRBMT.Thisguretopsoutat around68%.

The results strongly suggest that the SMT is unable to repliate the RBMT's performane at all losely even in an easy language-pair,

irre-spetive ofthe amountoftraining dataavailable. Outofuriosity,andtoreassureourselvesthatthe

(5)

ut-numberofpairsoftraining sentenesfor English ! Frenh; training and test data both indepen-dentlygenerated,heneoverlapping.

terane seedorpus usedtogeneratethe

gram-mar (f. Setion 2). This produed utterly dis-malperformane,withBLEU=0.52.Theresultis moreinteresting thanitmayrstappear, sine,in

speehreognition, thediffereneinperformane betweentheSLMstrainedfromseedorporaand largegeneratedorporaisfairlysmall(Hokeyet

al.,2008).

Itseemedpossiblethattheimprovementin per-formanewithinreasedquantitiesoftrainingdata might, in effet, only be due to the SMT

fun-tioning as a translation memory; sine training and testdata are independently generated by the same random proess, they overlap,with the

de-gree ofoverlap inreasingasthetraining setgets larger. In order to investigate this hypothesis, werepeatedtheexperimentswithdatawhihhad

been uniqued, so that the training and test sets were ompletely disjoint, and neither ontained any dupliate sentenes

1

. Infat, Figure 2show

thatthegraphforuniqued English!Frenhdata are fairly similar tothe one forthe original non-uniqueddatashowninFigures1. Themain

differ-eneisthat thenon-standardBLEUsore forthe

1

Ouropinionisthatthisisnotarealistiwaytoevaluate theperformaneofasmall-voabularysystem;forexample, inMedSLT,oneexpetsthatatleastsometrainingsentenes, e.g.Whereisthepain?,willalsoourfrequentlyintest data.

numberof pairsoftraining sentenesfor English ! Frenh; training and test data both indepen-dently generated, then uniqued to remove

dupli-atesandoverlappingitems.

uniqued data, unsurprisingly, tops out at a lower level, reeting thefatthat atranslation

mem-oryeffetdoesindeedourtosomeextent.

Results for English ! Japanese showed the same trendsasEnglish!Frenh, butweremore

pronouned. Table 3 ompares the performane of the best versions of the SMTs for the two language-pairs, using both plain and artiially

uniqued data. We see that, with plain data, the English!JapaneseSMTfallseven furthershort ofrepliating theperformaneoftheRBMTthan was the ase for English ! Frenh; BLEU is

only 0.76. The differene between the plain and uniqued versions is also more extreme. BLEU

(0.64)isonsiderablylowerfortheversiontrained onuniqueddata,suggesting thattheSMTforthis language pair is nding it harder to generalise,

and is in effet loser to funtioning asa trans-lation memory. This is onrmed by ounting thesentenes intestdataand notintraining data whih weretranslated similarly by theSMT and

theRBMT;wend thatthegure topsoutatthe verylowvalueof26%.

Asnotedinourdisussion oftheexperimental

(6)

English!Frenh

Generated Generated 0.90

Gen/uniqued Gen/uniqued 0.85

English!Japanese

Generated Generated 0.76

Gen/uniqued Gen/uniqued 0.64

Table3:Translationperformane,intermsof

non-standard BLEU metri, for different ongura-tions, training on all available data of the

spe-ied type. Generated = data randomly gener-ated; Gen/uniqued = data randomly generated, then uniqued so that dupliates areremoved and

testandtrainingpairsdonotoverlap.

neessary toestablish whatthe differenes mean intermsofhumanjudgements. Weonsequently

turntoevaluation ofthepairsforwhihtheSMT and theRBMTsystems produed different trans-lationresults.

Table 4 shows theategorisation, aording to theriteriaoutlinedattheendofSetion3,for500

English ! Frenh pairs randomly seleted from thesetofexamples whereRBMTand SMTgave different results; we asked three judges to

evalu-atethemindependently,and ombinedtheir judg-mentsbymajoritydeisionwhereappropriate.We observed a very heavy bias towards the RBMT,

withunanimousagreementamongthejudgesthat theRBMTtranslationwasbetterin201/500ases, and 2-1agreement in a further 127. In ontrast,

there were only 4/500 ases where the judges unanimouslythoughtthattheSMTtranslationwas preferable, with a further 12 supported by a

ma-jority deision. The rest of the table gives the aseswheretheRBMTandSMTtranslationswere judgedthesameorasesinwhihthejudges

dis-agreed; there were only 41/500 ases where no majority deision wasreahed. Our overall on-lusion is that we are justied in evaluating the

SMTbyusingtheBLEUsoreswiththeRBMTas thereferene. Oftheaseswherethetwosystems differ, only a tiny fration, atmost 16/500, indi-ate a better translation from theSMT, and well

overhalf aretranslated better by theRBMT. Ta-ble5presentstypialexamplesofbadSMT

trans-lations in the English ! Frenh pair, ontrasted withthetranslationsproduedbytheRBMT.The rsttwoaregrammatialerrors(asuperuous

ex-tra verb in the rst, and agreement errors inthe

seond). The thirdis an bad hoie of tense and preposition; althoughgrammatial,thetarget lan-guagesentenefailstopreservethemeaning,and,

rather than referring to a 20 day period ending now, insteadrefers toa20 day period sometime inthepast.

Result Agreement Count

RBMTbetter alljudges 201

RBMTbetter majority 127 SMTbetter alljudges 4

SMTbetter majority 12

Similar alljudges 34 Similar majority 81

Unlear disagree 41

Total 500

Table4: ComparisonofRBMTand SMT

perfor-maneon500randomlyhosenEnglish!Frenh translation examples, evaluated independently by threejudges.

Table 6shows asimilar evaluation forthe

En-glish ! Japanese. Here, the differene between theSMTandRBMTversionswassopronouned thatwefeltjustiedintakingasmallersample,of

only150sentenes. Thistime,92/150aseswere unanimously judged as having a better RBMT

translation,andthere wasnotasingleasewhere even a majority found that the SMT was better. Agreement was good here too, with only 8/150 asesnotyieldingatleastamajoritydeision.

Result Agreement Count

RBMTbetter alljudges 92

RBMTbetter majority 32

SMTbetter alljudges 0 SMTbetter majority 0

Similar alljudges 2 Similar majority 16

Unlear disagree 8

Total 150

Table 6: Comparison of RBMT and SMT

per-formane on 150 randomly hosen English ! Japanesetranslationexamples,evaluated indepen-dentlybythreejudges.

(7)

RBMTFrenh vosmauxdetetesont-ilsaus´espardeshangementsdetemp´erature

(yourheadahesare-theyausedbyhangesoftemperature)

SMTFrenh avez-vousvosmauxdetetesont-ilsaus´espardeshangementsdetemp´erature (have-youyourheadahesare-theyausedbyhangesoftemperature)

English areheadahesrelievedintheafternoon RBMTFrenh vosmauxdetetediminuent-ilsl'apres-midi

(yourheadahes(MASC-PLUR)derease-MASC-PLURtheafternoon) SMTFrenh vosmauxdetetediminue-t-ellel'apres-midi

(yourheadahes(MASC-PLUR)derease-FEM-SINGtheafternoon)

English haveyouhadthemfortwentydays

RBMTFrenh avez-vousvosmauxdetetedepuisvingtjours

(have-youyourheadahessinetwentydays) SMTFrenh avez-vouseuvosmauxdetetependantvingtjours

(have-youhadyourheadahesduringtwentydays)

Table 5: Examples ofinorretSMTtranslations from Englishinto Frenh. Errors arehighlighted in

bold.

the SMTould have an advantage; robustness is generally astrength ofstatistial approahes. We

return to English ! Frenh in Table 7, whih presentstheresultofrunningthebestSMTmodel on the357 examplesfromthe testsetin(Rayner

et al., 2005) whih failedto be translated by the RBMT.Wedividethesetintoategoriesbasedon thereasonforfailureoftheRBMT.

In the most populous group, translations that

failed due to out of voabulary items, the SMT was, more or less by onstrution, also unable to produe a translation. For the 110 items that

wereoutofgrammaroveragefortheRBMT,the SMTprodued38goodtranslations,andanother4 borderline translations. Therewere50items that

were withinthe soure grammar overage of the RBMT,butfailedsomewhereintransfer and gen-eration proessing. Of those, the majority (32)

representedbadsouresentenes,onsideredas ill-formedforthepurposesofthisexperiment.Out of the remaining items that were within RBMT

grammaroverage,theSMTmanagedtoprodue 5goodtranslationsand1borderlinetranslation.In total, onthemostlenientinterpretation, theSMT

produed 48 additional translations out of 357. Whilethisimprovementinreallisarguablyworth having,itwouldomeattheprieofasubstantial

delineinpreision.

5 DisussionandConlusions

Wehavepresentedanovelmethodologyfor om-paring RBMT and SMT, and tested it on a

spe-Result Count

Outofvoabulary

Badtranslation 187

Outofsouregrammaroverage

Goodtranslation 38

Badtranslation 44 Borderlinetranslation 4 Badsouresentene 34

Insouregrammaroverage

Goodtranslation 5 Badtranslation 12

Borderlinetranslation 1 Badsouresentene 32

Total 357

Table7: English!Frenh SMTperformaneon examplesfromthetestsetwhihfailedtobe

trans-latedbytheRBMT,evaluatedbyonejudge.

i pair ofRBMTand SMT arhitetures. Our laim is that these results show that the version of SMTusedhereisnotinfatapableof

repro-duingtheoutputoftheRBMTsystem. Although therehasbeensomeinterestinattempting totrain SMTsystems fromRBMToutput,theevaluation issuesthatarisewhenomparingSMTandRBMT

versions of a high-preision limited-domain sys-tem aredifferent from those arising inmost MT

(8)

Thisisnot,infat,true.

Whenwehavedisussedthemethodologywith people who work primarily with SMT, we have heard two main objetions. The rst is that the

SMTisbeingtrainedonRBMToutput,andhene an only be worse; a ommon suggestion is that a system trainedon human-produed translations

ould yield better results. It isnotat all implau-sible that an SMT trained on this kind of data mightperformbetteronmaterialwhihisoutside

the overage of the RBMT system. In this do-main, however, the important issue is preision, not reall; what is ritial is the ability to

trans-late auratelyonmaterial that iswithinthe on-strainedlanguagedenedbytheRBMToverage. The RBMTengine givesverygood performane

on in-overage data, as hasbeen shown in other evaluationsoftheMedSLTsystem,e.g.(Rayneret al., 2005); over97%ofall in-overage sentenes

areorretlytranslated. Human-generated transla-tionswouldoften,nodoubt,bemore naturalthan thoseproduedbytheRBMT,andtherewouldbe

slightly fewer outright mistranslations. But the primary reason why the SMT is doing badly is not that the training material ontains bad

trans-lations, but rather that the SMT is inapable of orretlyreproduingthetranslationsitseesinthe training data. EvenintheeasyEnglish!Frenh

language-pair,theSMToftenproduesadifferent translation fromtheRBMT.Itouldapriorihave been oneivable that the differenes were

unin-teresting, inthesense thatSMToutputs different fromRBMToutputswereasgood,orevenbetter. Infat,Table4showthatthisisnottrue;whenthe twotranslationsdiffer, althoughtheSMT

transla-tionanoasionallybebetter,itisusuallyworse. Table 6 shows that this problem is onsiderably more aute in English ! Japanese. Thus the

SMTsystem's inabilitytomodelthe RBMT sys-tempointstoareallimitation.

IftheSMThadinsteadbeentrainedon

human-generated data, its performane on in-overage materialouldonlyhaveimprovedsubstantiallyif theSMTforsomereasonfounditeasiertolearnto reprodue patterns inhuman-generated data than

in RBMT-generated data. This seems unlikely. TheSMTisbeingtrainedfromasetoftranslation

pairswhihareguaranteedtobeompletely on-sistent,sinetheyhavebeenautomatially gener-atedbytheRBMT;thefatthattheRBMTsystem

its favour. IftheSMTisunable toreproduethe RBMT'soutput,itisreasonable toassumeitwill have even greater difulty reproduing

transla-tions presentinnormalhuman-generated training data,whihisalwaysfarfromonsistent,andwill havealargervoabulary.

Theseondobjetion wehaveheard isthatthe non-standardBLEUsoreswhihwehaveusedto measure performane usetheRBMTtranslations

asareferene. People arequiktopointoutthat, ifrealhumantranslationsweresoredinthisway, theywoulddoless wellon thenon-standard

met-ris thantheRBMTtranslations. Thisis,indeed, absolutelytrue, andexplainswhyitwasessential toarry outtheomparisonjudging shownin

Ta-bles4and 6. Ifwehad omparedhuman transla-tionswithRBMTtranslationsinthesameway,we wouldhave found that human translations whih

differedfromRBMTtranslationsweresometimes better, and hardly ever worse. This would have shown that the non-standard metris were

inap-propriate for thetask ofevaluating human trans-lations. Intheatualaseonsideredinthispaper, wenda ompletelydifferentpattern: the differ-enes are one-sided intheopposite diretion,

in-diating that the non-standard metris do in fat agreewithhumanjudgementshere.

Ageneral objetion toall these experiments is

that there may be more powerful SMT arhite-tures. We used the Giza++/Moses/SRILM om-bininationbeause itisthedefato standard. We

have posted the datawe usedat http://www. bahr.net/geaf2009; this will allow other groupstoexperimentwithalternatearhitetures,

and determine whether they do in fat yield sig-niantimprovements. Forthemoment,however, wethinkitisreasonabletolaimthat,indomains

wherehigh auray isrequired, itremains tobe shownthatSMTapproahesareapableof ahiev-ingthelevelsofperformane thatrule-based

(9)

D.Arnold,L.Balkan,S.Meijer,R.L.Humphreys,and L.Sadler.1994.MahineTranslation:An Introdu-toryGuide.Blakwell,Oxford.

P. Bouillon, M. Rayner, N. Chatzihrisas, B.A. Hokey,M. Santaholma,M.Starlander,Y.Nakao, K.Kanzaki,andH.Isahara. 2005.Ageneri multi-lingual open soure platform for limited-domain medialspeehtranslation. InProeedingsofthe 10th Conferene of the European Assoiation for Mahine Translation (EAMT), pages 5058, Bu-dapest,Hungary.

P.Bouillon,F.Ehsani,R.Frederking,andM.Rayner, editors. 2006. ProeedingsoftheHLT-NAACL In-ternationalWorkshop onMedial Speeh Transla-tion,NewYork.

P. Bouillon, F. Ehsani, R. Frederking, M. MTear, and M. Rayner, editors. 2008a. Proeedings of the COLING Workshop on Speeh Proessing for SafetyCritial TranslationandPervasive Applia-tions,Manhester.

P. Bouillon, G. Flores, M. Georgesul, S. Halimi, B.A. Hokey, H. Isahara, K. Kanzaki, Y. Nakao, M. Rayner, M. Santaholma, M. Starlander, and N.Tsourakis. 2008b. Many-to-manymultilingual medialspeehtranslationonaPDA. In Proeed-ings of The Eighth Confereneof the Assoiation forMahineTranslationintheAmerias, Waikiki, Hawaii.

N. Chatzihrisas, P. Bouillon, M. Rayner, M. San-taholma, M. Starlander, andB.A. Hokey. 2006. Evaluating task performane for a unidiretional ontrolledlanguagemedialspeehtranslation sys-tem. In Proeedings of the HLT-NAACL Interna-tional Workshop on Medial Speeh Translation, pages916,NewYork.

L.Dugast,J.Senellart,andP.Koehn. 2008. Canwe relearnanRBMT system? InProeedingsofthe ThirdWorkshoponStatistialMahineTranslation, pages175178,Columbus,Ohio.

B.A. Hokey, M. Rayner, and G. Christian. 2008. Trainingstatistiallanguagemodelsfrom grammar-generateddata: Aomparativease-study. In Pro-eedingsofthe6thInternationalConfereneon Nat-uralLanguageProessing,Gothenburg,Sweden.

R.Jonson.2005.Generatingstatistiallanguage mod-elsfrom interpretation grammars in dialogue sys-tems. In Proeedings of the 11th EACL, Trento, Italy.

A.Jurafsky,C.Wooters,J.Segal,A.Stolke, E. Fos-ler, G. Tajhman, and N. Morgan. 1995. Us-ingastohastiontext-freegrammarasalanguage modelfor speeh reognition. In Proeedings of the IEEE International Conferene on Aoustis, SpeehandSignalProessing,pages189192.

M. Federio, N. Bertoldi, B. Cowan, W. Shen, C.Moran,R.Zens,etal.2007.Moses:Opensoure toolkit forstatistial mahine translation. In AN-NUAL MEETING-ASSOCIATION FOR COMPU-TATIONALLINGUISTICS,volume45,page2.

F.J.OhandH.Ney. 2000. Improvedstatistial align-mentmodels. InProeedings of the 38thAnnual Meetingofthe Assoiationfor Computational Lin-guistis,HongKong.

K.Papineni,S.Roukos,T.Ward,andW.-J.Zhu.2001. BLEU: amethod for automati evaluation of ma-hinetranslation. ResearhReport,Computer Si-eneRC22176(W0109-022),IBMResearh Divi-sion,T.J.WatsonResearhCenter.

M. Rayner, P. Bouillon, N. Chatzihrisas, B.A. Hokey,M.Santaholma,M.Starlander,H.Isahara, K.Kanzaki, and Y. Nakao. 2005. A methodol-ogyfor omparing grammar-basedand robust ap-proahestospeehunderstanding. InProeedings ofthe9thInternationalConfereneonSpoken Lan-guageProessing(ICSLP),pages11031107, Lis-boa,Portugal.

M. Rayner, B.A. Hokey, and P. Bouillon. 2006. Putting Linguistis into Speeh Reognition: The RegulusGrammarCompiler. CSLIPress,Chiago.

M.Rayner,P.Bouillon,G.Flores,F.Ehsani,M. Star-lander,B.A.Hokey,J.Brotanek,andL.Biewald. 2008. Asmall-voabularysharedtask formedial speehtranslation. InProeedingsoftheCOLING Workshop on Speeh Proessing for Safety Criti-alTranslationandPervasiveAppliations, Manh-ester.

S.Seneff,C.Wang,andJ.Lee. 2006. Combining lin-guistiandstatistialmethodsforbi-diretional En-glishChinese translation inthe ightdomain. In ProeedingsofAMTA2006.

A.Stolke. 2002. SRILM- an extensible language modelingtoolkit. InSeventhInternational Confer-eneonSpokenLanguageProessing.ISCA.