Adam Meyers
New York University
Michiko Kosaka
Monmouth University
Ralph Grishman
New York University
Abstract
Transfer-basedMachineTranslationsystems
re-quireaprocedureforchoosingthesetoftransfer
rules for generating a target language
transla-tion from a given sourcelanguage sentence. In
an MT system with many competing transfer
rules,choosingthe best setof transfer rulesfor
translationmayinvolvetheevaluationofan
ex-plosivenumberofcompetingsets. Weproposea
solutionto thisproblembased on current
best-rst chartparsing algorithms.
1 Introduction
Transfer-basedMachineTranslationsystems
re-quirea procedureforchoosingtheset of
trans-ferrulesforgeneratingatarget language
trans-lation from a given source language sentence.
This procedure is trivial for a system if, given
a context, one transferrulecan be selected
un-ambiguously. Otherwise, choosing the best set
of transfer rules may involve the evaluation of
numerous competing sets. In fact, thenumber
of possible transfer rule combinations
increas-es exponentially with the length of the source
language sentence. This situation mirrors the
problemof choosingproductionsina
nondeter-ministic parser. In this paper, we describe a
system for choosing transfer rules, based on
s-tatisticalchart parsing(Bobrow, 1990; Chitrao
and Grishman, 1990; Caraballo and Charniak,
1997; Charniaketal., 1998).
In our Machine Translation system, transfer
rules are generated automatically from parsed
parallel text along the lines of (Matsumoto et
al., 1993; Meyers et al., 1996; Meyers et al.,
1998b). Our system tends to acquire a large
numberoftransferrules,duemainlyto
alterna-tive ways of translating the same sequences of
words, non-literal translations in parallel text
oursystemchoosethebestsetofrules
eÆcient-ly. Whilethetechniquediscussedhereobviously
appliestosimilarsuchsystems,itcouldalso
ap-plyto hand-codedsystems inwhich each word
or group of words is related to more than one
transferrule. For example, both Multra(Hein,
1996)andtheEurotrasystemdescribedin(Way
et al., 1997) require components for deciding
whichcombinationoftransferrulestouse. The
proposed technique may be used with systems
liketheseprovidedthatalltransferrulesare
as-signedinitialscoresratingtheirappropriateness
for translation. These appropriateness ratings
couldbedependentorindependentofcontext. 1
2 Previous Work
The MTliteraturedescribesseveral techniques
forderivingtheappropriatetranslation.
Statis-ticalsystems that do notincorporatelinguistic
analysis (Brown et al., 1993) typically choose
the most likely translation based on a
statis-tical model, i.e.., translation probability
deter-minesthetranslation. (Hein,1996)reportsaset
of(hand-coded) featurestructurebased
prefer-encerulestochooseamongalternativesin
Mul-tra. There is some discussion about adding
sometransferrulesautomaticallyacquiredfrom
corpora to Multra. 2
Assuming that they
over-generaterules(aswedid),asystemliketheone
weproposeshouldbebenecial. In(Wayetal.,
1997),manydierentcriteriaareusedtochoose
transfer rulesto execute including: preferences
forspecicrulesovergeneralones, andcomplex
rulenotationthatinsuresthatfewrulescan
ap-plyto thesame set of words.
The Pangloss Mark III system (Nirenburg
1
Thistranslation procedurewouldprobably
comple-ment,notreplaceexistingproceduresinthesesystems.
2
and Frederking, 1995) uses a chart-walk
algo-rithm to combine the results of three MT
en-gines: an example-based engine, a
knowledge-based engine, and a lexical-transfer engine.
Each engine contributes its best edges and the
chart-walk algorithm uses dynamic
program-mingto ndthecombination of edges withthe
best overall score that covers the input string.
Scoresofedgesarenormalizedsothatthescores
from the dierent engines are comparable and
weightedtofavorengineswhichtendtoproduce
better results. Pangloss's algorithm combines
whole MT systems. In contrast, our
algorith-m combines output of individualtransfer rules
withinasingleMTsystem. Also,weusea
best-rst search that incorporates a
probabilistic-basedgure ofmerit,whereasPangloss usesan
empirically based weighting scheme and what
appears to be atop-down search.
Best-rst probabilistic chart parsers
(Bo-brow,1990;ChitraoandGrishman,1990;
Cara-balloandCharniak,1997;Charniaketal.,1998)
strive to nd the best parse, without
exhaus-tivelytryingall possibleproductions. A
proba-bilisticgureofmerit(CaraballoandCharniak,
1997;Charniaketal.,1998)isdevisedfor
rank-ing edges. The highest ranking edges are
pur-suedrst andthe parserhaltsafter itproduces
a completeparse. We proposeanalgorithm for
choosing and applying transfer rules based on
probability. Each nal translation is derived
from a specic set of transfer rules. If the
pro-cedureimmediatelyselectedthesetransferrules
andappliedtheminthecorrectorder,wewould
arriveatthenaltranslationwhilecreatingthe
minimumnumberofedges. Ourprocedureuses
about 4 times this minimumnumber of edges.
Withrespectto chartparsing,(Charniaketal.,
1998)reportthattheirparsercanachieve good
results while producing about three times the
minimumnumberof edges requiredto produce
thenal parse.
3 Test Data
Weconductedtwoexperiments. For
experimen-t1, we parsed a sentence-aligned pair of
Span-ish and English corpora, each containing 1155
sentences of Microsoft Excel Help Text. These
pairsof parsedsentenceswere dividedinto
dis-tinct training and test sets, ninety percent for
Obj
B = valores
en
E = calcular
C = libro
de
F = trabajo
D = volver
a
Subj
Target Tree
B’ = values
Obj
C’ = workbook
D’ = recalculate
A = Excel
Subj
in
A’ = Excel
Excel recalculates
values in workbook
Excel vuelve a calcular
valores en libro de trabajo
Source Tree
Figure 1: Spanish and English Regularized
Parse Trees
set was used to acquire transfer rules (Meyers
etal.,1998b)whichwerethenusedtotranslate
thesentences inthetest set. Thispaper
focus-eson ourtechnique for applyingthese transfer
rulesinorder to translatethetest sentences.
The test and training sets in experiment1
were rotated, assigning a dierent tenth of the
sentencestothetestsetineachrotation. Inthis
waywetestedtheprogramontheentirecorpus.
Onlyone test set(onetenth of thecorpus)was
usedfortuningthesystemduringdevelopment.
Transfer rules, 1109 on average, were acquired
from each trainingset and used for translation
of the corresponding test set. For Experiment
2,weparsed2617pairsofalignedsentencesand
used the same rotation procedure for dividing
test and training corpora. The Experiment 2
corpusincludedtheexperiment1corpus. An
av-erageof 2191 transferrules were acquiredfrom
agiven setof Experiment2 trainingsentences.
Experiment1isorchestratedinacareful
man-ner that may not be practical for extremely
largecorpora,and Experiment2showshowthe
program performs ifwe scale up and eliminate
someofthene-tuning. Apartfromcorpussize,
there are two main dierence between the two
experiments: (1) the experiment1 corpus was
alignedcompletelybyhand,whereasthe
Exper-iment2corpuswasalignedautomaticallyusing
thesystem described in(Meyers etal., 1998a);
and (2) the parsers were tuned to the
sen-C’ = workbook
1
2
3
Subj
Obj
in
A = Excel
B = valores
de
F = trabajo
C = libro
a
en
Obj
E = calcular
D = volver
3
2
1
Subj
A’ = Excel
B’ = values
4)
1)
2)
3)
D’ = recalculate
Figure2: A Setof TransferRules
4 Parses and Transfer Rules
Figure 1 is a pair of \regularized" parses for a
correspondingpair of Spanishand English
sen-tences from Microsoft Excel help text. These
areF-structure-likedependencyanalysesof
sen-tencesthatrepresent predicateargument
struc-ture. This representation serves to neutralize
somedierencesbetweenrelatedsentencetypes,
e.g., theregularized parse of related active and
passive sentences are identical, except for the
featurevaluepairfMood,Passiveg. Nodes
(val-ues)are labeledwithheadwordsand arcs
(fea-tures) are labeled with grammatical functions
(subject, object), prepositions (in) and
subor-dinate conjunctions (before). 3
. For
demonstra-tionpurposes,thesourcetreeinFigure1 isthe
inputto our translationsystem and the target
tree istheoutput.
The transfer rules in Figure 2 can be
used to convert the input tree into the
out-put tree. These transfer rules are pairs of
corresponding rooted substructures, where a
substructure (Matsumoto et al., 1993) is a
connected set of arcs and nodes. A rule
3
Morphological features and their values
(Gram-consistsofeitherapairof\open"substructures
(rule4)orapairof\closed"substructures(rules
1, 2 and 3). Closed substructures consist of
s-inglenodes(A,A',B,B',C') orsubtrees(the left
hand side of rule 3). Open substructures
con-tainone ormore open arcs,arcs withoutheads
(bothsubstructuresinrule4).
5 Simplied Translation with
Tree-based Transfer Rules
The rules in Figure 2 could combine by lling
intheopen arcs in rule4 withthe rootsof the
substructures in rules 1, 2 and 3. The result
wouldbeaclosededgewhichmapsthelefttree
inFigure1intotherighttree. Justasedgesofa
chart parserarebasedon thecontext freerules
used by the chart parser, edges of our
trans-lation systemare basedon these transfer rules.
Initialedgesareidenticaltotransferrules.
Oth-er edges result from combiningone closededge
withone openedge. Figure3liststhesequence
ofedgeswhichwouldresultfromcombiningthe
initialedgesbasedonRules1{4to replicatethe
trees in Figure 1. The translation proceeds by
incrementally matching the left hand sides of
Rules1{4withtheinputtree(andinsuringthat
the tree is completely covered by these rules).
The right-hand sides of these compatible rules
are also combined to produce the translation.
Thisisanidealizedviewofoursysteminwhich
each node in the input tree matches the
left-handside of exactly one transfer rule: there is
no ambiguity and no combinatorial explosion.
Therealityis thatmorethanone transferrules
may be activated for each node, as suggested
in Figure 4. 4
If each of the six nodes of the
source tree corresponded to ve transfer rules,
there are 5 6
= 15625 possible combinations of
rulestoconsider. Toproducetheoutputin
Fig-ure 3, a minimum of seven edges would be
re-quired: four initial edges derived from the
o-riginaltransferrulesplusthreeadditionaledges
representing thecombination of edges (steps2,
3and4inFigure3). Thespeedofoursystemis
measuredbythenumberofactualedgesdivided
bythisminimum.
4
Thethirdexamplelistedwouldactuallyinvolvetwo
1)
4)
3)
2)
a
en
Obj
E = calcular
D’ = recalculate
D = volver
2
3
1
1
a
Obj
D’ = recalculate
D = volver
A’ = Excel
E = calcular
A = Excel
C’ = workbook
de
F = trabajo
C = libro
a
en
Obj
D’ = recalculate
D = volver
3
A’ = Excel
E = calcular
A = Excel
3
B = valores
a
en
Obj
D’ = recalculate
D = volver
3
3
2
A’ = Excel
E = calcular
2
A = Excel
Subj
Subj
Obj
in
in
Obj
Subj
Subj
Obj
in
B’ = values
Subj
Obj
in
B’ = values
B = valores
en
Subj
Subj
Subj
2
3
Figure3: An Idealized Translation Procedure
6 Best First Translation Procedure
The following is an outline of our best rst
searchprocedureforndingasingletranslation:
1. Foreach node N, ndT
N
, theset of
com-patibletransfer rules
2. Create initialedges forallT
N
3. Repeat until a \nished" edge is found or
anedge limitisreached:
(b) Ifcomplete, combine E with
compati-bleincomplete edges
(c) If incomplete, combine E with
com-patible completeedges
(d) Incomplete edge + complete edge =
new edge
The procedure creates one initial edge
for each matching transfer rule in the
database 5
and puts these edges in a
5
com-..
.
a
Obj
D = volver
3
2
E = calcular
en
1
Subj
D’ = recalculate
Obj
in
Subj
1
2
3
D’ = calculate
1
Obj
in
Subj
D = repeat
3
2
1
of
Subj
3
in
Obj
2
Adv
again
E = calculation
Figure4: MultipleTransferRulesforEach
Sub-structure
queue prioritized by score. The
pro-cedure iteratively combines the best
scoring edge with some other compatible
edgetoproduceanewedgeandinsertsthenew
edgeinthequeue. Thescore foreachnew edge
is a function of the scores of theedges used to
produce it. The process continues untileither
an edge limit is reached (the system looks like
itwilltaketoolongtoterminate)oracomplete
edge is produced whose left-hand side is the
inputtree: we callthisedgea \nishededge".
Weusethefollowingtechniqueforcalculating
the score for initial edges. 6
The score for each
initial edge E rooted at N, based on rule R is
calculatedasfollows:
1. SCOR E1(S)=log
2 (
Freq(R)
Freq(Al l Rul esat N) )
Wherethefrequency(Freq)ofa ruleisthe
numberof times itmatchedan examplein
thetrainingcorpus,duringruleacquisition.
Thedenominator isthecombined
frequen-ciesofall rulesthat matchN.
6
Thisissomewhatdependentonthewaythese
trans-ferrulesarederived.Othersystemswouldprobablyhave
Experiment 1: 1155 sentences
Norm No Norm
Total Translations 1153 1127
OverEdge Limit 2 28
ActualEdges 93,719 579,278
MinimumEdges 22,125 20,125
EdgeRatio 3.3 14.8
Accuracy 70.9 70.9
Experiment 2: 2617 sentences
Norm No Norm
Total Translations 2610 2544
OverEdge Limit 7 73
ActualEdges 262,172 1,398,796
MinimumEdges 48,570 42,770
EdgeRatio 4.0 15.5
Accuracy 62.6 61.5
Figure 5: Results
2. Score(S)=Score1(S) Norm Factor
Where the Norm (normalization) factor is
equal to the highest SCORE1 forany rule
matching N.
Since the log
2
of probabilities are necessarily
negative, thishas the eect of setting the E of
each ofthe mostprobable initialedges to zero.
The scores for non-initial edges are calculated
by adding up the scores of the initial edges of
whichthey arecomposed. 7
Without any normalization (Score(S) =
SCOR E1(S)),smalltreesarefavoredoverlarge
trees. Thisslowsdowntheprocessofndingthe
nal result. The normalization we use insures
that themostprobable setof transfer rulesare
considered earlyon.
7 Results
Figure 5 givesour resultsforbothexperiments
1 and 2, both with normalization (Norm) and
without (NoNorm). \Total Translations"refer
to thenumberofsentenceswhichwere
translat-ed successfully by the system and \Over Edge
Limit" refersto thenumberof sentences which
caused thesystemtoexceedtheedgelimit,i.e.,
once the system produces over 10,000 edges,
translationfailure isassumed. The system
cur-7
Scoringforspecialcasesisnotincludedinthispaper.
These casesincluderules for conjunctionsand rulesfor
tion for any input if the edge limit is
exceed-ed. \Actual Edges" refers to the total number
of edges used forattempting to translate every
sentenceinthecorpus. \MinimumEdges"refer
tothetotal minimumnumberof edgesrequired
for successful translations. The \Edge Ratio"
is a ratio between: (1) \Total Edges" less the
numberofedgesusedinfailedtranslations;and
(2)The\MinimumEdges". Thisratio,in
com-binationwith,thenumberof\OverEdgeLimit"
measurestheeÆciency ofa given system.
\Ac-curacy" is an assessment of translation quality
which we willdiscussinthenext section.
Normalizationcausedsignicantspeed-upfor
both experiments. If you compare the total
number of edges used with and without
nor-malization, speed-up is a factor of 6.2 for
Ex-periment 1 and 5.3 for Experiment 2. If you
compareactualedgeratios,speed-upisafactor
of4.5 forExperiment1 and3.9forExperiment
2. Inaddition,thenumberoffailedparseswent
downbyafactorof10forbothexperiments. As
shouldbe expected, accuracy wasvirtually the
samewithandwithoutnormalization,although
normalization did cause a slight
improvemen-t. Normalizationshouldproducetheessentially
thesame result inlesstime.
These results suggest that we can probably
count on a speed-up of at least 4 and a
signif-icant decline in failed parses by using
normal-ization. The dierences in performanceon the
twocorporaaremostlikelyduetothedegreeof
hand-tuningforExperiment 1.
7.1 Our Accuracy Measure
\Accuracy" in Figure 5 is the average of the
followingscore foreach translatedsentence:
jT
NYU T
T
MS j
1=2(jT
NYU j+jT
MS j)
T
NYU
is the set of words in NYU's translation
and T
MS
istheset of words intheoriginal
Mi-crosoft translation. If T
NYU
= \A B C D E"
and T
MS
= \A B C F", then the intersection
set \A B C" is length 3 (the numerator) and
the average length of T
NYU
and T
MS
is 4 1/2
(the denominator). The accuracy score equals
34 1=2=2=3. Thisis aDicecoeÆcient
com-parisonofourtranslationwiththeoriginal. Itis
mentsintheaverageaccuracyscoreforour
sam-ple set of sentences usuallyreect an
improve-ment inoverall translationquality. While it is
signicant thatthe accuracy scores inFigure 5
didnotgodownwhenwenormalizedthescores,
theslight improvement in accuracy shouldnot
be given much weight. Our accuracy score is
awed inthat itcannot account forthe
follow-ingfacts: (1)goodparaphrasesareperfectly
ac-ceptable;(2)some dierences inwordselection
aremore signicant thanothers; and (3)errors
insyntax arenotdirectly accounted for.
NYU's system translates the Spanish
sen-tence \1. Seleccion la celda en la que desea
introduciruna referencia"as \1. select the
cel-l that you want to enter a reference in".
Mi-crosofttranslatesthissentenceas\1. Selectthe
cellin which you want to enter the reference".
Our system gives NYU's translation an
accu-racy score of .75 due to the degree of overlap
withMicrosoft'stranslation. Ahumanreviewer
wouldprobablyrateNYU'stranslationas
com-pletely acceptable. In contrast, NYU's system
producedthefollowingunacceptabletranslation
which also received a score of .75: theSpanish
sentence\Elijalafuncionquedesea pegarenla
formula en el cuadro de dialogo Asistentepara
funciones"is translatedas \ \Choose the
func-tionthatwantstopasteFunctionWizardinthe
formulainthedialogbox",incontrastwith
Mi-crosoft's translation \Choose the function you
want to paste into the formula from the F
unc-tion Wizard dialog box". In fact, some good
translations will get worse scores than some
bad ones, e.g., an acceptable one word
trans-lation can even get a score of 0, e.g.,\SUPR"
was translated as \DEL" by Microsoft and as
\Delete" by NYU. Nevertheless, by averaging
thisaccuracy score overmanyexamples, ithas
provedavaluablemeasureforcomparing
dier-ent versionsof a particular system: better
sys-temsget better results. Similarly,after
tweak-ingthesystem,a better translationofa
partic-ularsentence willusuallyyield abetter score.
8 Future Work
Future work should address two limitations of
our current system: (1) Bad parses yield bad
\bad parse"problem, we are considering using
ourMTsystemwithless-detailedparsers,since
these parsers typicallyproducelesserror-prone
output. We will have to conduct experiments
to determine the minimum level of detail that
isneeded. 8
Previous to thework reported in this paper,
we ran ourMT system on bilingual corpora in
whichthesentenceswerealignedmanually. The
costofmanualalignmentlimitedthesizeofthe
corporawe could use. A lot of our recent MT
researchhasbeenfocusedonsolvingthissparse
dataproblemthroughourdevelopmentofa
sen-tencealignmentprogram(Meyersetal.,1998a).
Wenowhave300,000automaticallyaligned
sen-tencesintheMicrosofthelptext domainfor
fu-ture experiments. In addition to providing us
withmanymoretransferrules,thisshouldallow
us to collect transfer rule co-occurrence
infor-mationwhichwecanthenusetoapplytransfer
rulesmoreeectively,perhapsimproving
trans-lation quality. In a preliminary experiment
a-longthese linesusingthe Experiment1corpus,
co-occurrence informationhadnonoticeable
ef-fect. However, we are hopeful that future
ex-periments with 300,000 aligned sentences (300
times asmuchdata) willbemore successful.
References
Robert J. Bobrow. 1990. Statistical agenda
parsing. In DARPA Speech and Language
Workshop, pages 222{224.
Peter Brown, Stephen A. Della Pietra,
Vin-cent J. Della Pietra, and Robert L.
Mer-cer. 1993. The Mathematics of Statistical
Machine Translation: Parameter Estimation.
Computational Linguistics,19:263{312.
Sharon A. Caraballo and Eugene Charniak.
1997. Newguresofmeritforbest-rst
prob-abilistic chart parsing. Computational
Lin-guistics, 24:275{298.
EugeneCharniak,SharonGoldwater,andMark
Johnson. 1998. Edge-BasedBest-First Chart
Parsing. In Proceedings of the Sixth Annual
Workshop forVery LargeCorpora,Montreal.
8
Onecouldsetupacontinuumfromdetailed
parser-s like Proteus down to shallow verb-group/noun-group
recognizers, with the Penn treebank based parsers
ly-ing somewhereinthe middle. Asonetravels downthe
continuumtothelowerdetailparsers,theerrorrate
nat-Statistical parsing of messages. In DARPA
Speech and Language Workshop, pages 263{
266.
Anna Sagvall Hein. 1996. Preference
Mecha-nismsoftheMultraMachineTranslation
Sys-tem. In Barbara H. Partee and Petr Sgall,
editors, Discourse and Meaning: Papers in
HonorofEvaHaji^cova.JohnBenjamins
Pub-lishingCompany, Amsterdam.
Y. Matsumoto, H. Ishimoto, T. Utsuro, and
M. Nagao. 1993. Structural Matching of
Parallel Texts. In 31st Annual Meeting of
the Association for Computational
Linguis-tics: \Proceedings of the Conference".
Adam Meyers, Roman Yangarber, and Ralph
Grishman. 1996. Alignment of Shared
ForestsforBilingualCorpora. InProceedings
of Coling 1996: The16th International
Con-ference on Computational Linguistics, pages
460{465.
AdamMeyers, Michiko Kosaka,and Ralph
Gr-ishman. 1998a. A Multilingual Procedure
forDictionary-BasedSentenceAlignment. In
Proceedings of AMTA'98: Machine T
ransla-tion and the Information Soup, pages 187{
198.
Adam Meyers, Roman Yangarber, Ralph
Gr-ishman, Catherine Macleod, and Antonio
Moreno-Sandoval. 1998b. Deriving T
rans-fer Rulesfrom Dominance-Preserving
Align-ments. InProceedingsof Coling-ACL98: The
17th International Conference on
Computa-tionalLinguistics andthe 36thMeetingof the
Association for Computational Linguistics.
Sergei Nirenburg and Robert E. Frederking.
1995. ThePanglossMarkIIIMachineT
rans-lation System: Multi-Engine System
Archi-tecture. Technical report, NMSU CRL,USC
ISI,and CMUCMT.
AndrewWay,IanCrookston,andJaneShelton.
1997. A Typology of Translation Problems
for Eurotra Translation Machines. Machine