Chart Based Transfer Rule Application in Machine Translation

(1)

Adam Meyers

New York University

[email protected]

Michiko Kosaka

Monmouth University

[email protected]

Ralph Grishman

New York University

[email protected]

Abstract

Transfer-basedMachineTranslationsystems

re-quireaprocedureforchoosingthesetoftransfer

rules for generating a target language

transla-tion from a given sourcelanguage sentence. In

an MT system with many competing transfer

rules,choosingthe best setof transfer rulesfor

translationmayinvolvetheevaluationofan

ex-plosivenumberofcompetingsets. Weproposea

solutionto thisproblembased on current

best-rst chartparsing algorithms.

1 Introduction

Transfer-basedMachineTranslationsystems

re-quirea procedureforchoosingtheset of

trans-ferrulesforgeneratingatarget language

trans-lation from a given source language sentence.

This procedure is trivial for a system if, given

a context, one transferrulecan be selected

un-ambiguously. Otherwise, choosing the best set

of transfer rules may involve the evaluation of

numerous competing sets. In fact, thenumber

of possible transfer rule combinations

increas-es exponentially with the length of the source

language sentence. This situation mirrors the

problemof choosingproductionsina

nondeter-ministic parser. In this paper, we describe a

system for choosing transfer rules, based on

s-tatisticalchart parsing(Bobrow, 1990; Chitrao

and Grishman, 1990; Caraballo and Charniak,

1997; Charniaketal., 1998).

In our Machine Translation system, transfer

rules are generated automatically from parsed

parallel text along the lines of (Matsumoto et

al., 1993; Meyers et al., 1996; Meyers et al.,

1998b). Our system tends to acquire a large

numberoftransferrules,duemainlyto

alterna-tive ways of translating the same sequences of

words, non-literal translations in parallel text

oursystemchoosethebestsetofrules

eÆcient-ly. Whilethetechniquediscussedhereobviously

appliestosimilarsuchsystems,itcouldalso

ap-plyto hand-codedsystems inwhich each word

or group of words is related to more than one

transferrule. For example, both Multra(Hein,

1996)andtheEurotrasystemdescribedin(Way

et al., 1997) require components for deciding

whichcombinationoftransferrulestouse. The

proposed technique may be used with systems

liketheseprovidedthatalltransferrulesare

as-signedinitialscoresratingtheirappropriateness

for translation. These appropriateness ratings

couldbedependentorindependentofcontext. 1

2 Previous Work

The MTliteraturedescribesseveral techniques

forderivingtheappropriatetranslation.

Statis-ticalsystems that do notincorporatelinguistic

analysis (Brown et al., 1993) typically choose

the most likely translation based on a

statis-tical model, i.e.., translation probability

deter-minesthetranslation. (Hein,1996)reportsaset

of(hand-coded) featurestructurebased

prefer-encerulestochooseamongalternativesin

Mul-tra. There is some discussion about adding

sometransferrulesautomaticallyacquiredfrom

corpora to Multra. 2

Assuming that they

over-generaterules(aswedid),asystemliketheone

weproposeshouldbebenecial. In(Wayetal.,

1997),manydierentcriteriaareusedtochoose

transfer rulesto execute including: preferences

forspecicrulesovergeneralones, andcomplex

rulenotationthatinsuresthatfewrulescan

ap-plyto thesame set of words.

The Pangloss Mark III system (Nirenburg

1

Thistranslation procedurewouldprobably

comple-ment,notreplaceexistingproceduresinthesesystems.

2

(2)

and Frederking, 1995) uses a chart-walk

algo-rithm to combine the results of three MT

en-gines: an example-based engine, a

knowledge-based engine, and a lexical-transfer engine.

Each engine contributes its best edges and the

chart-walk algorithm uses dynamic

program-mingto ndthecombination of edges withthe

best overall score that covers the input string.

Scoresofedgesarenormalizedsothatthescores

from the dierent engines are comparable and

weightedtofavorengineswhichtendtoproduce

better results. Pangloss's algorithm combines

whole MT systems. In contrast, our

algorith-m combines output of individualtransfer rules

withinasingleMTsystem. Also,weusea

best-rst search that incorporates a

probabilistic-basedgure ofmerit,whereasPangloss usesan

empirically based weighting scheme and what

appears to be atop-down search.

Best-rst probabilistic chart parsers

(Bo-brow,1990;ChitraoandGrishman,1990;

Cara-balloandCharniak,1997;Charniaketal.,1998)

strive to nd the best parse, without

exhaus-tivelytryingall possibleproductions. A

proba-bilisticgureofmerit(CaraballoandCharniak,

1997;Charniaketal.,1998)isdevisedfor

rank-ing edges. The highest ranking edges are

pur-suedrst andthe parserhaltsafter itproduces

a completeparse. We proposeanalgorithm for

choosing and applying transfer rules based on

probability. Each nal translation is derived

from a specic set of transfer rules. If the

pro-cedureimmediatelyselectedthesetransferrules

andappliedtheminthecorrectorder,wewould

arriveatthenaltranslationwhilecreatingthe

minimumnumberofedges. Ourprocedureuses

about 4 times this minimumnumber of edges.

Withrespectto chartparsing,(Charniaketal.,

1998)reportthattheirparsercanachieve good

results while producing about three times the

minimumnumberof edges requiredto produce

thenal parse.

3 Test Data

Weconductedtwoexperiments. For

experimen-t1, we parsed a sentence-aligned pair of

Span-ish and English corpora, each containing 1155

sentences of Microsoft Excel Help Text. These

pairsof parsedsentenceswere dividedinto

dis-tinct training and test sets, ninety percent for

Obj

B = valores

en

E = calcular

C = libro

de

F = trabajo

D = volver

a

Subj

Target Tree

B’ = values

Obj

C’ = workbook

D’ = recalculate

A = Excel

Subj

in

A’ = Excel

Excel recalculates

values in workbook

Excel vuelve a calcular

valores en libro de trabajo

Source Tree

Figure 1: Spanish and English Regularized

Parse Trees

set was used to acquire transfer rules (Meyers

etal.,1998b)whichwerethenusedtotranslate

thesentences inthetest set. Thispaper

focus-eson ourtechnique for applyingthese transfer

rulesinorder to translatethetest sentences.

The test and training sets in experiment1

were rotated, assigning a dierent tenth of the

sentencestothetestsetineachrotation. Inthis

waywetestedtheprogramontheentirecorpus.

Onlyone test set(onetenth of thecorpus)was

usedfortuningthesystemduringdevelopment.

Transfer rules, 1109 on average, were acquired

from each trainingset and used for translation

of the corresponding test set. For Experiment

2,weparsed2617pairsofalignedsentencesand

used the same rotation procedure for dividing

test and training corpora. The Experiment 2

corpusincludedtheexperiment1corpus. An

av-erageof 2191 transferrules were acquiredfrom

agiven setof Experiment2 trainingsentences.

Experiment1isorchestratedinacareful

man-ner that may not be practical for extremely

largecorpora,and Experiment2showshowthe

program performs ifwe scale up and eliminate

someofthene-tuning. Apartfromcorpussize,

there are two main dierence between the two

experiments: (1) the experiment1 corpus was

alignedcompletelybyhand,whereasthe

Exper-iment2corpuswasalignedautomaticallyusing

thesystem described in(Meyers etal., 1998a);

and (2) the parsers were tuned to the

(3)

sen-C’ = workbook

1

2

3

Subj

Obj

in

A = Excel

B = valores

de

F = trabajo

C = libro

a

en

Obj

E = calcular

D = volver

3

2

1

Subj

A’ = Excel

B’ = values

4)

1)

2)

3)

D’ = recalculate

Figure2: A Setof TransferRules

4 Parses and Transfer Rules

Figure 1 is a pair of \regularized" parses for a

correspondingpair of Spanishand English

sen-tences from Microsoft Excel help text. These

areF-structure-likedependencyanalysesof

sen-tencesthatrepresent predicateargument

struc-ture. This representation serves to neutralize

somedierencesbetweenrelatedsentencetypes,

e.g., theregularized parse of related active and

passive sentences are identical, except for the

featurevaluepairfMood,Passiveg. Nodes

(val-ues)are labeledwithheadwordsand arcs

(fea-tures) are labeled with grammatical functions

(subject, object), prepositions (in) and

subor-dinate conjunctions (before). 3

. For

demonstra-tionpurposes,thesourcetreeinFigure1 isthe

inputto our translationsystem and the target

tree istheoutput.

The transfer rules in Figure 2 can be

used to convert the input tree into the

out-put tree. These transfer rules are pairs of

corresponding rooted substructures, where a

substructure (Matsumoto et al., 1993) is a

connected set of arcs and nodes. A rule

3

Morphological features and their values

(Gram-consistsofeitherapairof\open"substructures

(rule4)orapairof\closed"substructures(rules

1, 2 and 3). Closed substructures consist of

s-inglenodes(A,A',B,B',C') orsubtrees(the left

hand side of rule 3). Open substructures

con-tainone ormore open arcs,arcs withoutheads

(bothsubstructuresinrule4).

5 Simplied Translation with

Tree-based Transfer Rules

The rules in Figure 2 could combine by lling

intheopen arcs in rule4 withthe rootsof the

substructures in rules 1, 2 and 3. The result

wouldbeaclosededgewhichmapsthelefttree

inFigure1intotherighttree. Justasedgesofa

chart parserarebasedon thecontext freerules

used by the chart parser, edges of our

trans-lation systemare basedon these transfer rules.

Initialedgesareidenticaltotransferrules.

Oth-er edges result from combiningone closededge

withone openedge. Figure3liststhesequence

ofedgeswhichwouldresultfromcombiningthe

initialedgesbasedonRules1{4to replicatethe

trees in Figure 1. The translation proceeds by

incrementally matching the left hand sides of

Rules1{4withtheinputtree(andinsuringthat

the tree is completely covered by these rules).

The right-hand sides of these compatible rules

are also combined to produce the translation.

Thisisanidealizedviewofoursysteminwhich

each node in the input tree matches the

left-handside of exactly one transfer rule: there is

no ambiguity and no combinatorial explosion.

Therealityis thatmorethanone transferrules

may be activated for each node, as suggested

in Figure 4. 4

If each of the six nodes of the

source tree corresponded to ve transfer rules,

there are 5 6

= 15625 possible combinations of

rulestoconsider. Toproducetheoutputin

Fig-ure 3, a minimum of seven edges would be

re-quired: four initial edges derived from the

o-riginaltransferrulesplusthreeadditionaledges

representing thecombination of edges (steps2,

3and4inFigure3). Thespeedofoursystemis

measuredbythenumberofactualedgesdivided

bythisminimum.

4

Thethirdexamplelistedwouldactuallyinvolvetwo

(4)

1)

4)

3)

2)

a

en

Obj

E = calcular

D’ = recalculate

D = volver

2

3

1

a

Obj

D’ = recalculate

D = volver

A’ = Excel

E = calcular

A = Excel

C’ = workbook

de

F = trabajo

C = libro

a

en

Obj

D’ = recalculate

D = volver

3

A’ = Excel

E = calcular

A = Excel

3

B = valores

a

en

Obj

D’ = recalculate

D = volver

3

2

A’ = Excel

E = calcular

2

A = Excel

Subj

Obj

in

Obj

Subj

Obj

in

B’ = values

Subj

Obj

in

B’ = values

B = valores

en

Subj

2

3

Figure3: An Idealized Translation Procedure

6 Best First Translation Procedure

The following is an outline of our best rst

searchprocedureforndingasingletranslation:

1. Foreach node N, ndT

N

, theset of

com-patibletransfer rules

2. Create initialedges forallT

N

3. Repeat until a \nished" edge is found or

anedge limitisreached:

(b) Ifcomplete, combine E with

compati-bleincomplete edges

(c) If incomplete, combine E with

com-patible completeedges

(d) Incomplete edge + complete edge =

new edge

The procedure creates one initial edge

for each matching transfer rule in the

database 5

and puts these edges in a

5

(5)

com-..

.

a

Obj

D = volver

3

2

E = calcular

en

1

Subj

D’ = recalculate

Obj

in

Subj

1

2

3

D’ = calculate

1

Obj

in

Subj

D = repeat

3

2

1

of

Subj

3

in

Obj

2

Adv

again

E = calculation

Figure4: MultipleTransferRulesforEach

Sub-structure

queue prioritized by score. The

pro-cedure iteratively combines the best

scoring edge with some other compatible

edgetoproduceanewedgeandinsertsthenew

edgeinthequeue. Thescore foreachnew edge

is a function of the scores of theedges used to

produce it. The process continues untileither

an edge limit is reached (the system looks like

itwilltaketoolongtoterminate)oracomplete

edge is produced whose left-hand side is the

inputtree: we callthisedgea \nishededge".

Weusethefollowingtechniqueforcalculating

the score for initial edges. 6

The score for each

initial edge E rooted at N, based on rule R is

calculatedasfollows:

1. SCOR E1(S)=log

2 (

Freq(R)

Freq(Al l Rul esat N) )

Wherethefrequency(Freq)ofa ruleisthe

numberof times itmatchedan examplein

thetrainingcorpus,duringruleacquisition.

Thedenominator isthecombined

frequen-ciesofall rulesthat matchN.

6

Thisissomewhatdependentonthewaythese

trans-ferrulesarederived.Othersystemswouldprobablyhave

Experiment 1: 1155 sentences

Norm No Norm

Total Translations 1153 1127

OverEdge Limit 2 28

ActualEdges 93,719 579,278

MinimumEdges 22,125 20,125

EdgeRatio 3.3 14.8

Accuracy 70.9 70.9

Experiment 2: 2617 sentences

Norm No Norm

Total Translations 2610 2544

OverEdge Limit 7 73

ActualEdges 262,172 1,398,796

MinimumEdges 48,570 42,770

EdgeRatio 4.0 15.5

Accuracy 62.6 61.5

Figure 5: Results

2. Score(S)=Score1(S) Norm Factor

Where the Norm (normalization) factor is

equal to the highest SCORE1 forany rule

matching N.

Since the log

2

of probabilities are necessarily

negative, thishas the eect of setting the E of

each ofthe mostprobable initialedges to zero.

The scores for non-initial edges are calculated

by adding up the scores of the initial edges of

whichthey arecomposed. 7

Without any normalization (Score(S) =

SCOR E1(S)),smalltreesarefavoredoverlarge

trees. Thisslowsdowntheprocessofndingthe

nal result. The normalization we use insures

that themostprobable setof transfer rulesare

considered earlyon.

7 Results

Figure 5 givesour resultsforbothexperiments

1 and 2, both with normalization (Norm) and

without (NoNorm). \Total Translations"refer

to thenumberofsentenceswhichwere

translat-ed successfully by the system and \Over Edge

Limit" refersto thenumberof sentences which

caused thesystemtoexceedtheedgelimit,i.e.,

once the system produces over 10,000 edges,

translationfailure isassumed. The system

cur-7

Scoringforspecialcasesisnotincludedinthispaper.

These casesincluderules for conjunctionsand rulesfor

(6)

tion for any input if the edge limit is

exceed-ed. \Actual Edges" refers to the total number

of edges used forattempting to translate every

sentenceinthecorpus. \MinimumEdges"refer

tothetotal minimumnumberof edgesrequired

for successful translations. The \Edge Ratio"

is a ratio between: (1) \Total Edges" less the

numberofedgesusedinfailedtranslations;and

(2)The\MinimumEdges". Thisratio,in

com-binationwith,thenumberof\OverEdgeLimit"

measurestheeÆciency ofa given system.

\Ac-curacy" is an assessment of translation quality

which we willdiscussinthenext section.

Normalizationcausedsignicantspeed-upfor

both experiments. If you compare the total

number of edges used with and without

nor-malization, speed-up is a factor of 6.2 for

Ex-periment 1 and 5.3 for Experiment 2. If you

compareactualedgeratios,speed-upisafactor

of4.5 forExperiment1 and3.9forExperiment

2. Inaddition,thenumberoffailedparseswent

downbyafactorof10forbothexperiments. As

shouldbe expected, accuracy wasvirtually the

samewithandwithoutnormalization,although

normalization did cause a slight

improvemen-t. Normalizationshouldproducetheessentially

thesame result inlesstime.

These results suggest that we can probably

count on a speed-up of at least 4 and a

signif-icant decline in failed parses by using

normal-ization. The dierences in performanceon the

twocorporaaremostlikelyduetothedegreeof

hand-tuningforExperiment 1.

7.1 Our Accuracy Measure

\Accuracy" in Figure 5 is the average of the

followingscore foreach translatedsentence:

jT

NYU T

T

MS j

1=2(jT

NYU j+jT

MS j)

T

NYU

is the set of words in NYU's translation

and T

MS

istheset of words intheoriginal

Mi-crosoft translation. If T

NYU

= \A B C D E"

and T

MS

= \A B C F", then the intersection

set \A B C" is length 3 (the numerator) and

the average length of T

NYU

and T

MS

is 4 1/2

(the denominator). The accuracy score equals

34 1=2=2=3. Thisis aDicecoeÆcient

com-parisonofourtranslationwiththeoriginal. Itis

mentsintheaverageaccuracyscoreforour

sam-ple set of sentences usuallyreect an

improve-ment inoverall translationquality. While it is

signicant thatthe accuracy scores inFigure 5

didnotgodownwhenwenormalizedthescores,

theslight improvement in accuracy shouldnot

be given much weight. Our accuracy score is

awed inthat itcannot account forthe

follow-ingfacts: (1)goodparaphrasesareperfectly

ac-ceptable;(2)some dierences inwordselection

aremore signicant thanothers; and (3)errors

insyntax arenotdirectly accounted for.

NYU's system translates the Spanish

sen-tence \1. Seleccion la celda en la que desea

introduciruna referencia"as \1. select the

cel-l that you want to enter a reference in".

Mi-crosofttranslatesthissentenceas\1. Selectthe

cellin which you want to enter the reference".

Our system gives NYU's translation an

accu-racy score of .75 due to the degree of overlap

withMicrosoft'stranslation. Ahumanreviewer

wouldprobablyrateNYU'stranslationas

com-pletely acceptable. In contrast, NYU's system

producedthefollowingunacceptabletranslation

which also received a score of .75: theSpanish

sentence\Elijalafuncionquedesea pegarenla

formula en el cuadro de dialogo Asistentepara

funciones"is translatedas \ \Choose the

func-tionthatwantstopasteFunctionWizardinthe

formulainthedialogbox",incontrastwith

Mi-crosoft's translation \Choose the function you

want to paste into the formula from the F

unc-tion Wizard dialog box". In fact, some good

translations will get worse scores than some

bad ones, e.g., an acceptable one word

trans-lation can even get a score of 0, e.g.,\SUPR"

was translated as \DEL" by Microsoft and as

\Delete" by NYU. Nevertheless, by averaging

thisaccuracy score overmanyexamples, ithas

provedavaluablemeasureforcomparing

dier-ent versionsof a particular system: better

sys-temsget better results. Similarly,after

tweak-ingthesystem,a better translationofa

partic-ularsentence willusuallyyield abetter score.

8 Future Work

Future work should address two limitations of

our current system: (1) Bad parses yield bad

(7)

\bad parse"problem, we are considering using

ourMTsystemwithless-detailedparsers,since

these parsers typicallyproducelesserror-prone

output. We will have to conduct experiments

to determine the minimum level of detail that

isneeded. 8

Previous to thework reported in this paper,

we ran ourMT system on bilingual corpora in

whichthesentenceswerealignedmanually. The

costofmanualalignmentlimitedthesizeofthe

corporawe could use. A lot of our recent MT

researchhasbeenfocusedonsolvingthissparse

dataproblemthroughourdevelopmentofa

sen-tencealignmentprogram(Meyersetal.,1998a).

Wenowhave300,000automaticallyaligned

sen-tencesintheMicrosofthelptext domainfor

fu-ture experiments. In addition to providing us

withmanymoretransferrules,thisshouldallow

us to collect transfer rule co-occurrence

infor-mationwhichwecanthenusetoapplytransfer

rulesmoreeectively,perhapsimproving

trans-lation quality. In a preliminary experiment

a-longthese linesusingthe Experiment1corpus,

co-occurrence informationhadnonoticeable

ef-fect. However, we are hopeful that future

ex-periments with 300,000 aligned sentences (300

times asmuchdata) willbemore successful.

References

Robert J. Bobrow. 1990. Statistical agenda

parsing. In DARPA Speech and Language

Workshop, pages 222{224.

Peter Brown, Stephen A. Della Pietra,

Vin-cent J. Della Pietra, and Robert L.

Mer-cer. 1993. The Mathematics of Statistical

Machine Translation: Parameter Estimation.

Computational Linguistics,19:263{312.

Sharon A. Caraballo and Eugene Charniak.

1997. Newguresofmeritforbest-rst

prob-abilistic chart parsing. Computational

Lin-guistics, 24:275{298.

EugeneCharniak,SharonGoldwater,andMark

Johnson. 1998. Edge-BasedBest-First Chart

Parsing. In Proceedings of the Sixth Annual

Workshop forVery LargeCorpora,Montreal.

8

Onecouldsetupacontinuumfromdetailed

parser-s like Proteus down to shallow verb-group/noun-group

recognizers, with the Penn treebank based parsers

ly-ing somewhereinthe middle. Asonetravels downthe

continuumtothelowerdetailparsers,theerrorrate

nat-Statistical parsing of messages. In DARPA

Speech and Language Workshop, pages 263{

266.

Anna Sagvall Hein. 1996. Preference

Mecha-nismsoftheMultraMachineTranslation

Sys-tem. In Barbara H. Partee and Petr Sgall,

editors, Discourse and Meaning: Papers in

HonorofEvaHaji^cova.JohnBenjamins

Pub-lishingCompany, Amsterdam.

Y. Matsumoto, H. Ishimoto, T. Utsuro, and

M. Nagao. 1993. Structural Matching of

Parallel Texts. In 31st Annual Meeting of

the Association for Computational

Linguis-tics: \Proceedings of the Conference".

Adam Meyers, Roman Yangarber, and Ralph

Grishman. 1996. Alignment of Shared

ForestsforBilingualCorpora. InProceedings

of Coling 1996: The16th International

Con-ference on Computational Linguistics, pages

460{465.

AdamMeyers, Michiko Kosaka,and Ralph

Gr-ishman. 1998a. A Multilingual Procedure

forDictionary-BasedSentenceAlignment. In

Proceedings of AMTA'98: Machine T

ransla-tion and the Information Soup, pages 187{

198.

Adam Meyers, Roman Yangarber, Ralph

Gr-ishman, Catherine Macleod, and Antonio

Moreno-Sandoval. 1998b. Deriving T

rans-fer Rulesfrom Dominance-Preserving

Align-ments. InProceedingsof Coling-ACL98: The

17th International Conference on

Computa-tionalLinguistics andthe 36thMeetingof the

Association for Computational Linguistics.

Sergei Nirenburg and Robert E. Frederking.

1995. ThePanglossMarkIIIMachineT

rans-lation System: Multi-Engine System

Archi-tecture. Technical report, NMSU CRL,USC

ISI,and CMUCMT.

AndrewWay,IanCrookston,andJaneShelton.

1997. A Typology of Translation Problems

for Eurotra Translation Machines. Machine