Using Language and Translation Models to Select the Best among Outputs from Multiple MT Systems

(1)

to Select the Best among Outputs from Multiple MT systems

Yasuhiro Akiba, Taro Watanabe and Eiichiro Sumita

ATR Spoken Language Translation ResearchLaboratories

2-2-2 Hikaridai,Seika-cho, Soraku-gun, Kyoto 619-0288, Japan

fyasuhiro.akiba,taro.watanabe,[email protected]

Abstract

This paper addresses the problem of

automat-ically selecting the best among outputs from

multiple machine translation (MT) systems.

Existing approaches select the output assigned

thehighestscoreaccordingtoatargetlanguage

model. In some cases, the existing approaches

do not work well. This paper proposes two

methods to improve performance. The rst

method is based on a multiple comparison test

and checks whether a score from language and

translation models is signicantly higher than

the others. The second method is based on

probabilitythat a translation is notinferior to

the others, which is predicted from the above

scores. Experimentalresultsshowthatthe

pro-posedmethodsachievean improvementof 2 to

6 %inperformance.

1 Introduction

This paper addresses the challenging problem

of automatically selecting the best among

out-puts from multiple machine translation (MT)

systems (Figure 1). In combinations of

multi-ple MTsystems, somecomponent MTsystems

cantranslateasourcesentencewellwhileothers

cannot well. In sucha case,correct selectionof

thebestcanobviously boostperformance.

ATR has been developing such multiple MT

systems, including three Japanese-to-English

(J-E) MT systems: TDMT (Furuse and Iida,

Outputa

Outputb

Outputc

The Best Output

Selection System

MTa

MTb

MTc

Input

Figure 1: Theselection system.

1996),D3(Sumita,2001),andSMT(Watanabe

etal.,2002),andthree English-to-Japanese

(E-J)MTsystems: TDMT(FuruseandIida,1996),

HPAT (Imamura, 2002), and SMT (Watanabe

etal.,2002). InordertoevaluateeachMT

sys-tem, the MT outputs were manually assigned

one of four ranks 1

, A, B, C, and D, by

na-tive speakersof the target language. The ideal

selection for J-E MT systems is the

highest-rankedoutputsfromthethreeJ-EMTsystems:

TDMT, D3, and SMT. The ideal selection for

E-J MT systems is the highest-ranked outputs

fromthethreeE-JMTsystems: TDMT,HPAT,

andSMT.

Figure 2 shows the individual performances

of the three J-E MT systems and the ideal

se-lection system derived from their combination.

Figure 3 shows the individual performances of

the three E-J MT systems and the ideal

se-lection system derived from their combination.

The left-hand group of bars indicates the

ra-1

The fourranksare dened as follows: (A) Perfect:

noproblemineitherinformationor grammar;(B)Fair:

easy to understand, with either some unimportant

in-formationmissingor awed grammar; (C) Acceptable:

broken,but understandable with eort; (D) Nonsense:

importantinformationhasbeentranslatedincorrectly.

0

10

20

30

40

50

60

70

80

90

100

A

A+B

A+B+C

Performance (%)

Ideal Selection System

TDMT

D3

SMT

(2)

sys-thetotalnumberofsentencestranslatedbyeach

MT system (hereafter, performance for Rank

A).Themiddlegroupofbarsindicatestheratio

ofthenumberofsentencesrankedasAorBto

thetotalnumberofsentencestranslatedbyeach

MT system (hereafter, performance for Rank

A+B). The right-hand group of bars indicates

the ratio of the numb er of sentences ranked as

A, B, or C to the total numb er of sentences

translated by each MT system (hereafter,

per-formance for Rank A+B+C). The black bars

indicate the performance of the ideal selection

system. As Figures 2 and 3 show, the

perfor-manceoftheidealJ-EandE-Jselectionsystem

ismuchbetterthanthatofeachcomponentMT

system.

Conventional approaches to the selection

problem include methods (Callison-Burch and

Flournoy, 2001; Kaki et al., 1999) that

auto-maticallyselecttheoutputassignedthehighest

probability P(t) (hereafter, LM-score),

accord-ing to a language model (LM) for the

trans-lation target language. As a preliminary

ex-periment, theauthors appliedthis LM-score to

selecting the best among theoutputs from the

threeJ-EMTsystems. Inordertomakea

com-parison, theauthors also useda scorebasedon

a translation model (TM)called IBM4 (Brown

et al., 1993) (hereafter, TM-score) and a score

based on the product of theTM-score and the

LM-score (hereafter, TM3LM-score) to select

the best output. Table 1 shows the results of

thispreliminaryexperiment. Theoating

num-ber indicates the dierence between the

per-formance for Rank A of each selection system

and that ofD3 (the best MTsystem, i.e., with

thehighestperformanceforRankA). The

LM-score and TM-score based selections did not

0

10

20

30

40

50

60

70

80

90

100

A

A+B

A+B+C

Performance (%)

Ideal Selection System

TDMT

HPAT

SMT

Figure3: Performanceoftheidealselection

sys-Table 1: Dierence in performance forRank A

betweeneach selectionsystemand D3.

Scoringmethod TM3LM TM LM

Dierence

inperformance 4.1 -1.5 -0.5

boost/improve the performance for Rank A,

whereas the TM3LM-score did. The

prelimi-nary experiment appears to indicate that the

TM3LM-score works better than the LM-score

inselectingthebestoutput.

Ascanbeeasilyguessed, thescoresfrom

lan-guagemodel,translationmodel,orbothmodels

combined hastwo problems. The rst problem

isthatTM3LM-score,TM-score,and LM-score

are statistical variables. Even in the case that

TMandLMaretrainedonacorpusofthesame

size, changing the training corpus also changes

theTM-score, theLM-score, and the

TM3LM-score. Figure 4 shows this phenomenon. Inthe

gure, TRN1, TRN2, and TRN3 correspond,

respectively, to (2), (3), and (4) below, which

aretranslationsof(1) bydierent J-EMT

sys-tems.

(1)

I-subj

Konputa-no

computer-of

shisutemu

system

enjinia

engineer desu

am

\I'ma computerengineer."

(2) I'macomputer systemsengineer. (TRN1)

(3) I'macomputer salesman. (TRN2)

(4) It'scomputer. (TRN3)

LMand TMwere trainedintenwayson

train-ing sets, which were subsets of ATR

broad-coveragebilingualbasicexpression(BE) corpus

(Takezawa et al., 2002), according to ten-fold

crossvalidation(Mitchell,1997). Foreachi(i=

-35

-30

-25

-20

-15

-10

-5

0

1

2

3

4

5

6

7

8

9

10

Pairs of LMs and TMs

Logarithmic scaled score

TRN1

TRN2

TRN3

(3)

Each matrix shows agreement and disagreement between the ideal selection and the score-based

selectionusing theTM3LM-score. 1 and 0 indicates\selected"and \notselected", respectively.

TM3LM-Score-BasedSelection

Ideal TDMT D3 SMT

Selection 1 0 1 0 1 0

1 22.7 44.5 37.3 23.7 31.4 5.5

0 1.0 31.8 7.1 32.0 36.7 26.5

1,2,3),TRNiisscoredintenwaysby

TM3LM-score. Some TM3LM-scoresplaceTRNi(i=1,

2,3)inadierentorder. Evenifa hugecorpus

is prepared to train a good TM and LM, this

phenomenonremains.

Inordertosolvethisrstproblem,thispaper

propose a statistical-test-based selection

sys-tem. Here, the statistical test used is a

multi-plecomparisontestbasedontheKruskal-Wallis

test (Hochb erg and Tamhane, 1983). The

pro-posedmethodcheckswhether thehighestscore

issignicantlydierent from theothers.

The second problem is that the translations

with the highest TM3LM-score tend to dier

from those ranked highest by human

evalua-tors. Table 2 shows this phenomenon. The

Table consists of three 222 confusion

matri-cesforthreeJ-E MTsystems: J-ETDMT,D3,

and J-E SMT. Each matrix shows agreement

and disagreement between the ideal selection

bya human evaluator and the selection by the

TM3LM-score. The(1,1)-elementandthe

(0,0)-element indicate the percentage of agreement,

and the(1,0)-elementand the(0,1)-element

in-dicate the percentage of disagreement. In the

confusion matrix for J-E SMT, the number in

the(1,0)-elementislargerthanthatinthe

(0,1)-element. This means that the TM3LM-score

tends to give the highest score to the

transla-tion from J-E SMT when it is not the

transla-tionassignedthebestrank. Ontheotherhand,

in the confusion matrices for J-E TDMT and

D3, the number in the (0,1)-element is larger

thanthatinthe(1,0)-element. Thismeansthat

theTM3LM-scoretendsnottogivethehighest

score to the translation from the MT systems,

except forJ-E SMT, even ifthat translation is

assigned thebest rank.

Tosolvethissecondproblem,this paper

pro-poses a selection system based on the

condi-tional probability that a translation is not

in-feriorto theother translationswhen the

trans-lation encoded by using the TM3LM-score,

conditions. For each MT system, the

condi-tionalprobabilityislearnedasaregression tree

(Chambers and Hastie, 1992; Breiman et al.,

1984) from the vector-encoding of the

transla-tionslabeledas \notinferior"or\inferior".

The next section presents our two proposed

methods. Experimental results are shown and

discussedinSection 3. Finally, ourconclusions

arepresentedinSection 4.

2 Proposed Method

2.1 Proposed Method (1)

TosolvetherstproblemdescribedinSection1,

this Subsection proposes amethodthat selects

thetranslationaccording towhether thescores

of outputs from each MT system signicantly

dierfrom eachother.

In order to detect a signicant dierence,

the proposed method rst prepares multiple

subsets of the full parallel corpus according

to k-fold cross validation (Mitchell, 1997) and

trains both TM and LM on each subset. For

example, the full parallel corpus C is

di-vided into ten subsets V

i

(i = 0; 1; 111; 9).

For each i(i = 0; 1; 111; 9), the proposed

method trains a translation model TM

i on

C

i

(= C 0 V

i

) anda languagemodelLM

i on

thetarget-languagepart ofC

i

(Figure5).

Hereafter,letL

i

(t)denoteTri-gramstatistics

ofatranslationtbyusingLM

i

. Also,letT

i (s;t)

denote P

a2S

P(s;ajt), where s is a translation

source sentence, t is a translation target

sen-tence, and S is the alignment set 2

(Brown et

2

Notethat the denitionofS is changeddepending

C

1

C

9

C

10-fold Cross Validation

…..

TM

1

LM

1

…..

LM

9

TM

9

C

0

TM

0

LM

0

Parallel corpus

Figure5: Themethodtotrainmultiplepairsof

(4)

al.,1993) that includes thebestalignment, the

neighboring alignments, and the pegged

align-ments.

The proposed method scores each output

in k ways. In the example, as shown in

Figure 6, the proposed method scores

trans-lation t

a

from translation system MT

a

translations are scored by using a

lan-guage model of the translation target

lan-guage (Figure 6). On the other hand,

the proposed method scores translation t

a

when the translations are scored by

us-ing a translation model. Whereas, the

proposed method scores translation t

a

),whenthetranslationsare

scored byusing the products of the scores of a

languagemodel and atranslation model.

Then the proposed method compares

the means of the scores. In the example

(Figure 6),

)=10 are compared when

the translations are scored by using a

language model of the translation target.

P

)=10 are compared when the

translations are scored by using a

trans-lation model.

)=10 are compared when

the translations are scored by using the

prod-ucts of the scores of a language model and a

translation model.

The proposed method checks whether the

highest mean is signicantlydierent from the

ontheTM-trainingalgorithmsused.

(s,t

a

)

LM

0

Q:Which pair of averages of scores significantly different?

A:Apply Multiple Comparison Test based on Kruskal-Wallis Test (#)

Figure 6: Preparation formultiple comparison

othersbyusingamultiplecomparisontest with

the Kruskal-Wallis test 4

, which is known as a

Tukey-Kramer-type modication of the Dunn

test (Hochberg and Tamhane, 1983). If the

highest mean is signicantlydierent, the

pro-posed method selects the translation with the

highest score. If not, the proposed method

se-lectsfrom among thetranslations whosescores

are not signicantly dierent from the highest

score the translation from the MT whose

per-formanceis thebest.

2.2 Proposed Method (2)

To solve the second problem described in

Sec-tion1,thisSubsection proposes aselection

sys-tem based on the conditional probability that

a translation is not inferior to other

transla-tionswhenthetranslationencodedbyusingthe

TM3LM-score, the TM-score, or the LM-score

satisessomeconditions. ForeachMT system,

the conditional probability is learned as a

re-gression tree from the vector-encoding of the

translations labeled as \not inferior" or

\infe-rior",whichisthe criterionvariable.

In order to learn the conditional probability

mentioned above, translations from the

com-ponent translation systems, MT

a

, are ranked by human evaluators in

ad-vance (Figure 7). Let r

a

denote the rank

as-signedtotranslation t

a

from MTsystem MT

a .

Also,letr

best

denotethebestrankamongr

a ,r

b ,

3

It iswellknownthatrepeatingasimple t-test

mul-tipletimesincreases thechanceofincorrectlynding a

signicant dierence. Multiple comparison is designed

toavoidsuchaphenomenon.

4

TheKruskal-Wallistestisanon-parametricone-way

Analysisof Variance (ANOVA).This test doesnot

as-sumethedatadistribution.

(s,t

a

,r

a

)

to t

a

by human evaluator

(T(s,t

a

)* L(t

a

), T(s,t

a

), L(t

a

), R

a

(t

a

))

RT Learner

with pruning or shrinking

RT Learner

C

TM

LM

Figure7: Preparationforregression tree

(5)

c i i i i

is equal to 1 if r

i = r

best

; otherwise R

i (t

i ) is

equalto0. ThereforewhenR

a (t

a

)isequalto1,

the transaction t

a

is superior to or as good as

theother translations.

The proposed method trains a translation

model TM on the full parallel corpus C and a

language model LM on the

translation-target-languagepart ofC (Figure7).

Hereafter, let L(t) denoteTri-gramstatistics

of atranslation tbyusing LM.Also, letT(s;t)

denote P

a2S

P(s;ajt), where s is a translation

source sentence, t is a translation target

sen-tence,andS is thealignmentset(Brownetal.,

1993).

Next, the proposed method encodes three

vectors (s; t

a

(Figure 7) into three score-vectors with

non-inferiority or inferiority, respectively:

(T(s;t

For each MT

i

(i=a; b; or c),

the proposed method learns from

f(T(s;t

the conditional probability, which is expressed

by the regression tree (Chamb ers and Hastie,

1992; Breiman et al., 1984) RT

i

(Figure 7).

Regressiontree(RT)learnerisknownas

recur-sive binary partitioning. In growinga tree, an

RT learner recursively splits the training data

in each node so as to reduce variance within

partitions asmuchaspossible.

Ingeneral, thelearnedRTover-tsthe

train-ing data. As post-processing, the learned RT

is simplied by using two procedures:

prun-ingand shrinking(Chambers andHastie,1992;

Breiman et al., 1984). Pruning successively

snips o the least important splits. The

Im-portance of a rooted subtree is determined by

thecost-complexitymeasure,D

k

) denotesthedeviance

ofthesubtreeT 0

,size(T 0

)isthenumb erof

ter-minal nodes of T 0

, and k is a cost-complexity

parameter. Shrinking reduces the

num-(s,t

a

)

Q: Which translation is selected?

A: The translation with the maximal score (#)

Figure8: Themethodtoselectatranslationby

value of each node towards its parent node.

Shrunkentted values,forashrinking

parame-terk,arecomputedaccording totherecursion,

^

y(node) = k3 y(node)~ +(10k) 3y(parent);^

wherey(node)~ denotestheusualttedvaluefor

a node, y(parent)^ is the shrunken tted value

forthe node's parent, and k is a shrinking

pa-rameter such that 0 < k < 1. The

param-eter k in each of the two procedures is xed

so as to minimize cross-validation estimates of

the deviance. Therefore, after growing each

R T

i

(i=a; b; or c), theproposedmethod

per-formsoneof thesimplied procedures of

prun-ingand shrinking.

In the selection phase, the proposed method

encodes three pair of a source sentence and its

translation,(s;t

a ), (s;t

b

), and (s;t

c

) into three

vectors (T(s;t

The proposed method predicts the

condi-tional probability that each t

i

(i=a, b,orc) is

notinferior tothe others by using RT

i

and

se-lects 5

the translation with the highest

condi-tionalprobability.

3 Experimental Comparison

3.1 ExperimentalMethod

Theauthorsevaluatedtheproposedmethodsin

order to answer the following question: Which

selectionsystemimprovesperformancebest in

comparisonwiththatofthebestMTsystemi.e.

the MT systems with the highest performance

asshowninFigures2 and 3?

Inordertoanswertheabovequestion,the

au-thorsusedasetofthreeJ-EcomponentMT

sys-tems(TDMT,D3,andSMT)and asetofthree

E-J component MT systems (TDMT, HPAT,

andSMT).BilingualEnglishandJapanesedata

were from ATR broad-coverage bilingual basic

expression(BE)corpus(Takezawaetal.,2002),

which is split into three parts: a training set

of 125,537 sentence pairs, a verication set of

9,872pairs, and atest set of10,023 pairs.

Thefull corpusC intraining translation

tar-getlanguagemodelandtranslationmodelisthe

trainingset. Tensubsetsofthefullcorpus were

5

PreparingmultipleRTsforeachcomponentMT

sys-temenablesthesecondmethodtobeextendedsoas to

selectthebestoutputaccordingtothemultiple

(6)

lationmodelandlanguagemodelarelearnedby

using GIZA++ (Och and Ney, 2000) and the

CMU-CambridgeTo olkit(Clarksonand

Rosen-feld, 1997),respectively. Thetranslationmodel

islearnedfromIBM 1to4,includingtheHMM

model, as suggested by Och and Ney (2000),

and its training loop was terminated when the

perplexity for the validation set indicated the

lowest scores. The word classes used in TM

learningarethepart-of-speech(POS)classesin

TDMT.TheP-valueusedforthemultiple

com-parisontest is 0.05.

Four setsof aboutve hundred pairs of

En-glishandJapanesesentenceswererandomly

se-lectedfrom the test set. The Englishsentences

inthefoursetsweretranslatedbytheE-J

com-ponent MT systems and ranked by a native

speakerofJapanese; likewise theJapanese

sen-tencesinthefoursetsweretranslatedbytheJ-E

componentMTsystemsandrankedbyanative

speakerof English. Each performancewas

cal-culated asthe average of the performance over

thefour sets. In particular,theperformanceof

the second proposed method is calculated

ac-cording to four-fold cross validation (Mitchell,

1997).

3.2 Experimental Results

In order to evaluate the point mentioned at

the beginning of Section 3.1, the authors

com-paredtheperformanceof eachselection system

with thatof thebest MTsystem. Asshown in

Figure 2, among the J-E component MT

sys-tems, D3 had the best performance for Rank

A, and TDMT had the best performance for

bothRankA+B(equaltoorbetterthanB)and

Rank A+B+C (equal to orbetterthan C). As

shown in Figure 3, among the E-J component

MTsystems, TDMThad thebestperformance

for Rank A, Rank A+B, and Rank A+B+C.

Figures 9, 10, and 11 show the results of the

comparisons. The vertical axis in each gure

shows thedierencein theperformances.

Each bar corresponds to a selection system.

The rst three bars in left-to-right order

cor-respond toTM3LM-score-basedselection,

TM-score-based selection, and LM-score-based

se-lection, which were usedin thepreliminary

ex-periment described in Section 1. The next

three bars correspond to the rst proposed

methodbasedonTM3LM-score,TM-score,and

LM-score. The next three bars correspond to

the second proposed method in which

predic-bothTM3LM-scoreandTM-score,toallscores,

LM3TM-score, TM-score, and TM3LM-score.

Inthese selection methods,theregression trees

aresimpliedbyusingtheshrinkingprocedure.

The lastthree bars also correspond to the

sec-ond proposed method, but in these selection

methods, the regression trees are simplied by

using the pruning procedure. Accuracy means

thepercentageofcorrectly selectingtheoutput

assignedthehighest rank inalltrials.

Figure9showsthattherstproposedsystem

based on TM3LM-score achieved the greatest

improvementofalittleunder6%,inthe

perfor-manceforRankA.Ontheotherhand,the

exist-ingselectionsystemsimply using theLM-score

(languagemodelofthetranslationtarget)could

notimproveandevendegradedperformancefor

RankA.

Figure10showsthatthesecondproposed

sys-temwith thepruning procedure based on both

TM3LM-score and TM-score (marked

RT12-PRN in the graph) achieved the greatest

im-provement of about 5%, for Rank A+B (equal

toor higher than B). On the other hand,

Fig-ure10 shows that theexisting selectionsystem

-16

-14

-12

-10

-8

-6

-4

-2

0

2

4

6

8

10

12

14

16

18

Difference in performance (%)

LM*TM

TM

LM

LM*TM+KWMC

TM+KWMC

LM+KWMC

RT1+SHR

RT12+SHR

RT123+SHR

RT1+PRN

RT12+PRN

RT123+PRN

A

A+B

A+B+C

Accuracy

Figure 9: Dierence in performance between

eachselection systemand D3.

-16

-14

-12

-10

-8

-6

-4

-2

0

2

4

6

8

10

12

14

16

18

Difference in performance (%)

LM*TM

TM

LM

LM*TM+KWMC

TM+KWMC

LM+KWMC

RT1+SHR

RT12+SHR

RT123+SHR

RT1+PRN

RT12+PRN

RT123+PRN

A

A+B

A+B+C

Accuracy

(7)

formanceforRankA+B,witha degradationof

6%.

Figure11showsthatthesecondproposed

sys-tems, with eitherthe pruning procedure or the

shrinking procedure, based on only

TM3LM-score, onbothTM3LM-scoreand TM-score, or

onall scoresachieved animprovementofabout

2% for rank A in all cases. Figure 11 also

showsthatthesecondproposedsystemwiththe

shrinkingprocedurebasedonallscores(marked

RT123-SHRinthegraph)achievedan

improve-mentof a littlemore than2%forRank A+B.

4 Conclusions

Thispaperaddressedthechallengingproblemof

automaticallyselectingthebestamongoutputs

from multiple MT systems to improve

transla-tionquality. Thispaperproposedtwomethods.

The rst method is based on a multiple

com-parison test based on the Kruskal-Wallis test

and checks whether the highest score from the

languagemodel, thetranslation model, orboth

models combined is signicantly dierent from

theothers. Thesecondmethodisbasedon

con-ditionalprobabilitythatatranslation isnot

in-feriortotheotherswhenthetranslationsatises

someconditions. The conditionalprobabilityis

predicted bya regression tree learned from the

abovescores. The proposedmethodswere

eval-uatedusing anATRtravelcorpus.

Experimen-tal results showed that the performance of the

proposed methods is much better than that of

theexistingmethodsandachievedthe

improve-mentof 2 to6%in performance.

Acknowledgment

This research was supported in part by the

Telecommunications Advancement

Organiza-tion ofJapan.

-16

-14

-12

-10

-8

-6

-4

-2

0

2

4

6

8

10

12

14

16

18

Difference in performance (%)

LM*TM

TM

LM

LM*TM+KWMC

TM+KWMC

LM+KWMC

RT1+SHR

RT12+SHR

RT123+SHR

RT1+PRN

RT12+PRN

RT123+PRN

A

A+B

A+B+C

Accuracy

Figure 11: Dierence in performance between

LeoBreiman,Jerome H.Friedman,RichardA. Olshen,

and Charles J. Stone. 1984. Classication and

Re-gressionTrees. ChapmanandHall,California, USA.

PeterF.Brown,StephenDellaPietra, VincentJ.Della

Pietra,andRobertL. Mercer. 1993. The

mathemat-icsofstatisticalmachinetranslation: Parameter

esti-mation. ComputationalLinguistics,19(2):263{311.

ChrisCallison-Burchand Raymond S.Flournoy. 2001.

Aprogramforautomaticallyselectingthebestoutput

from multiple machine translation engines. In

Pro-ceedingsofMTSummitVIII,pages63{66.

John M. Chamb ers and Trevor J. Hastie. 1992.

Sta-tisticalModelsinS. Wadsworth&Brooks/Cole

Ad-vancedBooks&Software,ADivisionofWadsworth,

Inc.,California, USA.

PhilipClarksonandRonaldRosenfeld. 1997. Statistical

languagemodelingusingtheCMU-Cambridgetoolkit.

InProceedingsoftheEuropeanConferenceonSpeech

CommunicationandTechnology: EUROSPEECH-97,

pages2707{2710.

Osamu Furuse and Hitoshi Iida. 1996. Incremental

translation utilizing constituent boundary patterns.

InProceedingsofthe16thInternationalConferenceon

ComputationalLinguistics: COLING-96,pages412{

417.

Y. Hochb erg andA.C.Tamhane. 1983. Multiple

Com-parisonProcedures.Wiley,NewYork,USA.

KenjiImamura. 2002. Applicationoftranslation

knowl-edge acquired by hierarchical phrase alignment for

pattern-based mt. InProceedingsof the9th

Confer-enceonTheoreticalandMethodologicalIssuesin

Ma-chineTranslation,pages74{84.

Satoshi Kaki, Setsuo Yamada, and Eiichiro Sumita.

1999. Scoring multiple translations using

charac-ter n-gram. In Proceedings of the 5th Natural

Lan-guage Processing Pacic Rim Symposium:

NLPRS-99,pages298{302.

Tom M. Mitchell. 1997. Machine Learning. The

McGraw-HillCompaniesInc.,NewYork,USA.

FranzJosefOchandHermannNey. 2000. Improved

sta-tisticalalignmentmodels. InProceedingsof the38th

AnnualMeetingoftheAssociationforComputational

Linguistics: ACL00,pages440{447.

EiichiroSumita. 2001. Example-based machine

trans-lation using DP-matching between work sequences.

InProceedingsof the ACL2001 Workshop on

Data-Driven Methods in Machine Translation:

DDMT-2001,pages1{8.

ToshiyukiTakezawa,EiichiroSumita,FumiakiSugaya,

Hirofumi Yamamoto, and Seiichi Yamamoto. 2002.

Towardabroad-coveragebilingualcorpus forspeech

translation of travel conversations inthe real world.

InProceedings of Third International Conference on

Language Resources and Evaluation: LREC2002,

pages147{152.

Taro Watanab e, Kenji Imamura, and Eiichiro Sumita.

2002. Statistical machine translation system based

on hierarchical phrase alignment. In Proceedings of

the9thConferenceonTheoreticalandMethodological