A Stochastic Parser Based on a Structural Word Prediction Model

(1)

Shinsuke MORI, Masafumi NISHIMURA, Nobuyasu ITOH,

Shiho OGINO, Hideo WATANABE

IBM Research,Tokyo ResearchLab oratory, IBM Japan,Ltd.

1623-14 ShimotsurumaYamatoshi,242-8502, Japan

[email protected]

Abstract

In this pap er, we present a sto chastic language

mo del using dep endency. This mo del considers a

sentence asawordsequenceandpredictseach word

fromleft to right. Thehistory at each step of

pre-dictionis asequence of partialparse trees covering

the preceding words. First our mo del predicts the

partialparsetreeswhichhaveadep endencyrelation

with thenext word amongthem and then predicts

thenext wordfromonlythetrees whichhave a

de-p endencyrelationwiththenextword. Ourmo delis

agenerativesto chasticmo del,thusthiscanb e used

not only as a parser but also as a languagemo del

of asp eech recognizer. Inour exp eriment, we

pre-paredab out1,000syntacticallyannotatedJapanese

sentences extracted fromanancialnewspap er and

estimatedthe parametersof ourmo del. We builta

parserbased onourmo delandtestediton

approx-imately100sentences of thesamenewspap er. The

accuracyofthedep endencyrelationwas89.9%,the

highestaccuracylevelobtainedbyJapanesesto

chas-ticparsers.

1 Introduction

The sto chastic language mo deling, imp orted from

thesp eech recognitionarea,isone ofthe successful

metho dologies of natural language pro cessing. In

fact,alllanguagemo delsforsp eech recognition are,

as far as we know, based onan n-grammo del and

mostpracticalpart-of-sp eech(POS)taggersarealso

basedonawordorPOSn-grammo delorits

exten-sion (Church, 1988;Cutting et al.,1992; Merialdo,

1994; Dermatas and Kokkinakis, 1995). POS

tag-ging is the rst step of natural language pro

cess-ing,andsto chastictaggershavesolvedthisproblem

withsatisfyingaccuracyformanyapplications. The

next step is parsing, or that is to say discovering

the structure of a given sentence. Recently, many

parsers based onthesto chasticapproachhaveb een

prop osed. Although their rep orted accuracies are

high,they arenotaccurate enough formany

appli-cationsatthisstage,andmoreattemptshavetob e

madetoimprovethemfurther.

parse the sp oken text recognized by a sp eech

rec-ognizer. This attempt is clearly aiming at sp oken

languageunderstanding. Ifweconsiderhowto

com-bineaparser andasp eech recognizer,it isb etterif

theparserisbasedonagenerativesto chasticmo del,

asrequired for thelanguagemo delofasp eech

rec-ognizer. Here, \generative" meansthat the sumof

probabilitiesover all p ossible sentences is equal to

orless than 1. Ifthelanguagemo delisgenerative,

it allowsa seamless combination of the parser and

thesp eech recognizer. This means that the sp eech

recognizer hasthesto chastic parseras its language

mo del and b enets richer information than a

nor-maln-gram mo del. Even though such a

combina-tionisnotp ossibleinpractices,therecognizer

out-puts N-b est sentences with theirprobabilities, and

theparser, takingthem asinput,parsesallofthem

and outputs the sentence with its parse tree that

hasthehighest probabilityofall p ossible

combina-tions. As a result, a parser based on agenerative

sto chastic language mo del may help a sp eech

rec-ognizer to select the most syntactically reasonable

sentence amongcandidates. Therefore, it isb etter

ifthelanguagemo delof aparserisgenerative.

In thispap er, taking Japanese as theobject

lan-guage,we prop ose agenerative sto chastic language

mo delandaparserbasedonit. Thismo deltreatsa

sentence asawordsequence andpredictseachword

fromlefttoright. Thehistoryateachstepof

predic-tionisasequence ofpartialparsetrees coveringthe

precedingwords. Topredictaword,ourmo delrst

predictswhichofthepartialparsetreesatthisstage

have dep endency relation withthe word, and then

predicts the word from the selected partial parse

trees. In Japanese each worddep ends on a

subse-quentword,thatistosay,eachdep endencyrelation

isleftto right,it isnotnecessary to predictthe

di-rectionofeachdep endency relation. Soinorder to

extendourmo deltootherlanguages,themo delmay

havetopredictthedirectionofeachdep endency. We

builtaparser based onthis mo del, whose

parame-ters are estimated from1,072sentences ina

(2)

anyJapanesesto chastic parsers.

2 Stochastic Language Model based

on Dependency

In this section, we prop ose a sto chastic language

mo delbased on dep endency. Unlike most sto

chas-ticlanguagemo delsfor aparser, ourmo delis

the-oretically based onahidden Markovmo del. Inour

mo delasentenceispredictedwordbywordfromleft

to right and thestate at each step of prediction is

basicallyasequence ofwordswhosemo dicandhas

not app eared yet. According to apsycholinguistic

rep ortonlanguagestructure (Yngve,1960),thereis

an upp er limiton the numb er of the words whose

mo dicandshavenotapp earedyet. Thislimitis

de-terminedbythenumb erofslotsinshort-term

mem-ory, 72(Miller, 1956). With this limitation,we

candesignaparserbasedonanitestatemo del.

2.1 SentenceMo del

Thebasicideaofourmo delisthateachwordwould

b e b etter predicted fromthe words that have a

de-p endency relation with the word to b e predicted

thanfromtheprecedingtwowords(tri-grammo del).

Let us consider the complete structure of the

sen-tence inFigure1andahyp otheticalstructure after

the prediction of the fth word at the top of

Fig-ure2. Inthishyp otheticalstructure,therearethree

trees: one ro ot-only tree (t

b

comp osed of w

3 ) and

twotwo-no detrees(t

a

b e predicted fromthese twotrees. Fromthisp oint

ofview,ourmo delrstpredictsthetrees dep ending

on the next wordand then predicts the next word

fromthesetrees.

Now,letusmakethefollowingdenitionsinorder

toexplainourmo delformally.

w=w

: asequence ofwords. Herea

wordisdenedasapairconsistingofastringof

alphab eticcharacters andapart ofsp eech(e.g.

the/DT).

: a sequence of partial parse

trees coveringthei-prexwords (w

1

: subsequences of t

i

having and

nothavingadep endencyrelationwiththenext

wordresp ectively. InJapanese,likemanyother

languages, no two dep endency relations cross

eachother; thus t

i

sequence ofall subtreesconnected to thero ot.

After w

i+1

has b een predicted from the trees

dep ending on it (t +

i

), there are trees

remain-ing(t

he

that

ano

blue

ao

ending

i

wa

he

that

ano

blue

ao

ending

i

wa

he

that

ano

blue

ao

ending

i

wa

subj

apple

ringo

w

1

w

2

w

3

w

4

w

5

w

6

Figure2: Wordpredictionfromapartialparse

y

max

: upp er limiton the numb er numb er of

words whose mo dicands have not app eared

yet.

Under these denitions, our sto chastic language

mo delisdened asfollows:

P(w ) =

is all p ossible binary trees with n no des.

Here,therstfactor,(P(w

i jt

+

i 1

)),iscalledtheword

predictionmo delandthesecond,(P(t +

i 1 jt

i 1 )),the

state prediction mo del. Let us consider Figure 2

again. Atthe topisthestate just after the

predic-tionof the fth word. The state predictionmo del

then predicts the partial parse trees dep ending on

thenextwordamongallpartialparsetrees,asshown

in the second gure. Finally, the word prediction

mo delpredictsthenextwordfromthepartialparse

treesdep endingonit.

As describ ed ab ove,there mayb e anupp er limit

onthenumb erofwordswhosemo dicandshavenot

yet app eared. Toput it inanotherway,thelength

ofthesequence ofpartialparsetrees (t

i

(3)

kare

he

that

ano

blue

ao

wo

obj

tabe

eat

i

Figure1: Dep endency structure ofasentence.

Therefore,ifthedepthofthepartialparsetreeisalso

limited,thenumb erofp ossiblestatesislimited.

Un-derthisconstraint,ourmo delcan b e consideredas

ahiddenMarkovmo del. InahiddenMarkovmo del,

therst factor iscalled theoutput probability and

thesecond,thetransitionprobability.

Sinceweassumethatnotwodep endencyrelations

cross each other, the state prediction mo del only

has to predict the numb er of the trees dep ending

on the next word. Thus P(t

whereyis thenumb eroftreesinthesequence t +

i 1 .

Accordingto theab ove assumption,thelast y

par-tialparse trees dep end onthei-th word. Since the

numb erof p ossible parse trees for awordsequence

growsexp onentiallywith the numb er ofthe words,

the space of the sequence of partial parse trees is

huge even if the length of the sequence is limited.

This inevitably causes a data-sparseness problem.

To avoid this problem, we limited the numb er of

levelsof no desused todistinguishtrees. Inour

ex-p eriment,onlythero otanditschildrenarechecked

toidentifyapartialparsetree. Hereafter,we

repre-sentP

LL

todenotethismo del,inwhich thelexicon

oftherstlevelandthatofthesecondlevelare

con-sidered. Thus,inourexp erimenteachwordandthe

numb er of partial parse trees dep ending on it are

predicted by asequence of partialparse trees that

takeaccountoftheno deswhosedepthistwoorless.

It isworth notingthat ifthe dep endencystructure

ofasentence islinear|thatistosay,ifeach word

dep ends on the next word, | then our mo delwill

b e equivalenttoawordtri-grammo del.

Weintro duce aninterp olation technique(Jelinek

etal.,1991)intoourmo dellikethoseusedinn-gram

mo dels. Bylo oseningtreeidenticationregulations,

we obtain a more general mo del. For example, if

we check only the POS of the ro ot and the POS

of its children, we willobtain amo del similarto a

POS tri-gram mo del (denoted P

PP

hereafter). If

we check thelexicon ofthero ot, but notthat ofits

children,themo delwillb elikeawordbi-grammo del

(denoted P

NL

hereafter). Asa smo othingmetho d,

wecaninterp olatethemo delP

LL

,similartoaword

tri-grammo del,withamoregeneralmo del,P

PP or

P

NL

. In our exp eriment, as the following formula

generalizationlevels:

isthe checklevelof therstlevel

ofthe tree (N: none, P: POS, L:lexicon) and Y is

thatofthesecondlevel,andP

w ;0 gram

istheuniform

distributionoverthevo cabularyW(P

w ;0 gram (w )=

1=jWj).

Thestate predictionmo del(P(y jt

i 1

))isalso

in-terp olatedinthe sameway. Inthis case,thep

ossi-bleeventsarey =1;2;:::;y

2.2 ParameterEstimation

Sinceourmo delisahiddenMarkovmo del,the

pa-rameters of a mo del can b e estimated from a row

corpus by EMalgorithm (Baum, 1972). With this

algorithm,the probability of the row corpus is

ex-p ectedtob emaximizedregardlessofthestructureof

eachsentence. Sotheobtainedmo delis notalways

appropriateforaparser.

In order to develop a mo del appropriate for a

parser,itisb etterthattheparametersareestimated

from a syntactically annotated corpus by a

maxi-mumlikeliho o destimation(MLE)(Merialdo,1994)

asfollows:

wheref(x)representsthefrequencyofaneventxin

thetrainingcorpus.

The interp olation co eÆcients in the formula (2)

are estimated by the deleted interp olation metho d

(Jelineketal.,1991).

2.3 SelectingWords to b eLexicalized

(4)

frequentwordsmayb eharmfulb ecauseitmaycause

a data-sparseness problem. In a practical tagger

(Kupiec,1989),onlythemostfrequent100wordsare

lexicalized. Also,inastate-of-the-artEnglishparser

(Collins,1997)onlythewordsthato ccurmorethan

4timesintrainingdataarelexicalized.

Forthisreason,ourparserselectsthewordstob e

lexicalized at the time of learning. In the

lexical-izedmo delsdescrib ed ab ove (P

LL , P

PL andP

NL ),

only the selected words are lexicalized. The

selec-tion criterion isparsing accuracy (see section4) of

aheld-out corpus, asmallpart ofthe learning

cor-pusexcludedfromparameterestimation. Thusonly

thewordsthatarepredictedtoimprovetheparsing

accuracy ofthe testcorpus, or unknown input, are

lexicalized. Thealgorithmis as follows(see Figure

3):

1. Intheinitialstate allwordsare intheclassof

theirPOS.

2. Allwordsaresortedindescendingorderoftheir

frequency,andthefollowingpro cessisexecuted

foreachwordinthisorder:

(a) The word is lexicalized provisionally and

theaccuracyof theheld-outcorpus is

cal-culated.

(b) Ifanimprovementisobserved,thewordis

lexicalizeddenitively.

Theresultofthislexicalizationalgorithmisusedto

identifyapartialparsetree. Thatistosay,only

lex-icalized words aredistinguishedinlexicalized mo

d-els. Ifnowordswereselected tob elexicalized,then

P

NL = P

NP and P

LL = P

PL =P

PP

. It is worth

noting that if we try to join a word with another

word,thenthisalgorithmwillb eanormaltop-down

clusteringalgorithm.

2.4 UnknownWordMo del

Toenable oursto chastic languagemo delto handle

unknownwords,we addedanunknownwordmo del

based on a character bi-gram mo del. If the next

word is not in the vo cabulary, the mo del predicts

itsPOS andthe unknownwordmo delpredicts the

stringofthewordasfollows:

P(w jPO S)= m+1

Y

i=1 P

PO S (x

i jx

i 1 )

wherew=x

1 x

2 x

m ; x

0 =x

m+1 =BT :

BT, a sp ecial character corresp onding to a word

b oundary,isintro ducedsothatthesumofthe

prob-abilityoverallstrings isequalto1.

In the parameter estimation describ ed ab ove, a

learningcorpus is divided into k parts. In our

ex-1

w

lexicalize

yes

no

yes

2

w

POS

2

w

6

2

w

3

POS

2

w

6

1

w

1

w

2

w

3

POS

2

w

6

1

POS

w

4

w

5

w

2

w

3

POS

2

w

6

3

w

4

1

POS

w

5

w

4

1

POS

w

5

yes

1

w

4

1

POS

w

5

Figure 3: A conceptual gure of the lexicalization

algorithm.

o ccurring inmore than one partial corpus and the

other words are used for parameter estimation of

theunknownwordmo del. Thebi-gramprobability

ofthe unknownwordmo delof aPOS is estimated

fromthe words amongthem and b elonging to the

POSasfollows:

P

POS (x

i jx

i 1 )

MLE

= f

POS (x

i ;x

i 1 )

f

POS (x

i 1 )

:

The character bi-gram mo del is also interp olated

with a uni-gram mo del and a zero-gram mo del.

Theinterp olation co eÆcients are estimated by the

deletedinterp olationmetho d(Jelinek etal.,1991).

3 Syntactic Analysis

Generally,aparser mayb e consideredas amo dule

thatreceives asequence of words annotatedwitha

POS and outputs its structure. Our parser, which

includesasto chasticunknownwordmo del,however,

isabletoacceptacharactersequenceasaninputand

execute segmentation, POS tagging, and syntactic

analysissimultaneously 1

. Inthissection,weexplain

our parser, which is based on the languagemo del

describ ed intheprecedingsection.

3.1 Sto chastic SyntacticAnalyzer

Asyntacticanalyzer,basedonasto chasticlanguage

mo del,calculates theparsetree (see Figure1) with

thehighestprobabilityforagivensequenceof

char-actersxaccordingtothefollowingformula:

^

T = argmax

w (T)=x

P(Tjx)

(5)

w(T)=x

= argmax

w(T)=x

P(xjT)P(T) (

Bayes'formula)

= argmax

w(T)=x

P(T) (

P(xjT)=1);

where w (T) represents the concatenation of the

wordstringinthesyntactictreeT. P(T)inthelast

line is a sto chastic language mo del. In ourparser,

itistheprobabilityofaparsetree T denedbythe

sto chasticdep endencymo delincludingtheunknown

wordmo deldescrib ed insection2.

P(T)= n

Y

i=1 P(w

i jt

+

i 1 )P(t

+

i 1 jt

i 1

); (3)

wherew

1 w

2 w

n

=w(T).

3.2 SolutionSearchAlgorithm

Asshowninformula(3),ourparserisbasedona

hid-denMarkovmo del. ItfollowsthatViterbialgorithm

isapplicabletosearchtheb estsolution. Viterbi

al-gorithmiscapableofcalculatingtheb estsolutionin

O (n) time,where nis thenumb er ofinput

charac-ters.

The parser rep eats a state transition, reading

charactersoftheinputsentence fromlefttoright. In

orderthatthestructureoftheinputsentencemayb e

atree,thenumb eroftreesofthenalstatet

n must

b e1andnomore.Amongthestatesthatsatisfythis

condition,theparserselectsthestatewiththe

high-estprobability. Sinceourlanguagemo deluses only

thero otanditschildrenofapartialparsetreeto

dis-tinguishstates,thelast state do es nothave enough

informationtoconstruct theparsetree. Theparser

can, however, calculate the parse tree fromthe

se-quenceofstates,orb oththewordsequence andthe

sequence of y , the numb er of trees that dep end on

the next word. Thus it memorizes these values at

each step of prediction. After the most probable

last state has b een selected, the parser constructs

theparse tree byreading these sequences fromtop

tob ottom.

4 Evaluation

We develop ed a POS-based mo del and its

lexical-izedversionexplainedinsection2to evaluatetheir

predictivep ower,andimplementedparsersbasedon

them that calculate the mostprobabledep endency

tree fromagiven character sequence, using the

so-lutionsearchalgorithmexplainedinsection3to

ob-servetheiraccuracy. Inthissection,wepresent and

discusstheexp erimentalresults.

4.1 Conditionson the Exp eriments

The corpus used inour exp erimentsconsists of

ar-Table1: Corpus.

#sentences #words #chars

learning 1,072 30,292 46,212

test 119 3,268 4,909

Keizai Shinbun). Each sentence in the articles is

segmentedintowords anditsdep endency structure

is annotated by linguists using an editor sp ecially

designed for this task at our site. The corpus was

dividedinto tenparts;theparameters ofthemo del

were estimated from nine of them and the mo del

was tested on therest (see Table1). A smallpart

of each leaning corpus is withheld fromparameter

estimationand used to select the words to b e

lex-icalized. After checking the learning corpus, the

maximumnumb erofpartialparsetrees isset to10

(y

max =10).

Toevaluatethepredictivep owerofourmo del,we

calculatedtheircross entropyonthetestcorpus. In

thispro cess,theannotatedtree inthetestcorpusis

usedasthestructureofthesentences. Thereforethe

probabilityofeachsentence inthetestcorpusisnot

thesummationover all itsp ossible derivations. To

compare the POS-based mo del and the lexicalized

mo del,weconstructed thesemo delsusing thesame

learning corpus and calculated their cross entropy

onthesametestcorpus. ThePOS-basedmo deland

thelexicalizedmo delhave thesameunknownword

mo del,thus its contributionto the cross entropy is

constant.

We implemented a parser based on the dep

en-dency mo dels. Since our mo dels, including a

character-based unknown word mo del, can return

theb est parse tree with its probability for any

in-put,we can buildaparser thatreceives acharacter

sequence as input. It is noteasy to evaluate,

how-ever, b ecause errors may o ccur in segmentation of

thesequence into words and in estimationof their

POSs. For thisreason,inthefollowingdescription,

we assumeawordsequence as theinput.

Thecriterionforaparseristheaccuracyofits

out-put dep endency relations. This criterion is widely

usedtoevaluateJapanesedep endency parsers. The

accuracyistheratioofthenumb erofthewords

an-notatedwiththesamedep endencytothenumb erof

thewordsasinthecorpus:

accur acy

=

#wordsdep endingonthecorrect word

#words

Thelastwordandthesecond-to-lastwordofa

sen-tence are excluded, b ecause there is no ambiguity.

(6)

Table2: Crossentorpyandaccuracyofeachmo del.

languagemo del

cross

entropy

accuracy

selectivelylexicalized 6.927 89.9%

completelylexicalized 6.651 87.1%

POS-based 7.000 87.5%

linearstructure

| 78.7%

*Eachworddep ends onthenextword.

4.2 Evaluation

Table2 shows thecross entropy and parsing

accu-racyofthebaseline,whereallwords dep end onthe

next word, the POS-based dep endency mo del and

two lexicalized dep endency mo dels. In the

selec-tivelylexicalized mo del,words to b elexicalized are

selected bythealgorithmdescrib ed insection2. In

thecompletelylexicalizedmo del,allwords are

lexi-calized. Thisresult attests exp erimentallythat the

parser based on theselectively lexicalized mo del is

the b est parser. As for predictive p ower, however,

thecompletelylexicalizedmo delhasthelowestcross

entropy. Thusthismo delisestimatedtob etheb est

language mo del for sp eech recognition. Although

there is nodirect relation b etween cross entropyof

the languagemo deland error rate of asp eech

rec-ognizer, ifwe consider asp oken languageparser, it

mayb e b etter to select the words to b e lexicalized

usingothercriterion.

We calculatedthe cross entropy and the parsing

accuracy of the mo del whose parameters are

esti-mated from 1/4, 1/16, and 1/64 of the learning

corpus. The relation b etween the learning corpus

sizeandthecross entropyortheparsingaccuracyis

showninFigure4. Thecross entropyhasastronger

tendency to decrease as the corpus size increases.

Asforaccuracy, thereisalsoatendency forparsers

tob ecomemoreaccurate asthesizeofthelearning

increases. The size of the corpus we have at this

stageisnotatalllarge. However,itsaccuracyisat

thetoplevelofJapaneseparsers,whichuse50,000

-190,000sentences. Therefore,weconclude that our

approachis quitepromising.

5 Related Works

Historically, structures of natural languages have

b een describ ed byacontext-free grammarand

am-biguities have b een resolved by parsers based on a

context-free grammar(Fujisakiet al.,1989). In

re-cent years, some attempts have b een made in the

areaofparsingbyanitestatemo del(Oazer,1999)

etc. Ourparserisalsobasedonanitestatemo del.

Unlike these mo dels, we fo cused on rep orts on a

limitonlanguagestructure caused by the capacity

0

20

40

60

80

100

0

1

₀

1

₀

2

1

₀

3

1

₀

4

1

₀

5

0

1

%

cross entropy

#characters in learning corpus

cross entropy

accuracy

parsing accuracy

20

0

16

12

8

4

84.0%

86.2%

88.1% 89.9%

Figure4: Relationb etween cross entropyand

pars-ingaccuracy.

mo delispsycholinguisticallymoreappropriate.

Recently,intheareaofparsersbasedonasto

chas-ticcontext-free grammar(SCFG),someresearchers

have p ointed out the imp ortance of the lexicon

and prop osed lexicalized mo dels (Charniak, 1997;

Collins,1997). In these pap ers, they rep orted

sig-nicant improvement of parsing accuracy. Taking

theserep ortsintoaccount, we intro ducedametho d

ofpartiallexicalizationandrep ortedsignicant

im-provement of parsing accuracy. Our lexicalization

metho dis also applicable to a SCFG-based parser

andimprovesitsparsingaccuracy.

The mo del we present in this pap er is a

genera-tive sto chastic languagemo del. Chelbaand Jelinek

(1998) presented a similar mo del. In their mo del,

each word is predicted from two right-most head

words regardless of dep endency relation b etween

thesehead words and theword. Eisner(1996)also

presentedasto chasticstructurallanguagemo del,in

which each word is predicted from its head word

and thenearest one. This mo delis very similarto

theparser presented byCollins (1996). The

great-est dierence b etween our mo delandthese mo dels

isinthatourmo delpredictsthenext wordfromthe

headwords,or partialparsetrees, dep endingonit.

Clearly,itis notalwaystworight-mostheadwords

thathave dep endency relationwith thenext word.

It followsthat our mo delis linguisticallymore

ap-propirate.

There have b een some attempts at sto chastic

Japanese parser (Haruno et al., 1998) (Fujio and

Matsumoto,1998) (Mori andNagao, 1998). These

Japaneseparsersarebasedonaunitcalledbunsetsu,

asequenceofoneormorecontentwordsfollowedby

zero or more function words. The parsers take a

(7)

can easily b e extended to other languages. As for

the accuracy, although a direct comparison is not

easy b etween our parser (89.9%; 1,072 sentences)

andtheseparsers (82%-85%;50,000-190,000

sen-tences) b ecause of the dierence of the units and

thecorpus, ourparser isone of thestate-of-the-art

parsers for Japanese language. It should b e noted

that our mo del describ es relations amongthree or

moreunits (case frame,consecutive dep endency

re-lations,etc.);thusourmo delb enetsagreaterdeal

fromincrease ofcorpussize.

6 Conclusion

In this pap er we have presented a sto chastic

lan-guage mo del based on dep endency structure. This

mo deltreatsasentenceasawordsequenceand

pre-dicts each word fromleft to right. The history at

eachstepofpredictionisasequence ofpartialparse

trees covering the preceding words. To predict a

word, ourmo delrstselects the partialparse trees

thathaveadep endencyrelationwiththeword,and

thenpredictsthenextwordfromtheselectedpartial

parsetrees. Wealsopresentedanalgorithmfor

lexi-calization. WebuiltparsersbasedonthePOS-based

mo delanditslexicalizedversion,whoseparameters

are estimated from 1,072 sentences of a nancial

newspap er. We testedtheparsers on119sentences

fromthesamenewspap er,whichwereexcludedfrom

thelearning. Theaccuracyof thedep endency

rela-tionofthelexicalizedparserwas89.9%,thehighest

obtainedbyanyJapanesesto chastic parser.

References

L. E. Baum. 1972. An inequality and asso ciated

maximization technique in statistical estimation

forprobabilisticfunctionsofMarkovpro cess.

In-equalities,3:1{8.

Eugene Charniak. 1997. Statistical parsing with a

context-freegrammarandwordstatistics. In

Pro-ceedings ofthe 14thNationalConferenceon

Arti-cialIntelligence,pages598{603.

CiprianChelbaandFredericJelinek. 1998.

Exploit-ingsyntacticstructure forlanguagemo deling. In

Proceedings of the 17th International Conference

onComputational Linguistics,pages225{231.

Kenneth Ward Church. 1988. A sto chastic parts

programand noun phraseparser forunrestricted

text. InProceedings of the Second Conferenceon

AppliedNaturalLanguage Processing,pages136{

143.

MichaelJohnCollins.1996. Anewstatisticalparser

basedonbigramlexicaldep endencies. InPro

ceed-ingsofthe34thAnnualMeetingoftheAssociation

forComputationalLinguistics,pages184{191.

MichaelCollins. 1997.Three generative,lexicalised

Computational Linguistics,pages16{23.

Doug Cutting, Julian Kupiec, Jan Pedersen, and

Penelop e Sibun. 1992. A practicalpart-of-sp eech

tagger. InProceedingsoftheThirdConferenceon

AppliedNaturalLanguage Processing,pages133{

140.

EvangelosDermatasandGeorge Kokkinakis. 1995.

Automaticsto chastic taggingofnaturallanguage

texts. ComputationalLinguistics,21(2):137{163.

Jason M. Eisner. 1996. Three new probabilistic

mo dels for dep endency parsing: An exploration.

In Proceedings of the 16th International

Confer-ence on Computational Linguistics, pages 340{

345.

Masakazu Fujio and Yuji Matsumoto. 1998.

Japanese dep endencystructure analysis basedon

lexicalized statistics. InProceedings of the Third

ConferenceonEmpiricalMethodsinNatural

Lan-guage Processing,pages87{96.

T. Fujisaki, F. Jelinek, J. Co cke, E. Black, and

T.Nishino. 1989. Aprobabilisticparsingmetho d

forsentencedisambiguation. InProceedingsofthe

International Parsing Workshop.

Masahiko Haruno, Satoshi Shirai, and Yoshifumi

Ooyama.1998.Usingdecisiontreestoconstructa

practicalparser. InProceedingsof the17th

Inter-national Conference on Computational

Linguis-tics,pages505{511.

Fredelick Jelinek, Rob ert L. Mercer, and Salim

Roukos. 1991. Principles of lexical language

mo deling for sp eech recognition. InAdvances in

Speech SignalProcessing, chapter 21, pages651{

699.Dekker.

JulianKupiec. 1989. AugmentingahiddenMarkov

mo delforphrase-dep endentwordtagging.In

Pro-ceedings of the DARPA Speech and Natural

Lan-guage Workshop,pages92{98.

BernardMerialdo. 1994. TaggingEnglishtextwith

aprobabilisticmo del. ComputationalLinguistics,

20(2):155{171.

GeorgeA.Miller. 1956. Themagicalnumb erseven,

plus or minus two: Some limits on our capacity

forpro cessinginformation.The Psychological

Re-view,63:81{97.

ShinsukeMoriandMakotoNagao.1998.Asto

chas-ticlanguagemo delusingdep endency andits

im-provement by wordclustering. In Proceedings of

the 17th International Conference on

Computa-tional Linguistics,pages898{904.

KemalOazer. 1999. Dep endency parsing withan

extended nitestate approach. InProceedings of

the 37th Annual Meeting of the Association for

Computational Linguistics,pages254{260.

VictorH. Yngve. 1960. A mo deland ahyp othesis