Shinsuke MORI, Masafumi NISHIMURA, Nobuyasu ITOH,
Shiho OGINO, Hideo WATANABE
IBM Research,Tokyo ResearchLab oratory, IBM Japan,Ltd.
1623-14 ShimotsurumaYamatoshi,242-8502, Japan
Abstract
In this pap er, we present a sto chastic language
mo del using dep endency. This mo del considers a
sentence asawordsequenceandpredictseach word
fromleft to right. Thehistory at each step of
pre-dictionis asequence of partialparse trees covering
the preceding words. First our mo del predicts the
partialparsetreeswhichhaveadep endencyrelation
with thenext word amongthem and then predicts
thenext wordfromonlythetrees whichhave a
de-p endencyrelationwiththenextword. Ourmo delis
agenerativesto chasticmo del,thusthiscanb e used
not only as a parser but also as a languagemo del
of asp eech recognizer. Inour exp eriment, we
pre-paredab out1,000syntacticallyannotatedJapanese
sentences extracted fromanancialnewspap er and
estimatedthe parametersof ourmo del. We builta
parserbased onourmo delandtestediton
approx-imately100sentences of thesamenewspap er. The
accuracyofthedep endencyrelationwas89.9%,the
highestaccuracylevelobtainedbyJapanesesto
chas-ticparsers.
1 Introduction
The sto chastic language mo deling, imp orted from
thesp eech recognitionarea,isone ofthe successful
metho dologies of natural language pro cessing. In
fact,alllanguagemo delsforsp eech recognition are,
as far as we know, based onan n-grammo del and
mostpracticalpart-of-sp eech(POS)taggersarealso
basedonawordorPOSn-grammo delorits
exten-sion (Church, 1988;Cutting et al.,1992; Merialdo,
1994; Dermatas and Kokkinakis, 1995). POS
tag-ging is the rst step of natural language pro
cess-ing,andsto chastictaggershavesolvedthisproblem
withsatisfyingaccuracyformanyapplications. The
next step is parsing, or that is to say discovering
the structure of a given sentence. Recently, many
parsers based onthesto chasticapproachhaveb een
prop osed. Although their rep orted accuracies are
high,they arenotaccurate enough formany
appli-cationsatthisstage,andmoreattemptshavetob e
madetoimprovethemfurther.
parse the sp oken text recognized by a sp eech
rec-ognizer. This attempt is clearly aiming at sp oken
languageunderstanding. Ifweconsiderhowto
com-bineaparser andasp eech recognizer,it isb etterif
theparserisbasedonagenerativesto chasticmo del,
asrequired for thelanguagemo delofasp eech
rec-ognizer. Here, \generative" meansthat the sumof
probabilitiesover all p ossible sentences is equal to
orless than 1. Ifthelanguagemo delisgenerative,
it allowsa seamless combination of the parser and
thesp eech recognizer. This means that the sp eech
recognizer hasthesto chastic parseras its language
mo del and b enets richer information than a
nor-maln-gram mo del. Even though such a
combina-tionisnotp ossibleinpractices,therecognizer
out-puts N-b est sentences with theirprobabilities, and
theparser, takingthem asinput,parsesallofthem
and outputs the sentence with its parse tree that
hasthehighest probabilityofall p ossible
combina-tions. As a result, a parser based on agenerative
sto chastic language mo del may help a sp eech
rec-ognizer to select the most syntactically reasonable
sentence amongcandidates. Therefore, it isb etter
ifthelanguagemo delof aparserisgenerative.
In thispap er, taking Japanese as theobject
lan-guage,we prop ose agenerative sto chastic language
mo delandaparserbasedonit. Thismo deltreatsa
sentence asawordsequence andpredictseachword
fromlefttoright. Thehistoryateachstepof
predic-tionisasequence ofpartialparsetrees coveringthe
precedingwords. Topredictaword,ourmo delrst
predictswhichofthepartialparsetreesatthisstage
have dep endency relation withthe word, and then
predicts the word from the selected partial parse
trees. In Japanese each worddep ends on a
subse-quentword,thatistosay,eachdep endencyrelation
isleftto right,it isnotnecessary to predictthe
di-rectionofeachdep endency relation. Soinorder to
extendourmo deltootherlanguages,themo delmay
havetopredictthedirectionofeachdep endency. We
builtaparser based onthis mo del, whose
parame-ters are estimated from1,072sentences ina
anyJapanesesto chastic parsers.
2 Stochastic Language Model based
on Dependency
In this section, we prop ose a sto chastic language
mo delbased on dep endency. Unlike most sto
chas-ticlanguagemo delsfor aparser, ourmo delis
the-oretically based onahidden Markovmo del. Inour
mo delasentenceispredictedwordbywordfromleft
to right and thestate at each step of prediction is
basicallyasequence ofwordswhosemo dicandhas
not app eared yet. According to apsycholinguistic
rep ortonlanguagestructure (Yngve,1960),thereis
an upp er limiton the numb er of the words whose
mo dicandshavenotapp earedyet. Thislimitis
de-terminedbythenumb erofslotsinshort-term
mem-ory, 72(Miller, 1956). With this limitation,we
candesignaparserbasedonanitestatemo del.
2.1 SentenceMo del
Thebasicideaofourmo delisthateachwordwould
b e b etter predicted fromthe words that have a
de-p endency relation with the word to b e predicted
thanfromtheprecedingtwowords(tri-grammo del).
Let us consider the complete structure of the
sen-tence inFigure1andahyp otheticalstructure after
the prediction of the fth word at the top of
Fig-ure2. Inthishyp otheticalstructure,therearethree
trees: one ro ot-only tree (t
b
comp osed of w
3 ) and
twotwo-no detrees(t
a
b e predicted fromthese twotrees. Fromthisp oint
ofview,ourmo delrstpredictsthetrees dep ending
on the next wordand then predicts the next word
fromthesetrees.
Now,letusmakethefollowingdenitionsinorder
toexplainourmo delformally.
w=w
: asequence ofwords. Herea
wordisdenedasapairconsistingofastringof
alphab eticcharacters andapart ofsp eech(e.g.
the/DT).
: a sequence of partial parse
trees coveringthei-prexwords (w
1
: subsequences of t
i
having and
nothavingadep endencyrelationwiththenext
wordresp ectively. InJapanese,likemanyother
languages, no two dep endency relations cross
eachother; thus t
i
sequence ofall subtreesconnected to thero ot.
After w
i+1
has b een predicted from the trees
dep ending on it (t +
i
), there are trees
remain-ing(t
he
that
ano
blue
ao
ending
i
wa
he
that
ano
blue
ao
ending
i
wa
he
that
ano
blue
ao
ending
i
wa
subj
apple
ringo
w
1
w
2
w
3
w
4
w
5
w
6
Figure2: Wordpredictionfromapartialparse
y
max
: upp er limiton the numb er numb er of
words whose mo dicands have not app eared
yet.
Under these denitions, our sto chastic language
mo delisdened asfollows:
P(w ) =
is all p ossible binary trees with n no des.
Here,therstfactor,(P(w
i jt
+
i 1
)),iscalledtheword
predictionmo delandthesecond,(P(t +
i 1 jt
i 1 )),the
state prediction mo del. Let us consider Figure 2
again. Atthe topisthestate just after the
predic-tionof the fth word. The state predictionmo del
then predicts the partial parse trees dep ending on
thenextwordamongallpartialparsetrees,asshown
in the second gure. Finally, the word prediction
mo delpredictsthenextwordfromthepartialparse
treesdep endingonit.
As describ ed ab ove,there mayb e anupp er limit
onthenumb erofwordswhosemo dicandshavenot
yet app eared. Toput it inanotherway,thelength
ofthesequence ofpartialparsetrees (t
i
kare
he
that
ano
blue
ao
wo
obj
tabe
eat
i
Figure1: Dep endency structure ofasentence.
Therefore,ifthedepthofthepartialparsetreeisalso
limited,thenumb erofp ossiblestatesislimited.
Un-derthisconstraint,ourmo delcan b e consideredas
ahiddenMarkovmo del. InahiddenMarkovmo del,
therst factor iscalled theoutput probability and
thesecond,thetransitionprobability.
Sinceweassumethatnotwodep endencyrelations
cross each other, the state prediction mo del only
has to predict the numb er of the trees dep ending
on the next word. Thus P(t
whereyis thenumb eroftreesinthesequence t +
i 1 .
Accordingto theab ove assumption,thelast y
par-tialparse trees dep end onthei-th word. Since the
numb erof p ossible parse trees for awordsequence
growsexp onentiallywith the numb er ofthe words,
the space of the sequence of partial parse trees is
huge even if the length of the sequence is limited.
This inevitably causes a data-sparseness problem.
To avoid this problem, we limited the numb er of
levelsof no desused todistinguishtrees. Inour
ex-p eriment,onlythero otanditschildrenarechecked
toidentifyapartialparsetree. Hereafter,we
repre-sentP
LL
todenotethismo del,inwhich thelexicon
oftherstlevelandthatofthesecondlevelare
con-sidered. Thus,inourexp erimenteachwordandthe
numb er of partial parse trees dep ending on it are
predicted by asequence of partialparse trees that
takeaccountoftheno deswhosedepthistwoorless.
It isworth notingthat ifthe dep endencystructure
ofasentence islinear|thatistosay,ifeach word
dep ends on the next word, | then our mo delwill
b e equivalenttoawordtri-grammo del.
Weintro duce aninterp olation technique(Jelinek
etal.,1991)intoourmo dellikethoseusedinn-gram
mo dels. Bylo oseningtreeidenticationregulations,
we obtain a more general mo del. For example, if
we check only the POS of the ro ot and the POS
of its children, we willobtain amo del similarto a
POS tri-gram mo del (denoted P
PP
hereafter). If
we check thelexicon ofthero ot, but notthat ofits
children,themo delwillb elikeawordbi-grammo del
(denoted P
NL
hereafter). Asa smo othingmetho d,
wecaninterp olatethemo delP
LL
,similartoaword
tri-grammo del,withamoregeneralmo del,P
PP or
P
NL
. In our exp eriment, as the following formula
generalizationlevels:
isthe checklevelof therstlevel
ofthe tree (N: none, P: POS, L:lexicon) and Y is
thatofthesecondlevel,andP
w ;0 gram
istheuniform
distributionoverthevo cabularyW(P
w ;0 gram (w )=
1=jWj).
Thestate predictionmo del(P(y jt
i 1
))isalso
in-terp olatedinthe sameway. Inthis case,thep
ossi-bleeventsarey =1;2;:::;y
2.2 ParameterEstimation
Sinceourmo delisahiddenMarkovmo del,the
pa-rameters of a mo del can b e estimated from a row
corpus by EMalgorithm (Baum, 1972). With this
algorithm,the probability of the row corpus is
ex-p ectedtob emaximizedregardlessofthestructureof
eachsentence. Sotheobtainedmo delis notalways
appropriateforaparser.
In order to develop a mo del appropriate for a
parser,itisb etterthattheparametersareestimated
from a syntactically annotated corpus by a
maxi-mumlikeliho o destimation(MLE)(Merialdo,1994)
asfollows:
wheref(x)representsthefrequencyofaneventxin
thetrainingcorpus.
The interp olation co eÆcients in the formula (2)
are estimated by the deleted interp olation metho d
(Jelineketal.,1991).
2.3 SelectingWords to b eLexicalized
frequentwordsmayb eharmfulb ecauseitmaycause
a data-sparseness problem. In a practical tagger
(Kupiec,1989),onlythemostfrequent100wordsare
lexicalized. Also,inastate-of-the-artEnglishparser
(Collins,1997)onlythewordsthato ccurmorethan
4timesintrainingdataarelexicalized.
Forthisreason,ourparserselectsthewordstob e
lexicalized at the time of learning. In the
lexical-izedmo delsdescrib ed ab ove (P
LL , P
PL andP
NL ),
only the selected words are lexicalized. The
selec-tion criterion isparsing accuracy (see section4) of
aheld-out corpus, asmallpart ofthe learning
cor-pusexcludedfromparameterestimation. Thusonly
thewordsthatarepredictedtoimprovetheparsing
accuracy ofthe testcorpus, or unknown input, are
lexicalized. Thealgorithmis as follows(see Figure
3):
1. Intheinitialstate allwordsare intheclassof
theirPOS.
2. Allwordsaresortedindescendingorderoftheir
frequency,andthefollowingpro cessisexecuted
foreachwordinthisorder:
(a) The word is lexicalized provisionally and
theaccuracyof theheld-outcorpus is
cal-culated.
(b) Ifanimprovementisobserved,thewordis
lexicalizeddenitively.
Theresultofthislexicalizationalgorithmisusedto
identifyapartialparsetree. Thatistosay,only
lex-icalized words aredistinguishedinlexicalized mo
d-els. Ifnowordswereselected tob elexicalized,then
P
NL = P
NP and P
LL = P
PL =P
PP
. It is worth
noting that if we try to join a word with another
word,thenthisalgorithmwillb eanormaltop-down
clusteringalgorithm.
2.4 UnknownWordMo del
Toenable oursto chastic languagemo delto handle
unknownwords,we addedanunknownwordmo del
based on a character bi-gram mo del. If the next
word is not in the vo cabulary, the mo del predicts
itsPOS andthe unknownwordmo delpredicts the
stringofthewordasfollows:
P(w jPO S)= m+1
Y
i=1 P
PO S (x
i jx
i 1 )
wherew=x
1 x
2 x
m ; x
0 =x
m+1 =BT :
BT, a sp ecial character corresp onding to a word
b oundary,isintro ducedsothatthesumofthe
prob-abilityoverallstrings isequalto1.
In the parameter estimation describ ed ab ove, a
learningcorpus is divided into k parts. In our
ex-1
w
lexicalize
yes
no
yes
2
w
POS
2
w
6
2
w
w
3
POS
2
w
6
1
w
1
w
2
w
w
3
POS
2
w
6
1
POS
w
4
w
5
w
2
w
3
POS
2
w
6
3
w
w
4
1
POS
w
5
w
4
1
POS
w
5
yes
1
w
w
4
1
POS
w
5
Figure 3: A conceptual gure of the lexicalization
algorithm.
o ccurring inmore than one partial corpus and the
other words are used for parameter estimation of
theunknownwordmo del. Thebi-gramprobability
ofthe unknownwordmo delof aPOS is estimated
fromthe words amongthem and b elonging to the
POSasfollows:
P
POS (x
i jx
i 1 )
MLE
= f
POS (x
i ;x
i 1 )
f
POS (x
i 1 )
:
The character bi-gram mo del is also interp olated
with a uni-gram mo del and a zero-gram mo del.
Theinterp olation co eÆcients are estimated by the
deletedinterp olationmetho d(Jelinek etal.,1991).
3 Syntactic Analysis
Generally,aparser mayb e consideredas amo dule
thatreceives asequence of words annotatedwitha
POS and outputs its structure. Our parser, which
includesasto chasticunknownwordmo del,however,
isabletoacceptacharactersequenceasaninputand
execute segmentation, POS tagging, and syntactic
analysissimultaneously 1
. Inthissection,weexplain
our parser, which is based on the languagemo del
describ ed intheprecedingsection.
3.1 Sto chastic SyntacticAnalyzer
Asyntacticanalyzer,basedonasto chasticlanguage
mo del,calculates theparsetree (see Figure1) with
thehighestprobabilityforagivensequenceof
char-actersxaccordingtothefollowingformula:
^
T = argmax
w (T)=x
P(Tjx)
w(T)=x
= argmax
w(T)=x
P(xjT)P(T) (
Bayes'formula)
= argmax
w(T)=x
P(T) (
P(xjT)=1);
where w (T) represents the concatenation of the
wordstringinthesyntactictreeT. P(T)inthelast
line is a sto chastic language mo del. In ourparser,
itistheprobabilityofaparsetree T denedbythe
sto chasticdep endencymo delincludingtheunknown
wordmo deldescrib ed insection2.
P(T)= n
Y
i=1 P(w
i jt
+
i 1 )P(t
+
i 1 jt
i 1
); (3)
wherew
1 w
2 w
n
=w(T).
3.2 SolutionSearchAlgorithm
Asshowninformula(3),ourparserisbasedona
hid-denMarkovmo del. ItfollowsthatViterbialgorithm
isapplicabletosearchtheb estsolution. Viterbi
al-gorithmiscapableofcalculatingtheb estsolutionin
O (n) time,where nis thenumb er ofinput
charac-ters.
The parser rep eats a state transition, reading
charactersoftheinputsentence fromlefttoright. In
orderthatthestructureoftheinputsentencemayb e
atree,thenumb eroftreesofthenalstatet
n must
b e1andnomore.Amongthestatesthatsatisfythis
condition,theparserselectsthestatewiththe
high-estprobability. Sinceourlanguagemo deluses only
thero otanditschildrenofapartialparsetreeto
dis-tinguishstates,thelast state do es nothave enough
informationtoconstruct theparsetree. Theparser
can, however, calculate the parse tree fromthe
se-quenceofstates,orb oththewordsequence andthe
sequence of y , the numb er of trees that dep end on
the next word. Thus it memorizes these values at
each step of prediction. After the most probable
last state has b een selected, the parser constructs
theparse tree byreading these sequences fromtop
tob ottom.
4 Evaluation
We develop ed a POS-based mo del and its
lexical-izedversionexplainedinsection2to evaluatetheir
predictivep ower,andimplementedparsersbasedon
them that calculate the mostprobabledep endency
tree fromagiven character sequence, using the
so-lutionsearchalgorithmexplainedinsection3to
ob-servetheiraccuracy. Inthissection,wepresent and
discusstheexp erimentalresults.
4.1 Conditionson the Exp eriments
The corpus used inour exp erimentsconsists of
ar-Table1: Corpus.
#sentences #words #chars
learning 1,072 30,292 46,212
test 119 3,268 4,909
Keizai Shinbun). Each sentence in the articles is
segmentedintowords anditsdep endency structure
is annotated by linguists using an editor sp ecially
designed for this task at our site. The corpus was
dividedinto tenparts;theparameters ofthemo del
were estimated from nine of them and the mo del
was tested on therest (see Table1). A smallpart
of each leaning corpus is withheld fromparameter
estimationand used to select the words to b e
lex-icalized. After checking the learning corpus, the
maximumnumb erofpartialparsetrees isset to10
(y
max =10).
Toevaluatethepredictivep owerofourmo del,we
calculatedtheircross entropyonthetestcorpus. In
thispro cess,theannotatedtree inthetestcorpusis
usedasthestructureofthesentences. Thereforethe
probabilityofeachsentence inthetestcorpusisnot
thesummationover all itsp ossible derivations. To
compare the POS-based mo del and the lexicalized
mo del,weconstructed thesemo delsusing thesame
learning corpus and calculated their cross entropy
onthesametestcorpus. ThePOS-basedmo deland
thelexicalizedmo delhave thesameunknownword
mo del,thus its contributionto the cross entropy is
constant.
We implemented a parser based on the dep
en-dency mo dels. Since our mo dels, including a
character-based unknown word mo del, can return
theb est parse tree with its probability for any
in-put,we can buildaparser thatreceives acharacter
sequence as input. It is noteasy to evaluate,
how-ever, b ecause errors may o ccur in segmentation of
thesequence into words and in estimationof their
POSs. For thisreason,inthefollowingdescription,
we assumeawordsequence as theinput.
Thecriterionforaparseristheaccuracyofits
out-put dep endency relations. This criterion is widely
usedtoevaluateJapanesedep endency parsers. The
accuracyistheratioofthenumb erofthewords
an-notatedwiththesamedep endencytothenumb erof
thewordsasinthecorpus:
accur acy
=
#wordsdep endingonthecorrect word
#words
Thelastwordandthesecond-to-lastwordofa
sen-tence are excluded, b ecause there is no ambiguity.
Table2: Crossentorpyandaccuracyofeachmo del.
languagemo del
cross
entropy
accuracy
selectivelylexicalized 6.927 89.9%
completelylexicalized 6.651 87.1%
POS-based 7.000 87.5%
linearstructure
| 78.7%
*Eachworddep ends onthenextword.
4.2 Evaluation
Table2 shows thecross entropy and parsing
accu-racyofthebaseline,whereallwords dep end onthe
next word, the POS-based dep endency mo del and
two lexicalized dep endency mo dels. In the
selec-tivelylexicalized mo del,words to b elexicalized are
selected bythealgorithmdescrib ed insection2. In
thecompletelylexicalizedmo del,allwords are
lexi-calized. Thisresult attests exp erimentallythat the
parser based on theselectively lexicalized mo del is
the b est parser. As for predictive p ower, however,
thecompletelylexicalizedmo delhasthelowestcross
entropy. Thusthismo delisestimatedtob etheb est
language mo del for sp eech recognition. Although
there is nodirect relation b etween cross entropyof
the languagemo deland error rate of asp eech
rec-ognizer, ifwe consider asp oken languageparser, it
mayb e b etter to select the words to b e lexicalized
usingothercriterion.
We calculatedthe cross entropy and the parsing
accuracy of the mo del whose parameters are
esti-mated from 1/4, 1/16, and 1/64 of the learning
corpus. The relation b etween the learning corpus
sizeandthecross entropyortheparsingaccuracyis
showninFigure4. Thecross entropyhasastronger
tendency to decrease as the corpus size increases.
Asforaccuracy, thereisalsoatendency forparsers
tob ecomemoreaccurate asthesizeofthelearning
increases. The size of the corpus we have at this
stageisnotatalllarge. However,itsaccuracyisat
thetoplevelofJapaneseparsers,whichuse50,000
-190,000sentences. Therefore,weconclude that our
approachis quitepromising.
5 Related Works
Historically, structures of natural languages have
b een describ ed byacontext-free grammarand
am-biguities have b een resolved by parsers based on a
context-free grammar(Fujisakiet al.,1989). In
re-cent years, some attempts have b een made in the
areaofparsingbyanitestatemo del(Oazer,1999)
etc. Ourparserisalsobasedonanitestatemo del.
Unlike these mo dels, we fo cused on rep orts on a
limitonlanguagestructure caused by the capacity
0
20
40
60
80
100
0
1
0
1
1
0
2
1
0
3
1
0
4
1
0
5
0
1
%
cross entropy
#characters in learning corpus
cross entropy
accuracy
parsing accuracy
20
0
16
12
8
4
84.0%
86.2%
88.1% 89.9%
Figure4: Relationb etween cross entropyand
pars-ingaccuracy.
mo delispsycholinguisticallymoreappropriate.
Recently,intheareaofparsersbasedonasto
chas-ticcontext-free grammar(SCFG),someresearchers
have p ointed out the imp ortance of the lexicon
and prop osed lexicalized mo dels (Charniak, 1997;
Collins,1997). In these pap ers, they rep orted
sig-nicant improvement of parsing accuracy. Taking
theserep ortsintoaccount, we intro ducedametho d
ofpartiallexicalizationandrep ortedsignicant
im-provement of parsing accuracy. Our lexicalization
metho dis also applicable to a SCFG-based parser
andimprovesitsparsingaccuracy.
The mo del we present in this pap er is a
genera-tive sto chastic languagemo del. Chelbaand Jelinek
(1998) presented a similar mo del. In their mo del,
each word is predicted from two right-most head
words regardless of dep endency relation b etween
thesehead words and theword. Eisner(1996)also
presentedasto chasticstructurallanguagemo del,in
which each word is predicted from its head word
and thenearest one. This mo delis very similarto
theparser presented byCollins (1996). The
great-est dierence b etween our mo delandthese mo dels
isinthatourmo delpredictsthenext wordfromthe
headwords,or partialparsetrees, dep endingonit.
Clearly,itis notalwaystworight-mostheadwords
thathave dep endency relationwith thenext word.
It followsthat our mo delis linguisticallymore
ap-propirate.
There have b een some attempts at sto chastic
Japanese parser (Haruno et al., 1998) (Fujio and
Matsumoto,1998) (Mori andNagao, 1998). These
Japaneseparsersarebasedonaunitcalledbunsetsu,
asequenceofoneormorecontentwordsfollowedby
zero or more function words. The parsers take a
can easily b e extended to other languages. As for
the accuracy, although a direct comparison is not
easy b etween our parser (89.9%; 1,072 sentences)
andtheseparsers (82%-85%;50,000-190,000
sen-tences) b ecause of the dierence of the units and
thecorpus, ourparser isone of thestate-of-the-art
parsers for Japanese language. It should b e noted
that our mo del describ es relations amongthree or
moreunits (case frame,consecutive dep endency
re-lations,etc.);thusourmo delb enetsagreaterdeal
fromincrease ofcorpussize.
6 Conclusion
In this pap er we have presented a sto chastic
lan-guage mo del based on dep endency structure. This
mo deltreatsasentenceasawordsequenceand
pre-dicts each word fromleft to right. The history at
eachstepofpredictionisasequence ofpartialparse
trees covering the preceding words. To predict a
word, ourmo delrstselects the partialparse trees
thathaveadep endencyrelationwiththeword,and
thenpredictsthenextwordfromtheselectedpartial
parsetrees. Wealsopresentedanalgorithmfor
lexi-calization. WebuiltparsersbasedonthePOS-based
mo delanditslexicalizedversion,whoseparameters
are estimated from 1,072 sentences of a nancial
newspap er. We testedtheparsers on119sentences
fromthesamenewspap er,whichwereexcludedfrom
thelearning. Theaccuracyof thedep endency
rela-tionofthelexicalizedparserwas89.9%,thehighest
obtainedbyanyJapanesesto chastic parser.
References
L. E. Baum. 1972. An inequality and asso ciated
maximization technique in statistical estimation
forprobabilisticfunctionsofMarkovpro cess.
In-equalities,3:1{8.
Eugene Charniak. 1997. Statistical parsing with a
context-freegrammarandwordstatistics. In
Pro-ceedings ofthe 14thNationalConferenceon
Arti-cialIntelligence,pages598{603.
CiprianChelbaandFredericJelinek. 1998.
Exploit-ingsyntacticstructure forlanguagemo deling. In
Proceedings of the 17th International Conference
onComputational Linguistics,pages225{231.
Kenneth Ward Church. 1988. A sto chastic parts
programand noun phraseparser forunrestricted
text. InProceedings of the Second Conferenceon
AppliedNaturalLanguage Processing,pages136{
143.
MichaelJohnCollins.1996. Anewstatisticalparser
basedonbigramlexicaldep endencies. InPro
ceed-ingsofthe34thAnnualMeetingoftheAssociation
forComputationalLinguistics,pages184{191.
MichaelCollins. 1997.Three generative,lexicalised
Computational Linguistics,pages16{23.
Doug Cutting, Julian Kupiec, Jan Pedersen, and
Penelop e Sibun. 1992. A practicalpart-of-sp eech
tagger. InProceedingsoftheThirdConferenceon
AppliedNaturalLanguage Processing,pages133{
140.
EvangelosDermatasandGeorge Kokkinakis. 1995.
Automaticsto chastic taggingofnaturallanguage
texts. ComputationalLinguistics,21(2):137{163.
Jason M. Eisner. 1996. Three new probabilistic
mo dels for dep endency parsing: An exploration.
In Proceedings of the 16th International
Confer-ence on Computational Linguistics, pages 340{
345.
Masakazu Fujio and Yuji Matsumoto. 1998.
Japanese dep endencystructure analysis basedon
lexicalized statistics. InProceedings of the Third
ConferenceonEmpiricalMethodsinNatural
Lan-guage Processing,pages87{96.
T. Fujisaki, F. Jelinek, J. Co cke, E. Black, and
T.Nishino. 1989. Aprobabilisticparsingmetho d
forsentencedisambiguation. InProceedingsofthe
International Parsing Workshop.
Masahiko Haruno, Satoshi Shirai, and Yoshifumi
Ooyama.1998.Usingdecisiontreestoconstructa
practicalparser. InProceedingsof the17th
Inter-national Conference on Computational
Linguis-tics,pages505{511.
Fredelick Jelinek, Rob ert L. Mercer, and Salim
Roukos. 1991. Principles of lexical language
mo deling for sp eech recognition. InAdvances in
Speech SignalProcessing, chapter 21, pages651{
699.Dekker.
JulianKupiec. 1989. AugmentingahiddenMarkov
mo delforphrase-dep endentwordtagging.In
Pro-ceedings of the DARPA Speech and Natural
Lan-guage Workshop,pages92{98.
BernardMerialdo. 1994. TaggingEnglishtextwith
aprobabilisticmo del. ComputationalLinguistics,
20(2):155{171.
GeorgeA.Miller. 1956. Themagicalnumb erseven,
plus or minus two: Some limits on our capacity
forpro cessinginformation.The Psychological
Re-view,63:81{97.
ShinsukeMoriandMakotoNagao.1998.Asto
chas-ticlanguagemo delusingdep endency andits
im-provement by wordclustering. In Proceedings of
the 17th International Conference on
Computa-tional Linguistics,pages898{904.
KemalOazer. 1999. Dep endency parsing withan
extended nitestate approach. InProceedings of
the 37th Annual Meeting of the Association for
Computational Linguistics,pages254{260.
VictorH. Yngve. 1960. A mo deland ahyp othesis