A Stochastic Parser Based on an SLM with Arboreal Context Trees

(1)

Based on an SLM with Arboreal Context Trees

Shinsuke MORI

IBM Research,TokyoResearchLaboratory, IBM Japan,Ltd.

1623-14 Shimotsuruma Yamato-shi, 242-8502, Japan

[email protected]

Abstract

Inthispaper,wepresentaparserbasedona

stochas-ticstructuredlanguagemodel(SLM)withaexible

historyreferencemechanism. AnSLMisan

alterna-tivetoan n-grammodelas a language model fora

speechrecognizer. TheadvantageofanSLMagainst

ann-gram modelis theabilityto return the

struc-ture of a given sentence. Thus SLMs are expected

toplayanimportantpartinspokenlanguage

under-standingsystems. ThecurrentSLMsrefertoaxed

partofthehistoryforpredictionjustlikeann-gram

model. We intro duce a exible history reference

mechanism called an ACT (arboreal context tree;

anextension ofthecontext treeto tree-shaped

his-tories)anddescribeaparserbasedonanSLM with

ACTs. Inthe experiment, we built anSLM-based

parserwitha xedhistoryandonewithACTs,and

comparedtheirparsingaccuracies. Theaccuracyof

our parser was 92.8%, which was higher than that

for theparser withthe xed history (89.8%). This

resultshowsthattheexiblehistoryreference

mech-anismimprovestheparsingabilityofanSLM,which

hasgreatimportanceforlanguage understanding.

1 Introduction

Currently, the state-of-the-art speech recognizers

can take dictation with satisfactory accuracy.

Al-though continuing attempts for improvements in

predictive power are needed in the language

mod-eling area for speech recognizers, another research

topic,understandingofthedictationresults,is

com-inginto focus. Structuredlanguagemodels (SLMs)

(Chelba andJelinek,1998;Charniak,2001; Moriet

al., 2001) were proposed for these purposes. Their

predictivepowersarereportedtobeslightlyhigher

thananorthodoxwordtri-grammodeliftheSLMs

are interp olated with a word tri-gram model. In

contrast with word n-gram models, SLMs use the

syntactic structure (a partial parse tree) covering

the preceding words at each step of word

predic-tion. The syntacticstructure also growsin parallel

with the word prediction. Thus after the

predic-tionofthelastwordofasentence,SLMsareableto

inputsentence(parse trees)with associated

proba-bilities. Thoughtheimpactonthepredictivepower

isnot major, this ability, which is indispensable to

spokenlanguageunderstanding,isaclearadvantage

ofSLMs over word n-gram models. With an SLM

asa language model, a speech recognizeris ableto

directlyoutputa recognitionresultwithits

syntac-ticstructureafterbeinggivenasequenceofacoustic

signals.

The early SLMs refer to only a limited and

xed part of the histories for each step of word

and structure prediction in order to avoid a

data-sparseness problem. For example, in an English

model (Chelba and Jelinek,2000) thenextwordis

predicted from the two right-most exposed heads.

Also in a Japanese model (Mori et al., 2000) the

next word is predicted from 1) all exposed heads

depending on the next word and 2) the words

de-pendingonthoseexposedheads. Oneofthenatural

improvementsin predictivepower for an SLM can

beachievedbyaddingsomeexibilitytothehistory

referencemechanism. Fora linearhistory, which is

referred to by using word n-gram models, we can

use a context tree (Ron et al., 1996) as a exible

history reference mechanism. In an n-gram model

withacontexttree,thelengthofeachn-gramis

in-creased selectively according to an estimate of the

resultingimprovement in predictive quality. Thus,

ingeneral,ann-grammodelwithacontexttreehas

morepredictivepowerin asmallermodel.

In SLMs, the history is not a simple word

se-quence but a sequence of partial parse trees. For

atree-shapedcontext,thereisalsoaexiblehistory

referencemechanismcalledanarborealcontexttree

(ACT) (Mori et al., 2001). 1

Similar to a context

tree,an SLMwith ACTs selects, depending onthe

context,theregionof thetree-shapedhistory to be

referredtoforthenextwordpredictionandthenext

structureprediction. Moriet al. (2001)report that

anSLMwithACTshasmorepredictivepowerthan

anSLM with a xed history reference mechanism.

Therefore,ifaparser basedonanSLMwith ACTs

outperformsan SLM withoutACTs, an SLM with

(2)

searchmilestoneforspokenlanguageunderstanding

systems.

Inthispaper,rstwedescribeanSLMwithACTs

for a Japanese dependency grammar. Next, we

presentourstochasticparserbasedontheSLM.

Fi-nally,wereporttwoexperimentalresults: a

compar-isonwithanSLM withoutACTs andanother

com-parisonwithastate-of-the-artJapanesedependency

parser. Theparametersofourparserwereestimated

from9,108syntacticallyannotatedsentencesfroma

nancial newspaper. Wethen tested the parser on

1,011 sentences from thesame newspaper. The

ac-curacyofthe dependencyrelationships reported by

our parser was 92.8%, higher than theaccuracy of

theparserbasedonanSLMwithoutACTs(89.8%).

This provedexperimentally that anACT improves

a parserbasedonanSLM.

2 Structured Language Model based

on Dependency

Themostpopularlanguage modelfora speech

rec-ognizerisawordn-grammodel,inwhicheachword

ispredictedfromthelast(n01)words. Thismodel

works so well that the current recognizer can take

dictationwithanalmostsatisfactoryaccuracy. Now

theresearchfocusinthelanguagemodelareais

un-derstandingthedictationresults. Inthissituation,a

structured languagemodel(SLM) was proposed by

Chelba and Jelinek (1998). In this section, we

de-scribethedependencygrammarversionofanSLM.

2.1 Structured Language Model

Thebasic ideaofan SLM is that each wordwould

bebetter predicted from the wordsthat may have

a dependencyrelationshipwiththewordtobe

pre-dictedthanfromtheproceeding(n01)words. Thus

theprobabilityPofasentencew=w

1

itsparsetreeT isgivenasfollows:

P(T)=

isthei-thpartialparsetreesequence. The

partial parse tree depicted at the top of Figure 1

shows the status before the 9th wordis predicted.

Fromthisstatus,forexample,rstthe9thwordw

9

ispredictedfromthe8thpartialparsetreesequence

t

,andthen the9thpartialparse tree

sequence t

9

is predictedfrom the9th wordw

9 and

the 8thpartial parse tree sequence t

8

to getready

for the 10th word prediction. The problem here

is how to classify the conditional parts of the two

conditionalprobabilitiesin Equation(1)inorderto

predictthenextwordand thenextstructure

with-outencounteringa data-sparsenessproblem. In an

English model (Chelba and Jelinek,2000) thenext P(t

8 )

w2

w4

w6

w

1

w

3

w5

w7

w8

t8,3

t8,2

t8,1

?

Figure1: Wordpredictionfromapartial parse

wordis predicted from the two right-most exposed

heads(forexamplew

6 andw

8

inFigure1)asfollows:

P(w

where root(t) is a function returning the root

la-bel wordof the tree t. A similar approximation is

adaptedtotheprobabilityfunctionforstructure

pre-diction. InaJapanesemodel(Morietal.,2000)the

next word is predicted from 1) all exposed heads

depending on the next word and 2) the words

de-pendingonthoseexposedheads.

Itisclear,however,thatinsomecasessomechild

nodes of the tree t

i01;2 or t

i01;1

are useful for the

next word prediction and in other cases even the

consideration of an exposed head (root of the tree

t

i01;1 ort

i01;2

)suersfroma data-sparseness

prob-lem becauseofthelimitationofthelearningcorpus

size. Thereforeamoreexiblemechanismforhistory

classicationshouldimprovethepredictivepowerof

theSLM.

2.2 SLM forDependency Grammar

Since in a dependencygrammar ofJapanese,every

dependency relationshipis in a uniquedirection as

shownin Figure1 andsincenotwodependency

re-lationshipscrosseachother,thestructureprediction

modelonlyhastopredictthenumb eroftrees. Thus,

thesecondconditionalprobabilityintherighthand

side of Equation (1) is rewritten as P(l

i

is the length (numb er of elements) of the

(3)

depen-w2

Figure2: Ahistorytree.

dencygrammarisdenedas follows:

P(T)=

According to a psycholinguistic report on

lan-guage structure (Yngve, 1960), there is an upper

limitonl

i

,thenumb erofwordswhosemodicands

havenotappearedyet. Wesettheupperlimitto9,

themaximumnumb erofslots inhumanshort-term

memory (Miller, 1956). With this limitation, our

SLMbecomesahiddenMarkovmodel.

3 Arboreal Context Tree

A variable memory length Markov model (Ron et

al.,1996),a naturalextensionofthen-grammodel,

is a exible mechanism for a linear context (word

sequence) which selects, depending onthe context,

the length of the history to be referred to for the

nextwordprediction. Thismodelisrepresentedby

a suxtree,called a contexttree, whosenodes are

labeled witha sux ofthe context. Inthis model,

thelengthofeachn-gramisincreasedselectively

ac-cordingtoanestimateoftheresultingimprovement

inpredictivequality.

In SLMs, the history is not a simple word

se-quence but a sequence of partial parse trees. For

atree-shapedcontext,thereisalsoaexiblehistory

referencemechanismcalledanarborealcontexttree

(ACT)(Mori et al.,2001)whichselects, depending

onthecontext,theregionofthetree-shapedhistory

to be referred to for the next word predictionand

forthenextstructureprediction. Inthissection,we

explainACTsandtheirapplicationtoSLMs.

3.1 DataStructure

As we mentioned above, in SLMs the history is a

sequenceofpartialparsetrees. Thiscanberegarded

as a single tree, called a history tree, by adding a

virtualrootnodehavingthesepartialtreesunderit.

a

Figure 3: Anarborealcontexttree(ACT).

9thwordpredictionbasedonthestatusdepictedat

thetop of Figure 1. An arboreal context tree is a

datastructureforexiblehistorytreeclassication.

Each node of an ACT is labeled with a subtree of

thehistory tree. Thelabelof therootis anulltree

and if a node has child nodes, their labels are the

seriesoftrees madebyexpandinga leafof thetree

labeling the parent node. For example, each child

node of the root in Figure 3 is labeledwith a tree

producedbyaddingtheright-mostchildtothelabel

oftheroot. EachnodeofanACThasa probability

distributionP(xjt),wherexisansymbolandtisthe

label of the node. For example, let ha

k

represent a tree consisting of theroot labeled with

a

0

andkchildnodeslabeledwitha

k

sotheright-mostnodeatthebottomoftheACTin

Figure3hasaprobabilitydistributionofthesymb ol

xunder theconditionthat thehistory matchesthe

partial parse trees hhz?iaihbi, where \?" matches

withanarbitrarysymb ol. Puttingitinanotherway,

thenextwordispredictedfromthehistoryhavingb

astheheadoftheright-mostpartialparsetree,aas

theheadofthesecondright-mostpartialparsetree,

and z as the second right-most child of the second

right-mostpartialparsetree. Forexample,inFigure

2thesubtreeconsistingofw

4

to for the prediction of the 9thword w

9

in Figure

1 under the following set of conditions: a = w

6

3.2 AnSLM with ACTs

(4)

P(T) =

is an ACT for word prediction and

ACT

s

isanACTforstructureprediction. Notethat

thisisageneralizationofthepredictionfromthetwo

right-mostexposedheads(w

6 andw

8

)intheEnglish

model(ChelbaandJelinek,2000). Ingeneral,SLMs

with ACTs includes SLMs with xed history

refer-encemechanismsasspecialcases.

4 Parser

Inthis section, weexplain ourparser basedon the

SLMwithACTs wedescribedinSections2 and3.

4.1 StochasticParserBased on anSLM

Asyntacticanalyzer,basedonastochasticlanguage

model, calculates the parse tree with the highest

probability ^

T for a given sequence of wordsw

ac-cordingto

wheretheconcatenation ofthewordsinthe

syntac-tic tree T is equal to w. P(T) is an SLM. In our

parser,P(T)istheprobabilityofaparsetreeT

de-nedbytheSLMbasedonthedependencywiththe

ACTs(seeEquation(3)).

4.2 Solution SearchAlgorithm

As shownin Equation(3),ourparser isbasedona

hidden Markov model. It follows that the Viterbi

algorithm is applicable to search for the best

solu-tion. TheViterbialgorithmiscapableofndingthe

best solutionin O(n)time, where nis the number

ofinputwords.

The parser repeats state transitions, reading

wordsof theinputsentencefrom beginningto end.

So that the structure of the inputsentence will be

a singleparse tree, thenumber of treesin thenal

statet

n

mustbe 1 (l

n

=1). Among thenal

pos-sible states that satisfy this constraint, the parser

selectsthestatewiththehighestprobability. Since

our language model uses only a limited part of a

partial parse tree to distinguish among states, the

nal statedoesnot contain enough information to

constructthe parse tree. The parsercan, however,

Table1: Corpus.

#sentences #words #chars

learning 9,108 260,054 400,318

test 1,011 28,825 44,667

Table2: Word-basedparsingaccuracy.

language model parsing accuracy

SLMwithACTs 92.8%(24,867/26,803)

SLM withxedhistory 89.8%(24,060/26,803)

baseline 3

79.4%(21,278/26,803)

*Eachworddependsonthenextone.

or from the combination ofthe wordsequence and

thesequenceofl

i

,thenumberofwordswhose

modi-candshavenotappearedyet. Thereforeourparser

recordsthese values at each prediction step. After

themostprobable last statehasbeenselected, the

parserconstructstheparsetreebyreadingthese

se-quencesfrombeginningtoend.

5 Evaluation

Wedevelop edan SLM with a constant history

ref-erence (Mori et al., 2000) and one with ACTs as

explainedinSection3,andthenimplemented

SLM-based parsers using the solution search algorithm

presented in Section 4. In this section, we report

the results of the parsing experiments and discuss

them.

5.1 Conditions onthe Experiments

Thecorpususedinourexperimentsconsistedof

ar-ticles extracted from a nancial newspaper (Nihon

KeizaiShinbun). Eachsentenceinthearticlesis

seg-mentedintowordsandeachwordisannotatedwith

apart-of-speech(POS)andtheworditdependson.

There are 16 basic POSs in this corpus. Table 1

showsthecorpussize. Thecorpus was dividedinto

tenparts,andtheparametersofthemodelwere

es-timatedfromnineofthem(learning)andthemodel

wastestedontheremainingone(test).

In parameter estimation and parsing, the SLM

withACTs distinguisheslexicons offunctionwords

(4POSs)and ignoreslexiconsofcontentwords(12

POSs) in order to avoid the data-sparseness

prob-lem. As a result, the alphabet of the SLM with

ACTsconsistsof192 functionwords,4 symb olsfor

unknownfunctionwords,and12symb olsforcontent

words. The SLM of the constant history reference

selects words to be lexicalized referring to the

(5)

85%

90%

95%

80%

100%

0

1

₀

1

₀

2

1

₀

3

1

₀

4

1

₀

5

1

₀

6

1

₀

7

0

1

#characters in learning corpus

accuracy

83.15%

86.98%

89.17%

90.97%

92.78%

Figure4: Relationbetweencorpus size andparsing

accuracy.

5.2 Evaluation

Oneofthemajorcriteriaforadependencyparseris

theaccuracyofitsoutputdependencyrelationships.

Forthiscriterion,theinputofaparserisasequence

ofwords,eachannotatedwithaPOS.Theaccuracy

istheratioof correctdependencies(matchesin the

corpus)tothenumber ofthewordsin theinput:

accuracy

=

#wordsdependingonthecorrectword

#words

:

Thelastwordandthesecond-to-lastwordofa

sen-tence are excluded, because there is no ambiguity.

The last word has no word to depend on and the

second-to-lastwordalwaysdependsonthelastword.

Table 2 shows the accuracies of the SLM with

ACTs, the SLM of the constant history reference,

and a baselinein which each worddepends on the

next one. This result shows that the variable

his-tory reference mechanism based on ACTs reduces

30% of theerrorsof theSLM ofa constanthistory

reference. This proves experimentally that ACTs

improve an SLM for use as a spoken language

un-derstandingengine.

Wecalculatedtheparsingaccuracyofthemodels

whose parameters were estimated from 1/4, 1/16,

and1/64ofthelearningcorpusandplottedthemin

Figure4. Thegradientoftheaccuracycurveatthe

pointofthemaximumlearningcorpussizeisstill

im-portant. Itsuggeststhatanaccuracyof95%should

beachievedbyannotatingabout30,000sentences.

Similartomostoftheparsersformanylanguages,

ourparseris basedonwords. However,mostother

parsersfor Japanese arebased ona uniquephrasal

unit called a bunsetsu, a concatenation of one or

w4

w6

w

1

w3

w

5

w7

w

8

w9

1

b

2

b

3

b4

b

w2

Figure 5: Conversion from word dependencies to

bunsetsu dependencies.

Table3: Bunsetsu-basedparsingaccuracy.

languagemodel parsingaccuracy

SLM withACTs 87.8%(674/768)

JUMAN+KNP 85.3%(655/768)

baseline 3

62.4%(479/768)

*Eachbunsetsudependsonthenext one.

functionwords. Inordertocompareourparserwith

oneofthestate-of-the-artparsers,wecalculatedthe

bunsetsu-based accuracies of our model and KNP

(Kurohashi and Nagao, 1994) on the rst 100

sen-tencesof thetest corpus. First the sentenceswere

segmentedintowordsbyJUMAN(Kurohashietal.,

1994)andtheoutputwordsequencesareparsedby

KNP.Next,theword-baseddependenciesoutputby

our parser were changed into bunsetsu as used by

KNP, where the bunsetsu which is depended upon

bya bunsetsu is dened as thebunsetsu containing

the word depended upon by the last word of the

source bunsetsu (see Figure 5). Table3 shows the

bunsetsu-basedaccuraciesofourmodelandKNP.In

accuracy,ourparseroutperformedKNP,butthe

dif-ferencewasnotstatisticallysignicant. Inaddition,

there were dierences in theexperimental

environ-ment:

Thetestcorpus sizewaslimited.

ThePOSsystemfortheKNPinputisdetailed,

soithasmuchmoreinformationthanour

SLM-basedparser.

KNPinthisexperimentwasnotequippedwith

commercialdictionaries.

As we mentionedabove, our current model does

not attempt to use lexical information about

con-tentwordsbecauseofthe data-sparsenessproblem.

If we select the content words to be lexicalized by

referringtotheaccuracyofthewithheldcorpus,the

accuracy increases slightly to 92.9%. This means,

however, our method is not able to eciently use

lexical information aboutthecontent wordsat this

(6)

Historically,thestructuresofnaturallanguageshave

beendescribedbyaCFGandmostparsers(Fujisaki

et al., 1989; Pereira and Schabes, 1992; Charniak,

1997)arebasedonit. AnSLM forEnglish(Chelba

andJelinek,2000),proposedasalanguagemodelfor

speech recognition,is alsobasedonaCFG.Onthe

otherhand,anSLMforJapanese(Morietal.,2000)

isbasedonaMarkovmodelbyintro ducingalimiton

language structures caused by our human memory

limitations (Yngve, 1960; Miller, 1956). We

intro-duced thesame limitationintoourlanguage model

andourparserisalsobasedonaMarkovmodel.

In the last decade, the importance of the

lexi-con has come into focus in the area of stochastic

parsers. Nowadays, many state-of-the-art parsers

are based on lexicalized models (Charniak, 1997;

Collins, 1997). In these papers, theyreported

sig-nicant improvement in parsing accuracy by

lexi-calization. Our model is also lexicalized, the

lexi-calization islimited to grammatical functionwords

becauseofthesparsenessofdataatthestepofnext

word prediction. The greatest dierence between

ourparserandmanystate-of-the-artparsersisthat

ourparserisbasedona generativelanguagemodel,

whichworksasalanguage modelofaspeech

recog-nizer. Therefore,a speechrecognizerequippedwith

our parser as its language model should be useful

for a spoken language understanding system. The

greatest advantage of our model over other

struc-turedlanguagemodelsistheablitytorefertoa

vari-ablepartofthestructuredhistorybyusingACTs.

There have been several attempts at Japanese

parsers(KurohashiandNagao,1994;Harunoet al.,

1998; Fujio and Matsumoto, 1998; Kudo and

Mat-sumoto,2000). TheseJapaneseparsershaveallbeen

basedona uniquephrasal unitcalled a bunsetsu,a

concatenationofoneormorecontentwordsfollowed

bysome grammatical functionwords. Unlikethese

parsers, our model describes dependencies between

words.Thusourparsercanmoreeasilybeextended

to other languages. In addition, since almost all

pasersin other languagesthan Japanese output

re-lationshipsbetweenwords,theoutputofourparser

canbeusedbypost-parserlanguageprocessing

sys-temsproposed formanyotherlanguages (suchas a

word-level structural alignment of sentences in

dif-ferentlanguages).

7 Conclusion

In this paper we have described a structured

lan-guage model (SLM) based on a dependency

gram-mar. AnSLM treatsa sentenceas awordsequence

andpredictseachwordfrombeginningtoend. The

history at each step of prediction is a sequence of

partial parse trees covering the preceding words.

ingdata-sparsenessproblems. Asananswer,we

pro-pose to apply arborealcontext trees (ACTs) to an

SLM. AnACTis anextension of a context tree to

a tree-shaped history. We built a parser based on

an SLM with ACTs, whose parameters were

esti-matedfrom9,108syntacticallyannotatedsentences

from a nancial newspaper. We then tested the

parser on 1,011 sentences from the same

newspa-per. Theaccuracy of thedependency relationships

oftheparserwas92.8%,higherthantheaccuracyof

a parser basedon anSLM withoutACTs (89.8%).

This proved experimentally that ACTs improve a

parser basedonanSLM.

References

Eugene Charniak. 1997. Statistical parsing with a

context-freegrammarandwordstatistics.In

Pro-ceedingsof the14thNationalConferenceon

Arti-cial Intelligence,pages598{603.

Eugene Charniak. 2001. Immediate-head parsing

for language models. In Proceedings of the 39th

Annual Meetingof the Association for

Computa-tional Linguistics,pages124{131.

CiprianChelbaandFredericJelinek. 1998.

Exploit-ing syntacticstructureforlanguage modeling. In

Proceedingsof the 17th International Conference

on ComputationalLinguistics,pages225{231.

Ciprian Chelba and Frederic Jelinek. 2000.

Struc-tured language modeling. Computer Speech and

Language,14:283{332.

MichaelCollins. 1997. Threegenerative,lexicalised

models for statistical parsing. In Proceedings of

the 35th Annual Meeting of the Association for

Computational Linguistics,pages16{23.

Masakazu Fujio and Yuji Matsumoto. 1998.

Japanese dependencystructureanalysis basedon

lexicalized statistics. InProceedings of the Third

ConferenceonEmpiricalMethodsinNatural

Lan-guage Processing,pages87{96.

T. Fujisaki, F. Jelinek, J. Cocke, E. Black, and

T.Nishino. 1989. Aprobabilisticparsingmethod

forsentencedisambiguation.InProceedingsofthe

International ParsingWorkshop.

Masahiko Haruno, Satoshi Shirai, and Yoshifumi

Ooyama.1998. Usingdecisiontreestoconstructa

practical parser. InProceedingsof the17th

Inter-national Conference on Computational

Linguis-tics,pages505{511.

Taku Kudo and Yuji Matsumoto. 2000. Japanese

dependency structure analysis based on support

vectormachines. InProceedingsofthe2000Joint

SIGDAT Conference on Empirical Methods in

NaturalLanguageProcessingandVeryLarge

Cor-pora.

(7)

syn-ComputationalLinguistics,20(4):507{534.

Sadao Kurohashi, Toshihisa Nakamura, Yuji

Mat-sumoto,andMakotoNagao. 1994. Improvements

of Japanese morphological analyzer JUMAN. In

Proceedings of the International Workshop on

Sharable Natural Language Resources, pages 22{

28.

GeorgeA.Miller. 1956. Themagicalnumb erseven,

plus or minus two: Some limits on our capacity

forprocessinginformation. ThePsychological

Re-view,63:81{97.

Shinsuke Mori, Masafumi Nishimura, Nobuyasu

Itoh,ShihoOgino,andHideoWatanab e. 2000. A

stochasticparserbasedona structuralword

pre-dictionmodel. InProceedingsofthe18th

Interna-tional Conference on Computational Linguistics,

pages558{564.

ShinsukeMori,MasafumiNishimura,andNobuyasu

Itoh. 2001. Improvementofastructuredlanguage

model: Arbori-contexttree. InProceedingsof the

SeventhEuropeanConferenceonSpeech

Commu-nication andTechnology.

Fernando Pereira and YvesSchabes. 1992.

Inside-outsidereestimationfrompartiallybracketed

cor-pora. In Proceedingsof the 30thAnnual Meeting

of theAssociationforComputationalLinguistics,

pages128{135.

DanaRon,YoramSinger,andNaftaliTishby. 1996.

Thepowerofamnesia: Learningprobabilistic

au-tomata with variable memory length. Machine

Learning,25:117{149.

Victor H. Yngve. 1960. A model anda hypothesis

forlanguagestructure. The American