Tagging and Chunking with Bigrams

(1)

Ferran Pla, Antonio Molina and Natividad Prieto

Universitat Politecnicade Valencia

Departament de Sistemes Informaticsi Computacio

Cam de Vera s/n

46020 Valencia

ffpla,amolina,[email protected]

Abstract

In this paper we present an integrated system for

taggingandchunkingtextsfromacertainlanguage.

The approach is based on stochastic nite-state

models thatarelearntautomatically. Thisincludes

bigrammodelsornite-stateautomatalearntusing

grammaticalinferencetechniques.Asthemodels

in-volved in our system are learnt automatically, this

isaveryexibleandportablesystem.

Inordertoshowtheviabilityofourapproachwe

present results for tagging and chunking using

bi-grammodelsontheWallStreetJournalcorpus. We

haveachievedanaccuracyratefortaggingof96.8%,

andaprecisionrateforNP chunks of94.6%witha

recallrateof93.6%.

1 Introduction

PartofSpeechTaggingandShallowParsingaretwo

well-knownproblemsin Natural Language

Process-ing. ATaggercanbeconsideredasatranslatorthat

readssentencesfromacertainlanguageandoutputs

thecorrespondingsequencesofpartofspeech(POS)

tags,takingintoaccountthecontextin which each

wordofthesentenceappears. AShallowParser

in-volvesdividing sentences into non-overlapping

seg-ments on the basis of very supercial analysis. It

includes discovering the main constituents of the

sentences (NPs, VPs, PPs, ...) and their heads.

ShallowParsingusuallyidentiesnon-recursive

con-stituents, alsocalledchunks (Abney, 1991)(suchas

non-recursive Noun Phrases or base NP, base VP,

and so on). It can include determining syntactical

relationshipssuchas subject-verb,verb-object,etc.,

Shallow parsing which always follows the tagging

process,isusedasafastandreliablepre-processing

phasefor fullor partial parsing. It can beused for

InformationRetrievalSystems,Information

Extrac-tion,TextSummarizationandBilingualAlignment.

In addition, it is also used to solve computational

linguisticstaskssuchasdisambiguationproblems.

1.1 POS Tagging Approaches

The dierent approaches for solving this problem

onthetendencies followedforestablishingthe

Lan-guage Model (LM): the linguistic approach, based

onhand-coded linguisticrulesand thelearning

ap-proach derived from a corpora (labelled or

non-labelled). Other approximations that use hybrid

methods havealsobeenproposed (Voutilainen and

Padro,1997).

In the linguistic approach, an expert linguist is

neededtoformalisetherestrictionsofthelanguage.

Thisimplies averyhigh costand it is very

depen-dent on each particular language. We can nd an

importantcontribution(Voutilainen,1995)thatuses

Constraint Grammarformalism. Supervised

learn-ingmethodswere proposed in(Brill,1995)to learn

a set of transformation rules that repair the error

committedby aprobabilistictagger. The main

ad-vantageofthelinguisticapproachisthatthemodel

is constructed from a linguistic point of view and

containsmanyandcomplexkindsofknowledge.

In the learning approach, the most extended

formalism is based on n-grams or HMM. In this

case, the language model can be estimated from

a labelled corpus (supervised methods) (Church,

1988)(Weischedel et al., 1993) or from a

non-labelledcorpus(unsupervisedmethods)(Cutting et

al.,1992). Intherstcase,themodelistrainedfrom

therelativeobservedfrequencies. Inthesecondone,

the model is learned using the Baum-Welch

algo-rithmfromaninitialmodelwhichisestimatedusing

labelled corpora (Merialdo, 1994). The advantages

oftheunsupervisedapproacharethefacilitytobuild

language models, the exibility of choice of

cate-goriesandtheeaseofapplicationtootherlanguages.

Wecanndsomeothermachine-learningapproaches

that use moresophisticated LMs, such asDecision

Trees (Marquez and Rodrguez, 1998)(Magerman,

1996),memory-basedapproachestolearnspecial

de-cisiontrees(Daelemans et al.,1996),maximum

en-tropy approaches that combine statistical

informa-tionfrom dierentsources (Ratnaparkhi, 1996),

-nitestateautomatainferred usingGrammatical

In-ference(PlaandPrieto, 1998),etc.

(2)

dif-the size ofthe vocabulary,the ambiguity, the

diÆ-culty ofthe testset,etc. The best resultsreported

ontheWallStreetJournal(WSJ)Treebank(Marcus

etal.,1993),usingstatisticallanguagemodels,have

anaccuracyratiobetween95%and97%(depending

on the dierent factorsmentioned above). For the

linguisticapproachtheresultsarebetter. For

exam-ple, in (Voutilainen, 1995)an accuracy of 99.7%is

reported, but certain ambiguities in the output

re-mainunsolved. Someworkshaverecentlybeen

pub-lished(BrillandWu, 1998)inwhichasetoftaggers

arecombinedinordertoimprovetheirperformance.

Insomecases,thesemethodsachieveanaccuracyof

97.9%(Halterenetal.,1998).

1.2 Shallow Parsing Approaches

Since the early 90's, several techniques for

carry-ingoutshallowparsinghavebeendeveloped. These

techniques can also be classied into two main

groups: based on hand-coded linguistic rules and

based on learning algorithms. These approaches

have a common characteristic: they take the

se-quence oflexicaltagsproposedbyaPOStaggeras

input, for boththe learningand thechunking

pro-cesses.

1.2.1 Techniques based on hand-coded

linguisticrules

These methodsuseahand-writtenset ofrulesthat

are dened using POS as terminals of the

gram-mar. Most of these works use nite statemethods

fordetectingchunks orforaccomplishingother

lin-guistic tasks (Ejerhed, 1988), (Abney, 1996),

(At-Mokhtar and Chanod, 1997). Other works use

dif-ferent grammatical formalisms, such as constraint

grammars(Voutilainen,1993),orcombinethe

gram-marruleswithasetofheuristics(Bourigault,1992).

Theseworksusuallyuseasmalltestsetthatis

man-uallyevaluated, sotheachieved resultsarenot

sig-nicant. The regular expressions dened in

(Ejer-hed,1988)identiedbothnon-recursiveclausesand

non-recursiveNPs inEnglishtext. The

experimen-tationontheBrowncorpusachievedaprecisionrate

of 87% (for clauses) and 97.8 % (for NPs).

Ab-neyintroducedtheconceptof chunk(Abney, 1991)

andpresentedanincrementalpartialparser(Abney,

1996). This parser identies chunks base on the

parts of speech, and it then chooses how to

com-bine them for higherlevel analysisusing lexical

in-formation. Theaverageprecisionandrecallratesfor

chunkswere87.9%and87.1%,respectively,onatest

set of 1000 sentences. An incremental architecture

ofnite-statetransducersforFrenchispresentedin

(At-Mokhtar and Chanod, 1997). Eachtransducer

performs a linguistic task such as identifying

seg-mentsorsyntacticstructuresanddetectingsubjects

cision rate varied between 99.2% and 92.6%. The

recallratevariedbetween97.8%and82.6%.

The NPTool parser described in (Voutilainen,

1993) identied maximal-length noun phrases.

NPtool gave a precision rate of 95-98% and a

re-callrateof98.5-100%. Theseresultswerecriticised

in (Ramshaw and Marcus, 1995) due to some

in-consistenciesandapparentmistakeswhichappeared

onthe samplegiven in (Voutilainen, 1993).

Bouri-gaultdevelopedtheLECTER parserforFrench

us-ing grammatical rules and some heuristics

(Bouri-gault, 1992). It achievedarecall rateof 95%

iden-tifyingmaximallengthterminologicalnounphrases,

buthedidnotgiveaprecisionrate,soitisdiÆcult

toevaluatetheactualperformanceof theparser.

1.2.2 Learning Techniques

These approaches automatically construct a

lan-guagemodel from alabelledand bracketedcorpus.

The rst probabilistic approach was proposed in

(Church,1988). Thismethodlearntabigrammodel

fordetectingsimplenounphrasesontheBrown

cor-pus. Givena sequence ofparts ofspeech asinput,

theChurchprograminsertsthemostprobable

open-ings and endings of NPs, using a Viterbi-like

dy-namicprogrammingalgorithm. Churchdidnotgive

precisionandrecall rates. He showedthat 5outof

243NP wereomitted,but in averysmall testwith

aPOStaggingaccuracyof99.5%.

Transformation-basedlearning(TBL)wasusedin

(Ramshaw and Marcus, 1995) to detect base NP.

Inthis work chunking was considered asa tagging

technique, so that each POS could be tagged with

I(inside baseNP),O(outsidebaseNP)orB(inside

a baseNP, but the preceding word was in another

baseNP).Thisapproachresultedinaprecisionrate

of 91.8% and a recall rate of 92.3%. This result

wasautomatically evaluated on a test set (200,000

words)extractedfromtheWSJTreebank. Themain

drawbacktothisapproacharethehighrequirements

fortimeandspacewhichareneededtotrainthe

sys-tem;itneedstotrain100templatesofcombinations

ofwords.

Thereareseveralworksthatuseamemory-based

learning algorithm. These approaches construct a

classier fora task by storinga set of examples in

memory. Eachexampleisdenedbyasetoffeatures

thathavetobelearntfromabracketedcorpus. The

Memory-Based Learning (MBL) algorithm

(Daele-mansetal.,1999)takesintoaccountlexicalandPOS

information. It stores the following features: the

wordformandPOStagofthetwowordstotheleft,

thefocuswordandonewordtotheright. This

sys-temachieved aprecisionrate of93.7% andarecall

rateof94.0%ontheWSJTreebank. However,when

(3)

de-recall rateof 90.1%. TheMemory-Based Sequence

Learning(MBSL) algorithm(Argamonetal.,1998)

learnssubstringsorsequencesofPOSandbrackets.

Precision and recall rates were 92.4% on the same

datausedin (RamshawandMarcus,1995).

A simple approach is presented in (Cardie and

Pierce,1998)called Treebank Approach(TA).This

technique matches POS sequences from an initial

nounphrasegrammarwhich wasextractedfrom an

annotated corpus. The precision achieved for each

ruleisused torankandprune therules,discarding

those rules whose score is lower than a predened

threshold. It uses alongest match heuristicto

de-termine base NP. Precisionand recall on the WSJ

Treebankwas89.4%and90.0%, respectively.

Itis diÆculttocomparethedierentapproaches

due for various reasons. Each one uses a dierent

denition of base NP. Each one is evaluated on a

dierent corpus or on dierent parts of the same

corpus. Somesystemshaveevenbeenevaluated by

hand ona verysmall test set. Table 1summarizes

theprecisionandrecallratesforlearningapproaches

thatusedataextractedfrom theWSJ Treebank.

Method NP-Precision NP-Recall

TBL 91.8 92.3

MBSL 92.4 92.4

TA 89.4 90.9

MBL 93.7 94.0

MBL(onlyPOS) 90.3 90.1

Table1: Precisionand recallratesfordierentNP

parsers.

2 General Description of our

Integrated approach to Tagging

and Chunking

We propose an integrated system (Figure 1) that

combines dierent knowledge sources(lexical

prob-abilities, LM for chunks and Contextual LM for

the sentences) in order to obtain the

correspond-ing sequence of POS tags and the shallow parsing

([SUW

1 =C

1 W

2 =C

2 SU]W

3 =C

3

:::[SUW

n =C

n SU])

from a certain input string (W

1 ;W

2 ;:::;W

n ). Our

systemisatransducercomposed bytwolevels: the

upper one represents the Contextual LM for the

sentences, and the lower one modelize the chunks

considered. Theformalismthat wehaveusedin all

levels are nite-state automata. To be exact, we

have used models of bigrams which are smoothed

usingthebackotechnique(Katz,1987)inorderto

achieve fullcoverageof thelanguage. Thebigrams

LMs (bigramprobabilities)wasobtainedby means

TAGGING

and

CHUNKING

Lexical Probabilities

Contextual LM

W1 W2 ... Wn

[SU W1/C1 W2/C2 SU] ... Wn/Cn

DECODING

LEARNING

CORPUS

LM for Chunks

Learning LM

Figure 1: OverviewoftheSystem.

1997) from the sequences of categories in the

trainingset. Then,theyhavebeenrepresentedlike

nite-stateautomata.

2.1 The learningphase.

Themodels havebeen estimatedfrom labelled and

bracketedcorpora. Thetrainingset iscomposedby

sentenceslike:

[SU W1=C1W2=C2 SU]W3=C3:::[SU Wn=CnSU]:=:

whereW

i

arethe words,C

i

arepart-of-speechtags

andSU arethechunksconsidered.

Themodelslearntare:

ContextualLM:itisasmoothedbigrammodel

learntfromthesequencesofpart-of-speechtags

(C

i

)andchunk descriptors(SU)presentinthe

trainingcorpus(seeFigure2a).

Models for the chunks: they are smoothed

bi-grammodelslearntfromthesequencesof

part-of-speech tags corresponding to each chunk of

thetrainingcorpus(seeFigure2b).

Lexical Probabilities: they areestimated from

the word frequencies, the tag frequencies and

the word per tag frequencies. A tag

dictio-nary is used which is built from the full

cor-puswhichgivesusthepossiblelexicalcategories

(POS tags)for each word;this isequivalentto

having an ideal morphological analyzer. The

probabilities foreach possible tagare assigned

from this information taking into account the

obtained statistics. Due to the fact that the

word cannot have been seen at training, or it

hasonlybeenseenin someofthepossible

cat-egories, itis compulsoryto apply asmoothing

mechanism. In our case, if the word has not

previously been seen, the same probability is

(4)

dic-<s>

C

k

</s>

C

j

C

i

<s>

C

k

</s>

C

j

C

i

C

i

P(

C

j

|

)

</SU>

<SU>

C

i

C

m

C

k

C

n

C

j

P(

W

j

|

)

...

C

i

P(

<SU>

|

)

SU

</SU>

<SU>

C

i

C

m

C

k

C

n

(a) Contextual LM

(b) LM for Chunks

(c) Integrated LM

Figure2: IntegratedLanguageModelforTaggingandChunking.

categories, the smoothing called "add one" is

applied. Afterwards,arenormalizationprocess

iscarriedout.

OncetheLMshavebeenlearnt,aregular

substi-tution of the lowermodel(s) into the upper oneis

made. In this way, we get a single Integrated LM

which shows the possible concatenations of lexical

tagsandsyntacticalunits,withtheirowntransition

probabilitieswhichalsoincludethelexical

probabil-ities aswell (see Figure 2c). Note that themodels

inFigure 2arenotsmoothed).

2.2 The DecodingProcess: Taggingand

Parsing

Thetaggingandshallowparsingprocessconsistsof

ndingoutthesequenceofstatesofmaximum

prob-abilityontheIntegratedLM foraninputsentence.

Therefore, this sequence must be compatible with

the contextual, syntactical and lexical constraints.

This process can be carried out by Dynamic

Pro-grammingusingtheViterbialgorithm,whichis

con-veniently modied to allow for transitions between

certain states of the automata without consuming

any symbols(epsilontransitions). Aportionof the

DynamicProgrammingtrellisforagenericsentence

usingtheIntegratedLMshowninFigure 2ccanbe

seenin Figure 3. Thestates of the automata that

can be reached and that are compatible with the

lexical constraints are marked with a black circle

(i.e., from the state C

k

it is possible to reach the

stateC

i

ifthetransitionisintheautomataandthe

lexical probabilityP(W

i jC

i

) is notnull). Also, the

transitions to initial and nal statesof the models

for chunks (i.e., from C

i

to < SU >) are allowed;

thesestatesaremarkedinFigure3withawhite

cir-cle and in this caseno symbolis consumed. In all

ducetransitionstotheirsuccessors(thedottedlines

inFigure3)wherenowsymbolsmustbeconsumed.

Once theDynamic Programingtrellisisbuilt,we

can obtain the maximum probability path for the

inputsentence,andthusthebestsequenceoflexical

tagsandthebestsegmentationinchunks.

</SU>

<s>

Ci

Cj

Ck

</s>

Ci

Ck

Cm

Cn

<SU>

...

Input:

Output:

...

Wn-1

Wn-2

Wn-2/Ci

</s>

SU] Wn/Ck

Wn

[SU Wn-1/Cn

Final

State

...

Figure3: PartialTrellisforProgrammingDecoding

basedontheIntegratedLM.

3 Experimental Work

Inthissectionwewill describeasetofexperiments

thatwecarriedoutin ordertodemonstratethe

(5)

on theWSJ corpus, using thePOS tag set dened

in (Marcus et al., 1993), considering only the NP

chunks dened by (Church, 1988) and using the

modelsthatwehavepresentedabove. Nevertheless,

the use of this approach on other corpora

(chang-ingthereferencelanguage),otherlexicaltagsetsor

otherkindsofchunkscanbedoneinadirect way.

3.1 Corpus Description.

We used a portion of the WSJ corpus (900,000

words), which was tagged according to the Penn

Treebank tag set and bracketed with NP markers,

totrainandtestthesystem.

The tag set contained 45 dierent tags. About

36.5% of the wordsin the corpuswere ambiguous,

with an ambiguity ratio of 2.44 tag/word over the

ambiguouswords,1.52overall.

3.2 ExperimentalResults.

Inordertotrainthemodelsandtotestthesystem,

werandomlydividedthecorporaintotwoparts:

ap-proximately800,000wordsfortrainingand 100,000

wordsfortesting.

Both thebigram modelsfor representing

contex-tualinformationandsyntacticdescriptionoftheNP

chunk and the lexical probabilities were estimated

fromtrainingsetsofdierentsizes. Due tothefact

thatwedidnotuseamorphologicalanalyserfor

En-glish, weconstructedatagdictionarywith the

lex-iconof thetrainingset and thetest set used. This

dictionarygaveus thepossiblelexical tagsforeach

wordfromthecorpus. Innocase,wasthetestused

toestimatethelexicalprobabilities.

92

93

94

95

96

97

98

99

100

0

100

200

300

400

500

600

700

800

POS Accuracy (%)

#Words x 1000

BIG

BIG-BIG

LEX

Figure 4: Accuracy Rate of Tagging on WSJ for

incrementaltrainingsets.

InFigure4,weshowtheresultsoftaggingonthe

testset in termsofthetrainingsetsize usingthree

approaches: thesimplest(LEX)isataggingprocess

whichdoesnottakecontextualinformationinto

ac-92

93

94

95

96

97

98

99

100

0

100

200

300

400

500

600

700

800

(%)

#Words x 1000

Precision

Recall

Figure5: NP-chunkingresultsonWSJfor

incremen-taltrainingsets.

Tagging NP-Chunking

Tagger Accuracy Precision Recall

BIG-BIG 96.8 94.6 93.6

Lex 94.3 90.8 91.3

BIG 96.9 94.9 94.1

IDEAL 100(assumed) 95.5 94.7

Table2: Taggingand NP-Chunking resultsfor

dif-ferentstaggers(training setof800,000words).

bethatwhichhasappearedmoreofteninthe

train-ingset. Thesecondmethodcorrespondstoatagger

basedonabigrammodel(BIG).Thethirdoneuses

the Integrated LM described in this paper

(BIG-BIG).Thetagging accuracyfor BIGand BIG-BIG

was close, 96.9% and 96.8% respectively, whereas

withoutthe use of the languagemodel (LEX), the

taggingaccuracywas2.5pointslower. Thetrendin

allthecaseswasthatanincrementinthesizeofthe

training set resulted in an increase in the tagging

accuracy. After 300,000training words, the result

becamestabilized.

In Figure 5, we show the precision (#correct

proposed NP/#proposed NP) and recall (#correct

proposed NP/#NP in the reference) rates for NP

chunking. TheresultsobtainedusingtheIntegrated

LMwereverysatisfactoryachievingaprecisionrate

of 94.6% and a recall rate of 93.6%. The

perfor-mance of the NP chunker improves as the

train-ingset size increases. This is obviously due to the

fact that the model is better learnt when the size

of thetraining set increases, and the tagging error

decreasesaswehaveseenabove.

The usual sequential process forchunking a

sen-tencecanalsobeused. Thatis,rstwetagthe

sen-tence and then we use theIntegrated LM to carry

(6)

sultsthatweobtainedfortaggingandforNP

chunk-ing. Therstrowshowstheresultwhenthetagging

andthechunkingaredoneinaintegratedway. The

followingrowsshowtheperformanceofthe

sequen-tialprocessusingdierenttaggers:

LEX: it takesinto account only lexical

proba-bilities. In this case, the tagging accuracy was

94.3%.

BIG: it is based on a bigram model that

achievedanaccuracyof96.9%.

IDEAL:itsimulatesataggerwithanaccuracy

rate of 100%. Todo this, we used the tagged

sentencesoftheWSJcorpusdirectly.

These results conrm that precision and recall

rates increase when the accuracy of the tagger is

better. The performance of the sequentialprocess

(using the BIG tagger) is slightly better than the

performance of the integrated process (BIG-BIG).

We think that this is probably becauseof the way

wecombinedtheprobabilitiesof thedierent

mod-els.

4 Conclusions and Future Work

Inthis paper,we havepresentedasystem for

Tag-ging and Chunking based on an Integrated

Lan-guage Model that uses a homogeneous formalism

(nite-state machine) to combine dierent

knowl-edge sources: lexical, syntactical and contextual

models. It isfeasiblebothin termsof performance

andalsoin termsofcomputationaleÆciency.

All themodels involvedare learntautomatically

fromdata,sothesystemisveryexibleandportable

and changes in the reference language, lexical tags

orotherkindsofchunkscanbemadeinadirectway.

The tagging accuracy (96.9% using BIG and

96.8% usingBIG-BIG) is higherthan othersimilar

approaches. This is because we haveused the tag

dictionary (including the test set in it) to restrict

the possible tags for unknown words, this

assump-tionobviouslyincreasetheratesoftagging(wehave

notdoneaquantitativestudyofthisfactor).

Aswehavementionedabove,thecomparisonwith

otherapproachesisdiÆcultdueamongotherreasons

tothefollowingones: thedenitionsofbaseNPare

notalwaysthesame, thesizes ofthe trainand the

testsetsaredierentandtheknowledgesourcesused

in the learningprocess are also dierent. The

pre-cisionforNP-chunkingissimilartootherstatistical

approaches presented in section 1, for boththe

in-tegratedprocess(94.6%)andthesequentialprocess

usingataggerbasedonbigrams(94.9%). Therecall

rateisslightlylowerthanforsomeapproachesusing

quentialsystemtakinganerrorfreeinput(IDEAL),

the performance of the system obviously increased

(95.5% precision and 94.7% recall). These results

showtheinuenceof taggingerrorsontheprocess.

Nevertheless, we are studying why the results

be-tweentheintegratedprocessandthesequential

pro-cessaredierent. Wearetestinghowthe

introduc-tionof someadjustmentfactors among the models

for weighting the dierent probability distribution

canimprovetheresults.

Themodelsthatwehaveusedinthiswork,are

bi-grams,buttrigramsoranystochasticregularmodel

canbe used. Inthis respect, we haveworked ona

morecomplex LMs, formalizedasanite-state

au-tomatawhichislearntusingGrammaticalInference

techniques. Also, ourapproach would benet from

the inclusion of lexical-contextual information into

theLM.

5 Acknowledgments

This work has been partially supported by the

SpanishResearchProjectCICYT

(TIC97-0671-C02-01/02).

References

S.Abney. 1991. Parsingby Chunks. R.Berwick,S.

AbneyandC.Tenny(eds.)Principle{based

Pars-ing.KluwerAcademicPublishers,Dordrecht.

S. Abney. 1996. Partial Parsing via Finite-State

Cascades. In Proceedings of the ESSLLI'96

Ro-bust ParsingWorkshop,Prague,CzechRepublic.

S. Argamon,I.Dagan, and Y. Krymolowski. 1998.

A Memory{basedApproach to Learning Shallow

Natural Language Patterns. In Proceedings of

the joint 17th International Conference on

Com-putational Linguistics and 36th Annual Meeting

of the Association for ComputationalLinguistics,

COLING-ACL,pages67{73,Montreal,Canada.

S. At-Mokhtar and J.P. Chanod. 1997.

Incremen-talFinite-StateParsing. InProceedingsof the5th

ConferenceonAppliedNaturalLanguagePro

cess-ing,Washington D.C.,USA.

D. Bourigault. 1992. Surface Grammatical

Anal-ysis for the Extraction of Terminological Noun

Phrases. InProceedings of the 15th International

Conference on Computational Linguistics, pages

977{981.

Eric Brill and Jun Wu. 1998. Classier

Combi-nation for Improved Lexical Disambiguation. In

Proceedings of the joint 17th International

Con-ference on Computational Linguistics and 36th

Annual Meetingof the Association for

Computa-tionalLinguistics,COLING-ACL,pages191{195,

Montreal,Canada.

(7)

tationalLinguistics,21(4):543{565.

C.Cardie andD.Pierce. 1998. Error-Driven

Prun-ningofTreebankGrammarsforBaseNounPhrase

Identication. In Proceedings of the joint 17th

International Conference on Computational

Lin-guistics and 36th Annual Meeting of the

Asso-ciation for ComputationalLinguistics,

COLING-ACL,pages218{224,Montreal,Canada,August.

K. W. Church. 1988. A Stochastic Parts Program

and Noun Phrase Parser for Unrestricted Text.

In Proceedings of the 1st Conference on Applied

Natural Language Processing, ANLP,pages 136{

143.ACL.

P. Clarksond and R. Ronsenfeld. 1997. Statistical

Language Modeling using the CMU-Cambridge

Toolkit. In Proceedings of Eurospeech, Rhodes,

Greece.

D. Cutting, J. Kupiec, J. Pederson, and P. Sibun.

1992. APracticalPart{of{speechTagger. In

Pro-ceedings of the 3rd Conference on Applied

Natu-ral Language Processing, ANLP, pages 133{140.

ACL.

W. Daelemans, J. Zavrel, P. Berck, and S. Gillis.

1996. MBT: A Memory{Based Part{of{speech

Tagger Generator. In Proceedings of the 4th

Workshop on Very Large Corpora, pages 14{27,

Copenhagen,Denmark.

W.Daelemans,S.Buchholz,andJ.Veenstra. 1999.

Memory-Based Shallow Parsing. In Proceedings

ofEMNLP/VLC-99,pages239{246,Universityof

Maryland,USA,June.

E. Ejerhed. 1988. Finding Clauses in Unrestricted

TextbyFinitaryandStochasticMethods. In

Pro-ceedings ofSecondConferenceonAppliedNatural

LanguageProcessing,pages219{227.ACL.

H.vanHalteren,J.Zavrel,andW.Daelemans.1998.

Improving Data Driven Wordclass Tagging by

System Combination. InProceedings of the joint

17th International Conference on Computational

Linguisticsand36thAnnualMeetingofthe

Asso-ciation for ComputationalLinguistics,

COLING-ACL,pages491{497,Montreal,Canada,August.

S. M.Katz. 1987. EstimationofProbabilitiesfrom

SparseData fortheLanguageModelComponent

of a Speech Recognizer. IEEE Transactions on

Acoustics,Speech andSignalProcessing,35.

D. M. Magerman. 1996. Learning Grammatical

Structure Using Statistical Decision{Trees. In

Proceedings of the 3rd International Colloquium

on Grammatical Inference, ICGI, pages 1{21.

Springer-VerlagLectureNotes Seriesin Articial

Intelligence1147.

M. P. Marcus, M. A. Marcinkiewicz, and B.

San-torini. 1993. Building aLargeAnnotatedCorpus

of English: The Penn Treebank. Computational

of-Speech Tagging Using Decision Trees. In C.

Nedellec and C. Rouveirol, editor, LNAI 1398:

Proceedings of the 10th European Conference

on Machine Learning, ECML'98, pages 25{36,

Chemnitz, Germany.Springer.

B. Merialdo. 1994. Tagging English Text with a

Probabilistic Model. Computational Linguistics,

20(2):155{171.

F. Pla and N. Prieto. 1998. Using Grammatical

Inference Methods forAutomatic Part{of{speech

Tagging. InProceedingsof1stInternational

Con-ference on Language Resources and Evaluation,

LREC, Granada,Spain.

L.Ramshawand M.Marcus. 1995. Text Chunking

Using Transformation-Based Learning. In

Pro-ceedings of third Workshop on Very Large

Cor-pora,pages82{94,June.

A.Ratnaparkhi. 1996. AMaximumEntropyPart{

of{speech Tagger. InProceedings of the 1st

Con-ference on Empirical Methods in Natural

Lan-guage Processing, EMNLP.

Atro Voutilainen and Llus Padro. 1997.

Develop-ingaHybridNPParser. InProceedings ofthe5th

ConferenceonAppliedNaturalLanguagePro

cess-ing, ANLP,pages80{87,Washington DC.ACL.

Atro Voutilainen. 1993. NPTool,aDetectorof

En-glish Noun Phrases. InProceedings of the

Work-shop onVery LargeCorpora.ACL,June.

Atro Voutilainen. 1995. A Syntax{Based Part{of{

speech Analyzer. InProceedings of the 7th

Con-ferenceoftheEuropeanChapteroftheAssociation

for Computational Linguistics, EACL, Dublin,

Ireland.

R.Weischedel,R.Schwartz,J.Palmucci,M.Meteer,

and L. Ramshaw. 1993. Copingwith Ambiguity

and UnknownWordsthroughProbabilistic