Ferran Pla, Antonio Molina and Natividad Prieto
Universitat Politecnicade Valencia
Departament de Sistemes Informaticsi Computacio
Cam de Vera s/n
46020 Valencia
ffpla,amolina,[email protected]
Abstract
In this paper we present an integrated system for
taggingandchunkingtextsfromacertainlanguage.
The approach is based on stochastic nite-state
models thatarelearntautomatically. Thisincludes
bigrammodelsornite-stateautomatalearntusing
grammaticalinferencetechniques.Asthemodels
in-volved in our system are learnt automatically, this
isaveryexibleandportablesystem.
Inordertoshowtheviabilityofourapproachwe
present results for tagging and chunking using
bi-grammodelsontheWallStreetJournalcorpus. We
haveachievedanaccuracyratefortaggingof96.8%,
andaprecisionrateforNP chunks of94.6%witha
recallrateof93.6%.
1 Introduction
PartofSpeechTaggingandShallowParsingaretwo
well-knownproblemsin Natural Language
Process-ing. ATaggercanbeconsideredasatranslatorthat
readssentencesfromacertainlanguageandoutputs
thecorrespondingsequencesofpartofspeech(POS)
tags,takingintoaccountthecontextin which each
wordofthesentenceappears. AShallowParser
in-volvesdividing sentences into non-overlapping
seg-ments on the basis of very supercial analysis. It
includes discovering the main constituents of the
sentences (NPs, VPs, PPs, ...) and their heads.
ShallowParsingusuallyidentiesnon-recursive
con-stituents, alsocalledchunks (Abney, 1991)(suchas
non-recursive Noun Phrases or base NP, base VP,
and so on). It can include determining syntactical
relationshipssuchas subject-verb,verb-object,etc.,
Shallow parsing which always follows the tagging
process,isusedasafastandreliablepre-processing
phasefor fullor partial parsing. It can beused for
InformationRetrievalSystems,Information
Extrac-tion,TextSummarizationandBilingualAlignment.
In addition, it is also used to solve computational
linguisticstaskssuchasdisambiguationproblems.
1.1 POS Tagging Approaches
The dierent approaches for solving this problem
onthetendencies followedforestablishingthe
Lan-guage Model (LM): the linguistic approach, based
onhand-coded linguisticrulesand thelearning
ap-proach derived from a corpora (labelled or
non-labelled). Other approximations that use hybrid
methods havealsobeenproposed (Voutilainen and
Padro,1997).
In the linguistic approach, an expert linguist is
neededtoformalisetherestrictionsofthelanguage.
Thisimplies averyhigh costand it is very
depen-dent on each particular language. We can nd an
importantcontribution(Voutilainen,1995)thatuses
Constraint Grammarformalism. Supervised
learn-ingmethodswere proposed in(Brill,1995)to learn
a set of transformation rules that repair the error
committedby aprobabilistictagger. The main
ad-vantageofthelinguisticapproachisthatthemodel
is constructed from a linguistic point of view and
containsmanyandcomplexkindsofknowledge.
In the learning approach, the most extended
formalism is based on n-grams or HMM. In this
case, the language model can be estimated from
a labelled corpus (supervised methods) (Church,
1988)(Weischedel et al., 1993) or from a
non-labelledcorpus(unsupervisedmethods)(Cutting et
al.,1992). Intherstcase,themodelistrainedfrom
therelativeobservedfrequencies. Inthesecondone,
the model is learned using the Baum-Welch
algo-rithmfromaninitialmodelwhichisestimatedusing
labelled corpora (Merialdo, 1994). The advantages
oftheunsupervisedapproacharethefacilitytobuild
language models, the exibility of choice of
cate-goriesandtheeaseofapplicationtootherlanguages.
Wecanndsomeothermachine-learningapproaches
that use moresophisticated LMs, such asDecision
Trees (Marquez and Rodrguez, 1998)(Magerman,
1996),memory-basedapproachestolearnspecial
de-cisiontrees(Daelemans et al.,1996),maximum
en-tropy approaches that combine statistical
informa-tionfrom dierentsources (Ratnaparkhi, 1996),
-nitestateautomatainferred usingGrammatical
In-ference(PlaandPrieto, 1998),etc.
dif-the size ofthe vocabulary,the ambiguity, the
diÆ-culty ofthe testset,etc. The best resultsreported
ontheWallStreetJournal(WSJ)Treebank(Marcus
etal.,1993),usingstatisticallanguagemodels,have
anaccuracyratiobetween95%and97%(depending
on the dierent factorsmentioned above). For the
linguisticapproachtheresultsarebetter. For
exam-ple, in (Voutilainen, 1995)an accuracy of 99.7%is
reported, but certain ambiguities in the output
re-mainunsolved. Someworkshaverecentlybeen
pub-lished(BrillandWu, 1998)inwhichasetoftaggers
arecombinedinordertoimprovetheirperformance.
Insomecases,thesemethodsachieveanaccuracyof
97.9%(Halterenetal.,1998).
1.2 Shallow Parsing Approaches
Since the early 90's, several techniques for
carry-ingoutshallowparsinghavebeendeveloped. These
techniques can also be classied into two main
groups: based on hand-coded linguistic rules and
based on learning algorithms. These approaches
have a common characteristic: they take the
se-quence oflexicaltagsproposedbyaPOStaggeras
input, for boththe learningand thechunking
pro-cesses.
1.2.1 Techniques based on hand-coded
linguisticrules
These methodsuseahand-writtenset ofrulesthat
are dened using POS as terminals of the
gram-mar. Most of these works use nite statemethods
fordetectingchunks orforaccomplishingother
lin-guistic tasks (Ejerhed, 1988), (Abney, 1996),
(At-Mokhtar and Chanod, 1997). Other works use
dif-ferent grammatical formalisms, such as constraint
grammars(Voutilainen,1993),orcombinethe
gram-marruleswithasetofheuristics(Bourigault,1992).
Theseworksusuallyuseasmalltestsetthatis
man-uallyevaluated, sotheachieved resultsarenot
sig-nicant. The regular expressions dened in
(Ejer-hed,1988)identiedbothnon-recursiveclausesand
non-recursiveNPs inEnglishtext. The
experimen-tationontheBrowncorpusachievedaprecisionrate
of 87% (for clauses) and 97.8 % (for NPs).
Ab-neyintroducedtheconceptof chunk(Abney, 1991)
andpresentedanincrementalpartialparser(Abney,
1996). This parser identies chunks base on the
parts of speech, and it then chooses how to
com-bine them for higherlevel analysisusing lexical
in-formation. Theaverageprecisionandrecallratesfor
chunkswere87.9%and87.1%,respectively,onatest
set of 1000 sentences. An incremental architecture
ofnite-statetransducersforFrenchispresentedin
(At-Mokhtar and Chanod, 1997). Eachtransducer
performs a linguistic task such as identifying
seg-mentsorsyntacticstructuresanddetectingsubjects
cision rate varied between 99.2% and 92.6%. The
recallratevariedbetween97.8%and82.6%.
The NPTool parser described in (Voutilainen,
1993) identied maximal-length noun phrases.
NPtool gave a precision rate of 95-98% and a
re-callrateof98.5-100%. Theseresultswerecriticised
in (Ramshaw and Marcus, 1995) due to some
in-consistenciesandapparentmistakeswhichappeared
onthe samplegiven in (Voutilainen, 1993).
Bouri-gaultdevelopedtheLECTER parserforFrench
us-ing grammatical rules and some heuristics
(Bouri-gault, 1992). It achievedarecall rateof 95%
iden-tifyingmaximallengthterminologicalnounphrases,
buthedidnotgiveaprecisionrate,soitisdiÆcult
toevaluatetheactualperformanceof theparser.
1.2.2 Learning Techniques
These approaches automatically construct a
lan-guagemodel from alabelledand bracketedcorpus.
The rst probabilistic approach was proposed in
(Church,1988). Thismethodlearntabigrammodel
fordetectingsimplenounphrasesontheBrown
cor-pus. Givena sequence ofparts ofspeech asinput,
theChurchprograminsertsthemostprobable
open-ings and endings of NPs, using a Viterbi-like
dy-namicprogrammingalgorithm. Churchdidnotgive
precisionandrecall rates. He showedthat 5outof
243NP wereomitted,but in averysmall testwith
aPOStaggingaccuracyof99.5%.
Transformation-basedlearning(TBL)wasusedin
(Ramshaw and Marcus, 1995) to detect base NP.
Inthis work chunking was considered asa tagging
technique, so that each POS could be tagged with
I(inside baseNP),O(outsidebaseNP)orB(inside
a baseNP, but the preceding word was in another
baseNP).Thisapproachresultedinaprecisionrate
of 91.8% and a recall rate of 92.3%. This result
wasautomatically evaluated on a test set (200,000
words)extractedfromtheWSJTreebank. Themain
drawbacktothisapproacharethehighrequirements
fortimeandspacewhichareneededtotrainthe
sys-tem;itneedstotrain100templatesofcombinations
ofwords.
Thereareseveralworksthatuseamemory-based
learning algorithm. These approaches construct a
classier fora task by storinga set of examples in
memory. Eachexampleisdenedbyasetoffeatures
thathavetobelearntfromabracketedcorpus. The
Memory-Based Learning (MBL) algorithm
(Daele-mansetal.,1999)takesintoaccountlexicalandPOS
information. It stores the following features: the
wordformandPOStagofthetwowordstotheleft,
thefocuswordandonewordtotheright. This
sys-temachieved aprecisionrate of93.7% andarecall
rateof94.0%ontheWSJTreebank. However,when
de-recall rateof 90.1%. TheMemory-Based Sequence
Learning(MBSL) algorithm(Argamonetal.,1998)
learnssubstringsorsequencesofPOSandbrackets.
Precision and recall rates were 92.4% on the same
datausedin (RamshawandMarcus,1995).
A simple approach is presented in (Cardie and
Pierce,1998)called Treebank Approach(TA).This
technique matches POS sequences from an initial
nounphrasegrammarwhich wasextractedfrom an
annotated corpus. The precision achieved for each
ruleisused torankandprune therules,discarding
those rules whose score is lower than a predened
threshold. It uses alongest match heuristicto
de-termine base NP. Precisionand recall on the WSJ
Treebankwas89.4%and90.0%, respectively.
Itis diÆculttocomparethedierentapproaches
due for various reasons. Each one uses a dierent
denition of base NP. Each one is evaluated on a
dierent corpus or on dierent parts of the same
corpus. Somesystemshaveevenbeenevaluated by
hand ona verysmall test set. Table 1summarizes
theprecisionandrecallratesforlearningapproaches
thatusedataextractedfrom theWSJ Treebank.
Method NP-Precision NP-Recall
TBL 91.8 92.3
MBSL 92.4 92.4
TA 89.4 90.9
MBL 93.7 94.0
MBL(onlyPOS) 90.3 90.1
Table1: Precisionand recallratesfordierentNP
parsers.
2 General Description of our
Integrated approach to Tagging
and Chunking
We propose an integrated system (Figure 1) that
combines dierent knowledge sources(lexical
prob-abilities, LM for chunks and Contextual LM for
the sentences) in order to obtain the
correspond-ing sequence of POS tags and the shallow parsing
([SUW
1 =C
1 W
2 =C
2 SU]W
3 =C
3
:::[SUW
n =C
n SU])
from a certain input string (W
1 ;W
2 ;:::;W
n ). Our
systemisatransducercomposed bytwolevels: the
upper one represents the Contextual LM for the
sentences, and the lower one modelize the chunks
considered. Theformalismthat wehaveusedin all
levels are nite-state automata. To be exact, we
have used models of bigrams which are smoothed
usingthebackotechnique(Katz,1987)inorderto
achieve fullcoverageof thelanguage. Thebigrams
LMs (bigramprobabilities)wasobtainedby means
TAGGING
and
CHUNKING
Lexical Probabilities
Contextual LM
W1 W2 ... Wn
[SU W1/C1 W2/C2 SU] ... Wn/Cn
DECODING
LEARNING
CORPUS
LM for Chunks
Learning LM
Figure 1: OverviewoftheSystem.
1997) from the sequences of categories in the
trainingset. Then,theyhavebeenrepresentedlike
nite-stateautomata.
2.1 The learningphase.
Themodels havebeen estimatedfrom labelled and
bracketedcorpora. Thetrainingset iscomposedby
sentenceslike:
[SU W1=C1W2=C2 SU]W3=C3:::[SU Wn=CnSU]:=:
whereW
i
arethe words,C
i
arepart-of-speechtags
andSU arethechunksconsidered.
Themodelslearntare:
ContextualLM:itisasmoothedbigrammodel
learntfromthesequencesofpart-of-speechtags
(C
i
)andchunk descriptors(SU)presentinthe
trainingcorpus(seeFigure2a).
Models for the chunks: they are smoothed
bi-grammodelslearntfromthesequencesof
part-of-speech tags corresponding to each chunk of
thetrainingcorpus(seeFigure2b).
Lexical Probabilities: they areestimated from
the word frequencies, the tag frequencies and
the word per tag frequencies. A tag
dictio-nary is used which is built from the full
cor-puswhichgivesusthepossiblelexicalcategories
(POS tags)for each word;this isequivalentto
having an ideal morphological analyzer. The
probabilities foreach possible tagare assigned
from this information taking into account the
obtained statistics. Due to the fact that the
word cannot have been seen at training, or it
hasonlybeenseenin someofthepossible
cat-egories, itis compulsoryto apply asmoothing
mechanism. In our case, if the word has not
previously been seen, the same probability is
dic-<s>
C
k
</s>
C
j
C
i
<s>
C
k
</s>
C
j
C
i
C
i
P(
C
j
|
)
</SU>
<SU>
C
i
C
m
C
k
C
n
C
j
P(
W
j
|
)
...
...
C
i
P(
<SU>
|
)
SU
</SU>
<SU>
C
i
C
m
C
k
C
n
(a) Contextual LM
(b) LM for Chunks
(c) Integrated LM
Figure2: IntegratedLanguageModelforTaggingandChunking.
categories, the smoothing called "add one" is
applied. Afterwards,arenormalizationprocess
iscarriedout.
OncetheLMshavebeenlearnt,aregular
substi-tution of the lowermodel(s) into the upper oneis
made. In this way, we get a single Integrated LM
which shows the possible concatenations of lexical
tagsandsyntacticalunits,withtheirowntransition
probabilitieswhichalsoincludethelexical
probabil-ities aswell (see Figure 2c). Note that themodels
inFigure 2arenotsmoothed).
2.2 The DecodingProcess: Taggingand
Parsing
Thetaggingandshallowparsingprocessconsistsof
ndingoutthesequenceofstatesofmaximum
prob-abilityontheIntegratedLM foraninputsentence.
Therefore, this sequence must be compatible with
the contextual, syntactical and lexical constraints.
This process can be carried out by Dynamic
Pro-grammingusingtheViterbialgorithm,whichis
con-veniently modied to allow for transitions between
certain states of the automata without consuming
any symbols(epsilontransitions). Aportionof the
DynamicProgrammingtrellisforagenericsentence
usingtheIntegratedLMshowninFigure 2ccanbe
seenin Figure 3. Thestates of the automata that
can be reached and that are compatible with the
lexical constraints are marked with a black circle
(i.e., from the state C
k
it is possible to reach the
stateC
i
ifthetransitionisintheautomataandthe
lexical probabilityP(W
i jC
i
) is notnull). Also, the
transitions to initial and nal statesof the models
for chunks (i.e., from C
i
to < SU >) are allowed;
thesestatesaremarkedinFigure3withawhite
cir-cle and in this caseno symbolis consumed. In all
ducetransitionstotheirsuccessors(thedottedlines
inFigure3)wherenowsymbolsmustbeconsumed.
Once theDynamic Programingtrellisisbuilt,we
can obtain the maximum probability path for the
inputsentence,andthusthebestsequenceoflexical
tagsandthebestsegmentationinchunks.
</SU>
<s>
Ci
Cj
Ck
</s>
Ci
Ck
Cm
Cn
<SU>
...
Input:
Output:
...
Wn-1
Wn-2
Wn-2/Ci
</s>
</s>
SU] Wn/Ck
Wn
[SU Wn-1/Cn
Final
State
...
Figure3: PartialTrellisforProgrammingDecoding
basedontheIntegratedLM.
3 Experimental Work
Inthissectionwewill describeasetofexperiments
thatwecarriedoutin ordertodemonstratethe
on theWSJ corpus, using thePOS tag set dened
in (Marcus et al., 1993), considering only the NP
chunks dened by (Church, 1988) and using the
modelsthatwehavepresentedabove. Nevertheless,
the use of this approach on other corpora
(chang-ingthereferencelanguage),otherlexicaltagsetsor
otherkindsofchunkscanbedoneinadirect way.
3.1 Corpus Description.
We used a portion of the WSJ corpus (900,000
words), which was tagged according to the Penn
Treebank tag set and bracketed with NP markers,
totrainandtestthesystem.
The tag set contained 45 dierent tags. About
36.5% of the wordsin the corpuswere ambiguous,
with an ambiguity ratio of 2.44 tag/word over the
ambiguouswords,1.52overall.
3.2 ExperimentalResults.
Inordertotrainthemodelsandtotestthesystem,
werandomlydividedthecorporaintotwoparts:
ap-proximately800,000wordsfortrainingand 100,000
wordsfortesting.
Both thebigram modelsfor representing
contex-tualinformationandsyntacticdescriptionoftheNP
chunk and the lexical probabilities were estimated
fromtrainingsetsofdierentsizes. Due tothefact
thatwedidnotuseamorphologicalanalyserfor
En-glish, weconstructedatagdictionarywith the
lex-iconof thetrainingset and thetest set used. This
dictionarygaveus thepossiblelexical tagsforeach
wordfromthecorpus. Innocase,wasthetestused
toestimatethelexicalprobabilities.
92
93
94
95
96
97
98
99
100
0
100
200
300
400
500
600
700
800
POS Accuracy (%)
#Words x 1000
BIG
BIG-BIG
LEX
Figure 4: Accuracy Rate of Tagging on WSJ for
incrementaltrainingsets.
InFigure4,weshowtheresultsoftaggingonthe
testset in termsofthetrainingsetsize usingthree
approaches: thesimplest(LEX)isataggingprocess
whichdoesnottakecontextualinformationinto
ac-92
93
94
95
96
97
98
99
100
0
100
200
300
400
500
600
700
800
(%)
#Words x 1000
Precision
Recall
Figure5: NP-chunkingresultsonWSJfor
incremen-taltrainingsets.
Tagging NP-Chunking
Tagger Accuracy Precision Recall
BIG-BIG 96.8 94.6 93.6
Lex 94.3 90.8 91.3
BIG 96.9 94.9 94.1
IDEAL 100(assumed) 95.5 94.7
Table2: Taggingand NP-Chunking resultsfor
dif-ferentstaggers(training setof800,000words).
bethatwhichhasappearedmoreofteninthe
train-ingset. Thesecondmethodcorrespondstoatagger
basedonabigrammodel(BIG).Thethirdoneuses
the Integrated LM described in this paper
(BIG-BIG).Thetagging accuracyfor BIGand BIG-BIG
was close, 96.9% and 96.8% respectively, whereas
withoutthe use of the languagemodel (LEX), the
taggingaccuracywas2.5pointslower. Thetrendin
allthecaseswasthatanincrementinthesizeofthe
training set resulted in an increase in the tagging
accuracy. After 300,000training words, the result
becamestabilized.
In Figure 5, we show the precision (#correct
proposed NP/#proposed NP) and recall (#correct
proposed NP/#NP in the reference) rates for NP
chunking. TheresultsobtainedusingtheIntegrated
LMwereverysatisfactoryachievingaprecisionrate
of 94.6% and a recall rate of 93.6%. The
perfor-mance of the NP chunker improves as the
train-ingset size increases. This is obviously due to the
fact that the model is better learnt when the size
of thetraining set increases, and the tagging error
decreasesaswehaveseenabove.
The usual sequential process forchunking a
sen-tencecanalsobeused. Thatis,rstwetagthe
sen-tence and then we use theIntegrated LM to carry
sultsthatweobtainedfortaggingandforNP
chunk-ing. Therstrowshowstheresultwhenthetagging
andthechunkingaredoneinaintegratedway. The
followingrowsshowtheperformanceofthe
sequen-tialprocessusingdierenttaggers:
LEX: it takesinto account only lexical
proba-bilities. In this case, the tagging accuracy was
94.3%.
BIG: it is based on a bigram model that
achievedanaccuracyof96.9%.
IDEAL:itsimulatesataggerwithanaccuracy
rate of 100%. Todo this, we used the tagged
sentencesoftheWSJcorpusdirectly.
These results conrm that precision and recall
rates increase when the accuracy of the tagger is
better. The performance of the sequentialprocess
(using the BIG tagger) is slightly better than the
performance of the integrated process (BIG-BIG).
We think that this is probably becauseof the way
wecombinedtheprobabilitiesof thedierent
mod-els.
4 Conclusions and Future Work
Inthis paper,we havepresentedasystem for
Tag-ging and Chunking based on an Integrated
Lan-guage Model that uses a homogeneous formalism
(nite-state machine) to combine dierent
knowl-edge sources: lexical, syntactical and contextual
models. It isfeasiblebothin termsof performance
andalsoin termsofcomputationaleÆciency.
All themodels involvedare learntautomatically
fromdata,sothesystemisveryexibleandportable
and changes in the reference language, lexical tags
orotherkindsofchunkscanbemadeinadirectway.
The tagging accuracy (96.9% using BIG and
96.8% usingBIG-BIG) is higherthan othersimilar
approaches. This is because we haveused the tag
dictionary (including the test set in it) to restrict
the possible tags for unknown words, this
assump-tionobviouslyincreasetheratesoftagging(wehave
notdoneaquantitativestudyofthisfactor).
Aswehavementionedabove,thecomparisonwith
otherapproachesisdiÆcultdueamongotherreasons
tothefollowingones: thedenitionsofbaseNPare
notalwaysthesame, thesizes ofthe trainand the
testsetsaredierentandtheknowledgesourcesused
in the learningprocess are also dierent. The
pre-cisionforNP-chunkingissimilartootherstatistical
approaches presented in section 1, for boththe
in-tegratedprocess(94.6%)andthesequentialprocess
usingataggerbasedonbigrams(94.9%). Therecall
rateisslightlylowerthanforsomeapproachesusing
quentialsystemtakinganerrorfreeinput(IDEAL),
the performance of the system obviously increased
(95.5% precision and 94.7% recall). These results
showtheinuenceof taggingerrorsontheprocess.
Nevertheless, we are studying why the results
be-tweentheintegratedprocessandthesequential
pro-cessaredierent. Wearetestinghowthe
introduc-tionof someadjustmentfactors among the models
for weighting the dierent probability distribution
canimprovetheresults.
Themodelsthatwehaveusedinthiswork,are
bi-grams,buttrigramsoranystochasticregularmodel
canbe used. Inthis respect, we haveworked ona
morecomplex LMs, formalizedasanite-state
au-tomatawhichislearntusingGrammaticalInference
techniques. Also, ourapproach would benet from
the inclusion of lexical-contextual information into
theLM.
5 Acknowledgments
This work has been partially supported by the
SpanishResearchProjectCICYT
(TIC97-0671-C02-01/02).
References
S.Abney. 1991. Parsingby Chunks. R.Berwick,S.
AbneyandC.Tenny(eds.)Principle{based
Pars-ing.KluwerAcademicPublishers,Dordrecht.
S. Abney. 1996. Partial Parsing via Finite-State
Cascades. In Proceedings of the ESSLLI'96
Ro-bust ParsingWorkshop,Prague,CzechRepublic.
S. Argamon,I.Dagan, and Y. Krymolowski. 1998.
A Memory{basedApproach to Learning Shallow
Natural Language Patterns. In Proceedings of
the joint 17th International Conference on
Com-putational Linguistics and 36th Annual Meeting
of the Association for ComputationalLinguistics,
COLING-ACL,pages67{73,Montreal,Canada.
S. At-Mokhtar and J.P. Chanod. 1997.
Incremen-talFinite-StateParsing. InProceedingsof the5th
ConferenceonAppliedNaturalLanguagePro
cess-ing,Washington D.C.,USA.
D. Bourigault. 1992. Surface Grammatical
Anal-ysis for the Extraction of Terminological Noun
Phrases. InProceedings of the 15th International
Conference on Computational Linguistics, pages
977{981.
Eric Brill and Jun Wu. 1998. Classier
Combi-nation for Improved Lexical Disambiguation. In
Proceedings of the joint 17th International
Con-ference on Computational Linguistics and 36th
Annual Meetingof the Association for
Computa-tionalLinguistics,COLING-ACL,pages191{195,
Montreal,Canada.
tationalLinguistics,21(4):543{565.
C.Cardie andD.Pierce. 1998. Error-Driven
Prun-ningofTreebankGrammarsforBaseNounPhrase
Identication. In Proceedings of the joint 17th
International Conference on Computational
Lin-guistics and 36th Annual Meeting of the
Asso-ciation for ComputationalLinguistics,
COLING-ACL,pages218{224,Montreal,Canada,August.
K. W. Church. 1988. A Stochastic Parts Program
and Noun Phrase Parser for Unrestricted Text.
In Proceedings of the 1st Conference on Applied
Natural Language Processing, ANLP,pages 136{
143.ACL.
P. Clarksond and R. Ronsenfeld. 1997. Statistical
Language Modeling using the CMU-Cambridge
Toolkit. In Proceedings of Eurospeech, Rhodes,
Greece.
D. Cutting, J. Kupiec, J. Pederson, and P. Sibun.
1992. APracticalPart{of{speechTagger. In
Pro-ceedings of the 3rd Conference on Applied
Natu-ral Language Processing, ANLP, pages 133{140.
ACL.
W. Daelemans, J. Zavrel, P. Berck, and S. Gillis.
1996. MBT: A Memory{Based Part{of{speech
Tagger Generator. In Proceedings of the 4th
Workshop on Very Large Corpora, pages 14{27,
Copenhagen,Denmark.
W.Daelemans,S.Buchholz,andJ.Veenstra. 1999.
Memory-Based Shallow Parsing. In Proceedings
ofEMNLP/VLC-99,pages239{246,Universityof
Maryland,USA,June.
E. Ejerhed. 1988. Finding Clauses in Unrestricted
TextbyFinitaryandStochasticMethods. In
Pro-ceedings ofSecondConferenceonAppliedNatural
LanguageProcessing,pages219{227.ACL.
H.vanHalteren,J.Zavrel,andW.Daelemans.1998.
Improving Data Driven Wordclass Tagging by
System Combination. InProceedings of the joint
17th International Conference on Computational
Linguisticsand36thAnnualMeetingofthe
Asso-ciation for ComputationalLinguistics,
COLING-ACL,pages491{497,Montreal,Canada,August.
S. M.Katz. 1987. EstimationofProbabilitiesfrom
SparseData fortheLanguageModelComponent
of a Speech Recognizer. IEEE Transactions on
Acoustics,Speech andSignalProcessing,35.
D. M. Magerman. 1996. Learning Grammatical
Structure Using Statistical Decision{Trees. In
Proceedings of the 3rd International Colloquium
on Grammatical Inference, ICGI, pages 1{21.
Springer-VerlagLectureNotes Seriesin Articial
Intelligence1147.
M. P. Marcus, M. A. Marcinkiewicz, and B.
San-torini. 1993. Building aLargeAnnotatedCorpus
of English: The Penn Treebank. Computational
of-Speech Tagging Using Decision Trees. In C.
Nedellec and C. Rouveirol, editor, LNAI 1398:
Proceedings of the 10th European Conference
on Machine Learning, ECML'98, pages 25{36,
Chemnitz, Germany.Springer.
B. Merialdo. 1994. Tagging English Text with a
Probabilistic Model. Computational Linguistics,
20(2):155{171.
F. Pla and N. Prieto. 1998. Using Grammatical
Inference Methods forAutomatic Part{of{speech
Tagging. InProceedingsof1stInternational
Con-ference on Language Resources and Evaluation,
LREC, Granada,Spain.
L.Ramshawand M.Marcus. 1995. Text Chunking
Using Transformation-Based Learning. In
Pro-ceedings of third Workshop on Very Large
Cor-pora,pages82{94,June.
A.Ratnaparkhi. 1996. AMaximumEntropyPart{
of{speech Tagger. InProceedings of the 1st
Con-ference on Empirical Methods in Natural
Lan-guage Processing, EMNLP.
Atro Voutilainen and Llus Padro. 1997.
Develop-ingaHybridNPParser. InProceedings ofthe5th
ConferenceonAppliedNaturalLanguagePro
cess-ing, ANLP,pages80{87,Washington DC.ACL.
Atro Voutilainen. 1993. NPTool,aDetectorof
En-glish Noun Phrases. InProceedings of the
Work-shop onVery LargeCorpora.ACL,June.
Atro Voutilainen. 1995. A Syntax{Based Part{of{
speech Analyzer. InProceedings of the 7th
Con-ferenceoftheEuropeanChapteroftheAssociation
for Computational Linguistics, EACL, Dublin,
Ireland.
R.Weischedel,R.Schwartz,J.Palmucci,M.Meteer,
and L. Ramshaw. 1993. Copingwith Ambiguity
and UnknownWordsthroughProbabilistic