Natural Language Processing
JoakimNivre
1 Introduction
Whatis astatisticalmethod andhowcanit beused innatural language
pro-cessing(NLP)?In this paper, we willtry tothrowsomelighton this question
by examining the dierent ways in which NLP methods deserve to be called
statistical, an exercise that willhopefullythrowsomelight also on methods
thatdonotdeserve tobesocalled.
2 NLP: Problems, Models and Methods
NLPproblemshavetodowith naturallanguageinputand output. Herearea
fewtypicalanduncontroversialexamplesofNLPproblems:
Part-of-speech tagging: Annotating natural language sentences or texts
for parts-of-speech.
Natural language generation: Producing natural language sentences or
textsfromnon-linguisticrepresentations.
Machinetranslation: Translating sentences ortextsina sourcelanguage
tosentencesortextsina targetlanguage.
Inpart-of-speechtaggingwehavenaturallanguageinput,ingenerationwehave
naturallanguageoutput,and intranslationwe havebothinputand outputin
naturallanguage.
Ifouraim istobuild eective components forcomputationalsystems,then
we must develop algorithms for solving these problems. However, this is not
alwayspossible,simplybecausetheproblemsarenotwell-denedenough. The
wayoutofthisdilemmaisthesameasinmostotherbranchesofscience. Instead
ofattackingreal worldproblemsdirectly withalltheirmessydetails, webuild
mathematical models ofrealityand solveabstractproblems withinthemodels
instead. Provided that the models are worth their salt, these solutions will
provide adequateapproximations for the real problems. Anabstract problem
Q isa binaryrelation on aset I of probleminstances and aset S ofproblem
solutions. The abstract problems that are relevant to NLP are those where
either I or S (or both) are linguistic entities or representations of linguistic
entities. More precisely, an NLP problem P can be modeled by an abstract
problemQ iftheinstancesetI isasubsetofthesetofpermissibleinputstoP
AmethodforsolvinganNLPproblemP typicallyconsistsoftwoelements:
1. A mathematical model M dening an abstract problem Q that can be
used tomodelP.
2. AnalgorithmAthateectivelycomputesQ.
WewillsaythatM andAtogetherconstitutesanapplicationmethod for
prob-lem P with Q as the model problem. For example, let G be a context-free
grammar intendedtomodelthesyntaxofanaturallanguageNLandletQ be
theparsingproblemforG. ThenGtogetherwith,say,Earley'salgorithmisan
applicationmethodforsyntacticanalysisofNLwithQasthemodelproblem.
2.2 Acquisition Methods
The term acquisition method will be used to refer to any procedure for
con-structing a mathematical model that can be used in an application method.
For example, any procedure for developing a context-free grammar modeling
a naturallanguageor ahiddenMarkovmodelfor part-of-speech taggingisan
acquisition method inthis sense. Inthe following,we willconcentratealmost
exclusivelyonacquisitionmethodsthatmakeuseofmachinelearningtechniques
in orderto induce models (ormodelparameters) fromempiricaldata,
speci-callycorpus data. An empirical and algorithmicacquisition method typically
consistsoftwoelements:
1. A parameterizedmathematical model M
such thatprovidingvaluesfor
theparameters willyieldamathematical modelM thatcanbeusedin
an applicationmethodforsomeNLPproblemP.
2. AnalgorithmAthateectivelycomputesvaluesfortheparameterswhen
givenasampleofdatafromP.
Ifthedatasamplemustcontainbothinputsand(correct)outputsfromP,then
Aissaidtobeasupervised learningalgorithm. Ifitissucientwithasample
ofinputs,we haveanunsupervised learningalgorithm.
2.3 Evaluation Methods
Ifacquisitionandapplicationmethodswereinfallible,noothermethodswould
be needed. In practice, however, there are many factorswhich maycause an
NLPsystemtoperformlessthanoptimallyandwe thereforeneedmethodsfor
evaluating NLP systems. We willuse the term evaluation method to refer to
anyprocedure forevaluatingNLPsystems. However, thediscussionwillfocus
onextrinsicevaluationofsystemsintermsoftheiraccuracy. Forexample,letP
beanNLPproblem,andlet(M
1 ;A
1
)and(M
2 ;A
2
)betwodierentapplication
methods for P. A common wayof evaluating and comparing the accuracy of
thesetwomethodsistoapplythemtoarepresentativesampleofinputsfromP
and measurethe accuracyoftheoutputs producedbytherespective methods.
Aspecial caseofthisevaluationschemeiswhere A
1 =A
2
andthe modelsM
1
andM
2
Forthepurpose ofthis paper,wewillsaythatamodelormethodisstatistical
(orstochastic)if itinvolves theconcept ofprobability(orrelatednotions such
as entropyand mutual information)or if ituses conceptsof statisticaltheory
(suchasstatisticalestimationandhypothesistesting).
4 Statistical Methods in NLP
4.1 Application Methods
Mostexamplesofstatisticalapplicationmethodsintheliteratureare methods
thatmake use of a stochastic model,but where the algorithm appliedto this
modelisentirelydeterministic. Typically,theabstractmodelproblemcomputed
bythealgorithm isan optimizationproblem whichconsistsin maximizingthe
probabilityoftheoutputgiventheinput. Hereare someexamples:
Language modelingfor automaticspeech recognition using smoothed
n-gramstondthemostprobablestringofwordsw
1 ;:::;w
n
outofasetof
candidate stringscompatiblewiththeacousticdata[Jelinek1990].
Part-of-speechtaggingusinghiddenMarkovmodelstondthemost
prob-abletagsequencet
1 ;:::t
n
givenawordsequencew
1 ;:::w
n
[Merialdo1994].
Syntacticparsingusingprobabilisticgrammarstondthemostprobable
parsetreeT givenawordsequencew
1 ;:::;w
n
(ortagsequencet
1 ;:::;t
n )
[Stolcke1995].
Word sense disambiguation using Bayesian classiers to nd the most
probablesensesforwordw incontextC [Galeet al 1992].
Machine translation using probabilistic models to nd the most
proba-ble target language sentence T for a given source language sentence S
[Brownetal 1990a].
4.2 Acquisition Methods
Statisticalacquisitionmethodsaremethodsthatrelyonstatisticalinference to
inducemodels(ormodelparameters)fromempiricaldata,inparticularcorpus
data. Themodelinduced mayormaynotbea stochasticmodel,whichmeans
thatthereareasmanyvariationsinthisareaastherearedierentNLPmodels.
Wewillthereforelimitourselvestoafewrepresentativeexamples:
SupervisedlearningofHMMtaggers[Merialdo1994].
UnsupervisedlearningofHMMtaggers [Cuttingetal 1992]
Transformation-basedlearning[Brill1995].
EvaluationofNLPsystemscanhavedierentpurposesandconsidermany
dif-ferentdimensionsofasystem. Consequently,thereareawidevarietyofmethods
thatcanbeusedforevaluation.Manyofthesemethodsinvolveempirical
exper-iments orquasi-experiments inwhichthesystem isappliedtoarepresentative
sampleofdata inordertoprovidequantitativemeasures ofaspectssuchas
ef-ciency, accuracy and robustness. These evaluation methodscanmake use of
statisticsinatleastthreedierentways:
Descriptive statistics is often used to derive measures such as accuracy
rate,recallandprecision.
Statistical estimation may be used to derive condence intervals for
de-scriptive measures.
Hypothesistestingmaybeused totestthesignicanceofanydierences
foundwhencomparingalternativemethods.
5 Conclusion
Inthis paper, wehave discussedthreedierentkindsofmethodsthatare
rele-vantinnaturallanguageprocessing:
An application method is used to solve an NLP problem P, usually by
applyinganalgorithmAtoamathematicalmodelM inordertosolvean
abstractproblemQapproximatingP.
AnacquisitionmethodforanNLPproblemP isusedtoconstructamodel
M thatcanbeused in anapplication method forP. Ofspecial interest
here are empirical and algorithmic acquisition methodsthat allowus to
constructM fromaparameterizedmodelM
byapplyinganalgorithmA
toarepresentative sampleofP.
Anevaluationmethod foranNLPproblemP isusedtoevaluate
applica-tionmethodsforP. Ofspecial interesthereareexperimental(or
empiri-cal)evaluationmethodsthatallowustoevaluateapplicationmethodsby
applying themtoarepresentativesampleofP.
We have argued that statistics, in the wide sense including both stochastic
modelsandstatisticaltheory,canplayaroleinallthreekindsofmethodsandwe
havesuppliednumerousexamplestosubstantiatethisclaim. 1
Wehavealsotried
toshowthattherearemanywaysinwhichstatisticalmethodscanbecombined
withtraditionallinguisticrulesandrepresentation,bothinapplicationmethods
[Brill1995] Brill, E. (1995) Transformation-Based Error-DrivenLearning and
NaturalLanguageProcessing: ACaseStudyinPart-of-SpeechTagging.
Com-putational Linguistics, 21(4),543566.
[Brownet al 1990a] Brown,P.,Cocke,J.,DellaPietra,S.,DellaPietra,V.,
Je-linek,F.,Laerty,J.,Mercer,R.andRossin,P.(1990)AStatisticalApproach
toMachineTranslation.ComputationalLinguistics 16(2),7985.
[Cuttinget al 1992] Cutting,D.,Kupiec,J.,Pedersen,J.andSibun,P.(1992).
A PracticalPart-of-speech Tagger. In Third Conference onApplied Natural
LanguageProcessing,ACL,133140.
[Galeet al 1992] Gale, W. A., Church, K. W. and Yarowsky, D. (1992) A
MethodforDisambiguatingWordSensesinaLargeCorpus.Computersand
the Humanities 26,415439.
[Jelinek1990] Jelinek,F. (1990)Self-OrganizedLanguageModelingforSpeech
Recognition. InWaibel,A.andLee,K.-F.(eds)ReadingsinSpeech Re
cogni-tion,pp.450506.LosAltos,CA: MorganKaufman.
[Magerman1995] Magerman, D. (1995) Statistical Decision-Tree Models for
Parsing. In Proceedings of the 33rd Annual Meeting of the Association for
Computational Linguistics,276283.
[Merialdo1994] Merialdo, B.(1994)TaggingEnglishTextwithaProbabilistic
Model.ComputationalLinguistics 20(2),155172.
[Stolcke 1995] Stolcke,A.(1995) AnEcientProbabilistic Context-Free
Pars-ing Algorithm ThatComputesPrex Probabilities. Computational