On Statistical Methods in Natural Language Processing

(1)

Natural Language Processing

JoakimNivre

1 Introduction

Whatis astatisticalmethod andhowcanit beused innatural language

pro-cessing(NLP)?In this paper, we willtry tothrowsomelighton this question

by examining the dierent ways in which NLP methods deserve to be called

statistical, an exercise that willhopefullythrowsomelight also on methods

thatdonotdeserve tobesocalled.

2 NLP: Problems, Models and Methods

NLPproblemshavetodowith naturallanguageinputand output. Herearea

fewtypicalanduncontroversialexamplesofNLPproblems:

Part-of-speech tagging: Annotating natural language sentences or texts

for parts-of-speech.

Natural language generation: Producing natural language sentences or

textsfromnon-linguisticrepresentations.

Machinetranslation: Translating sentences ortextsina sourcelanguage

tosentencesortextsina targetlanguage.

Inpart-of-speechtaggingwehavenaturallanguageinput,ingenerationwehave

naturallanguageoutput,and intranslationwe havebothinputand outputin

naturallanguage.

Ifouraim istobuild eective components forcomputationalsystems,then

we must develop algorithms for solving these problems. However, this is not

alwayspossible,simplybecausetheproblemsarenotwell-denedenough. The

wayoutofthisdilemmaisthesameasinmostotherbranchesofscience. Instead

ofattackingreal worldproblemsdirectly withalltheirmessydetails, webuild

mathematical models ofrealityand solveabstractproblems withinthemodels

instead. Provided that the models are worth their salt, these solutions will

provide adequateapproximations for the real problems. Anabstract problem

Q isa binaryrelation on aset I of probleminstances and aset S ofproblem

solutions. The abstract problems that are relevant to NLP are those where

either I or S (or both) are linguistic entities or representations of linguistic

entities. More precisely, an NLP problem P can be modeled by an abstract

problemQ iftheinstancesetI isasubsetofthesetofpermissibleinputstoP

(2)

AmethodforsolvinganNLPproblemP typicallyconsistsoftwoelements:

1. A mathematical model M dening an abstract problem Q that can be

used tomodelP.

2. AnalgorithmAthateectivelycomputesQ.

WewillsaythatM andAtogetherconstitutesanapplicationmethod for

prob-lem P with Q as the model problem. For example, let G be a context-free

grammar intendedtomodelthesyntaxofanaturallanguageNLandletQ be

theparsingproblemforG. ThenGtogetherwith,say,Earley'salgorithmisan

applicationmethodforsyntacticanalysisofNLwithQasthemodelproblem.

2.2 Acquisition Methods

The term acquisition method will be used to refer to any procedure for

con-structing a mathematical model that can be used in an application method.

For example, any procedure for developing a context-free grammar modeling

a naturallanguageor ahiddenMarkovmodelfor part-of-speech taggingisan

acquisition method inthis sense. Inthe following,we willconcentratealmost

exclusivelyonacquisitionmethodsthatmakeuseofmachinelearningtechniques

in orderto induce models (ormodelparameters) fromempiricaldata,

speci-callycorpus data. An empirical and algorithmicacquisition method typically

consistsoftwoelements:

1. A parameterizedmathematical model M

such thatprovidingvaluesfor

theparameters willyieldamathematical modelM thatcanbeusedin

an applicationmethodforsomeNLPproblemP.

2. AnalgorithmAthateectivelycomputesvaluesfortheparameterswhen

givenasampleofdatafromP.

Ifthedatasamplemustcontainbothinputsand(correct)outputsfromP,then

Aissaidtobeasupervised learningalgorithm. Ifitissucientwithasample

ofinputs,we haveanunsupervised learningalgorithm.

2.3 Evaluation Methods

Ifacquisitionandapplicationmethodswereinfallible,noothermethodswould

be needed. In practice, however, there are many factorswhich maycause an

NLPsystemtoperformlessthanoptimallyandwe thereforeneedmethodsfor

evaluating NLP systems. We willuse the term evaluation method to refer to

anyprocedure forevaluatingNLPsystems. However, thediscussionwillfocus

onextrinsicevaluationofsystemsintermsoftheiraccuracy. Forexample,letP

beanNLPproblem,andlet(M

1 ;A

1

)and(M

2 ;A

2

)betwodierentapplication

methods for P. A common wayof evaluating and comparing the accuracy of

thesetwomethodsistoapplythemtoarepresentativesampleofinputsfromP

and measurethe accuracyoftheoutputs producedbytherespective methods.

Aspecial caseofthisevaluationschemeiswhere A

1 =A

2

andthe modelsM

1

andM

2

(3)

Forthepurpose ofthis paper,wewillsaythatamodelormethodisstatistical

(orstochastic)if itinvolves theconcept ofprobability(orrelatednotions such

as entropyand mutual information)or if ituses conceptsof statisticaltheory

(suchasstatisticalestimationandhypothesistesting).

4 Statistical Methods in NLP

4.1 Application Methods

Mostexamplesofstatisticalapplicationmethodsintheliteratureare methods

thatmake use of a stochastic model,but where the algorithm appliedto this

modelisentirelydeterministic. Typically,theabstractmodelproblemcomputed

bythealgorithm isan optimizationproblem whichconsistsin maximizingthe

probabilityoftheoutputgiventheinput. Hereare someexamples:

Language modelingfor automaticspeech recognition using smoothed

n-gramstondthemostprobablestringofwordsw

1 ;:::;w

n

outofasetof

candidate stringscompatiblewiththeacousticdata[Jelinek1990].

Part-of-speechtaggingusinghiddenMarkovmodelstondthemost

prob-abletagsequencet

1 ;:::t

n

givenawordsequencew

1 ;:::w

n

[Merialdo1994].

Syntacticparsingusingprobabilisticgrammarstondthemostprobable

parsetreeT givenawordsequencew

1 ;:::;w

n

(ortagsequencet

1 ;:::;t

n )

[Stolcke1995].

Word sense disambiguation using Bayesian classiers to nd the most

probablesensesforwordw incontextC [Galeet al 1992].

Machine translation using probabilistic models to nd the most

proba-ble target language sentence T for a given source language sentence S

[Brownetal 1990a].

4.2 Acquisition Methods

Statisticalacquisitionmethodsaremethodsthatrelyonstatisticalinference to

inducemodels(ormodelparameters)fromempiricaldata,inparticularcorpus

data. Themodelinduced mayormaynotbea stochasticmodel,whichmeans

thatthereareasmanyvariationsinthisareaastherearedierentNLPmodels.

Wewillthereforelimitourselvestoafewrepresentativeexamples:

SupervisedlearningofHMMtaggers[Merialdo1994].

UnsupervisedlearningofHMMtaggers [Cuttingetal 1992]

Transformation-basedlearning[Brill1995].

(4)

EvaluationofNLPsystemscanhavedierentpurposesandconsidermany

dif-ferentdimensionsofasystem. Consequently,thereareawidevarietyofmethods

thatcanbeusedforevaluation.Manyofthesemethodsinvolveempirical

exper-iments orquasi-experiments inwhichthesystem isappliedtoarepresentative

sampleofdata inordertoprovidequantitativemeasures ofaspectssuchas

ef-ciency, accuracy and robustness. These evaluation methodscanmake use of

statisticsinatleastthreedierentways:

Descriptive statistics is often used to derive measures such as accuracy

rate,recallandprecision.

Statistical estimation may be used to derive condence intervals for

de-scriptive measures.

Hypothesistestingmaybeused totestthesignicanceofanydierences

foundwhencomparingalternativemethods.

5 Conclusion

Inthis paper, wehave discussedthreedierentkindsofmethodsthatare

rele-vantinnaturallanguageprocessing:

An application method is used to solve an NLP problem P, usually by

applyinganalgorithmAtoamathematicalmodelM inordertosolvean

abstractproblemQapproximatingP.

AnacquisitionmethodforanNLPproblemP isusedtoconstructamodel

M thatcanbeused in anapplication method forP. Ofspecial interest

here are empirical and algorithmic acquisition methodsthat allowus to

constructM fromaparameterizedmodelM

byapplyinganalgorithmA

toarepresentative sampleofP.

Anevaluationmethod foranNLPproblemP isusedtoevaluate

applica-tionmethodsforP. Ofspecial interesthereareexperimental(or

empiri-cal)evaluationmethodsthatallowustoevaluateapplicationmethodsby

applying themtoarepresentativesampleofP.

We have argued that statistics, in the wide sense including both stochastic

modelsandstatisticaltheory,canplayaroleinallthreekindsofmethodsandwe

havesuppliednumerousexamplestosubstantiatethisclaim. 1

Wehavealsotried

toshowthattherearemanywaysinwhichstatisticalmethodscanbecombined

withtraditionallinguisticrulesandrepresentation,bothinapplicationmethods

(5)

[Brill1995] Brill, E. (1995) Transformation-Based Error-DrivenLearning and

NaturalLanguageProcessing: ACaseStudyinPart-of-SpeechTagging.

Com-putational Linguistics, 21(4),543566.

[Brownet al 1990a] Brown,P.,Cocke,J.,DellaPietra,S.,DellaPietra,V.,

Je-linek,F.,Laerty,J.,Mercer,R.andRossin,P.(1990)AStatisticalApproach

toMachineTranslation.ComputationalLinguistics 16(2),7985.

[Cuttinget al 1992] Cutting,D.,Kupiec,J.,Pedersen,J.andSibun,P.(1992).

A PracticalPart-of-speech Tagger. In Third Conference onApplied Natural

LanguageProcessing,ACL,133140.

[Galeet al 1992] Gale, W. A., Church, K. W. and Yarowsky, D. (1992) A

MethodforDisambiguatingWordSensesinaLargeCorpus.Computersand

the Humanities 26,415439.

[Jelinek1990] Jelinek,F. (1990)Self-OrganizedLanguageModelingforSpeech

Recognition. InWaibel,A.andLee,K.-F.(eds)ReadingsinSpeech Re

cogni-tion,pp.450506.LosAltos,CA: MorganKaufman.

[Magerman1995] Magerman, D. (1995) Statistical Decision-Tree Models for

Parsing. In Proceedings of the 33rd Annual Meeting of the Association for

Computational Linguistics,276283.

[Merialdo1994] Merialdo, B.(1994)TaggingEnglishTextwithaProbabilistic

Model.ComputationalLinguistics 20(2),155172.

[Stolcke 1995] Stolcke,A.(1995) AnEcientProbabilistic Context-Free

Pars-ing Algorithm ThatComputesPrex Probabilities. Computational