Estimation of Stochastic Attribute Value Grammars using an Informative Sample

(1)

Informative Sample

Miles Osborne

[email protected]

Rijksuniversiteit Groningen, The Netherlands

Abstract

Wearguethatsomeofthecomputationalcomplexity

associated with estimation of stochastic

attribute-valuegrammarscanbereducedbytraininguponan

informativesubset of thefull trainingset. Results

using the parsed Wall Street Journal corpus show

that in somecircumstances,it is possibleto obtain

betterestimationresultsusing aninformative

sam-ple than when training upon all the available

ma-terial. Further experimentation demonstrates that

with unlexicalisedmodels, aGaussian priorcan

re-duce overtting. However, when models are

lexi-calisedandcontainoverlappingfeatures, overtting

doesnotseemtobeaproblem,andaGaussianprior

makesminimal dierenceto performance. Our

ap-proach is applicable for situations when there are

aninfeasiblylargenumberof parsesin thetraining

set, or else for when recoveryof these parses from

apackedrepresentationisitselfcomputationally

ex-pensive.

1 Introduction

Abney showed that attribute-value grammars

can-not be modelled adequately using statistical

tech-niques which assume that statistical dependencies

are accidental (Abney, 1997). Instead of using a

modelclassthatassumedindependence,Abney

sug-gested using Random Fields Models (RFMs) for

attribute-value grammars. RFMs deal with the

graphicalstructureofaparse. Becausetheydonot

makeindependence assumptionsaboutthe

stochas-tic generation process that might have produced

someparse,theyareabletomodelcorrectly

depen-denciesthat existwithin parses.

Whenestimatingstandardly-formulatedRFMs,it

is necessary to sum overall parses licensed by the

grammar. For many broad coverage natural

lan-guage grammars, this might involvesumming over

anexponentialnumberof parses. This wouldmake

the task computationally intractable. Abney,

fol-lowingtheleadof Laertyetal, suggestedaMonte

Current address: [email protected], University of

Edinburgh,DivisionofInformatics,2BuccleuchPlace,EH8

Carlosimulationasawayofreducingthe

computa-tionalburdenassociatedwithRFMestimation

(Laf-ferty et al., 1997). However, Johnson et al

consid-ered the form of sampling used in this simulation

(Metropolis-Hastings) intractable (Johnson et al.,

1999). Instead, theyproposedan alternative

strat-egythatredenedtheestimationtask. Itwasargued

thatthisredenitionmadeestimation

computation-ally simple enough that a Monte Carlo simulation

was unnecessary. They presented results obtained

usingasmallunlexicalisedmodeltrainedona

mod-estcorpus.

Unfortunately,Johnsonetalassumeditwas

possi-bletoretrieveallparseslicensedbyagrammarwhen

parsing a given training set. For us, this was not

thecase. Inourexperimentswith amanually

writ-tenbroadcoverageDeniteClauseGrammar(DCG)

(BriscoeandCarroll,1996),wewereonlyableto

re-cover all parses for Wall Street Journal sentences

thatwereat most13 tokenslongwithin acceptable

timeand space bounds on computation. When we

used an incremental Minimum Description Length

(MDL)basedlearnerto extendthecoverageofour

manually written grammar (from roughly 60% to

around90%oftheparsedWallStreetJournal),the

situationbecameworse. Sentenceambiguity

consid-erablyincreased. Wewerethenonlyabletorecover

allparsesforWallStreetJournalsentencesthatwere

atmost6tokenslong(Osborne,1999).

Wecanhowever,andusuallyinpolynomialtime,

recoverupto30parsesforsentencesupto30tokens

longwhen weuseaprobabilisticunpacking

mecha-nism(CarrollandBriscoe,1992). (Longersentences

than 30 tokens can be parsed, but the number of

parseswecanrecoverfor them dropso rapidly). 1

However,30 is far lessthan the maximumnumber

1

Wemadean attemptto determinethe maximum

num-berof parses our grammarmight assign to sentences. On

a 450MHz Ultra Sparc80 with2 Gbof realmemory,with

a limit of at most 1000 parses per sentence, and allowing

nomorethan100CPUsecondspersentence, wefoundthat

sentence ambiguity increased exponentially withrespect to

sentencelength. Sentenceswith30tokenshadanestimated

averageof 866 parses (standard deviation 290:4). Without

(2)

WallStreet Journalsentences. Any trainingsetwe

have access to will therefore be necessarily limited

insize.

We therefore need an estimation strategy that

takesseriouslytheissueof extractingthe best

per-formancefromalimitedsizetrainingset. Alimited

size trainingset meansonecreatedbyretrievingat

mostnparsespersentence. Althoughwecannot

re-coverall possible parses, wedo haveachoice asto

whichparsesestimationshouldbebasedupon.

OurapproachtotheproblemofmakingRFM

es-timationfeasible for ourhighly ambiguous DCGis

to seek out an informative sample and train upon

that. We do not redene the estimation task in a

non-standardway,nordoweuseaMonteCarlo

sim-ulation.

We call a sample informative if it both leads to

the selection of a model that does not undert or

overt,andalsoistypicaloffuturesamples. Despite

one's intuitions, an informativesample might be a

proper subset of the full training set. This means

thatestimationusing theinformativesamplemight

yield better resultsthan estimation usingallof the

trainingset.

The rest of this paper is as follows. Firstly we

introduceRFMs. Then weshowhowthey maybe

estimatedand howaninformativesamplemightbe

identied. Next, we give details of the

attribute-valuegrammar we use,and show howwego about

modelling it. We then present two sets of

experi-ments. Therstsetissmallscale,andaredesigned

toshowtheexistenceofaninformativesample. The

second set of experiments are larger in scale, and

build upon the computational savings we are able

to achieveusing aprobabilisticunpackingstrategy.

Theyshowhowlargemodels(twoordersof

magni-tude larger than those reported by Johnson et al)

canbeestimatedusingtheparsedWallStreet

Jour-nalcorpus. Overttingisshowntotakeplace. They

alsoshowhowthisoverttingcanbe(partially)

re-duced by using a Gaussian prior. Finally, we end

withsomecommentsonourwork.

2 Random Field Models

Hereweshowhowattribute-valuegrammarsmaybe

modelledusingRFMs. Althoughourcommentaryis

in termsof RFMs and grammars, it should be

ob-viousthatRFM technologycanbeappliedtoother

estimationscenarios.

LetG bean attribute-valuegrammar,D theset

of sentences within the string-set dened by L(G)

and the union of the set of parses assigned to

eachsentencein D bythegrammar G. ARandom

FieldModel,M,consistoftwocomponents: asetof

features,F andasetof weights,.

aspects of what it takes to dierentiate one parse

fromanotherparse. Eachfeatureisafunction from

a parse to an integer. Here, the integer value

as-sociated with a feature is interpreted as the

num-ber of times a feature `matches' (is `active') with

aparse. Note featuresshould notbeconfused with

featuresasfoundinfeature-valuebundles(thesewill

be called attributes instead). Features are usually

manuallyselectedbythesystemdesigner.

The other component of a RFM, , is a set of

weights. Informally,weightstellushowfeaturesare

tobeused whenmodellingparses. Forexample,an

activefeaturewithalargeweightmightindicatethat

someparsehadahighprobability. Eachweight

i is

associatedwithafeaturef

i

. Weightsarereal-valued

numbersandareautomaticallydeterminedbyan

es-timationprocess(forexampleusingImproved

Itera-tiveScaling(Laertyet al.,1997)). Oneof thenice

properties of RFMs is that the likelihood function

ofaRFMisstrictlyconcave. Thismeansthatthere

arenolocalminima,and sowecanbebesurethat

scaling will result in estimation of a RFM that is

globallyoptimal.

The (unnormalised) total weight of a parse x,

(x),isafunction ofthekfeaturesthatare`active'

onaparse:

(x)=exp( k

X

i=1

i f

i

(x)) (1)

The probability of a parse, P(x j M), is simply

theresultofnormalisingthetotalweightassociated

withthatparse:

P(xjM)= 1

Z

(x) (2)

Z= X

y2

(y) (3)

Theinterpretationofthisprobabilitydependsupon

theapplicationoftheRFM.Here,weuseparse

prob-abilitiestoreectpreferencesforparses.

When using RFMs for parse selection, we

sim-ply selectthe parse that maximises (x). In these

circumstances,there is no needto normalise

(com-puteZ). Also,whencomputing (x)forcompeting

parses,there is nobuilt-in biastowardsshorter (or

longer)derivations,andsononeedtonormalisewith

respecttoderivationlength. 2

2

Thereasonthereisnoneedtonormalisewithrespectto

derivationlengthisthatfeaturescan havepositiveor

nega-tiveweights. Theweightofaparsewillthereforenotalways

(3)

the Informative Sample

We now sketch how RFMs may be estimated and

thenoutlinehowweseekoutaninformativesample.

We use Improved Iterative Scaling (IIS) to

esti-mate RFMs. Inoutline,theIISalgorithmis as

fol-lows:

1. Start with areference distribution R , a set of

features F and aset of weights . Let M be

theRFMdened usingF and.

2. Initialise all weights to zero. This makes the

initialmodel uniform.

3. Computethe expectation of each feature w.r.t

R .

4. Foreachfeature f

i

(a) Findaweight

i

thatequatesthe

expecta-tionoff

i

w.r.tRandtheexpectationoff

i

w.r.tM.

(b) Replacetheoldvalueof

i with

i .

5. IfthemodelhasconvergedtoR ,outputM.

6. Otherwise,gotostep4

Thekeystephereis4a,computingtheexpectations

offeaturesw.r.ttheRFM.Thisinvolvescalculating

the probability of a parse, which, as we saw from

equation2, requiresasummation overall parsesin

.

We seekout aninformative sample

t (

t

)

asfollows:

1. Pickoutfromasampleofsize n.

2. Estimateamodelusingthat sampleand

evalu-ateit.

3. Ifthemodeljustestimatedshowssignsof

over-tting(withrespecttoanunseenheld-outdata

set),haltandoutputthemodel.

4. Otherwise,increasenandgobacktostep1.

Ourapproachismotivatedbythefollowing

(par-tiallyrelated)observations:

Because we use a non-parametric model class

and select an instance of it in terms of some

sample (section 5 gives details), a stochastic

complexityargumenttellsusthatanoverly

sim-ple model (resulting from a small sample) is

likelyto undert. Likewise, anoverlycomplex

model (resulting from alarge sample)is likely

toovert. Aninformativesamplewilltherefore

relatetoamodelthatdoesnotunderorovert.

Onaverage,aninformativesamplewillbe

`typ-ical'offuture samples. Formany real-life

situ-ations, this set islikely to besmall relativeto

searchmechanism. Becausewestartwithsmall

sam-ples and gradually increase their size, we remain

withinthedomainofeÆcientlyrecoverablesamples.

The second observation is (largely) incorporated

inthe waywepicksamples. Theexperimental

sec-tionofthispapergoesintotherelevantdetails.

Note our approach isheuristic: we cannotaord

toevaluateall2 jj

possibletrainingsets. Theactual

sizeof the informativesample

t

will depend both

the upon the model class used and the maximum

sentence length we can deal with. We would

ex-pect richer,lexicalised modelsto exhibitovertting

with smaller samples than would be the case with

unlexicalisedmodels. We would expect the size of

aninformativesampleto increaseasthemaximum

sentencelengthincreased.

There are similaritiesbetween our approach and

withestimationusingMDL(Rissanen,1989).

How-ever,ourimplementationdoesnotexplicitlyattempt

to minimise code lengths. Also, there are

similari-ties with importance sampling approaches to RFM

estimation (such as(Chen and Rosenfeld, 1999a)).

However, such attempts do not minimise under or

overtting.

4 The Grammar

ThegrammarwemodelwithRandomFields,(called

the Tag Sequence Grammar (Briscoe and Carroll,

1996),orTSGforshort)wasdevelopedwithregard

tocoverage,andwhencompiledconsistsof455

Def-inite Clause Grammar (DCG) rules. It does not

parse sequences of words directly, but instead

as-signsderivationstosequencesofpart-of-speechtags

(using the CLAWS2 tagset. The grammar is

rela-tivelyshallow,(for example,it does notfully

anal-yseunbounded dependencies) but it does makean

attempttodealwithcommonconstructions,suchas

datesornames,commonlyfoundin corpora,but of

littletheoreticalinterest. Furthermore,itintegrates

intothesyntaxatextgrammar,groupingutterances

intounitsthat reducetheoverall ambiguity.

5 Modelling the Grammar

ModellingtheTSGwithrespecttotheparsedWall

Street Journal consists of two steps: creation of a

feature set and denition of thereference

distribu-tion.

Ourfeatureset iscreatedbyparsingsentencesin

the training set (

T

), and using each parse to

in-stantiatetemplates. Each templatedenes afamily

of features. At present, the templates we use are

somewhatad-hoc. However,theyare motivatedby

theobservations that linguistically-stipulated units

(DCGrules) are informative, and that many DCG

(4)

us-AP/a1:unimpeded

A1/app1:unimpeded

a

a !

!

unimpeded PP/p1:by

P1/pn1:by

b

b "

"

by N1/n:traÆc

traÆc

Figure1: TSGParseFragment

The rst template creates features that count

thenumberof timesa DCGinstantiationis present

within a parse. 3

For example, suppose we parsed

theWallStreetJournalAP:

1 unimpededbytraÆc

AparsetreegeneratedbyTSGmightbeasshown

ingure1. Here,tosaveonspace,wehavelabelled

each interiornode in the parse tree with TSGrule

names, and not attribute-value bundles.

Further-more, we have annotatedeach node with the head

word of the phrase in question. Within our

gram-mar, heads are (usually) explicitly marked. This

means we do not have to make any guesses when

identifying the headof a local tree. With head

in-formation,weareabletolexicalisemodels. Wehave

suppressedtagginginformation.

Forexample,afeaturedenedusingthistemplate

mightcountthenumberoftimesthewesaw:

AP/a1

A1/app1

inaparse. Suchfeaturesrecordsomeofthecontext

oftheruleapplication,inthatruleapplicationsthat

dier in termsof howattributes are bound will be

modelled bydierentfeatures.

Oursecondtemplatecreatesfeaturesthatare

par-tiallylexicalised. Foreach localtree (ofdepthone)

that has a PP daughter, we create a feature that

countsthenumberoftimesthatlocaltree,decorated

with thehead-wordofthePP,wasseenin aparse.

Anexampleofsuchalexicalisedfeaturewouldbe:

A1/app1

PP/p1:by

3

Note,allourfeaturessuppressanyterminalsthatappear

inalocaltree.Lexicalinformationisincludedwhenwedecide

ments that can be resolved using the head of the

PP.

Thethirdandnaltemplatecreatesfeaturesthat

areagainpartiallylexicalised. Thistime, wecreate

localtreesofdepth onethataredecoratedwiththe

headword. Forexample,hereisonesuchfeature:

AP/a1:unimpeded

A1/app1

Notethesecondandthirdtemplatesresultin

fea-turesthat overlap with features resulting from

ap-plicationsofthersttemplate.

WecreatethereferencedistributionR(an

associ-ationofprobabilitieswithTSGparsesof sentences,

suchthattheprobabilitiesreectparsepreferences)

usingthefollowingprocess:

1. Extract some sample

T

(using the approach

mentionedinsection3).

2. Foreachsentencein thesample,foreachparse

of that sentence, compute the `distance'

be-tween the TSG parse and the WSJ reference

parse. In our approach, distance is calculated

intermsofaweightedsumofcrossingrates,

re-callandprecision. Minimisingitmaximisesour

denitionofparseplausibility. 4

However,there

isnothinginherentlycrucialaboutthisdecision.

Any otherobjectivefunction (that canbe

rep-resented asan exponential distribution) could

beusedinstead.

3. Normalisethedistances,suchthatforsome

sen-tence, the sum of the distances of all

recov-eredTSGparsesforthatsentenceisaconstant

across allsentences. Normalisingin this

man-ner ensuresthat each sentence isequiprobable

(rememberthatRFMprobabilitiesareinterms

of parsepreferences,and notprobabilityof

oc-currencein somecorpus).

4. Map the normalised distances into

probabili-ties. If d(p)isthe normaliseddistance ofTSG

parse p, then associate with parse p the

refer-ence probability givenby the maximum

likeli-hoodestimator:

d(p)

P

x2t d(x)

(4)

Ourapproach therefore givespartial credit (a

non-zeroreferenceprobability)to allparsesin

t . R is

thereforenotasdiscontinuousastheequivalent

dis-tributionusedbyJohnsonetal. Wethereforedonot

needtousesimulatedannealingorothernumerically

intensivetechniques toestimatemodels.

4

(5)

Here wepresent two sets ofexperiments. The rst

setdemonstratetheexistenceofaninformative

sam-ple. Italsoshowssomeofthecharacteristicsofthree

sampling strategies. Thesecond setof experiments

islarger inscale, and showRFMs (bothlexicalised

and unlexicalised) estimated using sentences up to

30tokenslong. Also,theeectsofaGaussianprior

aredemonstratedasawayof(partially)dealingwith

overtting.

6.1 Testing theVarious Sampling

Strategies

Inordertoseehowvarioussizesofsamplerelatedto

estimation accuracy and whether wecould achieve

similar levelsof performance withoutrecoveringall

possibleparses,weranthefollowingexperiments.

Weusedamodelconsisting offeaturesthat were

dened using all three templates. We also threw

awayallfeaturesthatoccurredlessthantwotimesin

thetrainingset. Werandomlysplit theWallStreet

Journalinto disjoint training, held-out and testing

sets. All sentencesin thetrainingandheld-outsets

were atmost14 tokenslong. Sentences inthe

test-ing set were at most 30 tokens long. There were

6626 sentences in the training set, 98 sentences in

theheld-outsetand441sentencesinthetestingset.

Sentences in the held-out set had on average 12:6

parses,whilstsentencesinthetesting-sethadon

av-erage60:6parsespersentence.

Theheld-outsetwasusedto decidewhichmodel

performedbest. Actual performance of themodels

shouldbejudgedwithrespect tothetestingset.

Evaluationwasin termsofexactmatch: foreach

sentence in the test set, we awarded ourselves a

pointiftheRFMrankedhighestthesameparsethat

wasrankedhighestusingthereferenceprobabilities.

When evaluating with respect to the held-out set,

werecoveredallparsesforsentencesintheheld-out

set. Whenevaluatingwithrespecttothetesting-set,

werecoveredatmost100parsespersentence.

For each run, we ran IIS for the same number

of iterations (20). In each case, we evaluated the

RFMaftereachotheriterationandrecordedthebest

classication performance. This step was designed

toavoidoverttingdistortingourresults.

Figure2showstheresultsweobtainedwith

pos-sible ways of picking `typical' samples. The rst

column shows the maximum number of parses per

sentencesthat weretrievedineachsample.

Thesecond column showsthe size of the sample

(inparses).

Theothercolumnsgiveclassicationaccuracy

re-sults(a percentage) withrespecttothe testingset.

Inparentheses,wegiveperformancewithrespectto

theheld-outset.

Maxparses Size Rand SCFG Ref

1 6626 25.2(51.7) 23.3 (50.0) 23.4(50.0)

2 12331 37.9(63.0) 40.4 (60.3) 40.4(60.0)

3 17026 43.2(65.5) 43.7 (63.8) 43.7(63.8)

5 24878 43.7(70.2) 45.8 (69.5) 45.8(69.5)

10 39581 47.4 (72.0) 47.0 (70.0) 46.9(70.0)

100 119694 45.0(68.7) 45.0 (68.0) 45.0(68.0)

1000 246686 44.4(67.4) 43.0 (67.0) 43.0(67.0)

1 267400 43.0(66.0) 43.0 (66.0) 43.0(66.0)

Figure2: Resultswithvarioussamplingstrategies

of runs that used a sample that contained parses

which wererandomlyand uniformlyselectedoutof

theset of allpossibleparses. Theclassication

ac-curacyresultsforthissamplerareaveragedover10

runs.

The column marked SCFGshowstheresults

ob-tained when using a sample that contained parses

thatwereretrievedusingtheprobabilisticunpacking

strategy. Thisdidnotinvolveretrievingallpossible

parses for each sentence in the training set. Since

thereisnorandomcomponent,theresultsarefroma

singlerun. Here,parseswererankedusinga

stochas-ticcontextfreebackboneapproximationofTSG.

Pa-rameterswereestimatedusingsimplecounting.

Finally, the column marked Ref shows the

re-sultsobtained whenusing asample that contained

theoveralln-bestparses persentence, asdened in

termsofthereferencedistribution.

As a baseline, a model containing randomly

as-signedweightsproducedaclassicationaccuracyof

45%onthe held-out sentences. These resultswere

averagedover10runs.

As can be seen, increasing the sample size

pro-duces better results (for each sampling strategy).

Aroundasamplesizeof40kparses,overttingstarts

tomanifest, andperformance bottoms-out. Oneof

theseisthereforeourinformativesample. Notethat

thebestsample(40kparses)islessthan20%ofthe

totalpossibletrainingset.

The dierence between the various samplers is

marginal, with aslight preference for Rand.

How-everthefactthatSCFGsamplingseemstodoalmost

aswellasRandsampling,andfurthermoredoesnot

requireunpackingallparses,makesitthesampling

strategyofchoice.

SCFG sampling is biased in the sense that the

sample produced using it will tend to concentrate

around those parses that are all close to the best

parses. Rand sampling is unbiased, and, apart

fromthepracticalproblemsofhavingtorecoverall

parses,mightin somecircumstancesbebetterthan

SCFGsampling. Atthetimeof writingthis paper,

(6)

tionwithoutunpackingallparses. Wesuspectthat

for probabilistic unpacking to be eÆcient, it must

relyupon somenon-uniform distribution.

Unpack-ing randomly and uniformly would probably result

inalargelossincomputationaleÆciency.

6.2 Larger Scale Evaluation

Hereweshowresultsusingalargersampleand

test-ing set. We also show the eects of lexicalisation,

overtting,andoverttingavoidanceusinga

Gaus-sianprior. Strictlyspeakingthissectioncouldhave

beenomittedfromthepaper. However,ifoneviews

estimation using an informative sample as

overt-ting avoidance, then estimation using a Gaussian

prior can be seen as another, complementary take

ontheproblem.

The experimental setup wasas follows. We

ran-domly split the Wall Street Journal corpus into a

trainingset and atesting set. Both sets contained

sentencesthat wereat most 30tokens long. When

creatingthesetofparsesusedtoestimateRFMs,we

used the SCFG approach, and retained the top 25

parsespersentence. Withinthetrainingset(arising

from 16;200sentences), there were 405;020parses.

The testingset consisted of 466sentences, with an

averageof60:6parsespersentence.

Whenevaluating,weretrievedatmost100parses

persentenceinthetestingsetandscoredthemusing

our reference distribution. As before, we awarded

ourselvesapointifthe mostprobabletestingparse

(intermsoftheRMF)coincidedwiththemost

prob-ableparse(intermsofthereferencedistribution). In

allcases,weranIISfor100iterations.

For the rst experiment, we used just the rst

template (features that related to DCG

instantia-tions)tocreatemodel1;thesecondexperimentused

the rst and second templates (additional features

relatingto PPattachment)tocreate model2. The

nalexperimentusedallthreetemplates(additional

featuresthat werehead-lexicalised)tocreatemodel

3.

The three models contained 39;230, 65;568and

278;127featuresrespectively,

As a baseline, a model containing randomly

as-signed weights achieved a 22% classication

accu-racy. Theseresultswereaveragedover10runs.

Fig-ure3showstheclassicationaccuracyusingmodels

1,2and3.

As can be seen, the larger scale experimental

results were better than those achieved using the

smallersamples(mentionedinsection6.1). The

rea-son for this was becausewe used longer sentences.

Theinformativesamplederivablefromsucha

train-42

44

46

48

50

52

54

0

10

20

30

40

50

60

70

80

90

100 Accuracy

Iterations

model1

model2

model3

Figure3: ClassicationAccuracyforThreeModels

EstimatedusingBasicIIS

42

44

46

48

50

52

54

56

0

10

20

30

40

50

60

70

80

90

100 Accuracy

Iterations

model1

model2

model3

Figure4: ClassicationAccuracyforThreeModels

EstimatedusingaGaussianPriorand IIS

thepopulation) than theinformativesample

deriv-abledfromatrainingsetusingshorter, less

syntac-tically complex sentences. With the unlexicalised

model, we see clear signs of overtting. Model 2

overtsevenmoreso. Forreasonsthatare unclear,

we see that the larger model 3 does notappear to

exhibitovertting.

We next used the Gaussian Prior method of

Chen and Rosenfeld to reduce overtting (Chen

and Rosenfeld, 1999b). This involved integrating

a Gaussian prior (with a zero mean) into IIS and

searching for the model that maximised the

prod-uctofthelikelihoodandpriorprobabilities. Forthe

experiments reported here, we used a single

vari-anceoverthe entire model (betterresults mightbe

achievable ifmultiple varianceswere used, perhaps

with one variance per template type). The actual

valueof the variance wasfound by trial-and-error.

(7)

time using a Gaussian prior. Figure 4 shows the

classication accuracy of the models when using a

GaussianPrior.

WhenweusedaGaussianprior,wefoundthatall

models showed signs of improvement (allbeit with

varying degrees): performance either increased, or

else did not decrease with respect to the number

of iterations. Still,model2continued to

underper-form. Model 3 seemedmost resistent to the prior.

It therefore appears that a Gaussian prior is most

useful for unlexicalised models, and that for

mod-els built from complex, overlapping features, other

formsofsmoothingmustbeusedinstead.

7 Comments

WearguedthatRFMestimationforbroad-coverage

attribute-valued grammars could be made

compu-tationally tractable by training upon an

informa-tivesample. Oursmall-scaleexperimentssuggested

thatusing thoseparsesthatcouldbeeÆciently

un-packed(SCFGsampling)wasalmost aseectiveas

sampling from allpossibleparses (Randsampling).

Also, wesawthat models should notbe bothbuilt

andalsoestimated usingallpossibleparses. Better

resultscan be obtainedwhen models are builtand

trainedusing aninformativesample.

Given the relationship between sample size and

model complexity, weseethat whenthere isa

dan-gerofovertting,oneshouldbuildmodelsonthe

ba-sisof aninformativeset. However,thisleavesopen

thepossibilityof training such amodel upona

su-persetoftheinformativeset. Althoughwehavenot

testedthisscenario,webelievethatthiswouldlead

tobetterresultsthanthoseachievedhere.

The largerscale experimentsshowed that RFMs

can be estimated using relatively long sentences.

TheyalsoshowedthatasimpleGaussianpriorcould

reduce theeects ofovertting. However,theyalso

showedthat excessiveovertting probablyrequired

analternativesmoothingapproach.

The smallerand larger experimentscan be both

viewed as (complementary) ways of dealing with

overtting. We conjecture that of the two

ap-proaches,theinformativesampleapproachis

prefer-ableasit dealswith overtting directly: overtting

resultsfromttingtocomplexamodelwithtoo

lit-tledata.

Our ongoing research will concentrate upon

stronger ways of dealing with overtting in

lexi-calised RFMs. Oneline weare pursuingisto

com-bineacompression-basedpriorwith anexponential

model. ThisblendsMDLwithMaximumEntropy.

We arealso looking at alternativetemplate sets.

Forexample,wewould probablybenetfromusing

templatesthatcapturemoreofthesyntacticcontext

We would like to thank Rob Malouf, Donnla Nic

Gearailt and the anonymous reviewers for

com-ments. This work was supported by the TMR

ProjectLearning ComputationalGrammars.

References

Steven P. Abney. 1997. Stochastic

Attribute-Value Grammars. Computational Linguistics,

23(4):597{618,December.

Miles Osborne 1999. DCG Induction using MDL

and Parsed Corpora. In James Cussens, editor,

Learning Language in Logic, pages 63{71, Bled,

Slovenia,June.

Ted Briscoe and John Carroll. 1996. Automatic

Extraction of Subcategorization from Corpora.

In Proceedings of the 5 th

Conference on Applied

NLP,pages356{363,Washington,DC.

John Carroll and Ted Briscoe. 1992.

Probabilis-ticNormalisationandUnpackingofPackedParse

Forestsfor Unication-basedGrammars. In

Pro-ceedings of the AAAI Fall Symposium on

Prob-abilistic Approaches to Natural Language, pages

33{38,Cambridge,MA.

Stanley Chen and Ronald Rosenfeld. 1999a.

EÆ-cient Sampling and Feature Selection in Whole

SentenceMaximumEntropyLanguageModels. In

ICASSP'99.

Stanley F. Chen and Ronald Rosenfeld. 1999b.

A Gaussian Prior for Smoothing Maximum

En-tropyModels. TechnicalReportCMU-CS-99-108,

CarnegieMellonUniversity.

Eirik Hektoen. 1997. Probabilistic Parse Selection

Based on Semantic Cooccurrences. In

Proceed-ingsofthe5thInternationalWorskhoponParsing

Technologies, Cambridge, Massachusetts, pages

113{122.

Mark Johnson, Stuart Geman, Stephen Cannon,

ZhiyiChi,andStephanRiezler. 1999. Estimators

for Stochastic\Unication-based"Grammars. In

37 th

AnnualMeetingofthe ACL.

J. Laerty, S. Della Pietra, and V. Della Pietra.

1997. InducingFeaturesofRandomFields. IEEE

Transactions on Pattern Analysis and Machine

Intelligence,19(4):380{393,April.

Jorma Rissanen. 1989. Stochastic Complexity in

Statistical Inquiry. Seriesin ComputerScience