Informative Sample
Miles Osborne
Rijksuniversiteit Groningen, The Netherlands
Abstract
Wearguethatsomeofthecomputationalcomplexity
associated with estimation of stochastic
attribute-valuegrammarscanbereducedbytraininguponan
informativesubset of thefull trainingset. Results
using the parsed Wall Street Journal corpus show
that in somecircumstances,it is possibleto obtain
betterestimationresultsusing aninformative
sam-ple than when training upon all the available
ma-terial. Further experimentation demonstrates that
with unlexicalisedmodels, aGaussian priorcan
re-duce overtting. However, when models are
lexi-calisedandcontainoverlappingfeatures, overtting
doesnotseemtobeaproblem,andaGaussianprior
makesminimal dierenceto performance. Our
ap-proach is applicable for situations when there are
aninfeasiblylargenumberof parsesin thetraining
set, or else for when recoveryof these parses from
apackedrepresentationisitselfcomputationally
ex-pensive.
1 Introduction
Abney showed that attribute-value grammars
can-not be modelled adequately using statistical
tech-niques which assume that statistical dependencies
are accidental (Abney, 1997). Instead of using a
modelclassthatassumedindependence,Abney
sug-gested using Random Fields Models (RFMs) for
attribute-value grammars. RFMs deal with the
graphicalstructureofaparse. Becausetheydonot
makeindependence assumptionsaboutthe
stochas-tic generation process that might have produced
someparse,theyareabletomodelcorrectly
depen-denciesthat existwithin parses.
Whenestimatingstandardly-formulatedRFMs,it
is necessary to sum overall parses licensed by the
grammar. For many broad coverage natural
lan-guage grammars, this might involvesumming over
anexponentialnumberof parses. This wouldmake
the task computationally intractable. Abney,
fol-lowingtheleadof Laertyetal, suggestedaMonte
Current address: [email protected], University of
Edinburgh,DivisionofInformatics,2BuccleuchPlace,EH8
Carlosimulationasawayofreducingthe
computa-tionalburdenassociatedwithRFMestimation
(Laf-ferty et al., 1997). However, Johnson et al
consid-ered the form of sampling used in this simulation
(Metropolis-Hastings) intractable (Johnson et al.,
1999). Instead, theyproposedan alternative
strat-egythatredenedtheestimationtask. Itwasargued
thatthisredenitionmadeestimation
computation-ally simple enough that a Monte Carlo simulation
was unnecessary. They presented results obtained
usingasmallunlexicalisedmodeltrainedona
mod-estcorpus.
Unfortunately,Johnsonetalassumeditwas
possi-bletoretrieveallparseslicensedbyagrammarwhen
parsing a given training set. For us, this was not
thecase. Inourexperimentswith amanually
writ-tenbroadcoverageDeniteClauseGrammar(DCG)
(BriscoeandCarroll,1996),wewereonlyableto
re-cover all parses for Wall Street Journal sentences
thatwereat most13 tokenslongwithin acceptable
timeand space bounds on computation. When we
used an incremental Minimum Description Length
(MDL)basedlearnerto extendthecoverageofour
manually written grammar (from roughly 60% to
around90%oftheparsedWallStreetJournal),the
situationbecameworse. Sentenceambiguity
consid-erablyincreased. Wewerethenonlyabletorecover
allparsesforWallStreetJournalsentencesthatwere
atmost6tokenslong(Osborne,1999).
Wecanhowever,andusuallyinpolynomialtime,
recoverupto30parsesforsentencesupto30tokens
longwhen weuseaprobabilisticunpacking
mecha-nism(CarrollandBriscoe,1992). (Longersentences
than 30 tokens can be parsed, but the number of
parseswecanrecoverfor them dropso rapidly). 1
However,30 is far lessthan the maximumnumber
1
Wemadean attemptto determinethe maximum
num-berof parses our grammarmight assign to sentences. On
a 450MHz Ultra Sparc80 with2 Gbof realmemory,with
a limit of at most 1000 parses per sentence, and allowing
nomorethan100CPUsecondspersentence, wefoundthat
sentence ambiguity increased exponentially withrespect to
sentencelength. Sentenceswith30tokenshadanestimated
averageof 866 parses (standard deviation 290:4). Without
WallStreet Journalsentences. Any trainingsetwe
have access to will therefore be necessarily limited
insize.
We therefore need an estimation strategy that
takesseriouslytheissueof extractingthe best
per-formancefromalimitedsizetrainingset. Alimited
size trainingset meansonecreatedbyretrievingat
mostnparsespersentence. Althoughwecannot
re-coverall possible parses, wedo haveachoice asto
whichparsesestimationshouldbebasedupon.
OurapproachtotheproblemofmakingRFM
es-timationfeasible for ourhighly ambiguous DCGis
to seek out an informative sample and train upon
that. We do not redene the estimation task in a
non-standardway,nordoweuseaMonteCarlo
sim-ulation.
We call a sample informative if it both leads to
the selection of a model that does not undert or
overt,andalsoistypicaloffuturesamples. Despite
one's intuitions, an informativesample might be a
proper subset of the full training set. This means
thatestimationusing theinformativesamplemight
yield better resultsthan estimation usingallof the
trainingset.
The rest of this paper is as follows. Firstly we
introduceRFMs. Then weshowhowthey maybe
estimatedand howaninformativesamplemightbe
identied. Next, we give details of the
attribute-valuegrammar we use,and show howwego about
modelling it. We then present two sets of
experi-ments. Therstsetissmallscale,andaredesigned
toshowtheexistenceofaninformativesample. The
second set of experiments are larger in scale, and
build upon the computational savings we are able
to achieveusing aprobabilisticunpackingstrategy.
Theyshowhowlargemodels(twoordersof
magni-tude larger than those reported by Johnson et al)
canbeestimatedusingtheparsedWallStreet
Jour-nalcorpus. Overttingisshowntotakeplace. They
alsoshowhowthisoverttingcanbe(partially)
re-duced by using a Gaussian prior. Finally, we end
withsomecommentsonourwork.
2 Random Field Models
Hereweshowhowattribute-valuegrammarsmaybe
modelledusingRFMs. Althoughourcommentaryis
in termsof RFMs and grammars, it should be
ob-viousthatRFM technologycanbeappliedtoother
estimationscenarios.
LetG bean attribute-valuegrammar,D theset
of sentences within the string-set dened by L(G)
and the union of the set of parses assigned to
eachsentencein D bythegrammar G. ARandom
FieldModel,M,consistoftwocomponents: asetof
features,F andasetof weights,.
aspects of what it takes to dierentiate one parse
fromanotherparse. Eachfeatureisafunction from
a parse to an integer. Here, the integer value
as-sociated with a feature is interpreted as the
num-ber of times a feature `matches' (is `active') with
aparse. Note featuresshould notbeconfused with
featuresasfoundinfeature-valuebundles(thesewill
be called attributes instead). Features are usually
manuallyselectedbythesystemdesigner.
The other component of a RFM, , is a set of
weights. Informally,weightstellushowfeaturesare
tobeused whenmodellingparses. Forexample,an
activefeaturewithalargeweightmightindicatethat
someparsehadahighprobability. Eachweight
i is
associatedwithafeaturef
i
. Weightsarereal-valued
numbersandareautomaticallydeterminedbyan
es-timationprocess(forexampleusingImproved
Itera-tiveScaling(Laertyet al.,1997)). Oneof thenice
properties of RFMs is that the likelihood function
ofaRFMisstrictlyconcave. Thismeansthatthere
arenolocalminima,and sowecanbebesurethat
scaling will result in estimation of a RFM that is
globallyoptimal.
The (unnormalised) total weight of a parse x,
(x),isafunction ofthekfeaturesthatare`active'
onaparse:
(x)=exp( k
X
i=1
i f
i
(x)) (1)
The probability of a parse, P(x j M), is simply
theresultofnormalisingthetotalweightassociated
withthatparse:
P(xjM)= 1
Z
(x) (2)
Z= X
y2
(y) (3)
Theinterpretationofthisprobabilitydependsupon
theapplicationoftheRFM.Here,weuseparse
prob-abilitiestoreectpreferencesforparses.
When using RFMs for parse selection, we
sim-ply selectthe parse that maximises (x). In these
circumstances,there is no needto normalise
(com-puteZ). Also,whencomputing (x)forcompeting
parses,there is nobuilt-in biastowardsshorter (or
longer)derivations,andsononeedtonormalisewith
respecttoderivationlength. 2
2
Thereasonthereisnoneedtonormalisewithrespectto
derivationlengthisthatfeaturescan havepositiveor
nega-tiveweights. Theweightofaparsewillthereforenotalways
the Informative Sample
We now sketch how RFMs may be estimated and
thenoutlinehowweseekoutaninformativesample.
We use Improved Iterative Scaling (IIS) to
esti-mate RFMs. Inoutline,theIISalgorithmis as
fol-lows:
1. Start with areference distribution R , a set of
features F and aset of weights . Let M be
theRFMdened usingF and.
2. Initialise all weights to zero. This makes the
initialmodel uniform.
3. Computethe expectation of each feature w.r.t
R .
4. Foreachfeature f
i
(a) Findaweight
i
thatequatesthe
expecta-tionoff
i
w.r.tRandtheexpectationoff
i
w.r.tM.
(b) Replacetheoldvalueof
i with
i .
5. IfthemodelhasconvergedtoR ,outputM.
6. Otherwise,gotostep4
Thekeystephereis4a,computingtheexpectations
offeaturesw.r.ttheRFM.Thisinvolvescalculating
the probability of a parse, which, as we saw from
equation2, requiresasummation overall parsesin
.
We seekout aninformative sample
t (
t
)
asfollows:
1. Pickoutfromasampleofsize n.
2. Estimateamodelusingthat sampleand
evalu-ateit.
3. Ifthemodeljustestimatedshowssignsof
over-tting(withrespecttoanunseenheld-outdata
set),haltandoutputthemodel.
4. Otherwise,increasenandgobacktostep1.
Ourapproachismotivatedbythefollowing
(par-tiallyrelated)observations:
Because we use a non-parametric model class
and select an instance of it in terms of some
sample (section 5 gives details), a stochastic
complexityargumenttellsusthatanoverly
sim-ple model (resulting from a small sample) is
likelyto undert. Likewise, anoverlycomplex
model (resulting from alarge sample)is likely
toovert. Aninformativesamplewilltherefore
relatetoamodelthatdoesnotunderorovert.
Onaverage,aninformativesamplewillbe
`typ-ical'offuture samples. Formany real-life
situ-ations, this set islikely to besmall relativeto
searchmechanism. Becausewestartwithsmall
sam-ples and gradually increase their size, we remain
withinthedomainofeÆcientlyrecoverablesamples.
The second observation is (largely) incorporated
inthe waywepicksamples. Theexperimental
sec-tionofthispapergoesintotherelevantdetails.
Note our approach isheuristic: we cannotaord
toevaluateall2 jj
possibletrainingsets. Theactual
sizeof the informativesample
t
will depend both
the upon the model class used and the maximum
sentence length we can deal with. We would
ex-pect richer,lexicalised modelsto exhibitovertting
with smaller samples than would be the case with
unlexicalisedmodels. We would expect the size of
aninformativesampleto increaseasthemaximum
sentencelengthincreased.
There are similaritiesbetween our approach and
withestimationusingMDL(Rissanen,1989).
How-ever,ourimplementationdoesnotexplicitlyattempt
to minimise code lengths. Also, there are
similari-ties with importance sampling approaches to RFM
estimation (such as(Chen and Rosenfeld, 1999a)).
However, such attempts do not minimise under or
overtting.
4 The Grammar
ThegrammarwemodelwithRandomFields,(called
the Tag Sequence Grammar (Briscoe and Carroll,
1996),orTSGforshort)wasdevelopedwithregard
tocoverage,andwhencompiledconsistsof455
Def-inite Clause Grammar (DCG) rules. It does not
parse sequences of words directly, but instead
as-signsderivationstosequencesofpart-of-speechtags
(using the CLAWS2 tagset. The grammar is
rela-tivelyshallow,(for example,it does notfully
anal-yseunbounded dependencies) but it does makean
attempttodealwithcommonconstructions,suchas
datesornames,commonlyfoundin corpora,but of
littletheoreticalinterest. Furthermore,itintegrates
intothesyntaxatextgrammar,groupingutterances
intounitsthat reducetheoverall ambiguity.
5 Modelling the Grammar
ModellingtheTSGwithrespecttotheparsedWall
Street Journal consists of two steps: creation of a
feature set and denition of thereference
distribu-tion.
Ourfeatureset iscreatedbyparsingsentencesin
the training set (
T
), and using each parse to
in-stantiatetemplates. Each templatedenes afamily
of features. At present, the templates we use are
somewhatad-hoc. However,theyare motivatedby
theobservations that linguistically-stipulated units
(DCGrules) are informative, and that many DCG
us-AP/a1:unimpeded
A1/app1:unimpeded
a
a
a
a !
!
!
!
unimpeded PP/p1:by
P1/pn1:by
b
b
b "
"
"
by N1/n:traÆc
traÆc
Figure1: TSGParseFragment
The rst template creates features that count
thenumberof timesa DCGinstantiationis present
within a parse. 3
For example, suppose we parsed
theWallStreetJournalAP:
1 unimpededbytraÆc
AparsetreegeneratedbyTSGmightbeasshown
ingure1. Here,tosaveonspace,wehavelabelled
each interiornode in the parse tree with TSGrule
names, and not attribute-value bundles.
Further-more, we have annotatedeach node with the head
word of the phrase in question. Within our
gram-mar, heads are (usually) explicitly marked. This
means we do not have to make any guesses when
identifying the headof a local tree. With head
in-formation,weareabletolexicalisemodels. Wehave
suppressedtagginginformation.
Forexample,afeaturedenedusingthistemplate
mightcountthenumberoftimesthewesaw:
AP/a1
A1/app1
inaparse. Suchfeaturesrecordsomeofthecontext
oftheruleapplication,inthatruleapplicationsthat
dier in termsof howattributes are bound will be
modelled bydierentfeatures.
Oursecondtemplatecreatesfeaturesthatare
par-tiallylexicalised. Foreach localtree (ofdepthone)
that has a PP daughter, we create a feature that
countsthenumberoftimesthatlocaltree,decorated
with thehead-wordofthePP,wasseenin aparse.
Anexampleofsuchalexicalisedfeaturewouldbe:
A1/app1
PP/p1:by
3
Note,allourfeaturessuppressanyterminalsthatappear
inalocaltree.Lexicalinformationisincludedwhenwedecide
ments that can be resolved using the head of the
PP.
Thethirdandnaltemplatecreatesfeaturesthat
areagainpartiallylexicalised. Thistime, wecreate
localtreesofdepth onethataredecoratedwiththe
headword. Forexample,hereisonesuchfeature:
AP/a1:unimpeded
A1/app1
Notethesecondandthirdtemplatesresultin
fea-turesthat overlap with features resulting from
ap-plicationsofthersttemplate.
WecreatethereferencedistributionR(an
associ-ationofprobabilitieswithTSGparsesof sentences,
suchthattheprobabilitiesreectparsepreferences)
usingthefollowingprocess:
1. Extract some sample
T
(using the approach
mentionedinsection3).
2. Foreachsentencein thesample,foreachparse
of that sentence, compute the `distance'
be-tween the TSG parse and the WSJ reference
parse. In our approach, distance is calculated
intermsofaweightedsumofcrossingrates,
re-callandprecision. Minimisingitmaximisesour
denitionofparseplausibility. 4
However,there
isnothinginherentlycrucialaboutthisdecision.
Any otherobjectivefunction (that canbe
rep-resented asan exponential distribution) could
beusedinstead.
3. Normalisethedistances,suchthatforsome
sen-tence, the sum of the distances of all
recov-eredTSGparsesforthatsentenceisaconstant
across allsentences. Normalisingin this
man-ner ensuresthat each sentence isequiprobable
(rememberthatRFMprobabilitiesareinterms
of parsepreferences,and notprobabilityof
oc-currencein somecorpus).
4. Map the normalised distances into
probabili-ties. If d(p)isthe normaliseddistance ofTSG
parse p, then associate with parse p the
refer-ence probability givenby the maximum
likeli-hoodestimator:
d(p)
P
x2t d(x)
(4)
Ourapproach therefore givespartial credit (a
non-zeroreferenceprobability)to allparsesin
t . R is
thereforenotasdiscontinuousastheequivalent
dis-tributionusedbyJohnsonetal. Wethereforedonot
needtousesimulatedannealingorothernumerically
intensivetechniques toestimatemodels.
4
Here wepresent two sets ofexperiments. The rst
setdemonstratetheexistenceofaninformative
sam-ple. Italsoshowssomeofthecharacteristicsofthree
sampling strategies. Thesecond setof experiments
islarger inscale, and showRFMs (bothlexicalised
and unlexicalised) estimated using sentences up to
30tokenslong. Also,theeectsofaGaussianprior
aredemonstratedasawayof(partially)dealingwith
overtting.
6.1 Testing theVarious Sampling
Strategies
Inordertoseehowvarioussizesofsamplerelatedto
estimation accuracy and whether wecould achieve
similar levelsof performance withoutrecoveringall
possibleparses,weranthefollowingexperiments.
Weusedamodelconsisting offeaturesthat were
dened using all three templates. We also threw
awayallfeaturesthatoccurredlessthantwotimesin
thetrainingset. Werandomlysplit theWallStreet
Journalinto disjoint training, held-out and testing
sets. All sentencesin thetrainingandheld-outsets
were atmost14 tokenslong. Sentences inthe
test-ing set were at most 30 tokens long. There were
6626 sentences in the training set, 98 sentences in
theheld-outsetand441sentencesinthetestingset.
Sentences in the held-out set had on average 12:6
parses,whilstsentencesinthetesting-sethadon
av-erage60:6parsespersentence.
Theheld-outsetwasusedto decidewhichmodel
performedbest. Actual performance of themodels
shouldbejudgedwithrespect tothetestingset.
Evaluationwasin termsofexactmatch: foreach
sentence in the test set, we awarded ourselves a
pointiftheRFMrankedhighestthesameparsethat
wasrankedhighestusingthereferenceprobabilities.
When evaluating with respect to the held-out set,
werecoveredallparsesforsentencesintheheld-out
set. Whenevaluatingwithrespecttothetesting-set,
werecoveredatmost100parsespersentence.
For each run, we ran IIS for the same number
of iterations (20). In each case, we evaluated the
RFMaftereachotheriterationandrecordedthebest
classication performance. This step was designed
toavoidoverttingdistortingourresults.
Figure2showstheresultsweobtainedwith
pos-sible ways of picking `typical' samples. The rst
column shows the maximum number of parses per
sentencesthat weretrievedineachsample.
Thesecond column showsthe size of the sample
(inparses).
Theothercolumnsgiveclassicationaccuracy
re-sults(a percentage) withrespecttothe testingset.
Inparentheses,wegiveperformancewithrespectto
theheld-outset.
Maxparses Size Rand SCFG Ref
1 6626 25.2(51.7) 23.3 (50.0) 23.4(50.0)
2 12331 37.9(63.0) 40.4 (60.3) 40.4(60.0)
3 17026 43.2(65.5) 43.7 (63.8) 43.7(63.8)
5 24878 43.7(70.2) 45.8 (69.5) 45.8(69.5)
10 39581 47.4 (72.0) 47.0 (70.0) 46.9(70.0)
100 119694 45.0(68.7) 45.0 (68.0) 45.0(68.0)
1000 246686 44.4(67.4) 43.0 (67.0) 43.0(67.0)
1 267400 43.0(66.0) 43.0 (66.0) 43.0(66.0)
Figure2: Resultswithvarioussamplingstrategies
of runs that used a sample that contained parses
which wererandomlyand uniformlyselectedoutof
theset of allpossibleparses. Theclassication
ac-curacyresultsforthissamplerareaveragedover10
runs.
The column marked SCFGshowstheresults
ob-tained when using a sample that contained parses
thatwereretrievedusingtheprobabilisticunpacking
strategy. Thisdidnotinvolveretrievingallpossible
parses for each sentence in the training set. Since
thereisnorandomcomponent,theresultsarefroma
singlerun. Here,parseswererankedusinga
stochas-ticcontextfreebackboneapproximationofTSG.
Pa-rameterswereestimatedusingsimplecounting.
Finally, the column marked Ref shows the
re-sultsobtained whenusing asample that contained
theoveralln-bestparses persentence, asdened in
termsofthereferencedistribution.
As a baseline, a model containing randomly
as-signedweightsproducedaclassicationaccuracyof
45%onthe held-out sentences. These resultswere
averagedover10runs.
As can be seen, increasing the sample size
pro-duces better results (for each sampling strategy).
Aroundasamplesizeof40kparses,overttingstarts
tomanifest, andperformance bottoms-out. Oneof
theseisthereforeourinformativesample. Notethat
thebestsample(40kparses)islessthan20%ofthe
totalpossibletrainingset.
The dierence between the various samplers is
marginal, with aslight preference for Rand.
How-everthefactthatSCFGsamplingseemstodoalmost
aswellasRandsampling,andfurthermoredoesnot
requireunpackingallparses,makesitthesampling
strategyofchoice.
SCFG sampling is biased in the sense that the
sample produced using it will tend to concentrate
around those parses that are all close to the best
parses. Rand sampling is unbiased, and, apart
fromthepracticalproblemsofhavingtorecoverall
parses,mightin somecircumstancesbebetterthan
SCFGsampling. Atthetimeof writingthis paper,
tionwithoutunpackingallparses. Wesuspectthat
for probabilistic unpacking to be eÆcient, it must
relyupon somenon-uniform distribution.
Unpack-ing randomly and uniformly would probably result
inalargelossincomputationaleÆciency.
6.2 Larger Scale Evaluation
Hereweshowresultsusingalargersampleand
test-ing set. We also show the eects of lexicalisation,
overtting,andoverttingavoidanceusinga
Gaus-sianprior. Strictlyspeakingthissectioncouldhave
beenomittedfromthepaper. However,ifoneviews
estimation using an informative sample as
overt-ting avoidance, then estimation using a Gaussian
prior can be seen as another, complementary take
ontheproblem.
The experimental setup wasas follows. We
ran-domly split the Wall Street Journal corpus into a
trainingset and atesting set. Both sets contained
sentencesthat wereat most 30tokens long. When
creatingthesetofparsesusedtoestimateRFMs,we
used the SCFG approach, and retained the top 25
parsespersentence. Withinthetrainingset(arising
from 16;200sentences), there were 405;020parses.
The testingset consisted of 466sentences, with an
averageof60:6parsespersentence.
Whenevaluating,weretrievedatmost100parses
persentenceinthetestingsetandscoredthemusing
our reference distribution. As before, we awarded
ourselvesapointifthe mostprobabletestingparse
(intermsoftheRMF)coincidedwiththemost
prob-ableparse(intermsofthereferencedistribution). In
allcases,weranIISfor100iterations.
For the rst experiment, we used just the rst
template (features that related to DCG
instantia-tions)tocreatemodel1;thesecondexperimentused
the rst and second templates (additional features
relatingto PPattachment)tocreate model2. The
nalexperimentusedallthreetemplates(additional
featuresthat werehead-lexicalised)tocreatemodel
3.
The three models contained 39;230, 65;568and
278;127featuresrespectively,
As a baseline, a model containing randomly
as-signed weights achieved a 22% classication
accu-racy. Theseresultswereaveragedover10runs.
Fig-ure3showstheclassicationaccuracyusingmodels
1,2and3.
As can be seen, the larger scale experimental
results were better than those achieved using the
smallersamples(mentionedinsection6.1). The
rea-son for this was becausewe used longer sentences.
Theinformativesamplederivablefromsucha
train-42
44
46
48
50
52
54
0
10
20
30
40
50
60
70
80
90
100
Accuracy
Iterations
model1
model2
model3
Figure3: ClassicationAccuracyforThreeModels
EstimatedusingBasicIIS
42
44
46
48
50
52
54
56
0
10
20
30
40
50
60
70
80
90
100
Accuracy
Iterations
model1
model2
model3
Figure4: ClassicationAccuracyforThreeModels
EstimatedusingaGaussianPriorand IIS
thepopulation) than theinformativesample
deriv-abledfromatrainingsetusingshorter, less
syntac-tically complex sentences. With the unlexicalised
model, we see clear signs of overtting. Model 2
overtsevenmoreso. Forreasonsthatare unclear,
we see that the larger model 3 does notappear to
exhibitovertting.
We next used the Gaussian Prior method of
Chen and Rosenfeld to reduce overtting (Chen
and Rosenfeld, 1999b). This involved integrating
a Gaussian prior (with a zero mean) into IIS and
searching for the model that maximised the
prod-uctofthelikelihoodandpriorprobabilities. Forthe
experiments reported here, we used a single
vari-anceoverthe entire model (betterresults mightbe
achievable ifmultiple varianceswere used, perhaps
with one variance per template type). The actual
valueof the variance wasfound by trial-and-error.
time using a Gaussian prior. Figure 4 shows the
classication accuracy of the models when using a
GaussianPrior.
WhenweusedaGaussianprior,wefoundthatall
models showed signs of improvement (allbeit with
varying degrees): performance either increased, or
else did not decrease with respect to the number
of iterations. Still,model2continued to
underper-form. Model 3 seemedmost resistent to the prior.
It therefore appears that a Gaussian prior is most
useful for unlexicalised models, and that for
mod-els built from complex, overlapping features, other
formsofsmoothingmustbeusedinstead.
7 Comments
WearguedthatRFMestimationforbroad-coverage
attribute-valued grammars could be made
compu-tationally tractable by training upon an
informa-tivesample. Oursmall-scaleexperimentssuggested
thatusing thoseparsesthatcouldbeeÆciently
un-packed(SCFGsampling)wasalmost aseectiveas
sampling from allpossibleparses (Randsampling).
Also, wesawthat models should notbe bothbuilt
andalsoestimated usingallpossibleparses. Better
resultscan be obtainedwhen models are builtand
trainedusing aninformativesample.
Given the relationship between sample size and
model complexity, weseethat whenthere isa
dan-gerofovertting,oneshouldbuildmodelsonthe
ba-sisof aninformativeset. However,thisleavesopen
thepossibilityof training such amodel upona
su-persetoftheinformativeset. Althoughwehavenot
testedthisscenario,webelievethatthiswouldlead
tobetterresultsthanthoseachievedhere.
The largerscale experimentsshowed that RFMs
can be estimated using relatively long sentences.
TheyalsoshowedthatasimpleGaussianpriorcould
reduce theeects ofovertting. However,theyalso
showedthat excessiveovertting probablyrequired
analternativesmoothingapproach.
The smallerand larger experimentscan be both
viewed as (complementary) ways of dealing with
overtting. We conjecture that of the two
ap-proaches,theinformativesampleapproachis
prefer-ableasit dealswith overtting directly: overtting
resultsfromttingtocomplexamodelwithtoo
lit-tledata.
Our ongoing research will concentrate upon
stronger ways of dealing with overtting in
lexi-calised RFMs. Oneline weare pursuingisto
com-bineacompression-basedpriorwith anexponential
model. ThisblendsMDLwithMaximumEntropy.
We arealso looking at alternativetemplate sets.
Forexample,wewould probablybenetfromusing
templatesthatcapturemoreofthesyntacticcontext
We would like to thank Rob Malouf, Donnla Nic
Gearailt and the anonymous reviewers for
com-ments. This work was supported by the TMR
ProjectLearning ComputationalGrammars.
References
Steven P. Abney. 1997. Stochastic
Attribute-Value Grammars. Computational Linguistics,
23(4):597{618,December.
Miles Osborne 1999. DCG Induction using MDL
and Parsed Corpora. In James Cussens, editor,
Learning Language in Logic, pages 63{71, Bled,
Slovenia,June.
Ted Briscoe and John Carroll. 1996. Automatic
Extraction of Subcategorization from Corpora.
In Proceedings of the 5 th
Conference on Applied
NLP,pages356{363,Washington,DC.
John Carroll and Ted Briscoe. 1992.
Probabilis-ticNormalisationandUnpackingofPackedParse
Forestsfor Unication-basedGrammars. In
Pro-ceedings of the AAAI Fall Symposium on
Prob-abilistic Approaches to Natural Language, pages
33{38,Cambridge,MA.
Stanley Chen and Ronald Rosenfeld. 1999a.
EÆ-cient Sampling and Feature Selection in Whole
SentenceMaximumEntropyLanguageModels. In
ICASSP'99.
Stanley F. Chen and Ronald Rosenfeld. 1999b.
A Gaussian Prior for Smoothing Maximum
En-tropyModels. TechnicalReportCMU-CS-99-108,
CarnegieMellonUniversity.
Eirik Hektoen. 1997. Probabilistic Parse Selection
Based on Semantic Cooccurrences. In
Proceed-ingsofthe5thInternationalWorskhoponParsing
Technologies, Cambridge, Massachusetts, pages
113{122.
Mark Johnson, Stuart Geman, Stephen Cannon,
ZhiyiChi,andStephanRiezler. 1999. Estimators
for Stochastic\Unication-based"Grammars. In
37 th
AnnualMeetingofthe ACL.
J. Laerty, S. Della Pietra, and V. Della Pietra.
1997. InducingFeaturesofRandomFields. IEEE
Transactions on Pattern Analysis and Machine
Intelligence,19(4):380{393,April.
Jorma Rissanen. 1989. Stochastic Complexity in
Statistical Inquiry. Seriesin ComputerScience