models with radically diering performance levels
Iain Bancarz and Miles Osborne
fiainrb,[email protected]
Division of Informatics
University of Edinburgh
2Buccleuch Place
Edinburgh EH89LW
Scotland
Abstract
Log-linear models can be eÆciently estimated
us-ing algorithms such as Improved Iterative Scaling
(IIS)(Laertyetal.,1997). Undercertainconditions
andforaparticularclassofproblems,IISis
guaran-teedto approachboththemaximum-likelihoodand
maximum entropy solution. This solution, in
like-lihood space, is unique. Unfortunately, in realistic
situations,multiplesolutionsmayexist,allofwhich
are equivalent to eachother in termsof likelihood,
but radically dierent from each other in terms of
performance. Weshowthatthisbehaviourcanoccur
whenamodelcontainsoverlappingfeaturesandthe
training material is sparse. Experimental results,
fromthedomainofparseselectionforstochastic
at-tribute value grammars, shows the wide variation
in performance that can be found whenestimating
models using IIS. Further resultsshow that the
in-uence of the initial model can be diminished by
selecting either uniform weights, or else by model
averaging.
1 Background
When statistically modelling linguistic phenomena
of onesortoranother, researcherstypically t
log-linearmodels to thedata(forexample(Johnson et
al., 1999)). There are (at least) three reasons for
thepopularityofsuchmodels: theydonotmake
un-warrantedindependenceassumptions,themaximum
likelihoodsolutionofsuchmodelscoincideswiththe
maximumentropysolution,andnally,theycanbe
eÆciently estimated (using algorithms such as
Im-provedIterativeScaling(IIS)(Laertyetal.,1997)).
Now, the solutionfound by IIS is guaranteed to
approachaglobal maximumforbothlikelihoodand
entropyunder certain conditions. Although this is
appealing, in realistic situations it turns out that
multiple modelsexist,allofwhichareequivalentin
termsof likelihood but dierent fromeachother in
termsoftheirperformanceatsometask. In
particu-lar,theinitialweightsettingscaninuencethe
qual-ityofthenal model, even thoughthisnal model
isthemaximumentropysolution(asfoundbyIIS).
algorithmisguaranteedtoconvergetoaglobally
op-timal solution regardlessof the initial parameters.
If the initial weights assigned to some of the
fea-tures are wildly inappropriate then the algorithm
maytakelongerto converge,but onewould expect
thenaldestination to remain thesame. However,
asweshowlater, what isunique in terms of
likeli-hood need not be unique in terms of performance,
andsoIIScanbesensitiveto theinitialweight
set-tings.
Some of the reason for this behaviour may lie
in a relatively subtle eect which we call
overlap-pingfeatures. 1
Ifsomefeaturesbehaveidenticallyin
thetrainingset (but notnecessarilyin future
sam-ples), IIS cannot distinguish between dierent sets
of weights for those features unless the sum of all
weights in each set is dierent. In fact, the nal
weightsassignedto such featureswillbedependent
ontheirinitialvalues. Undertheseconditions,there
willbeafamilyofmodels,allofwhichareidentical
asfar asIIS isconcerned, but distinguishable from
eachotherintermsofperformance. Thismeansthat
in terms of performance space, the landscape will
containlocal maxima. Intermsof likelihood space,
thelandscapewillcontinuetocontainjustthesingle
(global)maximum.
This indeterminacy is clearly undesirable.
How-ever,thereare(atleast)twopracticalwaysof
deal-ing with it: one either initialises IIS with weights
that are identical (0 is a good choice), or else one
takes a Bayesian approach, and averages over the
space of models. In this paper, we show that the
performance of IIS is sensitiveto the initial weight
settings. We showthat setting theweightsto zero
yields performance that is better than a large set
ofrandomlyinitialised models. Also, weshow that
modelaveragingcanalsoreduceindeterminacies
in-troducedthroughthechoiceofinitialweights.
The rest of this paper is as follows. Section
2 is a restatement of the theory behind IIS.
Sec-tion3showshowsparse trainingmaterial, coupled
1
Wedo not claimthat overlapping features are the sole
IIS produces suboptimal results. We then move
on to experimental support for our theoretical
re-sults. Our domain is parse selection for
broad-coverage stochastic attribute-valuegrammars.
Sec-tion 4 shows how we model the broad coverage
attribute-value grammar (mentioned in section 5
used in our experiments),and then present our
re-sults. Therst set ofexperiments(section7) deals
withhowwellIIS,withuniforminitialsettings.
out-performs models with randomised initial settings.
The second set of experiments shows how model
avergingcandealwiththeproblemofinitialisingthe
model. Weconcludethepaperwithsomecomments
onourwork.
2 The Duality Lemma
How can we be certain that IIS seeks out aglobal
maximumforbothlikelihoodandentropy? The
an-sweristhatfortheclassofproblemsunder
consider-ation,thereisonlyasinglemaximum-whichisthen
necessarily theglobal one. TheIIS algorithm
sim-ply `hill-climbs' andseeksto increasethelikelihood
ofthemodelbeingtrained. Itssuccessisguaranteed
becauseofaresultwhich werefertoas theDuality
Lemma 2
.
The proof of theDuality Lemma is containedin
(Laerty et al., 1997)but will be omittedhere. In
ordertostatethelemma,werstdenethesetting
andestablishsomenotation.
Supposethatwehaveaprobabilitymeasurespace
(;F;P),whereas usualis aset madeupof
ele-ments!, F isasigma-eldon ,and P is a
prob-abilitymeasure onF. Supposefurther that X is a
simplerandomvariableon-thatis tosay,itisa
real-valued function on , havingnite range, and
suchthat[!:X(!)=x]2F. Asusual,wewillomit
theargument!,sothatXindicatesageneralvalue
of thefunction as wellasthe function itself, and x
denotes[!:X(!)=x]. Wewillalsoabusenotation
byletting X denotethe set ofpossiblevaluesof x;
themeaningshouldbeclearfromcontext.
Wenowconsiderastochasticprocesson(;F;P).
Wearegivenasampleofpastoutputs(datapoints)
from thisprocesswhichmakeupthetrainingdata.
Weagainabusenotationbyusingp~torefertoboth
thesetofdatapointsandthedistributiondenedby
thatset. Letdenotethesetofallpossible
proba-bilitydistributions overX; weseekamodelq
2
whichisin somesense thebest possibleprobability
distributionoverfutureoutputs. Wespecically
ex-aminegeneralizedGibbsdistributions,whichareof
theform:
q
h (x)=
1
(Z
q (h))
exph(x) q(x) (1)
2
ThisresultwascalledProposition4(Laertyetal.,1997).
initialprobabilitydistributionoverX(whichmaybe
theuniformdistribution),andZ
q
(h)isanormalising
constant,takingavaluesuchthat P
x2X q
h
(x)=1.
Thefunction hheretakestheform:
h(x)= n
X
i=1
i f
i
(x)=(f)(x) (2)
where the f
i
(x) are integerfeature functions. The
realnumbers
i
are adjustableparameters.
Suppose that we are given the data p,~ an initial
model q
0
, and a set of features f. Laerty et al.
(1997)describetwonaturalsetsofmodels. Therst
isthesetP(f;p)~ ofalldistributionsthatagreewith
~
pastotheexpectedvalueofthefeaturefunctionf:
P(f;p)~ =[p2:p[f]=p~[f]] (3)
where, asusual,p[f]denotestheexpectationofthe
function f underthedistribution p.
The second is Q(f;q
0
), the set of generalized
Gibbsdistributionsbasedonq
0
andwithfeatureset
f:
Q(f;q
0
)=[(f)Æq
0
] (4)
Let Q denote the closure of Q in , with respect
tothetopology inheritsasasubsetof Euclidean
space.
These in turn determine two natural candidates
for the `best' model q
. Let D(pkq) denote the
Kullback-Leibler divergence between two
distribu-tionspandq. Thesuitablemodelsare:
MaximumLikelihoodGibbsDistribution. A
dis-tribution in Q with maximum likelihood with
respectto p:~ q ML
=argmin
q2Q
D(~pkq).
MaximumEntropyConstrainedDistribution. A
distribution in P with maximum entropy
rela-tiveto q
0 : q
ME
=argmin
q2P(f;~p ) D(pkq
0 )
The key result of Laerty et al. (1997) is that
there is a unique q
satisfying q
= q
ML
= q
ME
.
InAppendix 1ofthatpaper,thefollowingresultis
proved:
The Duality Lemma. SupposethatD(~pkq
0 )<
1. Thenthereexists auniqueq
2satisfying:
1. q
2P\Q
2. D(pkq)=D(pkq
)+D(q
kq)8p2P;q2Q
3. q
=argmin
q2Q
D(~pkq)
4. q
=argmin
q2P(f;~p) D(pkq
0 )
TheDualityLemmaisaveryusefulresult,
funda-mentaltotheabilityofIIStoconvergeuponthe
sin-glemaximumlikelihood,maximumentropysolution.
TheIISalgorithmitselfusesafairlystraightforward
techniquetolookforanymaximumofthelikelihood
function;becauseofthelemma,IISisguaranteedto
Data
3.1 Overlapping Features
Inthispaper,allmodelshavethesametrainingdata
and a uniform initial distribution q
0
. This ensures
that D(pkq
0
)1 andthat,forallmodels, theIIS
algorithmapproachesthesameoptimaldistribution
q
. However,ourexperimentsshowthatnotall
dis-tributionsobtainedbyrunningIIS(toconvergence)
areequallygoodatmodellingasetoftestdata.
Per-formanceappearstodepend(atleast)onthe
start-ingvaluesfor theweights
i
. Thismaybearesult
ofthefollowingsituation:
Consider two features, f
i and f
j
with i < j,
with weights
i and
j
respectively. Suppose that
f
i (x)=f
j
(x)forallvaluesofxin p,~ butthereexist
valuesof x outside p~such that f
i
(x) 6=f
j
(x); that
is, the two functions take exactly the same values
on the set of training data but dier outsideof it.
Thisphenomenoncanbecalledoverlapping. 3
Over-lapping features are commonly found in maximum
entropymodels,andareoneofthemainreasonsfor
their popularity. In fact, one could argue that all
maximumentropymodelsfoundinnaturallanguage
applications containoverlappingfeatures.
Overlap-pingfeaturesmaybepresentbyexplicitdesign(for
examplewhenemulatingbacking-osmoothing),or
else naturally, for example when using features to
treatwordsco-occurringwitheachother.
Weassigninitialweights (0)
i
and (0)
j
tothe
fea-tures, and after n iterationsof thealgorithm, they
havebeenadjustedto (n) i and (n) j respectively.
Nowconsiderthetargetsolutionq
,asdetermined
byq
0
andp. Letusassumeforconveniencethat q
isoftheform:
q = 1 Z exp( n X k =1 k f k ) (5) where Z
istheusual normalisingconstant. Notice
thatq
maynotbelongtothefamilyofexponential
models Q, but instead may be part of the larger
set Q. Indeed, della Pietra et al. state that P \
Q may be empty. This is not a serious limitation
asonecancomearbitrarily closeto any elementof
Q while remaining inside Q. Thus, if q
= 2 Q, we
mayconsidertheaboveexpressiontobeaveryclose
approximation to q
, such ascould beobtained by
runningIIStoconvergence.
Becausetheith andjthfeaturesareequalonthe
training data, the exact values of i and j have
no eect on the likelihood of the model as longas
3
Wecanrelaxthisdenitionofoverlappingandallow
fea-turestolargelyco-occur together. Dependingupon the
de-greeofco-occurrence,wewouldexpecttocontinuetondour
is nota singlemodel at all, but rather afamily of
models satisfying the condition i + j = ij for aparticular ij . 4
All models in this family assign
equallikelihoodtothetrainingdata,andsotheIIS
algorithmisunabletodistinguishbetweenthem.
Under thesecircumstancesweshouldensurethat
i = j
. Since we haveno way to distinguish
be-tween the two features, our model should assign
themequal importance. However,this is not
guar-anteedbytheIISalgorithm. Inparticular,itis less
likelytooccuriftheinitialweights (0) i and (0) j are notequal.
3.2 IISand Overlapping Features - A
SimpleExample
Supposethat ourmodel hasavectorof parameters
= [
i
:i=1:::n] and we wish to change it to
+Æ, where Æ=[Æ
i
:i=1:::n] TheIISalgorithm
considerseach
i
in turn. ItchoosesÆ
i
tomaximize
anauxiliaryfunctionB(Æj),whichprovidesalower
bound onthechangein log-likelihoodofthemodel.
Eachadjustment i i +Æ i
is guaranteedto
in-creasethelikelihoodofthemodelandthusapproach
ourideal solution q
. This process is repeated
un-til convergence. There are no inherentrestrictions
on the value of Æ
i
; it may be positive or negative,
andlargeorsmallcomparedtothevaluestakenfor
dierentfeatures.
Now, suppose that the features f
i and f
j
over-lap and we halt the algorithm after t iterations.
Clearly,thealgorithmcannotguaranteethatthe
-nalweights t i and t j
will be equal. The following
highly simplied example should demonstrate why
thisisthecase.
Example 1: Imagine that our model has two
overlapping features f
1 and f
2 , and q
is the
fam-ily of models in which
1 +
2
= 5. Suppose that
theinitialweightsare (0)
1
=5; 0
2
=0. Recall that
at the ith step of an iteration, the IIS algorithm
considersthe change in log-likelihood of the model
which can be made by only adjusting the ith
pa-rameter. In this case no change is possible asthe
sum
1 +
2
isalreadyatitsoptimumvalue,sothe
algorithmterminates with t
1
=5and t
2
=0. We
haveassigned fargreater importance to f
1
than to
f
2
with no justication for doing so. Clearly, this
assumptionmightnotbewarranted.
3.3 IISand Overlapping Features in
Practice
Thesituation will obviouslybemuch more
compli-catedinpractice. Mostimportantly,thereisnosuch
4
The number
ij
is not a `constant' as such, since any
vectorofweightsisinasenseuniqueonlyuptomultiplication
. As aresult, nooneparametercanberegarded
asxeduntilthealgorithmhasconvergedforall
pa-rameters. In the aboveexample, our \true" set of
targetsolutionsisthosefor which
1 +
2
isa
con-stant. Sincetherearenootherfeatures,IISwouldin
factterminateatonceforany pairofinitialweights.
Suppose that weadded a third feature f
3 which
didnotoverlapthersttwo. Oursetoftarget
solu-tionswouldthenbeoftheform
1 +
2 =
12 k;
3 =
3
k for some xed
12 and
3
and for any k > 0.
Theadjustmentsmadeto
1 and
2
willdependon
thosemadeto
3
,andthealgorithmwillnot
neces-sarilyterminate aftertherstpass.
Thefollowingexamplewilldescribewhathappens
in this morerealistic situation. It will also explain
thepossiblebenets ofsetting allinitial weightsto
thesamevalue.
Example 2: Let the model have the three
fea-tures described above, with initial weights 0
1 =
5; 0
2 =0;
0
3
=1andafamilyoftargetsolutionsq
dened by
1 +
2
=5k;
3
=k foranyk>0. IIS
againterminates at once,becausetheinitial model
is already partof the family q
. Againthere is an
unjustied dierence between the nal weights for
f
1 andf
2 .
Now suppose that all three initial weights are
set to zero. Imagine for simplicity that we have
restricted the algorithm to adjust weights only by
zeroor1. Onthe rstpass, allthree weights are
changedto1. Onthesecond,
1 and
2
areincreased
to2;
3
remainsat1. Inthethirditeration,
1 isset
to3,
2
remainsat2and
3
at1,andthealgorithm
terminates. 5
Noticethat,althoughthereisstilladierence
be-tweenthenalweightsforf
1 andf
2
,itismuchless
thanbefore. Thismorecloselyapproachestheideal
situationin which theweightsareequal.
This concludes our theoretical treatment of IIS.
Wenowshowexperimentallytheinuencethat the
initial weight settings have, and how it can be
minimised. Our strategy is to use plausible
fea-tures, as found in a realistic domain (parse
selec-tion for stochastic attribute-value grammars), and
rstly show what happens when the initial weight
settings areset uniformly (to zero). We thenshow
what happens when these initial settings are
ran-domly set, and nally, what happens when we
av-erageoverrandomlyinitialisedmaximumlikelihood
solutions.
5
Theexactbehaviourofthealgorithmwilldependonthe
trainingdata,sothisisnottheonlyimaginableoutcome,but
Attribute-Value Grammars
Hereweshowhowattribute-valuegrammarsmaybe
modelledusinglog-linearmodels. Abneygivesfuller
details(Abney,1997).
LetGbeanattribute-valuegrammar,andDaset
of sentences within the string-set dened byL(G).
Alog-linear model, M, consist oftwo components:
asetoffeatures,F andaset ofweights,.
The (unnormalised) total weight of a parse x,
(x),isafunction ofthekfeaturesthatare`active'
onaparse:
(x)=exp( k
X
i=1
i f
i
(x)) (6)
The probability of a parse, P(x j M), is simply
theresultofnormalisingthetotalweightassociated
withthatparse:
P(xjM)= 1
Z
(x) (7)
Z= X
y2
(y) (8)
istheunionof theset of parsesassigned toeach
sentence in D by the grammar G, such that each
parseinisuniqueintermsofthefeaturesthatare
activeonit. Normallyaparsecanbeviewedasthe
set offeaturesthatareactiveonit.
Theinterpretationofthisprobability(equation7)
depends upon the application of the model. Here,
we useparse probabilities to reect preferences for
parses.
5 The Grammar
The grammar we model with log-linear models
(called the Tag Sequence Grammar (Briscoe and
Carroll, 1996),orTSGforshort) wasmanually
de-velopedwithregardtocoverage,andwhencompiled
consists of 455 Denite Clause Grammar (DCG)
rules. Itdoesnotparsesequencesofwordsdirectly,
butinsteadassignsderivationstosequencesof
part-of-speech tags (using the CLAWS2 tagset). The
grammar is relativelyshallow(for example, itdoes
notfullyanalyseunboundeddependencies).
6 Modelling the Grammar
ModellingtheTSGwithrespecttotheparsedWall
Street Journal consists of two steps: creation of a
featuresetanddenitionofareferencedistribution
(thetargetmodel, p).~
6.1 Feature Set
AP/a1:unimpeded
A1/app1:unimpeded
a
a
a
a !
!
!
!
unimpeded PP/p1:by
P1/pn1:by
b
b
b "
"
"
by N1/n:traÆc
traÆc
Figure1: TSGParseFragment
feature set and denition of the reference
distribu-tion.
Our feature set is created by parsing sentences
in thetraining set,and usingeach parse to
instan-tiate templates. Each template denes afamily of
features. Our templates are motivated by the
ob-servationsthatlinguistically-stipulatedunits (DCG
rules) are informative, and that many DCG
appli-cations in preferred parses can be predicted using
lexicalinformation.
The rst template creates features that count
thenumberof timesa DCGinstantiationis present
within a parse. 6
For example, suppose we parsed
theWallStreetJournalAP:
1 unimpededbytraÆc
AparsetreegeneratedbyTSGmightbeasshown
ingure1. Here,tosaveonspace,wehavelabelled
each interiornode in the parse tree with TSGrule
names, and not attribute-value bundles.
Further-more, we have annotatedeach node with the head
word of the phrase in question. Within our
gram-mar, heads are (usually) explicitly marked. This
means we do not have to make any guesses when
identifying the headof a local tree. With head
in-formation,weareabletolexicalisemodels. Wehave
suppressedtagginginformation.
Forexample,afeaturedenedusingthistemplate
mightcountthenumberoftimesthewesaw:
AP/a1
A1/app1
inaparse. Suchfeaturesrecordsomeofthecontext
oftheruleapplication,inthatruleapplicationsthat
6
Note,allourfeaturessuppressanyterminalsthatappear
inalocaltree.Lexicalinformationisincludedwhenwedecide
modelledbydierentfeatures.
Oursecondtemplatecreatesfeaturesthatare
par-tiallylexicalised. Foreachlocal tree(ofdepth one)
that has a PP daughter, we create a feature that
countsthenumberoftimesthatlocaltree,decorated
withthehead-wordof thePP, wasseenin aparse.
Anexampleof suchalexicalisedfeaturewould be:
A1/app1
PP/p1:by
These features are designed to model PP
attach-ments that can be resolved using the head of the
PP.
Thethirdandnaltemplatecreatesfeaturesthat
are againpartiallylexicalised. Thistime, wecreate
localtreesofdepth onethataredecoratedwiththe
headword. Forexample,hereisonesuchfeature:
AP/a1:unimpeded
A1/app1
Notethesecondandthirdtemplatesresultin
fea-turesthat overlap with features resulting from
ap-plicationsofthersttemplate.
6.2 ReferenceDistribution
Wecreate thereferencedistribution R (an
associa-tionof probabilities with TSG parses of sentences,
suchthattheprobabilitiesreectparsepreferences)
usingthefollowingprocess:
1. Take the training set of parses and for each
parse, compare the structural dierences
be-tweenitandareferencetreebankparse.
2. Mapthese treesimilarityscoresinto
probabili-ties(wherethesumofallreferenceprobabilities
forallparsessumstoone).
Again,seeAnon formoredetails.
This concludes our discussion of how we model
grammars. Wenowgoontopresentour
experimen-talinvestigationoftheinuenceoftheinitialweight
settings.
7 Experiments
Here we present three sets of experiments. The
rstsetshowstheperformanceofmaximumentropy
when the initial weight setting are zero. The
sec-ondsetshowtheeectsofrandomisedinitialsetting,
andso establishes (anestimateof) thevariation in
performance space. The third set of experiments
showed how the inuence of the initial weight
Throughout,weused thesametrainingset. This
consistedofasampleof53795parses(producedfrom
sentencesat most 15 tokenslong, with at most 15
parses per sentence). The sentences were drawn
from the parsedWall Street Journal,and all could
beparsed using our grammar. Themotivation for
this choice of training set came from thefact that
whenthesampleofsentencesistoosmall,the
result-ingmodel willtendtoundert. Likewise, whenthe
trainingsetistoolarge,themodelwilltendto
over-t. Asampleofappropriatesize(whichcanbefound
using a simple search, as Osborne (2000)
demon-strated)will thereforeneither signicantly undert
nor overt. Quiteapart from estimation issues
re-latedtosamplesize,becausewerepeatedlyestimate
models,usingasamplethatisjustsuÆcientlylarge
(andnolarger)allowsustomakesignicant
compu-tationalsavings.
We used a disjoint development set and testing
set. Thedevelopment set consisted of 2620parses,
derived from parsing sentences at most 30 tokens
long,withatmost100parsespersentence. The
test-ingsetwasrandomlysampledfrom theWallStreet
Journal, and consisted of 469 sentences, with each
sentenceat most30 tokens long,with at most100
parses persentence. Each sentence hason average
60:0parsespersentence.
The model we used throughout contained 75171
features.
Forallexperiments,weranIISforthesame
num-berofiterations(75). Thisnumberwasselectedas
thenumberofiterationsthatproducedthebest
per-formanceformaximumentropy(ouryardstick).
Evaluationwasin termsofexactmatch: foreach
sentence in the test set, we awarded ourselves a
pointiftheestimatedmodelrankedhighestthesame
parse that was ranked highest using the reference
probabilities. 7
Tieswererandomlybroken.
7.1 MaximumEntropy Resultsusing
Uniform Initialisation
Amodeltrainedusingweightsallinitialisedtozero
yieldedanexactmatchscoreof52:0%.
7.2 RandomisedModels
A poolof 10;000 models was createdby randomly
setting the initial weights to values in the range
[-0.3,0.3]and thenestimating thenal weightsusing
thetrainingset.
The histogram in gure 2 shows the number of
models that producedthesameresultsonthe
test-ingmaterial. Ascanbeseen,performanceisroughly
normally distributed,with somelocal minima
hav-ingamuchwiderbasinofattractionthanother
min-ima. Also, note that all of these models
underper-7
Weuse exact match as an arbitraryevaluation metric.
Theparticularchoiceofmetricisnotcrucialtotheresultsof
forms amodel cratedusing uniform initialisation.
The performance of the worst model was 24:1%.
Thisis our lowerbound,and is less thanhalf that
ofbasicmaxent. Thebestrandomlyselectedmodel
hadaperformanceof46:5%.
0
50
100
150
200
250
300
350
400
450
20
25
30
35
40
45
50
Number of Models
Exact Match
"rand-hist04"
Figure2: Distributionofmodelswithrandomly
ini-tialisedstartingconditions
7.3 Averaging over models
To see whether an ensemble of such randomised
models could cancel-out the inuence of the
ini-tial weight settings, we createda pool of 600
ran-domised models, and then combined then together
using equation9:
P(xjM
1 :::M
n )=
Q
n
i=1
P(xjM
i )
P
y Q
n
j=1
P(yjM
j )
(9)
Becauseitispossiblethatsomesubsetofthe
mod-els outperforms an ensemble using all models, we
uniformlysampled,withreplacement,fromthispool
models for inclusion into the nal ensemble.
Ran-dom selection introduces variation, so we repeated
ourresultstentimeandaveragedtheresults.
Figure 3 shows our results (N is the number of
modelsineachensemble,xisthemeanperformance
and 2
is the standarddeviation). The nal entry
(markedall)showstheperformanceobtainedusing
anensemble consisting of allmodels in the model,
equallyweighted.
As can be seen, increasing the number of models
in the ensemble reduces the inuence of the initial
weight settings. However, even with a large pool
ofmodels,westillmarginallyunderperformuniform
initialisation.
8
Running IISuntilconvergence didnot narrow the gap,
N x N x
1 38.78 0.12 50 49.94 1.24
2 41.41 1.04 100 50.55 0.70
3 44.24 1.63 150 50.62 0.88
5 46.57 1.69 200 50.66 0.75
10 47.21 1.71 300 50.81 0.73
20 49.23 1.19 600 51.04 0.36
all 51.39
-Figure3: Modelaveragingresults
Notethatwehavenotcarriedoutthecontrol
ex-perimentofrepeatingourrunsusingfeatureswhich
donotoverlapwitheachother. Unfortunately,doing
this would probablymean having to create models
that are unnatural. We therefore cannot be
abso-lutelycertain that thesolereasonfor theexistance
of local optima is the presence of overlapping
fea-tures. However,wecanbesurethat they doexist,
and that varying theinitial weightsettings will
re-vealthem.
8 Comments
Wehaveestablishedthat,inthepresenceof
overlap-ping features, the values of the initial weights can
aect the utility of the nal model. Our
experi-mental resultssupport thisnding. Ingeneral,one
mightencounterasimilarproblem(overlapping
fea-tures) with what mightbecalled `semi-overlapping
features': features which are very similar (but not
identical) on the training data and very dierent
outsideof it.
IIS could be made more robust to the choice of
initialparametersinanumberofways:
The simplest course of action is to set all
ini-tialweights to the samevalue. Although zero
is often a convenient initial value, in principle
any real number would do, since theIIS
algo-rithmcan reach agivenoptimal solutionfrom
anystartingpointinthespaceofinitial
param-eters.
We could also examine all the features to
de-terminewhichonesoverlap,andforceitto
bal-ancethenalweightsofthesefeatures. Forvery
largemodels,thismaybeprohibitivelydiÆcult
andtime-consuming.
Model averagingcanalso canceloutvariations
causedbyaparticularchoiceofinitialsettings.
However, this implies a greater computational
burdenasIISwillneedtoberunmanytimesin
ordertogainarepresentativesampleofmodels.
Thenumber offeatures in themodel couldbe
reducedusingfeatureselectionmethods(for
ex-ample(MullenandOsborne,2000)).
Although IIS is a useful tool for estimating
log-modelsusinglimited-memoryvariable-metric
meth-ods (Malouf,2002). Ourndingsshowthat
conver-gence,for arange of problems, is faster. An
inter-esting questionis seeing the extent to which other
numericalmethods forestimating log-linearmodels
aresensitive toinitial parametervalues. Finally, it
shouldbenotedthatourtheoreticalresultsapplyto
amoregeneralsetting thanthat oflog-linear
mod-elstrainedusingtheIISalgorithm. Theproblemof
overlappingfeaturescouldin principleoccurin any
situationinwhichamodelhasalinearcombination
offeatures,anda`hill-climbing'algorithmisusedto
seekamaximum-likelihoodsolution.
Acknowledgements
WewishtothankSteveClarkforusefuldiscussions
aboutIIS, RobMalouffor supplyingtheIIS
imple-mentation and the twoanonymous reviewers. Iain
BancarzwassupportedbytheEPSRCgrantPOEM.
References
Steven P. Abney. 1997. Stochastic
Attribute-Value Grammars. Computational Linguistics,
23(4):597{618,December.
Ted Briscoe and John Carroll. 1996. Automatic
Extraction of Subcategorization from Corpora.
In Proceedings of the 5 th
Conference on Applied
NLP,pages356{363,Washington,DC.
Mark Johnson, Stuart Geman, Stephen Cannon,
ZhiyiChi,andStephanRiezler. 1999. Estimators
forStochastic\Unication-Based"Grammars. In
37 th
AnnualMeetingofthe ACL.
J. Laerty, S. Della Pietra, and V. Della Pietra.
1997. Inducing features of random elds. IEEE
Transactions on Pattern Analysis and Machine
Intelligence,19(4):380{393,April.
Robert Malouf. 2002. A comparison of algorithms
for maximum entropy parameter estimation. In
Proceedings of the joint CoNLL-WVLC Meeting,
Taipei,Taiwan.ACL. Toappear.
TonyMullenand MilesOsborne. 2000. Overtting
Avoidance for Stochastic Modeling of
Attribute-ValueGrammars. InClaireCardie,Walter
Daele-mans,ClaireNedellec,andErikTjongKimSang,
editors,Proceedingsofthe ComputationalNatural
Language learning 2000, pages 49{54.ACL,
Lis-bon,Portugal.
Kamal Nigam, John Laerty, , and Andrew
Mc-Callum. 1999. Using maximum entropy for text
classication. InIJCAI-99WorkshoponMachine
Learning for InformationFiltering,.
Miles Osborne. 2000. Estimation of Stochastic
Attribute-Value Grammars using anInformative
Sample. InThe18 th
International Conferenceon