Improved Iterative Scaling Can Yield Multiple Globally Optimal Models with Radically Differing Performance Levels

(1)

models with radically diering performance levels

Iain Bancarz and Miles Osborne

fiainrb,[email protected]

Division of Informatics

University of Edinburgh

2Buccleuch Place

Edinburgh EH89LW

Scotland

Abstract

Log-linear models can be eÆciently estimated

us-ing algorithms such as Improved Iterative Scaling

(IIS)(Laertyetal.,1997). Undercertainconditions

andforaparticularclassofproblems,IISis

guaran-teedto approachboththemaximum-likelihoodand

maximum entropy solution. This solution, in

like-lihood space, is unique. Unfortunately, in realistic

situations,multiplesolutionsmayexist,allofwhich

are equivalent to eachother in termsof likelihood,

but radically dierent from each other in terms of

performance. Weshowthatthisbehaviourcanoccur

whenamodelcontainsoverlappingfeaturesandthe

training material is sparse. Experimental results,

fromthedomainofparseselectionforstochastic

at-tribute value grammars, shows the wide variation

in performance that can be found whenestimating

models using IIS. Further resultsshow that the

in-uence of the initial model can be diminished by

selecting either uniform weights, or else by model

averaging.

1 Background

When statistically modelling linguistic phenomena

of onesortoranother, researcherstypically t

log-linearmodels to thedata(forexample(Johnson et

al., 1999)). There are (at least) three reasons for

thepopularityofsuchmodels: theydonotmake

un-warrantedindependenceassumptions,themaximum

likelihoodsolutionofsuchmodelscoincideswiththe

maximumentropysolution,andnally,theycanbe

eÆciently estimated (using algorithms such as

Im-provedIterativeScaling(IIS)(Laertyetal.,1997)).

Now, the solutionfound by IIS is guaranteed to

approachaglobal maximumforbothlikelihoodand

entropyunder certain conditions. Although this is

appealing, in realistic situations it turns out that

multiple modelsexist,allofwhichareequivalentin

termsof likelihood but dierent fromeachother in

termsoftheirperformanceatsometask. In

particu-lar,theinitialweightsettingscaninuencethe

qual-ityofthenal model, even thoughthisnal model

isthemaximumentropysolution(asfoundbyIIS).

algorithmisguaranteedtoconvergetoaglobally

op-timal solution regardlessof the initial parameters.

If the initial weights assigned to some of the

fea-tures are wildly inappropriate then the algorithm

maytakelongerto converge,but onewould expect

thenaldestination to remain thesame. However,

asweshowlater, what isunique in terms of

likeli-hood need not be unique in terms of performance,

andsoIIScanbesensitiveto theinitialweight

set-tings.

Some of the reason for this behaviour may lie

in a relatively subtle eect which we call

overlap-pingfeatures. 1

Ifsomefeaturesbehaveidenticallyin

thetrainingset (but notnecessarilyin future

sam-ples), IIS cannot distinguish between dierent sets

of weights for those features unless the sum of all

weights in each set is dierent. In fact, the nal

weightsassignedto such featureswillbedependent

ontheirinitialvalues. Undertheseconditions,there

willbeafamilyofmodels,allofwhichareidentical

asfar asIIS isconcerned, but distinguishable from

eachotherintermsofperformance. Thismeansthat

in terms of performance space, the landscape will

containlocal maxima. Intermsof likelihood space,

thelandscapewillcontinuetocontainjustthesingle

(global)maximum.

This indeterminacy is clearly undesirable.

How-ever,thereare(atleast)twopracticalwaysof

deal-ing with it: one either initialises IIS with weights

that are identical (0 is a good choice), or else one

takes a Bayesian approach, and averages over the

space of models. In this paper, we show that the

performance of IIS is sensitiveto the initial weight

settings. We showthat setting theweightsto zero

yields performance that is better than a large set

ofrandomlyinitialised models. Also, weshow that

modelaveragingcanalsoreduceindeterminacies

in-troducedthroughthechoiceofinitialweights.

The rest of this paper is as follows. Section

2 is a restatement of the theory behind IIS.

Sec-tion3showshowsparse trainingmaterial, coupled

1

Wedo not claimthat overlapping features are the sole

(2)

IIS produces suboptimal results. We then move

on to experimental support for our theoretical

re-sults. Our domain is parse selection for

broad-coverage stochastic attribute-valuegrammars.

Sec-tion 4 shows how we model the broad coverage

attribute-value grammar (mentioned in section 5

used in our experiments),and then present our

re-sults. Therst set ofexperiments(section7) deals

withhowwellIIS,withuniforminitialsettings.

out-performs models with randomised initial settings.

The second set of experiments shows how model

avergingcandealwiththeproblemofinitialisingthe

model. Weconcludethepaperwithsomecomments

onourwork.

2 The Duality Lemma

How can we be certain that IIS seeks out aglobal

maximumforbothlikelihoodandentropy? The

an-sweristhatfortheclassofproblemsunder

consider-ation,thereisonlyasinglemaximum-whichisthen

necessarily theglobal one. TheIIS algorithm

sim-ply `hill-climbs' andseeksto increasethelikelihood

ofthemodelbeingtrained. Itssuccessisguaranteed

becauseofaresultwhich werefertoas theDuality

Lemma 2

.

The proof of theDuality Lemma is containedin

(Laerty et al., 1997)but will be omittedhere. In

ordertostatethelemma,werstdenethesetting

andestablishsomenotation.

Supposethatwehaveaprobabilitymeasurespace

(;F;P),whereas usualis aset madeupof

ele-ments!, F isasigma-eldon ,and P is a

prob-abilitymeasure onF. Supposefurther that X is a

simplerandomvariableon-thatis tosay,itisa

real-valued function on , havingnite range, and

suchthat[!:X(!)=x]2F. Asusual,wewillomit

theargument!,sothatXindicatesageneralvalue

of thefunction as wellasthe function itself, and x

denotes[!:X(!)=x]. Wewillalsoabusenotation

byletting X denotethe set ofpossiblevaluesof x;

themeaningshouldbeclearfromcontext.

Wenowconsiderastochasticprocesson(;F;P).

Wearegivenasampleofpastoutputs(datapoints)

from thisprocesswhichmakeupthetrainingdata.

Weagainabusenotationbyusingp~torefertoboth

thesetofdatapointsandthedistributiondenedby

thatset. Letdenotethesetofallpossible

proba-bilitydistributions overX; weseekamodelq

2

whichisin somesense thebest possibleprobability

distributionoverfutureoutputs. Wespecically

ex-aminegeneralizedGibbsdistributions,whichareof

theform:

q

h (x)=

1

(Z

q (h))

exph(x) q(x) (1)

2

ThisresultwascalledProposition4(Laertyetal.,1997).

initialprobabilitydistributionoverX(whichmaybe

theuniformdistribution),andZ

q

(h)isanormalising

constant,takingavaluesuchthat P

x2X q

h

(x)=1.

Thefunction hheretakestheform:

h(x)= n

X

i=1

i f

i

(x)=(f)(x) (2)

where the f

i

(x) are integerfeature functions. The

realnumbers

i

are adjustableparameters.

Suppose that we are given the data p,~ an initial

model q

0

, and a set of features f. Laerty et al.

(1997)describetwonaturalsetsofmodels. Therst

isthesetP(f;p)~ ofalldistributionsthatagreewith

~

pastotheexpectedvalueofthefeaturefunctionf:

P(f;p)~ =[p2:p[f]=p~[f]] (3)

where, asusual,p[f]denotestheexpectationofthe

function f underthedistribution p.

The second is Q(f;q

0

), the set of generalized

Gibbsdistributionsbasedonq

0

andwithfeatureset

f:

Q(f;q

0

)=[(f)Æq

0

] (4)

Let Q denote the closure of Q in , with respect

tothetopology inheritsasasubsetof Euclidean

space.

These in turn determine two natural candidates

for the `best' model q

. Let D(pkq) denote the

Kullback-Leibler divergence between two

distribu-tionspandq. Thesuitablemodelsare:

MaximumLikelihoodGibbsDistribution. A

dis-tribution in Q with maximum likelihood with

respectto p:~ q ML

=argmin

q2Q

D(~pkq).

MaximumEntropyConstrainedDistribution. A

distribution in P with maximum entropy

rela-tiveto q

0 : q

ME

=argmin

q2P(f;~p ) D(pkq

0 )

The key result of Laerty et al. (1997) is that

there is a unique q

satisfying q

= q

ML

= q

ME

.

InAppendix 1ofthatpaper,thefollowingresultis

proved:

The Duality Lemma. SupposethatD(~pkq

0 )<

1. Thenthereexists auniqueq

2satisfying:

1. q

2P\Q

2. D(pkq)=D(pkq

)+D(q

kq)8p2P;q2Q

3. q

=argmin

q2Q

D(~pkq)

4. q

=argmin

q2P(f;~p) D(pkq

0 )

TheDualityLemmaisaveryusefulresult,

funda-mentaltotheabilityofIIStoconvergeuponthe

sin-glemaximumlikelihood,maximumentropysolution.

TheIISalgorithmitselfusesafairlystraightforward

techniquetolookforanymaximumofthelikelihood

function;becauseofthelemma,IISisguaranteedto

(3)

Data

3.1 Overlapping Features

Inthispaper,allmodelshavethesametrainingdata

and a uniform initial distribution q

0

. This ensures

that D(pkq

0

)1 andthat,forallmodels, theIIS

algorithmapproachesthesameoptimaldistribution

q

. However,ourexperimentsshowthatnotall

dis-tributionsobtainedbyrunningIIS(toconvergence)

areequallygoodatmodellingasetoftestdata.

Per-formanceappearstodepend(atleast)onthe

start-ingvaluesfor theweights

i

. Thismaybearesult

ofthefollowingsituation:

Consider two features, f

i and f

j

with i < j,

with weights

i and

j

respectively. Suppose that

f

i (x)=f

j

(x)forallvaluesofxin p,~ butthereexist

valuesof x outside p~such that f

i

(x) 6=f

j

(x); that

is, the two functions take exactly the same values

on the set of training data but dier outsideof it.

Thisphenomenoncanbecalledoverlapping. 3

Over-lapping features are commonly found in maximum

entropymodels,andareoneofthemainreasonsfor

their popularity. In fact, one could argue that all

maximumentropymodelsfoundinnaturallanguage

applications containoverlappingfeatures.

Overlap-pingfeaturesmaybepresentbyexplicitdesign(for

examplewhenemulatingbacking-osmoothing),or

else naturally, for example when using features to

treatwordsco-occurringwitheachother.

Weassigninitialweights (0)

i

and (0)

j

tothe

fea-tures, and after n iterationsof thealgorithm, they

havebeenadjustedto (n) i and (n) j respectively.

Nowconsiderthetargetsolutionq

,asdetermined

byq

0

andp. Letusassumeforconveniencethat q

isoftheform:

q = 1 Z exp( n X k =1 k f k ) (5) where Z

istheusual normalisingconstant. Notice

thatq

maynotbelongtothefamilyofexponential

models Q, but instead may be part of the larger

set Q. Indeed, della Pietra et al. state that P \

Q may be empty. This is not a serious limitation

asonecancomearbitrarily closeto any elementof

Q while remaining inside Q. Thus, if q

= 2 Q, we

mayconsidertheaboveexpressiontobeaveryclose

approximation to q

, such ascould beobtained by

runningIIStoconvergence.

Becausetheith andjthfeaturesareequalonthe

training data, the exact values of i and j have

no eect on the likelihood of the model as longas

3

Wecanrelaxthisdenitionofoverlappingandallow

fea-turestolargelyco-occur together. Dependingupon the

de-greeofco-occurrence,wewouldexpecttocontinuetondour

is nota singlemodel at all, but rather afamily of

models satisfying the condition i + j = ij for aparticular ij . 4

All models in this family assign

equallikelihoodtothetrainingdata,andsotheIIS

algorithmisunabletodistinguishbetweenthem.

Under thesecircumstancesweshouldensurethat

i = j

. Since we haveno way to distinguish

be-tween the two features, our model should assign

themequal importance. However,this is not

guar-anteedbytheIISalgorithm. Inparticular,itis less

likelytooccuriftheinitialweights (0) i and (0) j are notequal.

3.2 IISand Overlapping Features - A

SimpleExample

Supposethat ourmodel hasavectorof parameters

= [

i

:i=1:::n] and we wish to change it to

+Æ, where Æ=[Æ

i

:i=1:::n] TheIISalgorithm

considerseach

i

in turn. ItchoosesÆ

i

tomaximize

anauxiliaryfunctionB(Æj),whichprovidesalower

bound onthechangein log-likelihoodofthemodel.

Eachadjustment i i +Æ i

is guaranteedto

in-creasethelikelihoodofthemodelandthusapproach

ourideal solution q

. This process is repeated

un-til convergence. There are no inherentrestrictions

on the value of Æ

i

; it may be positive or negative,

andlargeorsmallcomparedtothevaluestakenfor

dierentfeatures.

Now, suppose that the features f

i and f

j

over-lap and we halt the algorithm after t iterations.

Clearly,thealgorithmcannotguaranteethatthe

-nalweights t i and t j

will be equal. The following

highly simplied example should demonstrate why

thisisthecase.

Example 1: Imagine that our model has two

overlapping features f

1 and f

2 , and q

is the

fam-ily of models in which

1 +

2

= 5. Suppose that

theinitialweightsare (0)

1

=5; 0

2

=0. Recall that

at the ith step of an iteration, the IIS algorithm

considersthe change in log-likelihood of the model

which can be made by only adjusting the ith

pa-rameter. In this case no change is possible asthe

sum

1 +

2

isalreadyatitsoptimumvalue,sothe

algorithmterminates with t

1

=5and t

2

=0. We

haveassigned fargreater importance to f

1

than to

f

2

with no justication for doing so. Clearly, this

assumptionmightnotbewarranted.

3.3 IISand Overlapping Features in

Practice

Thesituation will obviouslybemuch more

compli-catedinpractice. Mostimportantly,thereisnosuch

4

The number

ij

is not a `constant' as such, since any

vectorofweightsisinasenseuniqueonlyuptomultiplication

(4)

. As aresult, nooneparametercanberegarded

asxeduntilthealgorithmhasconvergedforall

pa-rameters. In the aboveexample, our \true" set of

targetsolutionsisthosefor which

1 +

2

isa

con-stant. Sincetherearenootherfeatures,IISwouldin

factterminateatonceforany pairofinitialweights.

Suppose that weadded a third feature f

3 which

didnotoverlapthersttwo. Oursetoftarget

solu-tionswouldthenbeoftheform

1 +

2 =

12 k;

3 =

3

k for some xed

12 and

3

and for any k > 0.

Theadjustmentsmadeto

1 and

2

willdependon

thosemadeto

3

,andthealgorithmwillnot

neces-sarilyterminate aftertherstpass.

Thefollowingexamplewilldescribewhathappens

in this morerealistic situation. It will also explain

thepossiblebenets ofsetting allinitial weightsto

thesamevalue.

Example 2: Let the model have the three

fea-tures described above, with initial weights 0

1 =

5; 0

2 =0;

0

3

=1andafamilyoftargetsolutionsq

dened by

1 +

2

=5k;

3

=k foranyk>0. IIS

againterminates at once,becausetheinitial model

is already partof the family q

. Againthere is an

unjustied dierence between the nal weights for

f

1 andf

2 .

Now suppose that all three initial weights are

set to zero. Imagine for simplicity that we have

restricted the algorithm to adjust weights only by

zeroor1. Onthe rstpass, allthree weights are

changedto1. Onthesecond,

1 and

2

areincreased

to2;

3

remainsat1. Inthethirditeration,

1 isset

to3,

2

remainsat2and

3

at1,andthealgorithm

terminates. 5

Noticethat,althoughthereisstilladierence

be-tweenthenalweightsforf

1 andf

2

,itismuchless

thanbefore. Thismorecloselyapproachestheideal

situationin which theweightsareequal.

This concludes our theoretical treatment of IIS.

Wenowshowexperimentallytheinuencethat the

initial weight settings have, and how it can be

minimised. Our strategy is to use plausible

fea-tures, as found in a realistic domain (parse

selec-tion for stochastic attribute-value grammars), and

rstly show what happens when the initial weight

settings areset uniformly (to zero). We thenshow

what happens when these initial settings are

ran-domly set, and nally, what happens when we

av-erageoverrandomlyinitialisedmaximumlikelihood

solutions.

5

Theexactbehaviourofthealgorithmwilldependonthe

trainingdata,sothisisnottheonlyimaginableoutcome,but

Attribute-Value Grammars

Hereweshowhowattribute-valuegrammarsmaybe

modelledusinglog-linearmodels. Abneygivesfuller

details(Abney,1997).

LetGbeanattribute-valuegrammar,andDaset

of sentences within the string-set dened byL(G).

Alog-linear model, M, consist oftwo components:

asetoffeatures,F andaset ofweights,.

The (unnormalised) total weight of a parse x,

(x),isafunction ofthekfeaturesthatare`active'

onaparse:

(x)=exp( k

X

i=1

i f

i

(x)) (6)

The probability of a parse, P(x j M), is simply

theresultofnormalisingthetotalweightassociated

withthatparse:

P(xjM)= 1

Z

(x) (7)

Z= X

y2

(y) (8)

istheunionof theset of parsesassigned toeach

sentence in D by the grammar G, such that each

parseinisuniqueintermsofthefeaturesthatare

activeonit. Normallyaparsecanbeviewedasthe

set offeaturesthatareactiveonit.

Theinterpretationofthisprobability(equation7)

depends upon the application of the model. Here,

we useparse probabilities to reect preferences for

parses.

5 The Grammar

The grammar we model with log-linear models

(called the Tag Sequence Grammar (Briscoe and

Carroll, 1996),orTSGforshort) wasmanually

de-velopedwithregardtocoverage,andwhencompiled

consists of 455 Denite Clause Grammar (DCG)

rules. Itdoesnotparsesequencesofwordsdirectly,

butinsteadassignsderivationstosequencesof

part-of-speech tags (using the CLAWS2 tagset). The

grammar is relativelyshallow(for example, itdoes

notfullyanalyseunboundeddependencies).

6 Modelling the Grammar

ModellingtheTSGwithrespecttotheparsedWall

Street Journal consists of two steps: creation of a

featuresetanddenitionofareferencedistribution

(thetargetmodel, p).~

6.1 Feature Set

(5)

AP/a1:unimpeded

A1/app1:unimpeded

a

a !

!

unimpeded PP/p1:by

P1/pn1:by

b

b "

"

by N1/n:traÆc

traÆc

Figure1: TSGParseFragment

feature set and denition of the reference

distribu-tion.

Our feature set is created by parsing sentences

in thetraining set,and usingeach parse to

instan-tiate templates. Each template denes afamily of

features. Our templates are motivated by the

ob-servationsthatlinguistically-stipulatedunits (DCG

rules) are informative, and that many DCG

appli-cations in preferred parses can be predicted using

lexicalinformation.

The rst template creates features that count

thenumberof timesa DCGinstantiationis present

within a parse. 6

For example, suppose we parsed

theWallStreetJournalAP:

1 unimpededbytraÆc

AparsetreegeneratedbyTSGmightbeasshown

ingure1. Here,tosaveonspace,wehavelabelled

each interiornode in the parse tree with TSGrule

names, and not attribute-value bundles.

Further-more, we have annotatedeach node with the head

word of the phrase in question. Within our

gram-mar, heads are (usually) explicitly marked. This

means we do not have to make any guesses when

identifying the headof a local tree. With head

in-formation,weareabletolexicalisemodels. Wehave

suppressedtagginginformation.

Forexample,afeaturedenedusingthistemplate

mightcountthenumberoftimesthewesaw:

AP/a1

A1/app1

inaparse. Suchfeaturesrecordsomeofthecontext

oftheruleapplication,inthatruleapplicationsthat

6

Note,allourfeaturessuppressanyterminalsthatappear

inalocaltree.Lexicalinformationisincludedwhenwedecide

modelledbydierentfeatures.

Oursecondtemplatecreatesfeaturesthatare

par-tiallylexicalised. Foreachlocal tree(ofdepth one)

that has a PP daughter, we create a feature that

countsthenumberoftimesthatlocaltree,decorated

withthehead-wordof thePP, wasseenin aparse.

Anexampleof suchalexicalisedfeaturewould be:

A1/app1

PP/p1:by

These features are designed to model PP

attach-ments that can be resolved using the head of the

PP.

Thethirdandnaltemplatecreatesfeaturesthat

are againpartiallylexicalised. Thistime, wecreate

localtreesofdepth onethataredecoratedwiththe

headword. Forexample,hereisonesuchfeature:

AP/a1:unimpeded

A1/app1

Notethesecondandthirdtemplatesresultin

fea-turesthat overlap with features resulting from

ap-plicationsofthersttemplate.

6.2 ReferenceDistribution

Wecreate thereferencedistribution R (an

associa-tionof probabilities with TSG parses of sentences,

suchthattheprobabilitiesreectparsepreferences)

usingthefollowingprocess:

1. Take the training set of parses and for each

parse, compare the structural dierences

be-tweenitandareferencetreebankparse.

2. Mapthese treesimilarityscoresinto

probabili-ties(wherethesumofallreferenceprobabilities

forallparsessumstoone).

Again,seeAnon formoredetails.

This concludes our discussion of how we model

grammars. Wenowgoontopresentour

experimen-talinvestigationoftheinuenceoftheinitialweight

settings.

7 Experiments

Here we present three sets of experiments. The

rstsetshowstheperformanceofmaximumentropy

when the initial weight setting are zero. The

sec-ondsetshowtheeectsofrandomisedinitialsetting,

andso establishes (anestimateof) thevariation in

performance space. The third set of experiments

showed how the inuence of the initial weight

(6)

Throughout,weused thesametrainingset. This

consistedofasampleof53795parses(producedfrom

sentencesat most 15 tokenslong, with at most 15

parses per sentence). The sentences were drawn

from the parsedWall Street Journal,and all could

beparsed using our grammar. Themotivation for

this choice of training set came from thefact that

whenthesampleofsentencesistoosmall,the

result-ingmodel willtendtoundert. Likewise, whenthe

trainingsetistoolarge,themodelwilltendto

over-t. Asampleofappropriatesize(whichcanbefound

using a simple search, as Osborne (2000)

demon-strated)will thereforeneither signicantly undert

nor overt. Quiteapart from estimation issues

re-latedtosamplesize,becausewerepeatedlyestimate

models,usingasamplethatisjustsuÆcientlylarge

(andnolarger)allowsustomakesignicant

compu-tationalsavings.

We used a disjoint development set and testing

set. Thedevelopment set consisted of 2620parses,

derived from parsing sentences at most 30 tokens

long,withatmost100parsespersentence. The

test-ingsetwasrandomlysampledfrom theWallStreet

Journal, and consisted of 469 sentences, with each

sentenceat most30 tokens long,with at most100

parses persentence. Each sentence hason average

60:0parsespersentence.

The model we used throughout contained 75171

features.

Forallexperiments,weranIISforthesame

num-berofiterations(75). Thisnumberwasselectedas

thenumberofiterationsthatproducedthebest

per-formanceformaximumentropy(ouryardstick).

Evaluationwasin termsofexactmatch: foreach

sentence in the test set, we awarded ourselves a

pointiftheestimatedmodelrankedhighestthesame

parse that was ranked highest using the reference

probabilities. 7

Tieswererandomlybroken.

7.1 MaximumEntropy Resultsusing

Uniform Initialisation

Amodeltrainedusingweightsallinitialisedtozero

yieldedanexactmatchscoreof52:0%.

7.2 RandomisedModels

A poolof 10;000 models was createdby randomly

setting the initial weights to values in the range

[-0.3,0.3]and thenestimating thenal weightsusing

thetrainingset.

The histogram in gure 2 shows the number of

models that producedthesameresultsonthe

test-ingmaterial. Ascanbeseen,performanceisroughly

normally distributed,with somelocal minima

hav-ingamuchwiderbasinofattractionthanother

min-ima. Also, note that all of these models

underper-7

Weuse exact match as an arbitraryevaluation metric.

Theparticularchoiceofmetricisnotcrucialtotheresultsof

forms amodel cratedusing uniform initialisation.

The performance of the worst model was 24:1%.

Thisis our lowerbound,and is less thanhalf that

ofbasicmaxent. Thebestrandomlyselectedmodel

hadaperformanceof46:5%.

0

50

100

150

200

250

300

350

400

450

20

25

30

35

40

45

50 Number of Models

Exact Match

"rand-hist04"

Figure2: Distributionofmodelswithrandomly

ini-tialisedstartingconditions

7.3 Averaging over models

To see whether an ensemble of such randomised

models could cancel-out the inuence of the

ini-tial weight settings, we createda pool of 600

ran-domised models, and then combined then together

using equation9:

P(xjM

1 :::M

n )=

Q

n

i=1

P(xjM

i )

P

y Q

n

j=1

P(yjM

j )

(9)

Becauseitispossiblethatsomesubsetofthe

mod-els outperforms an ensemble using all models, we

uniformlysampled,withreplacement,fromthispool

models for inclusion into the nal ensemble.

Ran-dom selection introduces variation, so we repeated

ourresultstentimeandaveragedtheresults.

Figure 3 shows our results (N is the number of

modelsineachensemble,xisthemeanperformance

and 2

is the standarddeviation). The nal entry

(markedall)showstheperformanceobtainedusing

anensemble consisting of allmodels in the model,

equallyweighted.

As can be seen, increasing the number of models

in the ensemble reduces the inuence of the initial

weight settings. However, even with a large pool

ofmodels,westillmarginallyunderperformuniform

initialisation.

8

Running IISuntilconvergence didnot narrow the gap,

(7)

N x N x

1 38.78 0.12 50 49.94 1.24

2 41.41 1.04 100 50.55 0.70

3 44.24 1.63 150 50.62 0.88

5 46.57 1.69 200 50.66 0.75

10 47.21 1.71 300 50.81 0.73

20 49.23 1.19 600 51.04 0.36

all 51.39

-Figure3: Modelaveragingresults

Notethatwehavenotcarriedoutthecontrol

ex-perimentofrepeatingourrunsusingfeatureswhich

donotoverlapwitheachother. Unfortunately,doing

this would probablymean having to create models

that are unnatural. We therefore cannot be

abso-lutelycertain that thesolereasonfor theexistance

of local optima is the presence of overlapping

fea-tures. However,wecanbesurethat they doexist,

and that varying theinitial weightsettings will

re-vealthem.

8 Comments

Wehaveestablishedthat,inthepresenceof

overlap-ping features, the values of the initial weights can

aect the utility of the nal model. Our

experi-mental resultssupport thisnding. Ingeneral,one

mightencounterasimilarproblem(overlapping

fea-tures) with what mightbecalled `semi-overlapping

features': features which are very similar (but not

identical) on the training data and very dierent

outsideof it.

IIS could be made more robust to the choice of

initialparametersinanumberofways:

The simplest course of action is to set all

ini-tialweights to the samevalue. Although zero

is often a convenient initial value, in principle

any real number would do, since theIIS

algo-rithmcan reach agivenoptimal solutionfrom

anystartingpointinthespaceofinitial

param-eters.

We could also examine all the features to

de-terminewhichonesoverlap,andforceitto

bal-ancethenalweightsofthesefeatures. Forvery

largemodels,thismaybeprohibitivelydiÆcult

andtime-consuming.

Model averagingcanalso canceloutvariations

causedbyaparticularchoiceofinitialsettings.

However, this implies a greater computational

burdenasIISwillneedtoberunmanytimesin

ordertogainarepresentativesampleofmodels.

Thenumber offeatures in themodel couldbe

reducedusingfeatureselectionmethods(for

ex-ample(MullenandOsborne,2000)).

Although IIS is a useful tool for estimating

log-modelsusinglimited-memoryvariable-metric

meth-ods (Malouf,2002). Ourndingsshowthat

conver-gence,for arange of problems, is faster. An

inter-esting questionis seeing the extent to which other

numericalmethods forestimating log-linearmodels

aresensitive toinitial parametervalues. Finally, it

shouldbenotedthatourtheoreticalresultsapplyto

amoregeneralsetting thanthat oflog-linear

mod-elstrainedusingtheIISalgorithm. Theproblemof

overlappingfeaturescouldin principleoccurin any

situationinwhichamodelhasalinearcombination

offeatures,anda`hill-climbing'algorithmisusedto

seekamaximum-likelihoodsolution.

Acknowledgements

WewishtothankSteveClarkforusefuldiscussions

aboutIIS, RobMalouffor supplyingtheIIS

imple-mentation and the twoanonymous reviewers. Iain

BancarzwassupportedbytheEPSRCgrantPOEM.

References

Steven P. Abney. 1997. Stochastic

Attribute-Value Grammars. Computational Linguistics,

23(4):597{618,December.

Ted Briscoe and John Carroll. 1996. Automatic

Extraction of Subcategorization from Corpora.

In Proceedings of the 5 th

Conference on Applied

NLP,pages356{363,Washington,DC.

Mark Johnson, Stuart Geman, Stephen Cannon,

ZhiyiChi,andStephanRiezler. 1999. Estimators

forStochastic\Unication-Based"Grammars. In

37 th

AnnualMeetingofthe ACL.

J. Laerty, S. Della Pietra, and V. Della Pietra.

1997. Inducing features of random elds. IEEE

Transactions on Pattern Analysis and Machine

Intelligence,19(4):380{393,April.

Robert Malouf. 2002. A comparison of algorithms

for maximum entropy parameter estimation. In

Proceedings of the joint CoNLL-WVLC Meeting,

Taipei,Taiwan.ACL. Toappear.

TonyMullenand MilesOsborne. 2000. Overtting

Avoidance for Stochastic Modeling of

Attribute-ValueGrammars. InClaireCardie,Walter

Daele-mans,ClaireNedellec,andErikTjongKimSang,

editors,Proceedingsofthe ComputationalNatural

Language learning 2000, pages 49{54.ACL,

Lis-bon,Portugal.

Kamal Nigam, John Laerty, , and Andrew

Mc-Callum. 1999. Using maximum entropy for text

classication. InIJCAI-99WorkshoponMachine

Learning for InformationFiltering,.

Miles Osborne. 2000. Estimation of Stochastic

Attribute-Value Grammars using anInformative

Sample. InThe18 th

International Conferenceon