Using maximum entropy for sentence extraction

(1)

Using Maximum Entropy for Sentence Extraction

Miles Osborne

[email protected]

Division of Informatics

University of Edinburgh

2Buccleuch Place

Edinburgh EH89LW

United Kingdom.

Abstract

A maximumentropyclassiercan beused

to extract sentencesfrom documents.

Ex-periments usingtechnical documentsshow

thatsuchaclassiertendstotreatfeatures

inacategoricalmanner. Thisresultsin

per-formancethatis worsethan when

extract-ingsentencesusinganaiveBayesclassier.

Additionofanoptimisedpriortothe

max-imum entropy classier improves

perfor-manceoverandabovethat of naive Bayes

(even when naive Bayes is also extended

withasimilar prior). Further experiments

show that,should wehave at ourdisposal

extremely informative features, then

max-imumentropyis ableto yield excellent

re-sults. NaiveBayes,in contrast, cannot

ex-ploit these features and so fundamentally

limitssentenceextractionperformance.

1 Introduction

Sentence extraction |the recovery of a given set

of sentences from some document| is useful for

tasks such as document summarisation (where the

extractedsentencescanformthebasisofasummary)

orquestion-answering(wheretheextractedsentences

canform thebasisofananswer). Inthispaper,we

concentrate upon extraction of sentences for

inclu-sioninto asummary. Fromamachinelearning

per-spective, sentence extraction is interesting because

typically, the number of sentences to be extracted

is a verysmall fraction of thetotal numberof

sen-tences in the document. Furthermore, those clues

which determine whether a sentence should be

ex-tracted ornot tend to beeither extremely specic,

or very weak, and furthermore interact together in

non-obviousways. Fromalinguisticperspective,the

ity to integrate together diverse levels of linguistic

description.

Frequently(see section 6 for examples), sentence

extraction systems are based around simple

algo-rithms which assume independence between those

features usedto encodethetask. A consequenceof

this assumption is that such approachesare

funda-mentallyunable to exploit dependencies which

pre-sumablyexistin thefeatures thatwouldbepresent

in an ideal sentence extraction system. This

situ-ation maybe acceptable when the features used to

model sentence extraction are simple. However, it

willrapidlybecomeunacceptablewhenmore

sophis-ticatedheuristics,withcomplicatedinteractions,are

brought to bear upon the problem. For example,

Boguraevand Ne(2000a)arguethatthequalityof

summarisation can be increased if lexical cohesion

factors(rhetorical devices which help achieve

cohe-sionbetweenrelateddocumentutterances)are

mod-elledby asentence extractionsystem. Clearlysuch

devices (for example, lexical repetition, ellipsis,

co-referenceandso on)allcontributetowardsthe

gen-eraldiscoursestructureofsometextandfurthermore

arerelatedtoeach otherinnon-obviousways.

Maximum entropy (log-linear) models, on the

otherhand,do notmakeunnecessaryindependence

assumptions. Within the maximum entropy

frame-work, we are able to optimally integrate together

whateversourcesofknowledgewebelievepotentially

tobeusefulforthetask. Shouldweusefeaturesthat

arebenecial,thenthemodelwillbeabletoexploit

thisfact. Shouldweusefeaturesthatareirrelevant,

thenagain,themodelwillbeabletonoticethis,and

eectivelyignorethem. Modelsbasedonmaximum

entropyarethereforewellsuitedtothesentence

ex-tractiontask,andfurthermore,yieldcompetitive

re-sults on a variety of language tasks (Ratnaparkhi,

1996;Berger etal., 1996;Charniak,1999; Nigamet

al.,1999).

(2)

Ourmodelworksincrementally,anddoesnotalways

need to process the entire document before

assign-ing classication. 1

It discriminates between those

sentenceswhichshouldandshouldnotbeextracted.

This contrastswith rankingapproacheswhich need

toprocesstheentiredocumentbeforeextracting

sen-tences. Becausewemodelwhetherasentenceshould

beextractedor notin termsoffeaturesthat are

ex-tractedfromthesentence(anditscontextinthe

doc-ument),wedonotneedtospecifythesizeofthe

sum-mary. Again,thiscontrastswithrankingapproaches

whichneedtospecifyapriorithesummary size.

Ourmaximumentropyapproach for sentence

ex-traction does not come without problems. Using

reasonably standard features, and when extracting

sentences from technical papers, we nd that

pre-cision levels are high, but recall is very low. This

arises from the fact that those features which

pre-dictwhetherasentenceshouldbeextractedtendto

beveryspecicandoccurinfrequently. Featuresfor

sentences that should not be extracted tend to be

muchmoreabundant,andsomorelikelytobeseen

in the future. A simple prior probability is shown

to help counter-act this tendency. Using ourprior,

wendthatthemaximumentropyapproachisable

to yield results that are better than a naive Bayes

classier.

Ournalsetof experimentslooks morecloselyat

thedierencesbetweenmaximumentropyandnaive

Bayes. Weshowthatwhenwehaveaccesstoan

ora-clethatisabletotelluswhentoextractasentence,

then in the situation when that information is

en-codedindependentfeatures, maximumentropy

eas-ily outperformsnaive Bayes. Furthermore, we also

showthatevenwhenthatinformationisencodedin

terms of independent features, naive Bayes can be

incapable of fully utilising this information, and so

producesworseresultsthanmaximumentropy. 2

1

Incrementalclassicationmeansthatadocumentis

processedfromstart-to-nishand decisionsare madeas

soonassentencesareencountered.Someofourfeatures

(in particular, those which encode sentence position in

adocument)dorequireprocessingtheentiredocument.

Using such features prevents true incremental

process-ing. However,itistrivialtoremovesuchfeaturesandso

ensuretrueincrementality.

2

As a reviewer commented, under certain

circum-stances, naive Bayes can do well even when there are

strongdependencieswithinfeatures(Domingosand

Paz-zani,1997). Forexample,whenthesamplesizeissmall,

naiveBayescanbecompetitivewithmoresophisticated

approaches such as maximum entropy. Given this, a

fullercomparison ofnaiveBayesand maximumentropy

for sentenceextraction requires consideringsample size

outlines thegeneral framework for sentence

extrac-tion using maximum entropy modelling. Section 3

presentsournaiveBayesclassier(whichisusedasa

comparisonwithmaximumentropy). Wethenshow

in section 4 how both our maximum entropy and

naiveBayesclassierscanbeextended withan

(op-timised)prior. Theissueofsummarysizeistouched

uponin section5. Section 6discussesrelatedwork.

We then present our main results (section 7).

Fi-nally, section 8 discusses our results and considers

futurework.

2 Maximum Entropy for Sentence

Extraction

2.1 ConditionalMaximumEntropy

Theparametricformforaconditionalmaximum

en-tropymodelisasfollows(Nigamet al.,1999):

P(cjs)=

theitemweareinterestedin labelling(from theset

ofitemsS). Inourdomain,Csimplyconsistsoftwo

labels: one indicating that a sentence should be in

the summary (`keep'), and another label indicating

thatthesentenceshouldnotbeinthesummary

(`re-ject'). Sconsistsofatrainingsetofsentences,linked

totheiroriginatingdocuments. Thismeansthat we

canrecoverthepositionofanygivensentenceinany

givendocument.

Withinmaximumentropymodels,thetrainingset

is viewed in terms of a set of features. Each

fea-ture expresses some characteristic of the domain.

Forexample, afeature might capture theidea that

abstract-worthysentences containthewordsin this

paper. Inequation1,f

i

(c;s)isafeature. Inthis

pa-perwerestrictourselvestointeger-valuedfunctions.

Anexamplefeaturemightbeasfollows:

f

1 ifscontainsthephrase

inthis paper

andcisthelabelkeep

0 otherwise

(3)

Featuresarerelatedtoeachotherthroughweights

(as can be seen in equation 1, where some feature

f

i

has a weight

i

(3)

num-niques. Inthispaper,weuseconjugategradient

de-scent to ndthe optimalset of weights. Conjugate

GradientdescentconvergesfasterthanImproved

It-erativeScaling(Laertyetal.,1997),andempirically

wendthatitisnumericallymorestable.

2.2 MaximumEntropy Classication

Whenclassifying sentences withmaximumentropy,

weusetheequation:

label(s)=argmax

c2C

P(cjs) (4)

In practice, we are not interested in the

probabil-ity of alabel given a sentence. Instead we use the

unnormalisedscore:

Notethatthismaximumentropyclassierassumes

auniformprior. Section4showshowanon-uniform

priorisusedinplaceofthis uniformprior.

We now present our basic naive Bayes classier.

Afterwards, we extend this classier with a

non-uniformprior.

3 Naive Bayes Classication

Asanalternativetomaximumentropy,wealso

inves-tigatedanaiveBayesclassier. Unlikemaximum

en-tropy,naiveBayesassumesfeaturesareconditionally

independentofeachother. So,comparingthetwo

to-getherwillgiveanindicationofthelevelofstatistical

dependencieswhichexistbetweenfeaturesinthe

sen-tence extraction domain. For our experiments, we

used a variant of the multi-variate Bernoulli event

model (McCallum andNigam,1998). Inparticular,

wedidnotconsiderfeaturesthatareabsentinsome

example. This allowsus to avoid summing overall

features in the model for each example. Note that

our maximum entropy model also did not consider

absentfeatures.

WithinournaiveBayesapproach,theprobability

ofalabelgiventhesentenceisasfollows:

P(cjs)=

As before, s is some sentence, c the label, and g

i

is some active feature describing sentence s. Naive

Bayes models can be estimated in a closed form

by simple counting. For features which have zero

counts,weuseadd-ksmoothing(where kisasmall

stant:

Ifweassumeauniformprior(inwhichcaseP(c)isa

constantforallc),thiscanbefurthersimplied to:

P(cjs)/

OurbasicnaiveBayesclassierisasfollows:

label(s)=argmax

Aswith themaximumentropyclassier, welater

replacetheuniformpriorwithanon-uniform prior.

4 Maximum a Posteriori

Classication

In this section, we show how our classiers can be

extendedwithanon-uniformprior. Wealsodescribe

howsuch apriorcanbeoptimised.

4.1 Adding a non-uniformprior

Now,thetwoclassiersmentionedpreviously

(equa-tions 9 and 5) are both based on maximum

likeli-hood estimation. However,aswedescribelater, for

sentenceextraction, themaximumentropyclassier

tends to over-select labels. In particular, it tends

to reject too many sentences for inclusion into the

summary. So,ititusefultoextendthetwoprevious

classiers with a non-uniform prior. For the naive

Bayesclassier, wehave:

label(s)=argmax

Here,P(c) isourprior. Theprobabilityofthedata

(P(s)) isconstantandsocanbedropped.

Forthemaximumentropycase, weare not

inter-estedintheactualprobability:

label(s)=argmax

F(c) is a function equivalent to the prior when

using the unnormalised classier. When this prior

distribution(orequivalentfunction)isuniform,

clas-sicationisasbefore(namelyasoutlinedinsections

2and 3), and depends uponthe maximum entropy

ornaive Bayescomponent. Whenthe prioris

non-uniform, the classier behaviour will change. This

prior therefore allows us to aect the performance

(4)

Wetreattheproblemofselectingapriorasan

opti-misationtask: selectsomeP(c)(or F(c))such that

performance, as measured by some objective

func-tionoftheoverallclassier,ismaximised. Sincethe

choice ofobjectivefunction is upto us, wecan

eas-ilyoptimisetheclassier inanywaywedecide. For

example,wecouldoptimiseforrecallbyusingasour

objectivefunctionanf-measurethatweightedrecall

morehighly than precision. In this paper, we

opti-mise thepriorusing asan objectivefunction thef2

score of the classier (section 7 details this score).

Ourpriorthereforedoesnotreectrelative

frequen-ciesoflabels(asfoundin somecorpus).

We now need to optimise our prior. Brent's one

dimensional function minimisation method is well

suited to this task (Press et al., 1993), since for a

random variable taking two values, the probability

of one value can be dened in terms of the other

value. Section7describestheheld-outoptimisation

strategyusedinourexperiments.

Shouldwedecidetouseamoreelaborateprior(for

example, onewhich wasalsosensitiveto properties

of documents) then we would need to use a

multi-dimensionalfunction minimisationmethod.

Note that we have not simultaneously optimised

the likelihood and prior probabilities. This means

that we do not necessarily nd the optimal

maxi-mum aposteriori (MAP)solution. It is possible to

integrateintomaximumentropyestimation(simple)

conjugatepriorsthatdoallowMAP solutionsto be

found (Chen and Rosenfeld, 1999). Although it is

an open questionwhether morecomplex priorscan

bedirectlyintegrated,futureworkoughttoconsider

theeÆcacyofsuchapproachesinthecontextof

sum-marisation.

5 Summary size

Determining the size of the summary is an

impor-tant consideration for summarisation. Frequently,

thisis carriedoutdynamically, andspeciedby the

user. For example, when there is limited

opportu-nityto displaylongsummariesausermightwanta

tersesummary. Alternatively,when recallis

impor-tant,ausermightpreferalongersummary. Usually,

systemsrankallsentencesin termsofhow

abstract-worthythatare,andthentakethetopnmosthighly

ranked sentences. This always requires the size of

summaryto bespecied.

Inourclassicationframework,sentencesare

pro-cessed(largely) independentlyofeachother, andso

there is no direct way of controllingthe size of the

marysize,wecanranksentencesusingourclassiers

(we not only label but can also assign label

prob-abilities) and select the top n most highly ranked

sentences.

Withinourclassicationapproach, theoptimised

priorplaysasimilarroletotheuser-denednumber

ofsentencesthatarankingapproachmightreturn.

Experiments (not reported here) showed that

rankingsentencesusingourmaximumentropy

classi-er,andthenselectingthetopnmosthighlyranked

sentencesproducedslightlyworseresultsthanwhen

selectingsentencesintermsofclassication.

6 Related Work

Thesummarisationliteratureislarge. Here we

con-sideronlyarepresentativesample.

Kupiecet al. (1995)usedNaiveBayesforsentence

extraction. They did not consider the role of the

prior, nor did they use Naive Bayes for

classica-tion. Instead, they used it to rank sentences and

selected the top n sentences. The TEXTRACT

system included a sentence extraction component

thatisfrequency-based(BoguraevandNe,2000b).

Whilst the system uses a wide variety of

linguis-tic cues when scoring sentences, it does not

com-bine these scores in an optimal manner. Also, it

doesnotconsiderinteractionsbetweenthelinguistic

cues. Goldsteinet al. (1999)usedacentroid

similar-itymeasure toscoresentences. Theydonotappear

tohaveoptimisedtheirmetric,nordotheydealwith

statisticaldependenciesbetweentheirfeatures.

7 Experiments

Summarisationevaluationisahardtask,principally

because the notion of an objective summary is

ill-dened. Thataside,inordertocompareourvarious

systems, we used an intrinsic evaluation approach.

Our summaries were evaluated using the standard

f2score:

r= j

m p=

j

k f2=

2pr

p+r

where:

r = Recall

p = Precision

j = Numberofcorrectsentencesinsummary

k = Numberofsentencesinsummary

m = Numberofcorrectsentencesinthedocument

A sentence being `correct' means that it was

(5)

(abstract-willthereforeattempttomimictheprocessof

select-ingwhatitmeansforasentenceto beimportantin

adocument.

Naturally this premise |that an annotator can

decideaprioriwhetherasentenceisabstract-worthy

or not|is open to question. That aside, in other

sentenceextractionscenarios,itmaywellbethecase

thatsentencescanbereliablyannotated.

The f2 score treats recall and precision equally.

Thisisasensiblemetrictouseaswehavenoapriori

reasontobelieveinsomeothernon-equalratioofthe

twocomponents.

Ourevaluation resultsarebasedon thefollowing

approach:

1. Splitthesetofdocumentsintotwodisjointsets

(T1andT2), with 70documentsin T1and10

documentsinT2.

2. Further split T1 into two disjoint sets T3 and

T4. T3 is used to train a model, and T4 is

a held-out set. The prior is estimated using

Brent's line minimisation method, when

train-ingusingT3andevaluatingonT4.T3consisted

of 60 documents and T4 consisted of 10

docu-ments.

3. Resultsarethenpresentedusingamodeltrained

onT1,withthepriorjustfound,andevaluated

using T2. T1is therefore the training set and

T2isthetestingset. Resultsarealsopresented

usingaatprior.

4. The whole process is then repeated after

ran-domising the documents. The nal resultsare

then averaged over these n runs. We set n to

40.

7.1 Documentset

For data, we used the same documents that

Teufel(2001) used in her experiments. 3

In brief,

these were 80 conference papers, taken from the

Comp-langpreprintarchive,andsemi-automatically

convertedfrom L A

T

E

XtoXML. TheXML annotated

documentswerethenadditionallymanually

marked-up with tags indicating the status of various

sen-tences. Thisdocumentset ismodestin size. Onthe

other hand, the actual documents are longer than

newswiremessagestypicallyusedforsummarisation

tasks. Also, thedocuments showvariation in style.

For example, some documents are written by

non-nativespeakers,somebystudents,somebymultiple

authorsandsoon. Summarisationisthereforehard.

3

Asupersetofthe documentsis describedin(Teufel

average,each document contained 8sentences that

weremarkedasbeingabstract-worthy(standard

de-viation of 3.1). The documents on average each

containedintotal174sentences(standarddeviation

50.7). Here, a `sentence' is either any sequence of

wordsthathappenedtobeinatitle,orelseany

se-quenceofwordsintherestofthedocument. Ascan

beseen,thesummariesarenotuniformlylong. Also,

the documents vary considerably in length.

Sum-marysize isthereforenotconstant.

7.2 Features

Weusedthefollowing,fairlystandardfeatureswhen

describingallsentencesinthedocuments:

Wordpairs. Word pairsare consecutivewords

asfoundinasentence. Awordpairfeature

sim-ply indicates whether a particularwordpair is

present. All wordswere reduced: truncated to

be at most 10 characters long. Stemming (as

forexamplecarriedoutbythePorterstemmer)

produced worse results. We extracted allword

pairsfound in all sentences, and forany given

sentence,foundtheset of(reduced)wordpairs.

Sentence length. We encoded in three binary

features whether a sentence was less than 6

wordsin length,whether itwasgreaterthan20

words in length, or whether it was in between

thesetworanges. Wealsousedafeaturewhich

encoded whether a previous sentence was less

than5wordsorlonger. Thiscapturedtheidea

thatsummarysentencestendtofollowheadings

(whichareshort).

Sentenceposition. Summary sentences tend to

occureither atthe start,ortheend of a

docu-ment. Weused threefeatures: whetheragiven

sentence waswithin therst 8paragraphsofa

document, whether a sentence was in the last

3 paragraphs, or whether the sentence was in

a paragraph between these two ranges to

en-code sentence position. Note that this feature

requiresthewholedocumenttobeprocessed

be-foreclassicationcantakeplace.

(Limited) discourse features. Our features

de-scribedwhetherasentenceimmediatelyfollowed

typicalheadingssuchasconclusion or

introduc-tion, whether a sentence was at the start of a

paragraph,orwhetherasentencefollowedsome

genericheading.

(6)

to be typical of those found in sentence extraction

systems. Note thatsomeofourfeatures exploit the

fact that the documents are annotated with

struc-turalinformation(suchasheadersetc).

Experimentswithremovingstopwordsfrom

docu-mentsresultedindecreasedperformance. We

conjec-turethatthisisbecauseourwordpairsareextremely

crudesyntax approximations. Removingstopwords

from sentencesand then creatingword pairsmakes

thesepairsevenworsesyntaxapproximations.

How-ever,using stopwordsincreased thenumberof

fea-tures in our model, and so again reduced

perfor-mance. We therefore compromised between these

twopositions,andmappedallstopwordstothesame

symbolprior to creationof word pair features. We

alsofounditusefultoremovewordpairswhich

con-sisted solely of stop words. Finally, for maximum

entropy, we deleted any feature that occurred less

than 4 times. Naive Bayes did not benet from a

frequency-basedcuto.

7.3 Classiercomparison

Herewereport onourclassiers.

Asabaselinemodel,wesimplyextractedtherst

n sentences from agivendocument. Figure 1

sum-marisesourresultsasnvaries. Inthistable,asinall

subsequent tables, P and R are averaged precision

and recall values,whilst F2is the f2 scoreof these

averagedvalues.

n F2 P R n F2 P R

1 0 0 0 26 16 10 36

6 3 3 2 31 18 12 45

11 19 15 26 36 18 11 53

16 20 16 29 41 17 10 58

21 23 16 38 46 16 9 58

Figure1: Resultsforthebaselinemodel

Figure 2showsourresultsfor maximumentropy,

bothwithandwithouttheprior. Prioroptimisation

was with respect to the f2 score. As in subsequent

tables, we show system performance when adding

moreandmorefeatures.

Performance without the prior is heavily skewed

towards precision. This is becauseour features are

largely acting categorically: the sheer presence of

somefeatureissuÆcienttoinuencelabellingchoice.

Furtherevidenceforthisanalysisissupportedby

in-spectingoneofthemodelsproducedwhenusingthe

F2 P R F2 P R

Wordpairs 8 5 30 20 40 14

andsentlength 25 63 16 36 36 36

andsentposition 28 62 18 39 35 45

anddiscourse 35 63 24 42 43 41

Figure2: Resultsforthemaximumentropymodel

featureinstancesinthemodel,thevastmajorityare

deededirrelevantbymaximumentropy,andassigned

azero weight. Only 7086 features (roughly 10% in

total)hadnonzeroweights.

Performanceusingtheoptimisedpriorshowsmore

balanced results, with an increase in F2 score.

Clearlyoptimising theprior hashelped counterthe

categorical behaviour of features in our maximum

entropyclassier.

Figure 3showsthe resultsweobtained when

us-ing a naive Bayes classier. As before, the results

showperformancewithand withouttheadditionof

theoptimisedprior. NaiveBayesoutperforms

maxi-mumentropywhenbothclassiersdonotuseaprior.

Performancewithandwithoutthepriorhowever,is

worsethantheperformanceofourmaximumentropy

classier with the prior. Evidently, even our

rela-tively simplefeatures interact with each other, and

soapproachessuchasmaximumentropyarerequired

tofullyexploitthem.

Features Flat prior Optimisedprior

F2 P R F2 P R

Wordpairs 26 29 23 29 26 32

anddiscourse 38 39 37 39 38 40

Figure3: ResultsforthenaiveBayesmodel

7.4 Using informativefeatures

Ourpreviousresultsshowedthatmaximumentropy

couldoutperformnaiveBayes. However,the

dier-ences, though present, were not large. Clearly, our

feature set was imperfect. 4

It is therefore

instruc-tiveto see what happensifwehad accessto an

or-aclewhoalwaystoldus thetruestatusof some

un-seensentence. Tomakethings moreinteresting,we

4

Anotherpossible reason for the closenessof the

re-sults is the small sample size. There may just not be

(7)

F2 P R F2 P R

Wordpairs 30 34 26 32 93 19

anddiscourse 43 44 41 99 100 97

Figure 4: Results for basic naive Bayes and

max-imum entropy models using dependent informative

features

Features NaiveBayes Maxent

F2 P R F2 P R

Wordpairs 84 74 97 25 15 91

andsentlength 85 75 97 100 100 100

anddiscourse 84 74 97 100 100 100

Figure 5: Results for basic naive Bayes and

maxi-mumentropymodelsusingindependentinformative

features

encodedthisinformationintermsofdependent

fea-tures. Wesimulatedthisoraclebyusingtwofeatures

which were active whenever a sentence should not

beinthesummary;forsentencesthat shouldbe

in-cludedinthesummary,weleteitheroneofthosetwo

features beactive,but onarandombasis. Our

fea-turesthereforeareonlyinformativewhenthelearner

iscapableofnotingthattherearedependencies. We

then repeated our previous maximum entropy and

naive Bayes experiments. Figure 4 summarise our

results.

Unsurprisingly, we see that when features are

highly dependent upon each other, maximum

en-tropyeasilyoutperformsnaiveBayes.

Evenwhenwehaveaccess tofeaturesthatare

in-dependent of each other, naive Bayes can still do

worsethanmaximumentropy. Todemonstratethis,

we used a feature that was activewhenevera

sen-tence should be in the summary. This feature was

not active on sentences that should not be in the

summary. Figure5summarises ourresults.

As can be seen (gure 5), even when naive Bayes

hasaccesstoaperfectlyreliableinformativefeature,

thefactthattheotherfeaturesarenotsuitably

dis-countedmeansthat performanceisworsethanthat

of maximum entropy. Maximum entropy can

dis-counttheotherfeatures, andsocantakeadvantage

8 Comments and Future Work

Weshowedhowmaximumentropycouldbeusedfor

sentence extraction, and in particular, that adding

a prior could deal with the categorical nature of

the features. Maximum entropy, with an

opti-misedprior,didyieldmarginallybetterresultsthan

naiveBayes(withandwithoutasimilarlyoptimised

prior). However,thedierenceswerenotthatgreat.

Our further experiments with informative features

showedthatthislackofdierencewasprobablydue

(atleastinpart)totheactualfeaturesused,andnot

duetothetechniqueitself.

Our oracle results are an idealisation. A fuller

comparison should use more sophisticated features,

alongwithmoredata. Asaresultofthis,we

conjec-turethat should weuse amuch moresophisticated

featureset,wewouldexpectthat thedierences

be-tweenmaximumentropyandnaiveBayeswould

be-comegreater.

Our approach treated sentences largely

indepen-dentlyofeachother. However,abstract-worthy

sen-tences tend to bunch together, particularly at the

beginning and end of a document. Weintend

cap-turing this idea by making our approach

sequence-based: future decisions should also be conditioned

onpreviouschoices.

A problem with supervised approaches (such as

ours) is that we need annotated material (Marcu,

1999). This is costly to produce. Future work will

consider weakly supervisedapproaches(for example

cotraining)asawayofbootstrappinglabelled

mate-rialfromunlabelleddocuments(BlumandMitchell,

1998). Notethatthereisacloseconnectionbetween

multi-document summarisation (where many

alter-nativedocumentsallconsidersimilarissues)andthe

conceptofaviewincotraining. Weexpectthat this

redundancycouldbeexploitedasameansof

provid-ing more annotatedtraining material, and so yield

betterresults.

Insummary,maximumentropycanbebenecially

used in sentence extraction. However,oneneeds to

guardagainstcategorialfeatures. Anoptimisedprior

canprovidesuchhelp.

Acknowledgement

WewouldliketothankRobMaloufforsupplyingthe

excellent log-linear estimation code, Simone Teufel

forprovidingtheannotateddata,KarenSparkJones

for a discussion about summarisation, Steve Clark

forspottingtextualbugsandtheanonymous

(8)

Adam Berger, Stephen Della Pietra, and

Vin-cent Della Pietra. 1996. A maximum entropy

approach to natural language processing.

Com-putational Linguistics,21{22.

Avrim Blum and Tom Mitchell. 1998.

Combin-ing labeled and unlabeled data with co-training.

InProceedings ofthe WorkshoponComputational

LearningTheory.MorganKaufmannPublishers.

BranimirK.BoguraevandMaryS.Ne. 2000a.The

eects of analysing cohesion on document

sum-marisation. In Proceedings of the 18 th

Interna-tional Conference on Computational Linguistics,

volume1,pages76{82,Saarbrucken.

BranmirK.BoguraevandMaryS.Ne. 2000b.

Dis-courseSegmentationinAidofDocument

Summa-rization. InProceedings of the 33 rd

Hawaii

Inter-nationalConference onSystemsScience.

Eugene Charniak. 1999. A

maximum-entropy-inspired parser. Technical Report CS99-12,

De-partmentofComputerScience,BrownUniversity.

Stanley F. Chen and Ronald Rosenfeld. 1999.

A Gaussian prior for smoothing maximum

en-tropymodels. TechnicalReport CMU-CS-99-108,

CarnegieMellon University.

PedroDomingosand MichaelJ.Pazzani. 1997. On

theoptimalityofthesimplebayesianclassier

un-derzero-oneloss. Machine Learning,29(2-3):103{

130.

Jade Goldstein, Mark Kantrowitz, Vibhu O.

Mit-tal,and JaimeG.Carbonell. 1999. Summarizing

textdocuments: Sentenceselectionandevaluation

metrics. InResearchandDevelopmentin

Informa-tionRetrieval,pages121{128.

Julian Kupiec, Jan Pedersen, and Francine Chen.

1995. A Trainable Document Summarizer. In

Proceedings of the 18 th

ACM-SIGIR Conference

onResearchandDevelopment inInformation

Re-trieval,pages68{73.

J. Laerty, S. Della Pietra, and V. Della Pietra.

1997. Inducing features of random elds. IEEE

TransactionsonPatternAnalysisandMachine

In-telligence,19(4):380{393,April.

Daniel Marcu. 1999. The automatic construction

oflarge-scalecorporaforsummarizationresearch.

InResearch andDevelopment in Information

Re-trieval,pages137{144.

A.McCallumandK.Nigam. 1998. Acomparisonof

eventmodelsfor naivebayestextclassicatio. In

AAAI-98Workshop onLearningfor Text

Catego-Callum. 1999. Using maximumentropy for text

classication. InIJCAI-99Workshop on Machine

Learningfor InformationFiltering,.

WilliamH.Press,SaulA.Teukolsky,WilliamT.

Vet-terling, and Brian P. Flannery. 1993.

Numeri-cal RecipesinC: the Artof ScienticComputing.

CambridgeUniversityPress,secondedition.

Adwait Ratnaparkhi. 1996. A Maximum

En-tropy Part-Of-Speech Tagger. In

Proceed-ings of Empirical Methods in Natural

Lan-guage, University of Pennsylvania, May. Tagger:

ftp://ftp.cis.upenn.edu/pub/adwait/jmx.

S. Teufel andM. Moens. 1997. Sentenceextraction

asaclassicationtask. InACL/EACL-97

Work-shopon Intelligent andScalable Text

Summariza-tion,Madrid,Spain.

Simone Teufel. 2001. Task-Based Evaluation of

Summary Quality: Describing Relationships

Be-tweenScienticPapers. InNAACL Workshop on

Automatic Summarization, Pittsburgh,