Using Maximum Entropy for Sentence Extraction
Miles Osborne
Division of Informatics
University of Edinburgh
2Buccleuch Place
Edinburgh EH89LW
United Kingdom.
Abstract
A maximumentropyclassiercan beused
to extract sentencesfrom documents.
Ex-periments usingtechnical documentsshow
thatsuchaclassiertendstotreatfeatures
inacategoricalmanner. Thisresultsin
per-formancethatis worsethan when
extract-ingsentencesusinganaiveBayesclassier.
Additionofanoptimisedpriortothe
max-imum entropy classier improves
perfor-manceoverandabovethat of naive Bayes
(even when naive Bayes is also extended
withasimilar prior). Further experiments
show that,should wehave at ourdisposal
extremely informative features, then
max-imumentropyis ableto yield excellent
re-sults. NaiveBayes,in contrast, cannot
ex-ploit these features and so fundamentally
limitssentenceextractionperformance.
1 Introduction
Sentence extraction |the recovery of a given set
of sentences from some document| is useful for
tasks such as document summarisation (where the
extractedsentencescanformthebasisofasummary)
orquestion-answering(wheretheextractedsentences
canform thebasisofananswer). Inthispaper,we
concentrate upon extraction of sentences for
inclu-sioninto asummary. Fromamachinelearning
per-spective, sentence extraction is interesting because
typically, the number of sentences to be extracted
is a verysmall fraction of thetotal numberof
sen-tences in the document. Furthermore, those clues
which determine whether a sentence should be
ex-tracted ornot tend to beeither extremely specic,
or very weak, and furthermore interact together in
non-obviousways. Fromalinguisticperspective,the
ity to integrate together diverse levels of linguistic
description.
Frequently(see section 6 for examples), sentence
extraction systems are based around simple
algo-rithms which assume independence between those
features usedto encodethetask. A consequenceof
this assumption is that such approachesare
funda-mentallyunable to exploit dependencies which
pre-sumablyexistin thefeatures thatwouldbepresent
in an ideal sentence extraction system. This
situ-ation maybe acceptable when the features used to
model sentence extraction are simple. However, it
willrapidlybecomeunacceptablewhenmore
sophis-ticatedheuristics,withcomplicatedinteractions,are
brought to bear upon the problem. For example,
Boguraevand Ne(2000a)arguethatthequalityof
summarisation can be increased if lexical cohesion
factors(rhetorical devices which help achieve
cohe-sionbetweenrelateddocumentutterances)are
mod-elledby asentence extractionsystem. Clearlysuch
devices (for example, lexical repetition, ellipsis,
co-referenceandso on)allcontributetowardsthe
gen-eraldiscoursestructureofsometextandfurthermore
arerelatedtoeach otherinnon-obviousways.
Maximum entropy (log-linear) models, on the
otherhand,do notmakeunnecessaryindependence
assumptions. Within the maximum entropy
frame-work, we are able to optimally integrate together
whateversourcesofknowledgewebelievepotentially
tobeusefulforthetask. Shouldweusefeaturesthat
arebenecial,thenthemodelwillbeabletoexploit
thisfact. Shouldweusefeaturesthatareirrelevant,
thenagain,themodelwillbeabletonoticethis,and
eectivelyignorethem. Modelsbasedonmaximum
entropyarethereforewellsuitedtothesentence
ex-tractiontask,andfurthermore,yieldcompetitive
re-sults on a variety of language tasks (Ratnaparkhi,
1996;Berger etal., 1996;Charniak,1999; Nigamet
al.,1999).
Ourmodelworksincrementally,anddoesnotalways
need to process the entire document before
assign-ing classication. 1
It discriminates between those
sentenceswhichshouldandshouldnotbeextracted.
This contrastswith rankingapproacheswhich need
toprocesstheentiredocumentbeforeextracting
sen-tences. Becausewemodelwhetherasentenceshould
beextractedor notin termsoffeaturesthat are
ex-tractedfromthesentence(anditscontextinthe
doc-ument),wedonotneedtospecifythesizeofthe
sum-mary. Again,thiscontrastswithrankingapproaches
whichneedtospecifyapriorithesummary size.
Ourmaximumentropyapproach for sentence
ex-traction does not come without problems. Using
reasonably standard features, and when extracting
sentences from technical papers, we nd that
pre-cision levels are high, but recall is very low. This
arises from the fact that those features which
pre-dictwhetherasentenceshouldbeextractedtendto
beveryspecicandoccurinfrequently. Featuresfor
sentences that should not be extracted tend to be
muchmoreabundant,andsomorelikelytobeseen
in the future. A simple prior probability is shown
to help counter-act this tendency. Using ourprior,
wendthatthemaximumentropyapproachisable
to yield results that are better than a naive Bayes
classier.
Ournalsetof experimentslooks morecloselyat
thedierencesbetweenmaximumentropyandnaive
Bayes. Weshowthatwhenwehaveaccesstoan
ora-clethatisabletotelluswhentoextractasentence,
then in the situation when that information is
en-codedindependentfeatures, maximumentropy
eas-ily outperformsnaive Bayes. Furthermore, we also
showthatevenwhenthatinformationisencodedin
terms of independent features, naive Bayes can be
incapable of fully utilising this information, and so
producesworseresultsthanmaximumentropy. 2
1
Incrementalclassicationmeansthatadocumentis
processedfromstart-to-nishand decisionsare madeas
soonassentencesareencountered.Someofourfeatures
(in particular, those which encode sentence position in
adocument)dorequireprocessingtheentiredocument.
Using such features prevents true incremental
process-ing. However,itistrivialtoremovesuchfeaturesandso
ensuretrueincrementality.
2
As a reviewer commented, under certain
circum-stances, naive Bayes can do well even when there are
strongdependencieswithinfeatures(Domingosand
Paz-zani,1997). Forexample,whenthesamplesizeissmall,
naiveBayescanbecompetitivewithmoresophisticated
approaches such as maximum entropy. Given this, a
fullercomparison ofnaiveBayesand maximumentropy
for sentenceextraction requires consideringsample size
outlines thegeneral framework for sentence
extrac-tion using maximum entropy modelling. Section 3
presentsournaiveBayesclassier(whichisusedasa
comparisonwithmaximumentropy). Wethenshow
in section 4 how both our maximum entropy and
naiveBayesclassierscanbeextended withan
(op-timised)prior. Theissueofsummarysizeistouched
uponin section5. Section 6discussesrelatedwork.
We then present our main results (section 7).
Fi-nally, section 8 discusses our results and considers
futurework.
2 Maximum Entropy for Sentence
Extraction
2.1 ConditionalMaximumEntropy
Theparametricformforaconditionalmaximum
en-tropymodelisasfollows(Nigamet al.,1999):
P(cjs)=
theitemweareinterestedin labelling(from theset
ofitemsS). Inourdomain,Csimplyconsistsoftwo
labels: one indicating that a sentence should be in
the summary (`keep'), and another label indicating
thatthesentenceshouldnotbeinthesummary
(`re-ject'). Sconsistsofatrainingsetofsentences,linked
totheiroriginatingdocuments. Thismeansthat we
canrecoverthepositionofanygivensentenceinany
givendocument.
Withinmaximumentropymodels,thetrainingset
is viewed in terms of a set of features. Each
fea-ture expresses some characteristic of the domain.
Forexample, afeature might capture theidea that
abstract-worthysentences containthewordsin this
paper. Inequation1,f
i
(c;s)isafeature. Inthis
pa-perwerestrictourselvestointeger-valuedfunctions.
Anexamplefeaturemightbeasfollows:
f
1 ifscontainsthephrase
inthis paper
andcisthelabelkeep
0 otherwise
(3)
Featuresarerelatedtoeachotherthroughweights
(as can be seen in equation 1, where some feature
f
i
has a weight
i
num-niques. Inthispaper,weuseconjugategradient
de-scent to ndthe optimalset of weights. Conjugate
GradientdescentconvergesfasterthanImproved
It-erativeScaling(Laertyetal.,1997),andempirically
wendthatitisnumericallymorestable.
2.2 MaximumEntropy Classication
Whenclassifying sentences withmaximumentropy,
weusetheequation:
label(s)=argmax
c2C
P(cjs) (4)
In practice, we are not interested in the
probabil-ity of alabel given a sentence. Instead we use the
unnormalisedscore:
Notethatthismaximumentropyclassierassumes
auniformprior. Section4showshowanon-uniform
priorisusedinplaceofthis uniformprior.
We now present our basic naive Bayes classier.
Afterwards, we extend this classier with a
non-uniformprior.
3 Naive Bayes Classication
Asanalternativetomaximumentropy,wealso
inves-tigatedanaiveBayesclassier. Unlikemaximum
en-tropy,naiveBayesassumesfeaturesareconditionally
independentofeachother. So,comparingthetwo
to-getherwillgiveanindicationofthelevelofstatistical
dependencieswhichexistbetweenfeaturesinthe
sen-tence extraction domain. For our experiments, we
used a variant of the multi-variate Bernoulli event
model (McCallum andNigam,1998). Inparticular,
wedidnotconsiderfeaturesthatareabsentinsome
example. This allowsus to avoid summing overall
features in the model for each example. Note that
our maximum entropy model also did not consider
absentfeatures.
WithinournaiveBayesapproach,theprobability
ofalabelgiventhesentenceisasfollows:
P(cjs)=
As before, s is some sentence, c the label, and g
i
is some active feature describing sentence s. Naive
Bayes models can be estimated in a closed form
by simple counting. For features which have zero
counts,weuseadd-ksmoothing(where kisasmall
stant:
Ifweassumeauniformprior(inwhichcaseP(c)isa
constantforallc),thiscanbefurthersimplied to:
P(cjs)/
OurbasicnaiveBayesclassierisasfollows:
label(s)=argmax
Aswith themaximumentropyclassier, welater
replacetheuniformpriorwithanon-uniform prior.
4 Maximum a Posteriori
Classication
In this section, we show how our classiers can be
extendedwithanon-uniformprior. Wealsodescribe
howsuch apriorcanbeoptimised.
4.1 Adding a non-uniformprior
Now,thetwoclassiersmentionedpreviously
(equa-tions 9 and 5) are both based on maximum
likeli-hood estimation. However,aswedescribelater, for
sentenceextraction, themaximumentropyclassier
tends to over-select labels. In particular, it tends
to reject too many sentences for inclusion into the
summary. So,ititusefultoextendthetwoprevious
classiers with a non-uniform prior. For the naive
Bayesclassier, wehave:
label(s)=argmax
Here,P(c) isourprior. Theprobabilityofthedata
(P(s)) isconstantandsocanbedropped.
Forthemaximumentropycase, weare not
inter-estedintheactualprobability:
label(s)=argmax
F(c) is a function equivalent to the prior when
using the unnormalised classier. When this prior
distribution(orequivalentfunction)isuniform,
clas-sicationisasbefore(namelyasoutlinedinsections
2and 3), and depends uponthe maximum entropy
ornaive Bayescomponent. Whenthe prioris
non-uniform, the classier behaviour will change. This
prior therefore allows us to aect the performance
Wetreattheproblemofselectingapriorasan
opti-misationtask: selectsomeP(c)(or F(c))such that
performance, as measured by some objective
func-tionoftheoverallclassier,ismaximised. Sincethe
choice ofobjectivefunction is upto us, wecan
eas-ilyoptimisetheclassier inanywaywedecide. For
example,wecouldoptimiseforrecallbyusingasour
objectivefunctionanf-measurethatweightedrecall
morehighly than precision. In this paper, we
opti-mise thepriorusing asan objectivefunction thef2
score of the classier (section 7 details this score).
Ourpriorthereforedoesnotreectrelative
frequen-ciesoflabels(asfoundin somecorpus).
We now need to optimise our prior. Brent's one
dimensional function minimisation method is well
suited to this task (Press et al., 1993), since for a
random variable taking two values, the probability
of one value can be dened in terms of the other
value. Section7describestheheld-outoptimisation
strategyusedinourexperiments.
Shouldwedecidetouseamoreelaborateprior(for
example, onewhich wasalsosensitiveto properties
of documents) then we would need to use a
multi-dimensionalfunction minimisationmethod.
Note that we have not simultaneously optimised
the likelihood and prior probabilities. This means
that we do not necessarily nd the optimal
maxi-mum aposteriori (MAP)solution. It is possible to
integrateintomaximumentropyestimation(simple)
conjugatepriorsthatdoallowMAP solutionsto be
found (Chen and Rosenfeld, 1999). Although it is
an open questionwhether morecomplex priorscan
bedirectlyintegrated,futureworkoughttoconsider
theeÆcacyofsuchapproachesinthecontextof
sum-marisation.
5 Summary size
Determining the size of the summary is an
impor-tant consideration for summarisation. Frequently,
thisis carriedoutdynamically, andspeciedby the
user. For example, when there is limited
opportu-nityto displaylongsummariesausermightwanta
tersesummary. Alternatively,when recallis
impor-tant,ausermightpreferalongersummary. Usually,
systemsrankallsentencesin termsofhow
abstract-worthythatare,andthentakethetopnmosthighly
ranked sentences. This always requires the size of
summaryto bespecied.
Inourclassicationframework,sentencesare
pro-cessed(largely) independentlyofeachother, andso
there is no direct way of controllingthe size of the
marysize,wecanranksentencesusingourclassiers
(we not only label but can also assign label
prob-abilities) and select the top n most highly ranked
sentences.
Withinourclassicationapproach, theoptimised
priorplaysasimilarroletotheuser-denednumber
ofsentencesthatarankingapproachmightreturn.
Experiments (not reported here) showed that
rankingsentencesusingourmaximumentropy
classi-er,andthenselectingthetopnmosthighlyranked
sentencesproducedslightlyworseresultsthanwhen
selectingsentencesintermsofclassication.
6 Related Work
Thesummarisationliteratureislarge. Here we
con-sideronlyarepresentativesample.
Kupiecet al. (1995)usedNaiveBayesforsentence
extraction. They did not consider the role of the
prior, nor did they use Naive Bayes for
classica-tion. Instead, they used it to rank sentences and
selected the top n sentences. The TEXTRACT
system included a sentence extraction component
thatisfrequency-based(BoguraevandNe,2000b).
Whilst the system uses a wide variety of
linguis-tic cues when scoring sentences, it does not
com-bine these scores in an optimal manner. Also, it
doesnotconsiderinteractionsbetweenthelinguistic
cues. Goldsteinet al. (1999)usedacentroid
similar-itymeasure toscoresentences. Theydonotappear
tohaveoptimisedtheirmetric,nordotheydealwith
statisticaldependenciesbetweentheirfeatures.
7 Experiments
Summarisationevaluationisahardtask,principally
because the notion of an objective summary is
ill-dened. Thataside,inordertocompareourvarious
systems, we used an intrinsic evaluation approach.
Our summaries were evaluated using the standard
f2score:
r= j
m p=
j
k f2=
2pr
p+r
where:
r = Recall
p = Precision
j = Numberofcorrectsentencesinsummary
k = Numberofsentencesinsummary
m = Numberofcorrectsentencesinthedocument
A sentence being `correct' means that it was
(abstract-willthereforeattempttomimictheprocessof
select-ingwhatitmeansforasentenceto beimportantin
adocument.
Naturally this premise |that an annotator can
decideaprioriwhetherasentenceisabstract-worthy
or not|is open to question. That aside, in other
sentenceextractionscenarios,itmaywellbethecase
thatsentencescanbereliablyannotated.
The f2 score treats recall and precision equally.
Thisisasensiblemetrictouseaswehavenoapriori
reasontobelieveinsomeothernon-equalratioofthe
twocomponents.
Ourevaluation resultsarebasedon thefollowing
approach:
1. Splitthesetofdocumentsintotwodisjointsets
(T1andT2), with 70documentsin T1and10
documentsinT2.
2. Further split T1 into two disjoint sets T3 and
T4. T3 is used to train a model, and T4 is
a held-out set. The prior is estimated using
Brent's line minimisation method, when
train-ingusingT3andevaluatingonT4.T3consisted
of 60 documents and T4 consisted of 10
docu-ments.
3. Resultsarethenpresentedusingamodeltrained
onT1,withthepriorjustfound,andevaluated
using T2. T1is therefore the training set and
T2isthetestingset. Resultsarealsopresented
usingaatprior.
4. The whole process is then repeated after
ran-domising the documents. The nal resultsare
then averaged over these n runs. We set n to
40.
7.1 Documentset
For data, we used the same documents that
Teufel(2001) used in her experiments. 3
In brief,
these were 80 conference papers, taken from the
Comp-langpreprintarchive,andsemi-automatically
convertedfrom L A
T
E
XtoXML. TheXML annotated
documentswerethenadditionallymanually
marked-up with tags indicating the status of various
sen-tences. Thisdocumentset ismodestin size. Onthe
other hand, the actual documents are longer than
newswiremessagestypicallyusedforsummarisation
tasks. Also, thedocuments showvariation in style.
For example, some documents are written by
non-nativespeakers,somebystudents,somebymultiple
authorsandsoon. Summarisationisthereforehard.
3
Asupersetofthe documentsis describedin(Teufel
average,each document contained 8sentences that
weremarkedasbeingabstract-worthy(standard
de-viation of 3.1). The documents on average each
containedintotal174sentences(standarddeviation
50.7). Here, a `sentence' is either any sequence of
wordsthathappenedtobeinatitle,orelseany
se-quenceofwordsintherestofthedocument. Ascan
beseen,thesummariesarenotuniformlylong. Also,
the documents vary considerably in length.
Sum-marysize isthereforenotconstant.
7.2 Features
Weusedthefollowing,fairlystandardfeatureswhen
describingallsentencesinthedocuments:
Wordpairs. Word pairsare consecutivewords
asfoundinasentence. Awordpairfeature
sim-ply indicates whether a particularwordpair is
present. All wordswere reduced: truncated to
be at most 10 characters long. Stemming (as
forexamplecarriedoutbythePorterstemmer)
produced worse results. We extracted allword
pairsfound in all sentences, and forany given
sentence,foundtheset of(reduced)wordpairs.
Sentence length. We encoded in three binary
features whether a sentence was less than 6
wordsin length,whether itwasgreaterthan20
words in length, or whether it was in between
thesetworanges. Wealsousedafeaturewhich
encoded whether a previous sentence was less
than5wordsorlonger. Thiscapturedtheidea
thatsummarysentencestendtofollowheadings
(whichareshort).
Sentenceposition. Summary sentences tend to
occureither atthe start,ortheend of a
docu-ment. Weused threefeatures: whetheragiven
sentence waswithin therst 8paragraphsofa
document, whether a sentence was in the last
3 paragraphs, or whether the sentence was in
a paragraph between these two ranges to
en-code sentence position. Note that this feature
requiresthewholedocumenttobeprocessed
be-foreclassicationcantakeplace.
(Limited) discourse features. Our features
de-scribedwhetherasentenceimmediatelyfollowed
typicalheadingssuchasconclusion or
introduc-tion, whether a sentence was at the start of a
paragraph,orwhetherasentencefollowedsome
genericheading.
to be typical of those found in sentence extraction
systems. Note thatsomeofourfeatures exploit the
fact that the documents are annotated with
struc-turalinformation(suchasheadersetc).
Experimentswithremovingstopwordsfrom
docu-mentsresultedindecreasedperformance. We
conjec-turethatthisisbecauseourwordpairsareextremely
crudesyntax approximations. Removingstopwords
from sentencesand then creatingword pairsmakes
thesepairsevenworsesyntaxapproximations.
How-ever,using stopwordsincreased thenumberof
fea-tures in our model, and so again reduced
perfor-mance. We therefore compromised between these
twopositions,andmappedallstopwordstothesame
symbolprior to creationof word pair features. We
alsofounditusefultoremovewordpairswhich
con-sisted solely of stop words. Finally, for maximum
entropy, we deleted any feature that occurred less
than 4 times. Naive Bayes did not benet from a
frequency-basedcuto.
7.3 Classiercomparison
Herewereport onourclassiers.
Asabaselinemodel,wesimplyextractedtherst
n sentences from agivendocument. Figure 1
sum-marisesourresultsasnvaries. Inthistable,asinall
subsequent tables, P and R are averaged precision
and recall values,whilst F2is the f2 scoreof these
averagedvalues.
n F2 P R n F2 P R
1 0 0 0 26 16 10 36
6 3 3 2 31 18 12 45
11 19 15 26 36 18 11 53
16 20 16 29 41 17 10 58
21 23 16 38 46 16 9 58
Figure1: Resultsforthebaselinemodel
Figure 2showsourresultsfor maximumentropy,
bothwithandwithouttheprior. Prioroptimisation
was with respect to the f2 score. As in subsequent
tables, we show system performance when adding
moreandmorefeatures.
Performance without the prior is heavily skewed
towards precision. This is becauseour features are
largely acting categorically: the sheer presence of
somefeatureissuÆcienttoinuencelabellingchoice.
Furtherevidenceforthisanalysisissupportedby
in-spectingoneofthemodelsproducedwhenusingthe
F2 P R F2 P R
Wordpairs 8 5 30 20 40 14
andsentlength 25 63 16 36 36 36
andsentposition 28 62 18 39 35 45
anddiscourse 35 63 24 42 43 41
Figure2: Resultsforthemaximumentropymodel
featureinstancesinthemodel,thevastmajorityare
deededirrelevantbymaximumentropy,andassigned
azero weight. Only 7086 features (roughly 10% in
total)hadnonzeroweights.
Performanceusingtheoptimisedpriorshowsmore
balanced results, with an increase in F2 score.
Clearlyoptimising theprior hashelped counterthe
categorical behaviour of features in our maximum
entropyclassier.
Figure 3showsthe resultsweobtained when
us-ing a naive Bayes classier. As before, the results
showperformancewithand withouttheadditionof
theoptimisedprior. NaiveBayesoutperforms
maxi-mumentropywhenbothclassiersdonotuseaprior.
Performancewithandwithoutthepriorhowever,is
worsethantheperformanceofourmaximumentropy
classier with the prior. Evidently, even our
rela-tively simplefeatures interact with each other, and
soapproachessuchasmaximumentropyarerequired
tofullyexploitthem.
Features Flat prior Optimisedprior
F2 P R F2 P R
Wordpairs 26 29 23 29 26 32
andsentlength 31 33 28 32 29 35
andsentposition 33 34 33 36 31 43
anddiscourse 38 39 37 39 38 40
Figure3: ResultsforthenaiveBayesmodel
7.4 Using informativefeatures
Ourpreviousresultsshowedthatmaximumentropy
couldoutperformnaiveBayes. However,the
dier-ences, though present, were not large. Clearly, our
feature set was imperfect. 4
It is therefore
instruc-tiveto see what happensifwehad accessto an
or-aclewhoalwaystoldus thetruestatusof some
un-seensentence. Tomakethings moreinteresting,we
4
Anotherpossible reason for the closenessof the
re-sults is the small sample size. There may just not be
F2 P R F2 P R
Wordpairs 30 34 26 32 93 19
andsentlength 35 38 32 99 100 99
andsentposition 40 41 39 100 100 100
anddiscourse 43 44 41 99 100 97
Figure 4: Results for basic naive Bayes and
max-imum entropy models using dependent informative
features
Features NaiveBayes Maxent
F2 P R F2 P R
Wordpairs 84 74 97 25 15 91
andsentlength 85 75 97 100 100 100
andsentposition 84 73 97 100 100 100
anddiscourse 84 74 97 100 100 100
Figure 5: Results for basic naive Bayes and
maxi-mumentropymodelsusingindependentinformative
features
encodedthisinformationintermsofdependent
fea-tures. Wesimulatedthisoraclebyusingtwofeatures
which were active whenever a sentence should not
beinthesummary;forsentencesthat shouldbe
in-cludedinthesummary,weleteitheroneofthosetwo
features beactive,but onarandombasis. Our
fea-turesthereforeareonlyinformativewhenthelearner
iscapableofnotingthattherearedependencies. We
then repeated our previous maximum entropy and
naive Bayes experiments. Figure 4 summarise our
results.
Unsurprisingly, we see that when features are
highly dependent upon each other, maximum
en-tropyeasilyoutperformsnaiveBayes.
Evenwhenwehaveaccess tofeaturesthatare
in-dependent of each other, naive Bayes can still do
worsethanmaximumentropy. Todemonstratethis,
we used a feature that was activewhenevera
sen-tence should be in the summary. This feature was
not active on sentences that should not be in the
summary. Figure5summarises ourresults.
As can be seen (gure 5), even when naive Bayes
hasaccesstoaperfectlyreliableinformativefeature,
thefactthattheotherfeaturesarenotsuitably
dis-countedmeansthat performanceisworsethanthat
of maximum entropy. Maximum entropy can
dis-counttheotherfeatures, andsocantakeadvantage
8 Comments and Future Work
Weshowedhowmaximumentropycouldbeusedfor
sentence extraction, and in particular, that adding
a prior could deal with the categorical nature of
the features. Maximum entropy, with an
opti-misedprior,didyieldmarginallybetterresultsthan
naiveBayes(withandwithoutasimilarlyoptimised
prior). However,thedierenceswerenotthatgreat.
Our further experiments with informative features
showedthatthislackofdierencewasprobablydue
(atleastinpart)totheactualfeaturesused,andnot
duetothetechniqueitself.
Our oracle results are an idealisation. A fuller
comparison should use more sophisticated features,
alongwithmoredata. Asaresultofthis,we
conjec-turethat should weuse amuch moresophisticated
featureset,wewouldexpectthat thedierences
be-tweenmaximumentropyandnaiveBayeswould
be-comegreater.
Our approach treated sentences largely
indepen-dentlyofeachother. However,abstract-worthy
sen-tences tend to bunch together, particularly at the
beginning and end of a document. Weintend
cap-turing this idea by making our approach
sequence-based: future decisions should also be conditioned
onpreviouschoices.
A problem with supervised approaches (such as
ours) is that we need annotated material (Marcu,
1999). This is costly to produce. Future work will
consider weakly supervisedapproaches(for example
cotraining)asawayofbootstrappinglabelled
mate-rialfromunlabelleddocuments(BlumandMitchell,
1998). Notethatthereisacloseconnectionbetween
multi-document summarisation (where many
alter-nativedocumentsallconsidersimilarissues)andthe
conceptofaviewincotraining. Weexpectthat this
redundancycouldbeexploitedasameansof
provid-ing more annotatedtraining material, and so yield
betterresults.
Insummary,maximumentropycanbebenecially
used in sentence extraction. However,oneneeds to
guardagainstcategorialfeatures. Anoptimisedprior
canprovidesuchhelp.
Acknowledgement
WewouldliketothankRobMaloufforsupplyingthe
excellent log-linear estimation code, Simone Teufel
forprovidingtheannotateddata,KarenSparkJones
for a discussion about summarisation, Steve Clark
forspottingtextualbugsandtheanonymous
Adam Berger, Stephen Della Pietra, and
Vin-cent Della Pietra. 1996. A maximum entropy
approach to natural language processing.
Com-putational Linguistics,21{22.
Avrim Blum and Tom Mitchell. 1998.
Combin-ing labeled and unlabeled data with co-training.
InProceedings ofthe WorkshoponComputational
LearningTheory.MorganKaufmannPublishers.
BranimirK.BoguraevandMaryS.Ne. 2000a.The
eects of analysing cohesion on document
sum-marisation. In Proceedings of the 18 th
Interna-tional Conference on Computational Linguistics,
volume1,pages76{82,Saarbrucken.
BranmirK.BoguraevandMaryS.Ne. 2000b.
Dis-courseSegmentationinAidofDocument
Summa-rization. InProceedings of the 33 rd
Hawaii
Inter-nationalConference onSystemsScience.
Eugene Charniak. 1999. A
maximum-entropy-inspired parser. Technical Report CS99-12,
De-partmentofComputerScience,BrownUniversity.
Stanley F. Chen and Ronald Rosenfeld. 1999.
A Gaussian prior for smoothing maximum
en-tropymodels. TechnicalReport CMU-CS-99-108,
CarnegieMellon University.
PedroDomingosand MichaelJ.Pazzani. 1997. On
theoptimalityofthesimplebayesianclassier
un-derzero-oneloss. Machine Learning,29(2-3):103{
130.
Jade Goldstein, Mark Kantrowitz, Vibhu O.
Mit-tal,and JaimeG.Carbonell. 1999. Summarizing
textdocuments: Sentenceselectionandevaluation
metrics. InResearchandDevelopmentin
Informa-tionRetrieval,pages121{128.
Julian Kupiec, Jan Pedersen, and Francine Chen.
1995. A Trainable Document Summarizer. In
Proceedings of the 18 th
ACM-SIGIR Conference
onResearchandDevelopment inInformation
Re-trieval,pages68{73.
J. Laerty, S. Della Pietra, and V. Della Pietra.
1997. Inducing features of random elds. IEEE
TransactionsonPatternAnalysisandMachine
In-telligence,19(4):380{393,April.
Daniel Marcu. 1999. The automatic construction
oflarge-scalecorporaforsummarizationresearch.
InResearch andDevelopment in Information
Re-trieval,pages137{144.
A.McCallumandK.Nigam. 1998. Acomparisonof
eventmodelsfor naivebayestextclassicatio. In
AAAI-98Workshop onLearningfor Text
Catego-Callum. 1999. Using maximumentropy for text
classication. InIJCAI-99Workshop on Machine
Learningfor InformationFiltering,.
WilliamH.Press,SaulA.Teukolsky,WilliamT.
Vet-terling, and Brian P. Flannery. 1993.
Numeri-cal RecipesinC: the Artof ScienticComputing.
CambridgeUniversityPress,secondedition.
Adwait Ratnaparkhi. 1996. A Maximum
En-tropy Part-Of-Speech Tagger. In
Proceed-ings of Empirical Methods in Natural
Lan-guage, University of Pennsylvania, May. Tagger:
ftp://ftp.cis.upenn.edu/pub/adwait/jmx.
S. Teufel andM. Moens. 1997. Sentenceextraction
asaclassicationtask. InACL/EACL-97
Work-shopon Intelligent andScalable Text
Summariza-tion,Madrid,Spain.
Simone Teufel. 2001. Task-Based Evaluation of
Summary Quality: Describing Relationships
Be-tweenScienticPapers. InNAACL Workshop on
Automatic Summarization, Pittsburgh,