An Evaluation Corpus For Temporal Summarization
Vikash Khandelwal, Rahul Gupta, and James Allan
Center for Intelligent Information Retrieval
Department of Computer Science
University of Massachusetts
Amherst, MA 01003
fvikas,rgupta,[email protected]
ABSTRACT
Inrecent years,alot ofworkhasbeendone intheeld of Topic Tracking. Thefocusof this workhas beenon iden-tifyingstoriesbelongingto thesametopic. Thismight re-sultinaverylargenumberofstoriesbeingreportedtothe user. Itmightbemoreusefultoauserifasummaryofthe maineventsinthetopicratherthantheentirecollectionof stories related to the topic werepresented. Though work onsuchane-grainedlevelhas beenstarted, thereis cur-rentlynostandardevaluationtestbedavailabletomeasure theaccuracyofsuchtechniques. Wedescribeaschemefor developing atestbedofuserjudgmentswhichcanbeused to evaluate the above mentioned techniques. The corpus thatwehavecreatedcanalsobeusedtoevaluatesingleor multi-documentsummaries.
1.
THE PROBLEM
Inrecent years, alot of progress hasbeenmade in the eldofTopic Tracking([2],[3], [8],etc). Thefocusof this workhasbeenonidentifyingnewsstoriesbelongingtothe same topic. This might result in avery large numberof storiesbeingreportedtotheuser. Itwouldbemoreuseful toauserifasummaryofthemainevents/developmentsin thetopicratherthantheentirecollectionofstoriesrelated tothetopicwerepresented. Wecanformulatetheproblem asfollows.
Wearegivenastreamofchronologicallyorderedand top-icallyrelatedstories. Westrivetoidentifytheshiftsinthe topic which represent the developments within the topic. For example, consider the topic \2000 Presidential Elec-tions". OnthenightofNovember7,therewerereportsof Gore concedingdefeat to Bush. Thenextmorning, there werereportsclaiminghisretractionoftheprevious conces-sion.Mostofthestoriesonthenextdaywouldalsocontain oldinformation including detailsof Gore's rstphonecall to Bush. We want to present only the new development (i.e.,Gore'sretraction)onthenextday.
.
Weassumethatsentenceextractscanidentifysuchtopic shifts. Attheveryleast,theycanconveyenough informa-tiontoausertokeeptrackofthedevelopmentswithinthat topic.Forexample,inFigure1,themappingsindicatehow thesentencesinastory correspondtoevents.
Human judgments are required to evaluate accuracy of extracts. Theapproachusuallytakenistohaveeachsuch extractevaluatedbyhumanbeingsbutsuchaprocessis ex-pensiveandtimeconsuming.Weneedanevaluationcorpus similartotheTDTorTRECcorporathatcanbeusedover and over again to do suchevaluations automatically. We proposeanewschemeforbuildingsuchacorpus.
SummarizationevaluationisdiÆcultbecausesummaries canbecreatedforarangeofpurposes. TheTipster SUM-MAC evaluation [7] requiredhuman assessors to evaluate each summary, and most other evaluations have also re-quired human checking of every summary [6]. There are otherswho haveattemptedautomaticevaluations([5],[9]) butnoneoftheseevaluationschemescapturesallthe desir-ablepropertiesinasummary.
Theparticular problemof summarizingshifts ina news topic was attacked slightly dierently at a Summer 1999 workshop onNovelty Detection[4]. Those eorts towards \new informationdetection" werea dead endbecause the granularityofnewinformation was too small,e.g., a men-tionofaperson'sagemightcountasnewinformationeven whenitis notthefocus ofthestory. SwanandAllanalso createdanevent-levelsummary\timeline" ([10], [11])but theydidnotdevelopanyevaluationcorpusfor theirwork.
Thispaperis organisedasfollows. InSection2,we dis-cussthedesirableproperties ofsuchanevaluationcorpus. Section 3 discusses the entire annotation process, as well asthe interesting practical issues,the problemsfaced and thenthestatisticsofthecorpusthatwehavebuilt. Finally, inSection 4, we discuss one possible way of utilising this corpus.
2.
DESIRABLE PROPERTIES OF THE
EVALUATION CORPUS
Any evaluation corpus of sentence extracts and events whichistobeusedforthepurposeofevaluatingsummaries oftopic shiftsina news streamshould havethe following properties:
The Navy has ordered the discharge of sailor Timothy Mcveigh
after he described himself gay in his America online user profile.
Civil rights groups and gay campaigners are outraged.
Mcveigh, who’s no relation of the convicted Oklahoma bomber,
is Lodging an appeal.
Paul Miller has more.
Timothy R. Mcveigh put "gay" in the marital status part of an aol
For "the world," I am Paul Miller in Washington.
user profile.
He did not use his full name or say he was in the Navy, referring
to himself only as "Tim of Honolulu".
The Navy’s personnel department says that’s violation of the
Clinton administration’s Don’t ask/Don’t tell policy of
Homo-sexuals in the military and Mcveigh has been dismissed.
Many people are upset that the Navy asked aol for information
about the supposedly anonymous user and, a Naval
investigat-or says, the online service provided it.
Gay rights groups say it’s discrimination.
Privacy advocates say it’s a breach of confidentiality.
Sailor Mcveigh
dis-charged from Navy
Navy claims that he
violated "Don’t ask/
Don’t tell" policy.
Discrimination against
gays.
Breach of privacy by
Navy.
event
Not related to any
Figure1: Anexampleshowinghowsentenceextractscanindicateevents
It should be possible to quantify the precision of a summary, i.e., it should be possible to nd the pro-portionofrelevantsentencesinthesummary, It should be possible to identify redundancy in the
systemoutputbeingevaluated. Thereshouldbesome way of assigning amarginalutility tosentences con-tainingrelevantbutredundantinformation
It should be possible to quantifythe \usefulness" of a summarytaking recall, precision as well as redun-dancyintoaccount.
Sentence boundaries should be uniquely identied (though they need not be perfect) because the aim of the system is to identify the relevant portions in thesummary.
3.
BUILDING AN EVALUATION CORPUS
3.1
The annotation process
We collect astream ofstories related toa certain topic fromthe TDT-2corpus ofstories fromJanuary 1toJune 301998. We used stories that were judged \on-topic"by annotators from LDC. The topics were selected from the TDT1998and1999 evaluations. Thestoriesare parsedto obtainsentenceboundariesandall thesentencesare given uniqueidentiers. We proceed withcollecting the human judgmentsinthefollowingfoursteps.
1. Eachjudgereadsallthestoriesandidentiesthe im-portantevents.
2. Thejudgessittogethertomergetheeventsidentied bythem,toformasinglelistofeventsforthattopic. Alltheeventsaregivenuniqueidentiers.
3. Eachjudgegoesthroughthestoriesagain,connecting sentences to the relevant events. Obviously, not all sentences needtobe related to any event. However, ifsomesentenceisrelevanttomorethanoneevent,it islinkedtoallthoseevents.
4. Anotherjudgenowveriesthe mappingbetweenthe sentencesandtheevents.Thisgivesusthenal map-pingfromsentencestoevents.
This way we obtain all the events mentioned within a storyand we canalsond outtheeventswhichndtheir rstmentionwithin thisstory. Theadvantageof building theevaluation corpus inthis way is that these judgments can be used both for summarizing topic shifts as well as summarizinganygivenstorybyitself.
WehavebuiltauserinterfaceinJavatoallowjudgestodo theabove worksystematically. Figure2shows asnapshot oftheinterfaceusedbythejudges.
3.2
Statistics of the judgments obtained
Wehaveobtainedjudgmentsfor22topics. Threejudges worked on each topic. We summarize the results of the annotation process for a subset of the topics in Table 1. Wedenethe interjudgeagreement foraneventtobe the ratio of the numberof sentences linked to that event, as agreeduponbythethirdjudge,tothenumberofsentences inthe union of the sentences individually marked by the rsttwojudges forthat event. For atopic,theinterjudge agreementisdenedtobetheaverageoftheagreementfor alltheeventsinthattopic. ItistobenotedthattheKappa statisticisnotapplicable hereinanystandardform.
Topicid #of #of Timetaken Inter-judge stories events (inhours) Agreement 20008 49 10 4.5 0.91 20020 34 23 4.5 0.98 20021 48 9 2.5 0.97 20022 27 10 3.5 0.85 20024 38 12 2.75 0.98 20026 68 11 2.5 0.87 20031 34 15 2.5 0.62
20041 24 11 2 0.94
20042 28 14 2.5 0.66
20057 19 9 2 0.66
20065 57 16 2.33 0.94
20074 51 13 3 0.96
Average 39.75 12.75 2.88 0.86
Table1: Annotationstatisticsforsomeofthetopics
thedamageduetotornadosinFlorida. Weseethatevent5 (\Reliefagenciesneededmorethan$300,000toprovide re-lief")islinkedto4sentenceswhileevent1(\Atleast40 peo-plediedinFloridadueto10-15tornados.") islinkedto43 sentences. Wemaybeableto usethenumberofsentences linkedtoaeventas anindicatoroftheweight/importance oftheevent.
Wehavedividedourcorpusintotwoparts-oneeachfor trainingandtestingrespectively. Eachpart consists of11 topics. Care was takento ensurethatboth theparts had topicsofroughlythesamesizeandtimeofoccurrence. The statisticsofbothpartsofthecorpusaregiveninTable3.
3.3
Problems faced
Sometimesoursentenceparser brokeup avalid sen-tenceinto multiple parts. Onejudgelinkedonlythe
Eventid #of Inter-judge sentences Agreement
1 43 1.0
2 9 1.0
3 33 0.97
4 8 1.0
5 4 0.8
6 5 1.0
7 14 1.0
8 19 1.0
9 9 1.0
Table2: Varianceinthenumberofsentenceslinked todierent eventsfor topic 20021
relevant part of the sentence to the corresponding eventwhileanotherlinkedallthepartstothatevent. Thishappenedinthecaseofthreeofthetopics (top-ics 20031, 20042 and 20057) before we detected the problem.
Sometimes whensimilar sentences occur indierent stories, one of the judges neglected the later occur-rencesofthesentence.
3.4
Interesting issues/judges’ comments
Weaskedthejudgesforfeedbackontheannotation pro-cessand thediÆculties faced. Hereare someof the inter-estingissueswhichcroppedup:
sen-Numberoftopics 11 11 22 Numberofstories 474 470 944 pertopic 43.1 42.7 42.9 Numberofevents 162 181 343 pertopic 14.7 16.5 15.6 Numberofsentences 8043 9006 17049 pertopic 731.2 818.7 775.0 perstory 17.0 19.2 18.1 O-eventsentences 72% 70% 71% Single-eventsentences 24% 26% 25% Multi-eventsentences 4% 4% 4%
Table3: Characteristicsofthecorpus. Allnumbers except for the number of topics are averaged over alltopicsincludedin thatcolumn.
this did happen." and \america online is sayingthis never happened." Clearly, any onesentence doesnot adequately represent the event. This can be easily takencareofbyconsideringgroupsofsentencesrather thansinglesentences.
Abstract ideas : Sometimes the meaningof individ-ualsentencesistotallydierentfromoverallideathey convey.Satiricalarticlesareanexampleofthis. These kind of ideas cannot be represented by sentence ex-tracts. Weomittedsuchevents.
Sometimes dierent stories totally contradict each other. For example, some stories (on thesame day) claimaleadforBushwhileothersclaimGoretobefar ahead. Thisis moreofasummarizationissue though andneednotbedealtwithwhilebuildingthe evalua-tioncorpus.
4.
USING THE EVALUATION CORPUS
Wehaveusedthecorpusforevaluatingoursystemwhich produces temporal summaries in news stream ([1]). The problem of temporal summarization canbe formalized as follows. A news topic is made up of a set of events and isdiscussed inasequenceof newsstories. Most sentences ofthenewsstoriesdiscussoneormoreoftheeventsinthe topic.Somesentencesarenotgermanetoanyoftheevents. Those sentences are called \o-event" sentencesand con-trastwith\on-event"sentences. Thetaskofthe systemis toassignascoretoeverysentencethatindicatesthe impor-tanceof thesentenceinthesummary. Thisscoring yields arankingon all sentences inthe topic,including o- and on-eventsentences.
We will use measures that are analogues of recall and precision. Weareinterestedinmultipleproperties:
Useful sentencesarethose thathavethepotentialto beameaningful partofthesummary. O-event sen-tencesarenotuseful,butallothersentencesare. Novel sentences are thosethat are not redundant|
i.e., are new in the presentation. The rst sentence aboutaneventis clearlynovel,butallfollowing
sen-Figure3: nu-recallvsnu-precisionplot forthetask ofsummarizingtopic shiftsinanewsstream
Sizeofthesummaryisatypicalmeasureusedin sum-marizationresearchandweincludeithere.
Basedonthoseproperties, wecould denethefollowing measuretocapturethecombinationofusefulnessand nov-elty:
nu recal l = P
I(r(e i
)>0) E nu precision =
P
I(r(ei)>0) Sr
where Sr is the number of sentences retrieved, E is the numberofeventsinthetopic,eiiseventnumberi(1i E),r(e
i
)is thenumberofsentencesretrievedforevente i
, I(exp)is1ifexp istrueand0if not. Allsummationsare asirangesoverthe setofevents. NotethatSr 6=
P r(ei) sincecompletelyo-topicsentencesmightberetrieved.
Thenu-recallmeasureistheproportionoftheeventsthat havebeenmentionedinthe summary,and nu-precision is theproportionofsentencesretrievedthataretherst men-tionsofanevent.
We used this measure to evaluate the performance of oursystem overtheentiretrainingcorpus. Theresultsfor thetrainingcorpusareshowninthenu-recall/nu-precision graph in gure 3. This work is described in detail else-where([1]).
5.
FUTURE WORK
Weintendtocompletecollectinguserjudgmentsformore topics soon. Afteranalyzing the reliablity of these judg-mentsand correcting the few mistakesthat we had made initially, we will collect annotations for more topics. Ini-tially,wehadusedasimplebarebonessentenceparser,since thatismostlysuÆcientfortheworksuchacorpuswouldbe putto. Nevertheless,infutureannotations,wewillneedto improvethe sentenceparser. Weintend tocontinueusing thesejudgmentstoevaluatetheperformanceofthesystems that we are currently building to identify and summarize topicshiftsinnewsstreams.
Acknowledgements
This material is based on worksupported in part by the Library of Congress and Department of Commerceunder cooperative agreement number EEC-9209623 and in part bySPAWARSYSCEN-SDgrantnumberN66001-99-1-8912. Anyopinions,ndingsandconclusionsorrecommendations expressedinthismaterialaretheauthor(s)anddonot nec-essarilyreectthoseofthesponsor.
6.
REFERENCES
[1] J.Allan,R.Gupta,andV.Khandelwal.Temporal SummariesofNewsTopics.ProceedingsofSIGIR 2001Conference,NewOrleans,LA.,2001.
[2] J.Allan,V.Lavrenko,D.Frey,andV.Khandelwal. UMASSatTDT2000.TDT2000Workshopnotebook, 2000.
[3] J.Allan,R.Papka,andV.Lavrenko.On-lineNew EventDetectionandTracking.ProceedingsofSIGIR 1998,pp. 37-45,1998.
[4] J.Allanetal.Topic-basednoveltydetection.1999 SummerWorkshopatCLSPFinalReport.Available athttp://www.clsp.jhu.edu/ws99/tdt,1999.
[5] J.Goldstein, M.Kantrowitz,V.Mittal,and
J.Carbonell.Summarizingtextdocuments: Sentence SelectionandEvaluationMetrics.Proceedingsof SIGIR1999,August1999.
[6] H.Jing,R.Barzilay,K.McKeown,andM.Elhadad. SummarizationEvaluationMethods: Experiments andAnalysis.Workingnotes, AAAISpring SymposiumonIntelligentTextSummarization, Stanford, CA,April,1998.
[7] InderjeetManiandetal.TheTIPSTERSUMMAC TextSummarizationEvaluationFinalReport.1998. [8] R.Papka,J.Allan, andV.Lavrenko.UMASS
ApproachestoDetectionandTrackingatTDT2. Proceedings oftheDARPA Broadcast News Workshop,Herndon,VA,pp. 111-125,1999. [9] D.R.Radev,H.Jing,andM.Budzikowska.
Summarizationofmultipledocuments:clustering, sentenceextraction,andevaluation.ANLP/NAACL WorkshoponSummarization, Seattle,WA,2000. [10] R.SwanandJ.Allan. ExtractingSignicantTime
VaryingFeaturesfromText.ProceedingsoftheEighth InternationalConferenceonInformationand
KnowledgeManagement,pp.38-45,1999. [11] R.SwanandJ.Allan. AutomaticGenerationof