ContentslistsavailableatScienceDirect
Artificial
Intelligence
in
Medicine
jou rn al h om e p a g e :w w w . e l s e v i e r . c o m / l o c a t e / a i i m
Comparison
of
automatic
summarisation
methods
for
clinical
free
text
notes
Hans
Moen
a,b,c,∗,
Laura-Maria
Peltonen
c,d,
Juho
Heimonen
b,e,
Antti
Airola
b,
Tapio
Pahikkala
b,e,
Tapio
Salakoski
b,e,
Sanna
Salanterä
c,daDepartmentofComputerandInformationScience,NorwegianUniversityofScienceandTechnology,SemSaelandsvei9,7491Trondheim,Norway
bDepartmentofInformationTechnology,UniversityofTurku,Joukahaisenkatu3–5,20520Turku,Finland
cDepartmentofNursingScience,UniversityofTurku,Lemminkäisenkatu1,20520Turku,Finland
dTurkuUniversityHospital,Kiinamyllynkatu4–8,20521Turku,Finland
eTurkuCentreforComputerScience(TUCS),Joukahaisenkatu3–5,20520Turku,Finland
a
r
t
i
c
l
e
i
n
f
o
Articlehistory:
Received4May2015 Receivedinrevisedform 14December2015 Accepted5January2016
Keywords:
Automatictextsummarisation Summarisationevaluation Distributionalsemantics Wordspacemodels Clinicaltextprocessing Electronichealthrecords
a
b
s
t
r
a
c
t
Objective:Amajorsourceofinformationavailableinelectronichealthrecord(EHR)systemsaretheclinical freetextnotesdocumentingpatientcare.Managingthisinformationistime-consumingforclinicians. Automatictextsummarisationcouldassistcliniciansinobtaininganoverviewofthefreetextinformation inongoingcareepisodes,aswellasinwritingfinaldischargesummaries.Wepresentastudyofautomated textsummarisationofclinicalnotes.Itlookstoidentifywhichmethodsarebestsuitedforthistask andwhetheritispossibletoautomaticallyevaluatethequalitydifferencesofsummariesproducedby differentmethodsinanefficientandreliableway.
Methodsandmaterials:Thestudyisbasedonmaterialconsistingof66,884careepisodesfromEHRsof heartpatientsadmittedtoauniversityhospitalinFinlandbetween2005and2009.Wepresentnovel extractivetextsummarisationmethodsforsummarisingthefreetextcontentofcareepisodes.Most ofthesemethodsrelyonwordspacemodelsconstructedusingdistributionalsemanticmodelling.The summarisationeffectivenessisevaluatedusinganexperimentalautomaticevaluationapproach incor-poratingwell-knownROUGEmeasures.Wealsodevelopedamanualevaluationschemetoperforma meta-evaluationontheROUGEmeasurestoseeiftheyreflecttheopinionsofhealthcareprofessionals.
Results:Theagreementbetweenthehumanevaluatorsisgood(ICC=0.74,p<0.001),demonstratingthe stabilityoftheproposedmanualevaluationmethod.Furthermore,thecorrelationbetweenthemanual andautomatedevaluationsarehigh(>0.90Spearman’srho).Threeofthepresentedsummarisation methods(‘Composite’,‘Case-Based’and‘Translate’)significantlyoutperformtheothermethodsforall ROUGEmeasures(p<0.05,Wilcoxonsigned-ranktestandBonferronicorrection).
Conclusion: The results indicate the feasibility of the automated summarisationof care episodes. Moreover,thehighcorrelationbetweenmanualand automatedevaluations suggeststhattheless labour-intensiveautomatedevaluationscanbeusedasaproxyforhumanevaluationswhendeveloping summarisationmethods.Thisisofsignificantpracticalvalueforsummarisationmethoddevelopment, becausemanualevaluationcannotbeaffordedforeveryvariationofthesummarisationmethods.Instead, onecanresorttoautomaticevaluationduringthemethoddevelopmentprocess.
©2016TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-ND license(http://creativecommons.org/licenses/by-nc-nd/4.0/).
∗ Correspondingauthorat:DepartmentofComputerandInformationScience, NorwegianUniversityofScienceandTechnology,SemSaelandsvei9,7491 Trond-heim,Norway.Tel.:+4797502647.
E-mailaddresses:[email protected]
(H.Moen),lmemur@utu.fi(L.-M.Peltonen),
juaheim@utu.fi(J.Heimonen),ajairo@utu.fi(A.Airola),aatapa@utu.fi(T.Pahikkala), tapio.salakoski@utu.fi(T.Salakoski),sansala@utu.fi(S.Salanterä).
1. Introduction 1.1. Background
Information overload in the health sector is becoming an increasingproblemforclinicians[1,2].Theyhavetoreadmassesof text(suchasclinicalnotes,guidelinesandscientificliterature)to satisfytheirinformationneeds.Lackoftimeandresourcestodothis properlycausesproblemssuchaserrors,frustration,inefficiency andcommunicationfailures[3].
http://dx.doi.org/10.1016/j.artmed.2016.01.003
0933-3657/©2016TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4. 0/).
The contents of electronic health record (EHR) systems are largelycomposedofclinicalnotes(orclinicalnarratives) inthe formofunstructuredandunclassifiedtext.Theclinicalnotes writ-tenduringasinglecareepisode,i.e.astayina hospital,canbe quitevoluminous,especiallyforpatientssufferingfrommore com-plexandlong-termhealthproblems.Knowingthemedicalhistory ofapatientisvitalforaclinician,butscanningthroughclinical notesconsumesprecioustimethatcouldbebetterspenttreating thepatient.
Automatic summarisation of the free text content in care episodescouldassistcliniciansinatleasttwoways.First,itcould providean(indicative)overviewofthedocumentationofacare episode.Together withstructureddata (suchas laboratorytest results,images,diagnosticcodesandpersonalinformation)itcould helpclinicianstofamiliarisethemselveswiththecontentofthecare episodeandthepatient’sproblems,whichisparticularlyusefulif theinformationisneededurgently.Second,itmayhelpinwritinga dischargesummaryofacareepisode.Dischargesummariesare cru-cialincommunicationbetweendifferenthealthcareprovidersand theyareneededtoensurecontinuityofcare.However,thereare anumberofchallengeswiththem,rangingfrombeingproduced late to having insufficient information. For example, Kripalani etal.[4]showedthat dischargesummaries exchangedbetween hospitalsandprimarycarephysiciansareoftenlackingsomeof theexpectedinformation,suchasthatrelatedtotreatment pro-gression,counsellingandfollow-upproposals.Computer-assisted dischargesummariesandstandardisedtemplatesaremeasuresfor improvingtransfertimeandthequalityofdischargeinformation betweenhospitalsandprimarycarephysicians[4].Theutilisation ofautomatictextsummarisationcouldimprovethetimelinessand qualityofdischargesummariesevenfurther.
Centraltothisworkisthefocusonresource-lean1and language-independentmethods.Suchmethodsareimportantforlanguages suchasFinnish,forwhichnomajormanuallyconstructedlexical resourcessuitedforthecomprehensivesemanticanalysisofclinical textareavailable.
1.2. Relatedwork
This study focuses on the extraction-based summarisation approach,inwhichthesummaryisgeneratedbyselectinga sub-setofsentences fromtherelevanttext.Thisapproach is viable becausea sizeableportionofaclinicaltextsummaryiscreated bycopyingorderivinginformationfromclinicalnotes[2,5–7].See
[8,9]forexample,formoreinformationonextraction-basedtext summarisation.
Acentralissuein extraction-basedsummarisationis howto determinewhatthemostrelevantcontenttobeincludedina sum-maryis.Commontechniquesofextraction-basedsummarisation includetopic-basedsentenceextraction[10,11],where the rele-vanceofasentenceiscomputedwithrespecttooneormoretopics ofinterest;andcentrality-basedsentenceextraction[12,13],where thesentencesthat arethemost(strongly)associated with oth-ersareselectedontheassumptionthattheyconstitutethebest coverageofthedocuments.Inordertoavoidincludingredundant information,itiscommontoapplythemaximalmarginalrelevance criterion[10]orsimilartechniquesthattakesentenceoverlapinto account.Purelystatistical(data-driven)approachestotext sum-marisationareoftenreferredtoas‘knowledgepoor’,whereasthose usingknowledgeresourcesareconsidered‘knowledgerich’.The lattercould,forexample,includetheuseofanontologythatmodels medicalandclinicalconceptsaswellastheirrelationships.
1 Wearestrivingtowardsusingaslittlemanuallabouraspossible.
Intheirrecentreview,Mishraetal.[14]indicatedthatthereisa growinginterestinknowledge-richapproachesinthebiomedical domain,coincidingwiththeincreasedavailabilityof comprehen-sivelexical resources,such as the WordNet ontology[15] and theUMLScompendium[16](includingSNOMED-CT[17],ICD[18]
and MeSH [19]).There are severallanguage tools that rely on theseresources,suchasMetaMap[20],cTAKES[21]andSemRep
[22].Othercommonlyusedresourcetypesinclude,forexample, annotatedcorporadesignedformachinelearning(ML)algorithms (seee.g.[6,23]).However,onedisadvantagetoapproachesthatrely onmanuallyconstructedresourcesisthattheyareoftennot appli-cableacrossdomainsor languages[24,25].WordNetand UMLS (SNOMED-CT,inparticular),forexample,areonlyavailableina fewlanguages.Thecostofadaptingexistingresourcestonew lan-guages,domainsortasks,orconstructingnewresources,isoften high.
Theuseofdistributionalsemanticmethodsrepresentsa resource-lightapproachtocapturingterminologyinclinicaltexts[26–31]. Thesemethodsrelyonthedistributionalhypothesis[32]for con-structingdistributionalsemanticmodelsfromwordco-occurrence statisticsinanunsupervisedmanner,typicallyusingaverylarge corpus of unannotated text. The aim is to model similarities, or relatedness, between linguistic items (e.g. words) in a way thatreflectstheirrelativesemanticmeaning.Distributional seman-ticmodelsrepresentingword-levelsemanticsimilarityareoften referredtoaswordspacemodels(WSMs).InaWSM,awordcontext vectoriscreatedforeachuniquewordintheunderlyingcorpus. Further,eachcontextvectorrepresentsapointinthe‘wordspace’ andtheirinternaldistancesreflecttheirsemanticsimilarities. Sim-ilaritiesbetweencontextvectorsarethencalculatedtoquantify thesemantic similarityas a numericvalue (for example,using thecosinesimilarityfunction).Populartechniquesandframeworks forconstructingWSMsincludelatentsemanticanalysis(LSA)[33], randomindexing(RI)[34]andWord2vec(W2V)[35].The domain-specificityofthecorpususedforconstructingthemodelhasbeen showntobeimportantfortheusefulnessofsemanticsimilarities totheintendedtask[27].Distributionalsemanticmodelsin vari-ousformshavebeenextensivelyusedintextsummarisation,e.g.
[9,13,36].
Tothebestofourknowledge,thetaskofautomatically gener-atingtextualsummariesfromclinicalnoteshasbeenpursuedby relativelyfewresearchers,whichisalsoevidentinrecentreviews andrelatedworks,forexample[14,24].Wehaveidentifiedseveral piecesofworkfocusingonthetaskofautomaticallygenerating textualsummariesfromunstructuredclinicalnotes.Liu[37]used theMEADsummarisationtoolkit.VanVlecketal.[2]performed structuredinterviewstoidentifyandclassifyphrasesthatclinicians consideredrelevanttoexplainingapatient’shistory.Mengetal.[6]
usedanannotatedtrainingcorpustogetherwithtailoredsemantic patternstodeterminewhatinformationshouldberepeatedina newclinicalnoteorsummary.VelupillaiandKvist[23]focusedon recognisingdiagnosticstatementsinclinicaltext,learningfroman annotatedtrainingcorpus,andclassifyingthesebasedonthelevel ofcertainty theyhave inthem.Extracteddiagnosticstatements arethen usedtoproduceatextsummary. Othershave worked on more conceptual models for understanding and supporting thegenerationofinformation summariesintheclinicaldomain
[38,39].
Theevaluationofcomputer-generatedsummariesistypically performed by comparing the generated summary with a gold standard(orreferencesummary),whichrepresentstheideal man-uallyconstructedsummaryorsummaries.TheROUGE2evaluation
Fig.1.Experimentalset-up.Thefigureshowshowtheexperimentwasconducted.
package[40]hasbecomeadefactoevaluationmetricintext sum-marisation.Therequiredgoldstandardsummaries arecostlyto creategiventhemanualworkrequired.Thisisparticularlythecase inspecialiseddomainswheredomainexpertsarerequired.Lissauer etal.[3]evaluatedcomputer-generateddischargesummariesfrom neonatereportsbymanuallycomparingthemtodictated summa-ries,aswellasanalysingthemtoseewhethertheycontainedthe requiredinformationaccordingtoguidelines.Liu[37]performed automaticevaluationofcomputer-generatedsummariesof clini-calnursingnotesbyusingtheoriginaldischargereportsasgold standard summaries.Moenetal. [41]appliedbothmanual and automaticevaluationtotheassessmentofthereliabilityof auto-maticevaluation;themanualevaluationwasperformedbydomain experts and the automaticevaluation wasperformed by using ROUGEtocalculatethesimilaritybetweenthecomputer-generated summaries and theoriginaldischarge summaries (producedby clinicians).
1.3. Objectives
Themaincontributionsofthisstudycanbesummarisedas fol-lows:
•Proposalandimplementationoffournovelautomatic summari-sationmethodsdesignedforsummarisingthefreetextincare episodes;
•Proposalandimplementationofamethodologyforconducting themanualevaluationofautomaticallygeneratedcareepisode summaries;
•Empiricalanalysis of automatic evaluation measures through comparisonwithmanualevaluations;
•Performanceassessmentofthefournovelautomatic summari-sationmethodsalongwithfourbaselinemethods.
Thedatausedinthis studyisFinnishclinicaltext,but since theappliedmethodsarelanguage-independent,thecontributions shouldalsoberelevanttootherlanguages.Theoverallset-upis illustratedinFig.1.
2. Materialandmethods 2.1. Data
ThedatasetusedinthisstudyconsistsofEHRsfrompatients withanytypeofheart-relatedproblemthatwereadmittedtoa sin-gleuniversityhospitalinFinlandbetween2005and2009.Ofthese, theclinicalnoteswrittenbyphysiciansonthevariouswardsthat thepatientsvisitedwereused.However,noteswrittenbynurses werenotincluded.Fig.2showsanexampleofaclinicalnote.
Ethicalapprovalfortheresearchwasobtainedfromtheethics committeeofthehospitaldistrict(17.2.2009§67)andpermission toconducttheresearchwasobtainedfromthemedicaldirectorof thehospitaldistrict(2/2009).Thetotalsetconsistedof66,884care episodes,whichamountsto398,040notesand64millionwords3
intotal.Thisfullset,minus308careepisodesreservedfor optimi-sationandevaluation(seebelow),wereusedforconstructingthe WSMs(seeSection2.2).
Thenotesaremostly unstructured,consistingofclinicalfree textinFinnish.Varioussubheadingsdooccurintheclinicalnotes, butthesearenotstandardised,structured,oruniquelyrecognised inourcorpus.Thus,wetreattheseinthesamewayastherestof thetext.Someofthesentencesare,accordingtotheEHRsystem, consideredtobemetadata—suchasnamesoftheauthors,dates, wardsandsoon.Wetreatthefreetextsentencesandthemeta-text sentencesastwoseparatetexttypes,sothesearenotmixedinthe sensethattheycannotbelongtothesame‘sentencetopic’clusters, whicharedescribedinSection2.3.
EachcareepisodehasbeenmanuallylabelledwithICD-10codes bycliniciansasapartoftheoriginalcareprocess.Theseare nor-mallyappliedattheendofthepatient’shospitalstay,orafterthey aredischargedfromhospital.Careepisodescommonlyhaveone primaryICD-10codeattachedtothem,andanumberofoptional secondarycodes.InthisstudytheprimaryICD-10codeisusedin constructingtheRI-ICDWSM,asdescribedinSection2.2.
Fig.2. Exampleofaclinicalnote.ThisisafakecaseoriginallycreatedinFinnishbydomainexperts,thentranslatedintoEnglish.Commonmisspellingsareincluded intentionally.
Inthepresentedexperiments,werestrictourevaluationtothe careepisodes which have the primaryICD code I25 — chronic ischemicheartdisease,includingsubcodes(I25.0,I25.1,etc.).As afurtherrestriction,tojustifytheuseoftextsummarisation,we consideronlycareepisodesconsistingofsevenormoreclinical noteswrittenbyphysician.Inordertoguaranteethatthemethods aretestedonindependenttestdatathatisnotusedfordeveloping thesummarisationmodels,the308careepisodesaresplitintotwo subsets:
•Asummarisationoptimisationset,consistingof152careepisodes, usedfor optimisingparameters related tothe summarisation methods.
•Asummarisationevaluationset,consistingof156careepisodes, usedforevaluationintheconductedexperiments.Thisisfurther splitintotwosubsetsof20and136careepisodes,theformer sub-setbeingevaluatedinExperiment1andthelatterinExperiment 2.
Thissplittingisperformedaccordingtotheyearinwhichthe careepisodeswerecarriedout.
2.2. Wordspacemodelsusedforsentencesimilarityand summarisation
WeuseamethodbasedontheRItechnique,utilisingtheICD-10 codesattachedtocareepisodes(RI-ICD)toconstructaWSMforthe purposeofcalculating(semantic)similaritybetweencareepisodes. WealsousetheRItechniqueinconstructinga‘cross-text’ transla-tionmodel(RI-Translate),andtheW2Vmethodisusedtoconstruct a WSM for the purpose of calculating sentence-to-sentence
similarities, as well as sentence-to-document4 similarities. The
cosinesimilaritymetricisusedtocalculatevectorsimilarities.
RandomindexingandRI-ICD
RI[34]isatechniqueforconstructinga(pre-)compressedWSM withafixeddimensionality,doneinanincrementalfashion.Thisis achievedbyinitiatingindexvectorsforeachuniquewordinthe cor-pus.Anindexvectorisavectorofafixeddimensionality,containing mainlyzeros along witha small number of randomlyassigned non-zeros,typically1or−1.Duringtraining,contextvectorsfor wordsareconstructedbyaddingindexvectorstothem.Inthisway, thedimensionalityofthecontextvectorsremainsconstant.Inthis workweuseaversionofRIwherecontextfeaturesarebasedon theICD-10codeclassificationsofcareepisodes.Wehavecalledthis
RI-ICD,5previouslyintroducedin[42]. RI-translate
AnotherRI-basedmethodusedhereisoneintendedfor cross-lingualtranslationpurposes,describedin [43].Werefertoitas
RI-Translate.ThemethodconstructsabilingualWSMthatconnects wordsinthesourcelanguage(SL)totheirtranslatedcounterparts inthetargetlanguage(TL).Inpractice,weoperatewithtwomodels, oneforSLandoneforTL,wherebothbelongtothesamesemantic spaceinthattheyareconstructedwiththesamesetofindexvectors. Fortraining,pre-alignedtranslationpairs(inthiscase,aligned sen-tences)connectingtheSLtotheTLareusedastraininginstances. Thetrainingtakesplaceasfollows:foreachtranslationpair(SL–TL), auniqueindexvectorisgeneratedandaddedtothecorresponding contextvectorsforwordsintheSLandTLmodels.Thiswillresult
4Documentsareinthiscasetheclinicalnotes.
5Avectordimensionalityof800wasused,andthenumberofnon-zerosforthe indexvectorswassettofour.
Table1
ToptenmostsimilarwordsaccordingtheW2V-basedWSMforthequerywords‘pain’and‘foot’,togetherwiththecorrespondingcosinesimilarityscores.Thewordshave beentranslatedfromFinnishtoEnglish.
Pain (kipu) cossim Foot (jalka) cossim
Painsensation (kiputuntemus) 0.5097 Lowerlimb (alaraaja) 0.5905
Ache (särky) 0.4835 Ankle (nilkka) 0.3731
Painsymptom (kipuoire) 0.4173 Limb (raaja) 0.3454
Chestpain (rintakipu) 0.4042 Shin (sääri) 0.3405
Dullpain (jomotus) 0.4000 Peripheral (periferisia) 0.3112
Backpain (selkäkipu) 0.3953 Callus (känsä) 0.3059
Painseizure/attack (kipukohtaus) 0.3904 Topofthefoot (jalkapöytä) 0.2909
Painstatus (kiputila) 0.3685 Upperlimb (yläraaja) 0.2879
Abdominalpain (vatsakipu) 0.3653 Peripheral (perifer) 0.2875
Discomfort (vaiva) 0.3614 Inlowerlimb (alaraajassa) 0.2707
inhighcosinesimilaritybetweenwordsintheSLmodelandtheTL modelthathaveoftenoccurredinthesametranslationpairs.
Whenqueryingthesystem,thecontextvector(s)corresponding tothequeryintheSLmodelisusedasthequery.Thisqueryvector isthenmatchedagainsttheunitsintheTLmodel,usingcosine sim-ilarity,inordertofindthemostlikelytranslation(s).Thismethod isusedforsummarisationpurposesintheTranslatemethod,as describedbelow.
Word2vec
Word2vec[35] isa framework forconstructingWSMsusing a neural network. In this work we utilise the W2V CBOW
architecture.6 Table1showsanexampleofhowtheW2V-based
modelcapturessemanticsimilarityrelations.Weusethismodel inthevarioussummarisationmethodsforcomputing sentence-to-sentencesimilaritiesandsentence-to-documentsimilarities.
Composingsentenceanddocumentvectors
Sentence context vectorsare composedby normalising and summing (pointwise summation) theconstituent word context vectors weighted by their sentence term frequency multiplied by their global inverted sentence frequency (TF*ISF). A similar approachisusedforconstructingcontextvectorsrepresenting clin-icalnotes,buthereweightingisbasedontermfrequencymultiplied withglobalinverteddocumentfrequency(TF*IDF)[44],whereeach clinicalnoteisconsideredasadocument.
2.3. Summarisationmethods
We evaluated eight different summarisation methods. Ora-cleis an(unrealistic)reference methodthat hasaccesstothe true/original discharge summary when selecting sentences to extract,providinganupperboundarytohowwellanextractive summarisation methodcan workfor ourdata. LastSentences
andRandomaresimplereferencemethodsthatasuccessful sum-marisation methodshould be able tooutperform. Centrality
is a standard baseline approach that is commonly used in the field,andtheremainingfourmethods,calledRepeatedSentences,
Case-Based,TranslateandComposite,areproposedmethods developedspecificallywiththeclinicaldomaininmind.
Foreachcareepisode,thelengthofthesummarygeneratedby eachsummarisation methodissettohaveafixedsizeequalto thewordcountoftheaccompanyingdischargesummary(i.e.the ‘goldstandardsummary’forthecareepisode).Sentencesare iter-ativelyextractedfromtheclinicalnotesuntilthetotalwordcount becomesequaltoorjustexceedsthewordcountofthedischarge summary.Therefore,generatedsummariescanhaveawordcount
6ForW2Vawindowsizeof5+5andadimensionalityof800wasused.
equaltothedischargesummary,orexceedthislimitbyasubsetof thewordsinthelastextractedsentence.Thiswayofdynamically selectingthesummarisationlengthismainlydonetoenablethe calculationoftheautomaticevaluationscores(F-score),described inSection2.4.2,whichassumesequallengthofthetargetsummary (thesummarybeingevaluated)andthegoldstandardsummary.
In the summarisation methods RepeatedSentences, Case-Based,TranslateandComposite,atypeoftopicclusteringisused toperformredundancyreduction.Wefoundthateachsentenceis typicallyinformative,self-sustainingininformationcontent,and independentofothersentenceswithinasinglenote.Allsentences arefirstclusteredintounlabelledsentencetopicsinan unsuper-visedwayusingtheW2Vmodel.Acosinesimilaritythreshold, optimisedonthesummarisationoptimisationset,isusedfor deter-miningwhethertwosentencescanbeconsideredsimilarornot— whetherornottheybelongtothesametopicbasedontheircosine similarity.Theunderlyingapproachis somewhatcomparable to howsimilarparagraphsaredetectedandmergedinMcKeownetal.
[45]withtheaimofreducingsentenceredundancy.Sinceweknow wheneachsentencewaswritten,andifweassumethatweareable toclustersentencesthatdiscussthesametopicacrossclinicalnotes (e.g.thestateofapatient’spain),wecanalsoassumethatthe lat-estwrittensentencebelongingtoatopicisthemostrepresentative ofthelatestinformationconcerningthattopic.Therefore,weallow thelatestwrittensentencebelongingtoeachtopicclustertobethe representativesentence.Inanattempttomosteffectivelymodel sentencetopicclusters,theclusteringapproachisdoneasfollows: first,weassumethatallsentencesinthefirstclinicalnoteofacare episodebelongtodifferenttopics(seeNote1inFig.3foran
illus-tration).Thenweiteratethroughthecareepisode,fromthefirstto thelastwrittennote,andassigneachsentencetoeitherexisting topics(cossim≥)ornewtopics(cossim<)basedontheircosine similaritiesinrelationto.Intheutilisedsimilaritycomparison, thelatestaddedsentenceofatopicalwaysrepresentsthattopic.A sentencecanonlybelongtoonetopic,soifasentenceissimilarto twoormoretopics,i.e.cossim>,thesentenceisassignedtothe mostsimilartopic.7
Originaldischargesummary
Theoriginaldischargesummaryisatextwrittenbyaclinician, typicallyaphysician,tosummariseacareepisode.These summ-ariesarethuswrittenattheendofeachcareepisode,andoften containextractsfromtheaccompanyingclinicalnotes.Theyalso typicallycontainacertainamountofas-yetundocumented infor-mationwhichfocusesonfollow-uptreatment,andaremeantfor thereceivingward(ifany)ortheprimarycaresector.
7Intheunlikelyeventthatthecosinesimilaritiesbetweenasentenceandtwoor moretopicsarethesame,thesentenceisassignedtoarandomtopicamongthese.
Fig.3. SummarisationmethodRepeatedSentences.Theexampleillustrateshowsummariesareproducedbysentencetopicclusteringandtopicscoring.Thehighestscoring topicsfromhighesttolowestareB,E,A,C,D,G,FandH.Inthegeneratedsummarydisplayedhere,thetopicsaresortedbythepost-processingstep,andthethreelowest scoringtopics,G,FandH,areexcluded.
In this work,the original discharge summary serves as the goldstandardsummaryforitsaccompanyingcareepisodeinthe automaticevaluationapproachthatisused(seeSection 2.4.2).In addition,someofthesummarisationmethodsusetheseintheir underlyingtraining(Translate)orinthesummarisationphase (Case-Based).Naturally,foracareepisodethatistobe summa-rised,theaccompanyingoriginaldischargesummarywillnotbe availabletothesummarisationsysteminarealisticscenario.
Summarisationmethod:Oracle
Thisisacontrolmethodthathasaccesstotheoriginaldischarge summary during the summarisation process. It optimises the
ROUGE-N2F-scores(seeSection2.4.2)forthegeneratedsummary accordingtothegoldsummary,usingagreedysearchstrategy.That is,themethodextractssentencesonebyonefromtheclinicalnotes untilitreachesthelengththreshold,alwayspickingsentencesthat resultinthehighestpossibleROUGE-N2score.Thismethodis cheat-ing,sinceithasaccesstotheoriginaldischargesummaryinthe summarisationprocess.Still,itrepresentstheupperlimitforwhat isachievableintermsofROUGE-N2scoresforanextraction-based summary.
Summarisationmethod:LastSentences
Thelatestwrittenclinicalnoteatanypointshouldsupposedly representthecurrent state of thepatient.By selecting the lat-estinformation foundin thelast or latestwritten information, onecanintuitivelyassumethatthisinformationisimportantina (discharge)summary.Inthismethod,thesummaryissimply con-structedfromtheNlastwrittensentencesduringthecareepisode, whereNisthenumberofsentencesneededtoreachthelength threshold.Intuitively,thisrepresentsastrongbaseline.
Summarisationmethod:Random
Thisbaselinemethodconstructssummariesbyrandomly select-ingsentencesfromthecareepisodeuntilthelengththresholdis reached.Itprovidesalowerboundarytotheperformance,which anymeaningful summarisation approach shouldaim to signifi-cantlyoutperform.
Summarisationmethod:RepeatedSentences
Mengetal.[6]arguethat informationbeingrepeatedacross clinicalnotesisanindicatorofitsrelevancewithrespectto inclu-sioninsubsequentnotesinthesequence.Theuseoftimefeatures isalsoexploredbyLimetal.[46]inthetaskofmulti-document summarisationofnewsarticledocuments.Theunderlying hypoth-esisfortheRepeatedSentencesmethodisthatinformationthat isrepeatedinmultipleclinicalnotesthroughoutacareepisode, withtheemphasisonwhenitwaswritten,isthemostimportant informationtoincludeinasummary.Featuresfromtheinitial sen-tencetopicclusteringstepareusedforscoring.Atopicisassigned ascorebasedonthesumoftheorderofwhen,inthecareepisode, its underlying sentences were written. For example, if a topic
containssentencesfromclinicalnotenumbers3,5and6 (num-beredrelative tothe datestheywere written), thetopic score becomes14.TheNhighestscoringsentencetopics(i.e.their rep-resentative sentences) are included in the final summary. The
RepeatedSentencessummarisationmethodisillustratedinFig.3.
Summarisationmethod:Case-Based
Case-Based,or‘case-basedsummarisation’,is herean anal-ogytocase-basedreasoning(CBR)[47]whichperformsatypeof
textualcase-basedreasoning(TCBR)[48].CBRinvolvesretrieving existingorolder‘cases’withsimilarcontentasthetarget prob-lem,andthenreusingthesolutionoftheretrievedcase(orcases) tosolvethetargetproblem.Inasimilarmanner,thisprincipleis appliedhereintextsummarisation.Theunderlyinghypothesisis thatpatientswith(themost)similarcareepisodes(accordingto thedocumentedtextintheirclinicalnotes)havesimilarcontent intheirdischargesummaries.Thesentencesfromthesedischarge summaries are then treated as the central ‘topics’ for what to includeinthesummary.Thisisinlinewithevidence-basedpractice
(EBP)inthatrelevantcareepisodesareidentifiedandthe infor-mationfoundthereisrelieduponas‘evidence’forwhatshouldbe includedinthesummary.
Givenatargetcareepisodethatistobesummarised,wefirst
retrievethetopfivemostsimilarcareepisodesusinginformation retrievaloncareepisodelevel(i.e.‘careepisoderetrieval’).Forthis theRI-ICDmethodisused(explainedin[42],Section4.1).Thenwe
reusethesebyiteratingthrougheachsentenceintheirdischarge summaries.Therepresentative/lastsentencesfromeachsentence topicinthetargetcareepisode(asdescribedearlier)isthenscored bytheircosinesimilaritytoeachoftheseusingtheW2Vmodel. Outofthese,theNhighestscoringsentencesareincludedinthe generatedsummary.Fig.4illustratesthisusingamodificationof the‘CBRcycle’from[47].8
Summarisationmethod:Translate
HereweusetheRI-Translatemethod,asexplainedinSection
2.2,forthepurposeoftextsummarisation.Insteadoftranslation betweenlanguages,itis usedfor‘cross-text-type translation’— translatingfromthetextinclinicalnotes(careepisodeswithout dischargesummaries)tothetextfoundinthedischarge summa-ries,whilelimitingthetranslationcandidates(i.e.sentences)toalso comefromthesentencetopicsintheclinicalnotes.Theaimisthus toconstructatypeoftranslationsystemthatcanmapsentences inclinicalnotestothemostprobablesentencestobefoundinan accompanyingdischargesummary,basedontranslationstatistics learntfromalargeclinicalcorpus.
First,atranslation,orcross-text-typeWSMisconstructedusing theRI-Translatemethod.Herethesourcelanguage(SL)consistsof
8Fig.4alsocontainsthestepsreviseandretain,buttheseareoutsidethescopeof thiswork.
Fig.4. TheCase-Basedsummarisationmethodillustratedasa‘CBRcycle’,basedontheCBRcycleintroducedbyAamodtandPlaza[47].Theleftsideofthedashedlineis notutilisedinthiswork,butillustrateshowthefullCBRcyclecanbeusedinahospitalsetting.
thetextintheclinicalnotes,whilethedischargesummaries con-stitutethetexttargetlanguage(TL).Traininginstancesarerather coarse,aseachcareepisoderepresentsasingletraininginstance. Moreprecisely,foreachcareepisode,thecontextvectorsofthe wordsintheunderlyingclinicalnotes(SL)andthoseinits accom-panyingdischargesummary(TL)haveauniqueindexvectoradded tothem.
Whensummarisingacareepisode,eachsentence(inthe cor-respondingclinicalnotes,pre-clusteredintosentencetopics)has twosentencevectorsconstructed,oneusingwordcontextvectors fromtheSLmodel,andtheotherusingtheTLmodel.Then,each sentencevectorbuiltwiththeSLmodelisiterativelyusedtoquery thesystem.SentencesrepresentedbytheTLmodelarethenranked bytheiroverallmaxcosinesimilarityscorestothesequeries,and thetopNsentencesareincludedinthefinalsummary.
Summarisationmethod:Composite
Inthiscompositemethod,thesentence-scoringfeaturesfrom themethodsRepeatedSentences,Case-Basedand Translate
arecombined.Wefoundthatthebestautomaticevaluationscores (F-scores)fromthesummarisationoptimisationsetwereachieved whenthescoresbyCase-BasedandTranslatewerekeptastheir initialcosinescores,whileforRepeatedSentences,thesentence scoreswerefirstnormalisedbydividingonthemaxscoring sen-tence.Thisnormalisationconvertsthescorestobewithinthesame rangeasthecosine-basedscores,rangingfrom0to1.Thesethree featurescoresarethensimplytotalledtocreatethefinalfeature scoreforeachsentence.FinallythetopNsentencesareselectedfor thefinalsummary.
Summarisationmethod:Centrality
Thecentrality (orcentroid) principleis the mostcommonly reliedonsummarisationtechnique formany generictexttypes
anddomains.Itisbasedonrankingsentencesbyhow represen-tativetheyare of thecentralinformation of thetextthat is to besummarised.Inexistingwork,arangeofmethodshavebeen usedtocomputesentencecentralityinextraction-based summari-sation.ThePageRankmethod[49]hasbeenextensivelyusedfor thispurposeasagraph-basedapproach.Wedecidedtobaseour implementationonthemethodpresentedin[13],whichrelieson a graph representation togetherwitha WSM(RI).To construct theWSM,weusedW2VinsteadofRIbecausepreliminarytesting indicatedthatthismodelperformedbetter.Here,weighted PageR-ankfortextisused,referredtoas‘TextRank’[50].Edgesbetween nodes,i.e.betweensentences,areweightedaccordingtothe pre-calculatedsentencesimilarityusingW2V.Eachsentencealsohas aninitialscoresimilartothecosinesimilaritybetweenthe sen-tenceandthecorrespondingclinicalnote,representedassentence vectorsanddocumentvectors,respectively.Inaddition,toadapt thisapproachtomultipledocuments,i.e.multipleclinicalnotes, wehaveextendedthismethodwithasentencecentralityranking thatworksonmultiplenotes,inasimilarwaytohowitisdonein
[51].Thisisdonebymultiplyingedgeweightsbyoneoftwopreset constants.Constantıismultipliedwithintra-noteedgeweights, i.e.edgesgoingbetweensentenceswithinthesameclinicalnote; andinter-noteedgesaremultipliedwiththeconstant
.92.3.1. Post-processing
Simple post-processing is applied to each summary for the purposeofrearrangingthesentenceorder.Sentencesaresorted accordingtothedatetheywerewritten(i.e.usingthedateofthe clinicalnotetheybelongto).Internalrankingbetweensentences
9IntheexperimentweusedaPageRank˛valueof0.90,ıwas0.3,whilewas 1.0.
fromthesamedateiscarriedoutaccordingtointernalsentence order.Iftwosentencesfromtwodifferentnoteshavethesame datestamp,rankingisperformedaccordingtotheirchronological noteIDs.Meta-sentences(describedinSection2.1)areplacedfirst andrearrangedinternally.
2.4. Experimentandevaluation
Thefollowingtwoexperimentswereconducted:
•Experiment1:Thefirstexperimentfocusesondeterminingthe reliabilityoftheautomaticevaluation.Thisisdoneby compar-inghow the manual and automatic evaluations (four ROUGE measures) correlate in terms of the relative rankings of the eightsummarisation methods.Here, 20 careepisodes(a sub-setof the 156 care episodes in the summarisation evaluation set)areevaluatedbothmanuallyandautomatically.Spearman’s rankcorrelationcoefficient(Spearman’srho)[52]iscalculated betweentheaveragemanualevaluationscoresandtheaverage scoresforeachoftheautomaticevaluationmetricsforeach sum-marisationmethod.
•Experiment2: In the secondexperiment, the summarisation methodsaretestedonalargerevaluationsetof136careepisodes (the156careepisodesinthesummarisationevaluationsetminus the20 usedinExperiment1).Theevaluationis performedin anautomatedmannerusingfourROUGEmeasures.Theaimis primarilytodeterminewhichsummarisationmethodproduces thebestsummaries.Inordertotestwhetherthedifferentscores achievedbythedifferentsummarisationmethodswere statis-ticallysignificant,weperformedtheWilcoxonsigned-ranktest
[53]basedonthescoresfromeach ROUGEmeasure,foreach summarisationmethodpair.
In both experiments,we use thesame eight summarisation methodsdescribedtoconstructsummariesforeachcareepisode. The utilised WSMs are first constructed using the full corpus describedinSection 2.1,minustheoptimisationandevaluation setsmentioned.
A preliminary version of our evaluation set-up has been describedin[41].10Ourcomparisonofmanualandautomatic
eval-uationissimilartotheanalysisconductedbyChin-YewLinin[40]
onEnglishnewswiredatawhenintroducingtheROUGEmeasures. Onegoalistoseeifourautomaticevaluationset-upis reli-able,giventheuncertaintiesrelatedtousingtheoriginaldischarge summariesasgoldstandardsummaries.Thisisdoneby indepen-dentlyanalysingwhetheror notthere isa correlationbetween howhumanevaluatorsranktheperformanceofthesummarisation methodsandhowautomaticevaluationmetricsrankthesesame summarisationmethods.Furthermore,weaimtoreliably estab-lishwhichofthetestedsummarisationmethods(andunderlying features)performbest.
2.4.1. Manualevaluation
Themanualevaluationisconductedbythreedomainexpertsin theclinicalfield:twophysiciansandonenurse,allprofessionalsin specialisedcareandeachwithoverfiveyears’experienceof work-ingwithpatientssufferingfromheart-relatedhealthproblems.
10 TheF-scoresfromtheautomaticevaluationareonaveragenoticeablylowerin thisstudythanthosereportedin[41].Thisisprimarilybecausehereweexcludeda specifictypeofnotefromallcareepisodes,atypeofsummaryforthepatients,which isoftenwrittenatthesametimeasthefinaldischargesummary,andtheircontents tendtobeverysimilar;sometimesidentical.Inaddition,someofthemethodsused inthisexperimentarenewordifferent.
Table2
Schemeusedforthemanualevaluation.
Evaluationcriteria Rating
Sender yes=1,no=0
Reasonforadmission yes=1,no=0 Long-termdiagnosis yes=1,no=0 Procedures(e.g.operation,
angioplasty)
yes=1,no=0 Tests(e.g.lab-test,X-ray,EKG) yes=1,no=0
Medication yes=1,no=0
Healthstatusatdischarge yes=1,no=0 Plansforthefuture yes=1,no=0 Readability:howgoodistheflowof
thetext?
0.0–1.0,0.0=badto 1.0=excellent Readability:howgoodisthecontentof
thesummary?
0.0–1.0,0.0=badto 1.0=excellent
Apre-studyfocusingonthesametypeofmanualevaluationwas conductedin[41].Inthispre-study,a30-itemevaluationscheme (ortool)formanualevaluationwasdevelopedbasedonthe hospi-taldistricts’guidelinesfor writingdischargesummaries.Itused a 4-pointscaleranging from−1 to2,where, −1=not relevant, 0=notincluded,1=partiallyincludedand2=fullyincluded. How-ever,usingthisschemeturnedouttobedifficultandextremely time-consuming.Onereasonforthisisthatquiteafewoftheitems weresomewhatoverlapping andveryfine-grained,like ‘conclu-sions’,‘assessmentofthefuture’and‘statusofthediseaseatthe endofthetreatmentperiod’.Otheritemswererarelydocumented byclinicians(physicians) intheclinicalnoteswrittenduringan ongoingcareepisode,suchas‘statusofthediseaseattheendof thetreatmentperiod’,‘abilitytowork’and‘continuedcareplan’. Inaddition,acoupleoftheitemswereredundantastheyconcern whatwerefertoasstructuredinformationintheEHRsystem,such as‘careplace’and‘careperiod’,andthereislittlevalueintryingto extractthisfromthetext.Therefore,thismanualevaluationscheme wasfurtherdevelopedtoamoresimplifiedversionwithonlyten criteriaitems.11Eightofthetencriteriawererateddichotomically
‘yes’or‘no’.Thesecriteriaitemsconcernthecontentsofthe dis-chargesummary,where‘yes’meansthatthesummaryincludes contentrelated tothecriteria.Movingfroma4-pointscaletoa 2-pointscalewasdonetosimplifytheevaluationfurther.Thetwo remainingcriteriaconcernthereadabilityofthesummaryandwere ratedonascaleof0.0–1.0,where0.0waspoorand1.0excellent. TheschemeusedinthemanualevaluationisshowninTable2. Informationaboutwhattypeofnoteeachsentencebelongsto,and whenitwaswritten,waspresentedasmetadataforthemanual evaluators.
Eachevaluatorevaluatedthesame20careepisodes,witheight summariespercareepisode.Theinter-rateragreementbetween thethreeevaluatorswascalculatedwiththeintraclasscorrelation coefficient(ICC)[54]withatwo-waymixedmodelusingIBMSPSS Statisticsversion22.Basedontheexistingliterature,wefoundno fixedlimitregardingtheinterpretationofICCvalues;one sugges-tionisthatvaluesbelow0.4arepoor,valuesfrom0.4to0.59are
fair,valuesfrom0.6to0.74aregood,andvaluesfrom0.75to1.0are
excellent[55].Theinter-rateragreementbetweentheevaluatorsin thisstudywasgood(ICC=0.744,95%CI0.722–0.766,p<0.001).
GiventhequiteconcreteevaluationcriteriainTable2,onecould intuitivelyassumethatthebestsummarisationapproachwouldbe tofocusonextractingthoseexacttencriteriaitems.Asaresult,we experimentedwithonesummarisationmethodthataimedtodo justthat.However,thisperformedpoorlyinbothmanualand auto-maticevaluation.Themainreasonforthisisthatwedonothave
11Apilottestwasconductedintheprocessofdevelopingthemanualevaluation scheme.
Fig.5.Graphillustratingthetrendforhowtheautomaticevaluationmetricscorrelatewiththemanualevaluationofsummariesfrom20careepisodes.Allevaluationentries havebeennormalisedbydividingthescoresbytheirmaxscores.Thesummarisationmethodsarearrangedaccordingtothemanualevaluationscores,andthelinesvisualise howROUGEmeasuresfollowthetrendofthemanualevaluations.
anygoodwayofmappingthecriteriadescriptionstothecontentin theclinicalnotes.Forexample,thereisnostraightforwardwayof mapping‘long-termdiagnosis’toasentencenotexplicitly contain-ingtheseexactorsimilarwords.Asentencementioninglong-term diagnosiscouldbe:‘thepatienthasbeensufferingfromhighblood pressureforthelastfouryears.’
2.4.2. Automaticevaluation
Automated evaluation of summaries generated from a care episodeisperformedbyusingtheaccompanyingoriginaldischarge summaryasagoldstandard.Thisexploratoryapproach circum-ventstheneedformanuallyconstructingsuchagoldstandard.
TheROUGE evaluationtoolkit [40] contains multiplen -gram-basedevaluationmetricsthatarecommonlyusedforautomatic summarisationscoring,suchasinthedocument understanding conferences(DUC)andthetextanalysisconferences(TAC)[56]. ROUGEbasicallyworksbycalculatingthen-gramoverlapbetween atargetsummary(thesummarythatistobeevaluated),andone ormoregoldstandardsummaries.Theoutputsfromthesemetrics areprecision,recallandF-score,reflectingtheoverlapbetweenthe targetandgoldstandardsummaries.TheaverageF-scoresarewhat wereporthere.Asthereareseveralmetricstochoosefrom,weuse thefollowing12:
•ROUGE-N1unigramco-occurrencestatistics.
•ROUGE-N2bigramco-occurrencestatistics.
•ROUGE-Llongestcommonsub-sequenceco-occurrencestatistics.
•ROUGE-SU4skip-bigramandunigramco-occurrencestatistics.
3. Results
3.1. ResultsforExperiment1
To visualise how theevaluations correlate, we have plotted thescoresfromthemanualandautomaticevaluationsinagraph, showninFig.5.
12 WefoundthelistedROUGEmetricstobethemostcommonlyusedmetricsin theliterature.
Table3
Spearman’srhoresults,indicatinghowtheautomaticevaluationmetricscorrelate withthemanualevaluationscoresover20careepisodes.
Evaluationmetric Spearman’srho(p-values)
ROUGE-N1 0.9048(0.00201)
ROUGE-N2 0.9524(0.00026)
ROUGE-L 0.9524(0.00026)
ROUGE-SU4 0.9048(0.00201)
Thecorrelations betweenmanual andautomatic evaluations werecalculatedusingSpearman’srho.Theresultsareshownin Table3.Basedonthestatisticalanalysisandp-valuesinTable3, thefourROUGEmeasureshaveahighcorrelationwiththemanual evaluations.
3.2. ResultsforExperiment2
Theresultsfromtheautomaticevaluationof136careepisodes areshowninTable4.Thercolumnsshowtheinternalrankingof eachsummarisationmethodforeachevaluationmeasure.Amore illustrativerepresentationisshowninFig.6.
Wecalculated significancelevelsusingtheWilcoxon signed-ranktest, withp<0.05 (with Bonferronicorrectionfor multiple hypothesis testing). Based on the p-values the methods could be divided into three groups. First, Oracle significantly out-performedalltheothermethodsagainstalloftheROUGEmeasures (highestp-value:2.12·10−22ROUGE-N1Oraclevs.Translate).
Second,Composite,Case-BasedandTranslatesignificantly out-performedthemethodsinthethirdgroup—RepeatedSentences,
LastSentences, Centralityand Random— against allROUGE measures(highestp-value3.74·10−4 ROUGE-N2
Translatevs.
LastSentences).Inthisthirdgroup,nomethodsignificantly dif-feredfromtheRandommethodinterms ofat leastoneROUGE measure.Thep-valuesforallcomparisonsareincludedinthe sup-plementarymaterials.
Basedontheanalysiswecandividethemethods(not count-ingtheOraclemethod)intotwogroups:Composite,Case-Based
andTranslatearesuccessfulatproducingsummariesthat outper-formthesimplebaselinemethodsinallcomparisons,whereasthe
Table4
F-scoresfromtheautomaticevaluationof136careepisodes.TheorderofthesummarisationmethodsisthesameasinFig.5.
Sum.method ROUGE-N1 r ROUGE-N2 r ROUGE-L r ROUGE-SU4 r
Composite 0.3820 2 0.1849 2 0.3678 2 0.1865 2 Oracle 0.4819 1 0.2865 1 0.4683 1 0.2694 1 Case-Based 0.3634 4 0.1741 3 0.3497 4 0.1764 3 Translate 0.3703 3 0.1661 4 0.3551 3 0.1720 4 Random 0.3043 7 0.1177 7 0.2949 7 0.1241 7 RepeatedSentences 0.3301 5 0.1408 5 0.3196 5 0.1463 5 LastSentences 0.3287 6 0.1398 6 0.3184 6 0.1462 6 Centrality 0.2862 8 0.1027 8 0.2743 8 0.1151 8
Fig.6.Graphillustratinghowthevarioussummarisationmethodsperformagainstasetof136careepisodes.Allevaluationentrieshavebeennormalisedbydividingscores bytheirmaxscores.TheorderofthesummarisationmethodsisthesameasinFig.5;thelineshighlightthesimilartrendsfortheROUGEmeasuresoverallthesummarisation methods.
CentralityandRepeatedSentencesmethodsfallinthesame groupwiththesimplebaselineapproaches.
WithouttheBonferroni correction,thesignificantly differing groupswouldbeasfollows:
1.Oracle 2.Composite 3.Case-Based,Translate 4.RepeatedSentences,LastSentences 4.Centrality,Random 4. Discussion
In this work we consider a variety of resource-lean and language-independentsummarisation methods for clinicaltext. Thesemethodscircumventtheneedfortailoredlanguageresources and tools. The proposedsummarisation methods utilise WSMs constructed from word co-occurrence statistics in a large cor-pusofclinicaltext(seeSection 2.1).Thisenablesustocapture various semantic similarity relations in the clinical text in an automatic,data-driven way.Theaimis nottoconstructperfect summariesthatcanfullyreplaceindividualclinicalnotesor com-pletelyautomatetheprocessofproducingdischargesummaries, forexample.Rather,thisworkisasteptowardsexploringways ofautomaticallyconstructingindicativeclinicaltextsummariesby relyingonpurelystatisticalfeaturesfordeterminingasentence’s significance.
Weintroduceaschemethatdomainexpertscanusetomanually comparetherelativequalityofdifferentautomaticallyproduced
summaries(andtheunderlyingsummarisationmethods).The pro-posedschemeconsistsofa10-itemquestionnairemeasuringthe expert’sopinionofthereadabilityofthesummary,andwhether ithasrelevantcontent.Theschemehasbeendevelopedbasedon experiencesfromourpreliminarystudyonevaluatingclinical sum-marisationmethods[41],resultinginamorestreamlinedtoolthat iseasiertouseconsistently.
However,suchmanualevaluationrequires humaninputand isthus impracticaltouseduring summarisationmethod devel-opment,whererapidfeedbackisrequiredwhentestingdifferent method variations. Therefore, we also use the ROUGE toolkit forperformingautomatedevaluation. Wealsoseektoestablish whethertheautomatedROUGE-basedevaluationcanbeusedin placeofhumanevaluationinthecontextofclinical summarisa-tion.Thismeta-evaluationisperformedthroughrankcorrelation coefficientanalysisbetweenthemanualandautomatedevaluation. Finally,weaimtoestablishwhichsummarisationmethodperforms bestinthetaskofclinicalsummarisation.
TheresultsfromExperiment1 showthatthere isa correla-tion betweenhow the manual and automatic evaluationsrank thedifferentsummarisationmethods.Thisindicatesthatusingan automatedROUGE-basedevaluationset-upisfeasible.Further,it showsthattheautomaticevaluationscores,withtheapplied eval-uationset-up,arereliable fordeterminingwhat summarisation methodperforms best.The observationthatthe manual evalu-ators preferred theComposite method to theOracle method indicatesthatthegreedysearchstrategy,basedontheoriginal dis-chargesummary,doesnotnecessarilyproducethebestpossible extraction-based(discharge)summary.
TheresultsfromExperiment2showthatthemethods Compos-ite,Case-BasedandTranslateallworkbetterthanthebasic baselinemethods(p<0.05)(nottakingintoconsiderationthe Ora-clemethod),whereastheCentralitybaselinefailstooutperform eventheRandombaselinewiththisdata.Composite,which con-sistsofcombinedfeaturesfromRepeatedSentences,Case-Based
andTranslateconsistentlyhasthehighestROUGEperformances. However,thedifferencebetweentheseandthenextbestmethods isnotstatisticallysignificantagainstallROUGEmeasuresfollowing theBonferronicorrection.
Whenproducingthesummaries,theCompositemethod com-binesthefollowingbasicprinciples:
•Theimportanceofasentencedependsonhowmanytimesthe sameorsimilarinformationhasbeenmentionedthroughouta careepisode(RepeatedSentences).
•Bylookingatdischargesummariesofothersimilarcareepisodes, onecanassesstheimportanceofasentencebasedonwhetheror notthesameorsimilarinformationhasbeenwritteninthese summaries(Case-Based).
•If,usingaWSM-basedtranslationsystem,asentence(itsvector representation)canbe‘translated’intoavectorrepresentation thatissimilartohowthissamesentencewouldlookinthe trans-latedwordspace,itshouldbeconsideredforinclusioninthefinal summary(Translate).
•Clusteringsentencesintotopicsthatspanacrossclinicalnotesin acareepisodeallowsfortheremovalofredundancy.
Centralityisevaluated asbeingone ofthelowest-scoring summarisationmethods.Givenitsbroadusageintext summarisa-tionforotherdomains,thisdeservesacloseranalysis.Weaskedthe evaluatorstocommentonthestructureandcontentofthe summ-ariesthatthismethodproducedusingopen-endedquestions.The threequestionswere:
1.Whatimportantinformationismissingfromthesummary? 2.Whatinformationinthesummaryisunnecessary?
3.Howlogicalisthestructureofthesummary?
Thefollowingsumsupwhattheywrotebasedontheanalysis offivesummaries:
•Disorganisedstructureoftext,confusing,illogicalorderorstructure. •Theendismissing.
•Cannotgetanoverallviewofpatients’careepisode. •Importantinformationismissing.
•Informationisdiffuseandfragmented. •Sentencesarenotconnected.
•Toomanydetailsaboutunimportantstuff.
Thisseemstoindicatethatthemost‘central’information, inde-pendentofwhen it waswritten,is nota good indicatorofthe informationthatclinicianswanttohaveinthedischargesummary. Thismethoddidnotincludesentencetopicclustering,whichwas usedinseveraloftheothermethods. Thisfurthersupportsthe importanceofsuchtopicclusteringdespitetherelativelypoor per-formanceofRepeatedSentences.Infuturework,othervariations andimplementationsofcentrality-basedmethodsshouldbe eval-uated,e.g.throughtheuseoftheMEADsystem[57],similarlyto howitisdonein[37].
LastSentencesperforms relatively poorlyin comparisonto manyoftheothersummarisationmethods.Thisisaninteresting observationinthatitsuggeststhatreadingonlythelatestwritten informationornote(s)issuboptimalwhenthetaskistowritea dis-chargesummary.Italsosuggeststhattherearereasonstobelieve
thatitisbeneficialforclinicianstousetextsummarisation sys-temsintheirwork,e.g.toassistinhighlightingrelevantinformation documentedearlierinacareepisode.
Evenwithourrathercoarse-grainedmanualevaluation,when appliedtoalimitednumberofcareepisodes,ahighcorrelationis seenwiththeautomaticevaluation.Hence,thisautomatic evalu-ationapproachcanbeusedtorankthedifferentsummarisation methodsin orderofeffectiveness.And sincesuchmanual eval-uationisnotaffordableeverytimeasummarisationmethodhas beenmodified,orwhenanewmethodisdeveloped,itshouldbe possibletoresorttothisautomaticevaluationduringthemethod developmentprocess.
Thisstudyraisesquestionsabouttheusability,reliabilityand usefulnessofsuch(imperfect)automaticsummarisationsystems, particularlywhenusedatthepointofcare.Thisisdifficulttoassess basedontheutilisedevaluationapproachandscoresachievedhere. Thequestionis:whatdoesitactuallymeantohaveasystemthat isabletogeneratetextualsummariescontainingpartsoforall(i.e. perfectevaluationscore)thecontentonewouldexpecttofindina manually-createddischargesummary?Oneansweristhatitwould likelyprovideagoodstartingpointforaclinicianwhoisaboutto writetheactualdischargesummary.Itisalsolikelythatthesame automatically generated summary would provide an indicative overviewoftheinformationhavingbeendocumentedduringthe correspondingcareepisode,fromaclinician’sperspective. How-ever,patientsafetyissuesmustbeconsideredbeforethiskindof systemistakenintopractice.Ontheonehand,itisimportantthat themostrelevantinformation neededforsafecareprovisionis assuredinautomaticallygeneratedsummaries.Ontheotherhand, aslongasclinicianstreatthegeneratedsummariesasanindicative summary,thiscouldbeahelpfulfeatureinEHRsystems, partic-ularlyinsituationswheretimeisoftheessence.Futureresearch includingmoreuser-centredevaluationisrequiredtoanswerthis questioninmoredetail.
Aweaknessofthisstudyisthevalidityoftheevaluation.The utilisedmanualevaluationschemeisquitecoarse-grainedinthat itcontainstencriteriaitems,andtheratingisdoneonthelevel of‘yes’or‘no’.However,inthepreviouslymentionedpre-study
[41],a30-itemevaluationschemewastested,usingafour-point scale, but was found to be too detailed and time-consuming touse.
Theautomaticevaluationisperformedusingtheoriginal dis-chargesummaryasagoldstandardsummary,despitethefactthat thesedischargesummariesarenotthemselvesproducedinapurely extractiveway.ThisisreflectedinthefactthattheROUGEscores arearguablyquitelowcomparedtoscoresreportedinvariousother studiesontextsummarisation(seee.g.[40]).Thescoresachievedby
OracleindicatethemaximumROUGE-N2scoresachievablewith anextractive-basedsummarisationsystemforourdata.However, itisencouragingtoseethatthereisacorrelationintermsof rel-ativegoodnessbetweenmanualandautomaticevaluation,both hereandinthepre-study[41],whichispromisingforfuturework inthisdirection.
An alternative evaluation approach would be to manually developgoldstandardsummariesinapurelyextractivewayfor a setof care episodes,replacing the original discharge summ-aries as gold standard summaries in theautomatic evaluation. This approach was not pursued here, as it is more resource-intensive, but it would possibly give us more reliable results. Anotherapproachwouldbetousethesummarisationsystemin a(simulated)clinicalsettingwithcliniciansasusers.Suchan eval-uationapproach isreferred toasextrinsic evaluation,and could potentiallyshedlightontheimpactondocumentationspeedand quality,aswellasonhealthcarequalityand patientoutcomes. Thistypeofevaluationcouldalsopotentiallyprovidedirectionsfor futureworkonimprovingthesummarisationsystem.
Currentlyitisdifficulttoassesstheusefulnessandpotential impactthatthistypeofsummarisationsystemcouldhaveinareal clinicalsetting.Ontheonehand,itcouldbeaconvenienttoolfor cliniciansintermsofprovidinganindicativetextualoverviewof ongoingcareepisodes,forexample.Ontheotherhand,thepossible imperfectionoftheinformationpresentedinthegenerated summ-aries mustbeconsidered in relationtopotentialpatient safety issues.Futureworkshouldfocusmoreonextrinsicevaluationby evaluatinghowtheuseofautomatictextsummarisationsystems inaclinicalsettingwillimpactondocumentationspeedandquality, aswellashealthcarequalityandpatientoutcomes.Herewebelieve thatamoreuser-guidedsummarisationsystemisneeded,enabling real-timeincrementalsummarygeneration,similartothemethods proposedin[58].Thiswouldmeanthatthecomputer-generated summary,orthesentencesthatitsuggestsforinclusioninthefinal summary,arecalculatedbasedonanalysingwhatcontenttheuser hasalreadywrittenin(orimportedinto)thesummary.
5. Conclusion
Thisworkontheautomatedsummarisationoffreetextincare episodesintroducesandevaluatesbothaframeworkforevaluating summarisationmethods,aswellasnovelmethodsforperforming thesummarisation.Mostofthepresentedsummarisationmethods relyonstatisticalinformationderivedfromalargecorpusofclinical text,thisincludesvariousWSMs.Thebestperforming summarisa-tionmethods,accordingtotheappliedevaluation,areComposite,
Case-BasedandTranslate.TheROUGE-basedevaluation meas-uresareshowntocorrelatehighlywiththemanualevaluationin termsofrelativeranking.Basedontheseresults,webelievethat theexploredsentencefeatures,especially thoseinthe Compos-itemethod,provideusefuldirectionsonhowtoapproach this summarisationtaskinaresource-leanfashion.Furtherstudiesare neededtoassesstheapplicabilityofsuchmethodsinreal-world clinicalsettings.
Conflictofinterest
Theauthorsdeclarethattheyhavenoconflictsofinterest.
Acknowledgements
Thisstudywas partlysupportedby theResearch Council of NorwaythroughtheEviCareproject(projectno.193022),Turku University Hospital (EVO 2014), and the Academy of Finland (projectno.140323).Thestudy ispartof theresearchprojects oftheIKITIKconsortium(http://www.ikitik.fi).Wewouldliketo thankthemanualevaluatorsfortheircontributions.Wewouldalso liketothankFilipGinterforassistingusintheworkonWord2vec. Finally,wewouldliketothankthereviewersfortheirinsightful comments.ThispaperhasbeenproofreadbyLingsoftLanguage ServicesOy,andthis wasfinancedby theDepartmentof Com-puterandInformationScience,NorwegianUniversityofScience andTechnology.
AppendixA. Supplementarydata
Supplementarydataassociatedwiththisarticlecanbefound, intheonlineversion,athttp://dx.doi.org/10.1016/j.artmed.2016. 01.003.
References
[1]Hall A, WaltonG. Information overload within the health care system: a literature review.HealthInf LibrJ 2004;21(2):102–8,http://dx.doi.org/ 10.1111/j.1471-1842.2004.00506.x.
[2]VanVleckTT,SteinDM,StetsonPD,JohnsonSB.Assessingdatarelevancefor automatedgenerationofaclinicalsummary.In:TeichJM,SuermondtJ, Hripc-sakG,editors.AMIAannualsymposiumproceedings.2007.p.761–5. [3]LissauerT,PatersonC,SimonsA,BeardR.Evaluationofcomputer
gener-ated neonataldischargesummaries.Arch DisChild1991;66(4Spec No.): 433–6.
[4]KripalaniS,LeFevreF,PhillipsCO,WilliamsMV,BasaviahP,BakerDW.Deficits incommunicationandinformationtransferbetweenhospital-basedand pri-marycarephysicians:implicationsforpatientsafetyandcontinuityofcare.J AmMedAssoc2007;297(8):831–41.
[5]SørbyID,NytrøØ.Doestheelectronicpatientrecordsupportthedischarge process? Astudy onphysicians’useofclinicalinformationsystems dur-ing dischargeofpatientswithcoronaryheartdisease.HealthInfManagJ 2005;34(4):112–9.
[6]MengF,TairaRK,BuiAA,KangarlooH,ChurchillBM.Automaticgeneration ofrepeatedpatientinformationfortailoringclinicalnotes.IntJMedInform 2005;74(7–8):663–73.
[7]WrennJO,SteinDM,BakkenS,StetsonPD.Quantifyingclinicalnarrative redun-dancyinanelectronichealthrecord.JAmMedInformAssoc2010;17(1):49–53, http://dx.doi.org/10.1197/jamia.M3390.
[8]AfantenosS,KarkaletsisV,StamatopoulosP.Summarizationfrommedical doc-uments:asurvey.ArtifIntellMed2005;33(2):157–77.
[9]NenkovaA,McKeownK.AutomaticSummarization.FoundTrendsInfRetr 2011;5(2–3):103–233,http://dx.doi.org/10.1561/1500000015.
[10]Carbonell J, GoldsteinJ. The useof MMR, diversity-based reranking for reorderingdocumentsandproducingsummaries.In:CroftWB,MoffatA,van RijsbergenCJ,WilkinsonR,ZobelJ,editors.Proceedingsofthe21stannual inter-nationalACMSIGIRconferenceonresearchanddevelopmentininformation retrieval.1998.p.335–6.
[11]GoldsteinJ,MittalV,CarbonellJ,KantrowitzM.Multi-document summariza-tionbysentenceextraction.In:Proceedingsofthe2000NAACL-ANLPworkshop onautomaticsummarization–volume4,NAACL-ANLP-AutoSum’00.2000.p. 40–8,http://dx.doi.org/10.3115/1117575.1117580.
[12]Patil K,Brazdil P.SumGraph:text summarizationusing centralityinthe pathfindernetwork.IntJComputSciInfSyst2007;2(1):18–32.
[13]Chatterjee N, MohanS.Extraction-based single-documentsummarization usingrandomindexing.In:Proceedingsofthe19thIEEEinternational confer-enceontoolswithartificialintelligence–volume02,ICTAI’07.2007.p.448–55, http://dx.doi.org/10.1109/ICTAI.2007.28.
[14]MishraR,BianJ,FiszmanM,WeirCR,JonnalagaddaS,MostafaJ,etal.Text sum-marizationinthebiomedicaldomain:asystematicreviewofrecentresearch. JBiomedInform2014;52:457–67,http://dx.doi.org/10.1016/j.jbi.2014.06.009. [15]Miller GA. WordNet: a lexical database for English. Commun ACM
1995;38(11):39–41,http://dx.doi.org/10.1145/219717.219748.
[16]Unifiedmedicallanguagesystem[cited10thAugust2015].http://www.nlm. nih.gov/research/umls.
[17]InternationalHealthTerminologyStandardsDevelopmentOrganisation: sup-portingdifferentlanguages[cited10thAugust2015].http://www.ihtsdo.org/ snomed-ct/snomed-ct0/different-languages.
[18]WorldHealthOrganization,InternationalClassificationofDiseases(ICD). [19]U.S.NationalLibraryofMedicine,MeSH(MedicalSubjectHeadings)[cited10th
August2015].http://www.ncbi.nlm.nih.gov/mesh.
[20]AronsonAR,LangF-M.AnoverviewofMetaMap:historicalperspectiveand recentadvances.JAmMedInformAssoc2010;17(3):229–36.
[21]SavovaGK,MasanzJJ,OgrenPV,ZhengJ,SohnS,Kipper-SchulerKC,etal. Mayoclinicaltextanalysisandknowledgeextractionsystem(cTAKES): archi-tecture, componentevaluation andapplications. JAmMed InformAssoc 2010;17(5):507–13.
[22]RindfleschTC,FiszmanM.Theinteractionofdomainknowledgeandlinguistic structureinnaturallanguageprocessing:interpretinghypernymic proposi-tionsinbiomedicaltext.JBiomedInform2003;36(6):462–77.
[23]Velupillai S, Kvist M. Fine-grained certainty level annotations used for coarser-grained e-health scenarios. In: Gelbukh A, editor. Compu-tational linguistics and intelligent text processing, vol. 7182 of lecture notes incomputerscience.Berlin/Heidelberg:Springer; 2012. p.450–61, http://dx.doi.org/10.1007/978-3-642-28601-838.
[24]KvistM,SkeppstedtM,VelupillaiS,DalianisH.Modelinghuman compre-hensionofSwedishmedicalrecordsforintelligentaccessandsummarization systems–futurevision,aphysician’sperspective.In:FensliR,DaleJ,editors. 9thScandinavianconferenceonhealthinformatics.2011.
[25]Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform 2009;42(5):760–72.
[26]PedersenT,Pakhomov SV,PatwardhanS,ChuteCG.Measuresof seman-tic similarityandrelatednessinthebiomedicaldomain.JBiomedInform 2007;40(3):288–99.
[27]KoopmanB,ZucconG,BruzaP,SitbonL,LawleyM.Anevaluationof corpus-drivenmeasuresofmedicalconceptsimilarityforinformationretrieval.In: ChenX,LebanonG,WangH,ZakiMJ,editors.21stACMinternational confer-enceoninformationandknowledgemanagement,CIKM’12.2012.p.2439–42, http://dx.doi.org/10.1145/2396761.2398661.
[28]HenrikssonA,MoenH,SkeppstedtM,DaudaraviV,DuneldM.Synonym extrac-tionandabbreviationexpansionwithensemblesofsemanticspaces.JBiomed Semant2014;5(1):6.
[29]CohenT,WiddowsD.Empiricaldistributionalsemantics:methodsand biomed-icalapplications.JBiomedInform2009;42(2):390–405.
[30]CohenR,AviramI,ElhadadM,ElhadadN.Redundancy-awaretopic mod-eling for patient record notes. PLOS ONE 2014;9(2), http://dx.doi.org/ 10.1371/journal.pone.0114677.
[31]VineLD,ZucconG,KoopmanB,SitbonL,PruzaP.Medicalsemanticsimilarity withaneurallanguagemodel.In:LiJ,WangXS,GarofalakisMN,SoboroffI,Suel T,WangM,editors.Proceedingsofthe23rdACMinternationalconferenceon conferenceoninformationandknowledgemanagement,CIKM2014.2014.p. 1819–22,http://dx.doi.org/10.1145/2661829.2661974.
[32]HarrisZS.Distributionalstructure.Word1954;10:146–62.
[33]DeerwesterS,DumaisS,FurnasG,LandauerT,HarshmanR.Indexingbylatent semanticanalysis.JAmSocInfSci1990;41(6):391–407.
[34]KanervaP,KristoferssonJ,HolstA.RandomIndexingoftextsamplesforlatent semanticanalysis.In:Proceedingsof22ndannualconferenceoftheCognitive ScienceSociety.2000.p.1036.
[35]MikolovT,SutskeverI,ChenK,CorradoGS,DeanJ.Distributedrepresentations ofwordsandphrasesandtheircompositionality.In:BurgesC,BottouL,Welling M,GhahramaniZ,WeinbergerK,editors.AdvancesinNeuralInformation ProcessingSystems26.NeuralInformationProcessingSystemsFoundation; 2013.p.3111–9.
[36]Luhn HP. Theautomatic creationof literature abstracts.IBM J ResDev 1958;2(2):159–65.
[37]LiuS.Experiencesandreflectionsontextsummarizationtools.IntJComput IntellSyst2009;2(3):202–18.
[38]SarkarK,NasipuriM,GhoseS.Usingmachinelearningformedicaldocument summarization.IntJDatabaseTheoryAppl2011;4:31–49.
[39]AbulkhairM,ALHarbiN,FahadA,OmairS,ALHosainiH,AlAffariF.Intelligent integrationofdischargesummary:aformativemodel.In:Al-DabassD, Uthay-opasP,Sa-nguanpongS,NiramitranonJ,editors.4thinternationalconference onintelligentsystemsmodelling&simulation.IEEE;2013.p.99–104. [40]LinC-Y.Rouge:apackageforautomaticevaluationofsummaries.In:
Marie-FrancineMoensSS,editor.Textsummarizationbranchesout:proceedingsof theACL-04workshop.2004.p.74–81.
[41]MoenH,HeimonenJ,MurtolaL-M,AirolaA,PahikkalaT,TeräväV,etal. Onevaluationofautomaticallygeneratedclinicaldischargesummaries.In: Proceedingsofthe2ndEuropeanWorkshoponPracticalAspectsofHealth Informatics(PAHI).2014.p.101–14.
[42]MoenH,MarsiE,GinterF,MurtolaL-M,SalakoskiT,SalanteräS.Careepisode retrieval.In:VelupillaiS,DuneldM,KvistM,DalianisH,SkeppstedtM, Hen-rikssonA,editors.Proceedingsofthe5thinternationalworkshoponhealthtext miningandinformationanalysis(Louhi)@EACL.2014.p.116–24.
[43]KarlgrenJ,SahlgrenM,JärvinenT,CösterR.Dynamiclexicaforquery trans-lation.In:PetersC,CloughP,GonzaloJ,JonesG,KluckM,MagniniB,editors. Multilingualinformationaccessfortext,speechandimages,vol.3491of lec-turenotesincomputerscience.Berlin/Heidelberg:Springer;2005.p.150–5, http://dx.doi.org/10.1007/1151964515.
[44]JonesK.Astatisticalinterpretationoftermspecificityanditsapplicationin retrieval.JDoc1972;28(1):11–21.
[45]McKeownK,KlavansJ,HatzivassiloglouV,BarzilayR,EskinE.Towards mul-tidocumentsummarizationby reformulation:progressand prospects.In: HendlerJ, SubramanianD, editors.Proceedings ofthe sixteenthnational conferenceonartificialintelligenceandeleventhconferenceoninnovative applicationsofartificialintelligence.1999.p.453–60.
[46]LimJ-M,Kang I-S,Bae J-H,Lee J-H.Sentence extractionusingtime fea-turesinmulti-documentsummarization.In:MyaengS,ZhouM,WongK-F, ZhangH-J, editors.Informationretrieval technology,vol. 3411oflecture notes in computer science.Berlin/Heidelberg: Springer; 2005. p. 82–93, http://dx.doi.org/10.1007/978-3-540-31871-28.
[47]AamodtA,PlazaE.Case-basedreasoning:foundationalissues,methodological variations,andsystemapproaches.AICommun1994;7(1):39–59.
[48]LenzM,HübnerA,KunzeM.TextualCBR.In:LenzM,BurkhardH-D, Bartsch-SpörlB,WessS,editors.Case-basedreasoningtechnology,vol.1400oflecture notesincomputerscience.Berlin/Heidelberg:Springer; 1998. p. 115–37, http://dx.doi.org/10.1007/3-540-69351-35.
[49]Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 1998;30(1):107–17, http://dx.doi.org/ 10.1016/S0169-7552(98)00110-X.
[50]MihalceaR.Graph-basedrankingalgorithmsforsentenceextraction,applied totextsummarization.In:ProceedingsoftheACL2004oninteractiveposter anddemonstrationsessions,ACLdemo’04.2004.
[51]WanX,YangJ.Improvedaffinitygraphbasedmulti-documentsummarization. In:ProceedingsofthehumanlanguagetechnologyconferenceoftheNAACL, companionvolume:shortpapers,NAACL-Short’06.2006.p.181–4.
[52]LehmanA.JMPforbasicunivariateandmultivariatestatistics:astep-by-step guide.Cary,NC,USA:SASInstitute;2005.
[53]Wilcoxon F. Individual comparisons by ranking methods. Biometrics 1945;1:80–3,http://dx.doi.org/10.2307/3001968.
[54]ShroutPE,FleissJL.Intraclasscorrelations:usesinassessingraterreliability. PsycholBull1979;86(2):420–8.
[55]CicchettiDV.Guidelines,criteria,andrulesofthumbforevaluatingnormed and standardized assessment instruments in psychology. Psychol Assess 1994;6(4):284–90.
[56]DangHT,OwczarzakK.OverviewoftheTAC2008updatesummarizationtask. In:Proceedingsoftextanalysisconference2008workshop–notebookpapers andresults.2008.p.1–16.
[57]RadevDR,JingH,BudzikowskaM.Centroid-basedsummarizationof mul-tiple documents: sentenceextraction, utility-based evaluation, and user studies. In: Proceedings of the 2000 NAACL-ANLP workshop on auto-maticsummarization,vol.4ofNAACL-ANLP-AutoSum’00.2000.p.21–30, http://dx.doi.org/10.3115/1117575.1117578.
[58]SankarasubramaniamY,RamanathanK,GhoshS.Textsummarizationusing Wikipedia.InfProcessManag2014;50(3):443–61.