• No results found

Comparison of automatic summarisation methods for clinical free text notes

N/A
N/A
Protected

Academic year: 2021

Share "Comparison of automatic summarisation methods for clinical free text notes"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

ContentslistsavailableatScienceDirect

Artificial

Intelligence

in

Medicine

jou rn al h om e p a g e :w w w . e l s e v i e r . c o m / l o c a t e / a i i m

Comparison

of

automatic

summarisation

methods

for

clinical

free

text

notes

Hans

Moen

a,b,c,∗

,

Laura-Maria

Peltonen

c,d

,

Juho

Heimonen

b,e

,

Antti

Airola

b

,

Tapio

Pahikkala

b,e

,

Tapio

Salakoski

b,e

,

Sanna

Salanterä

c,d

aDepartmentofComputerandInformationScience,NorwegianUniversityofScienceandTechnology,SemSaelandsvei9,7491Trondheim,Norway

bDepartmentofInformationTechnology,UniversityofTurku,Joukahaisenkatu3–5,20520Turku,Finland

cDepartmentofNursingScience,UniversityofTurku,Lemminkäisenkatu1,20520Turku,Finland

dTurkuUniversityHospital,Kiinamyllynkatu4–8,20521Turku,Finland

eTurkuCentreforComputerScience(TUCS),Joukahaisenkatu3–5,20520Turku,Finland

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received4May2015 Receivedinrevisedform 14December2015 Accepted5January2016

Keywords:

Automatictextsummarisation Summarisationevaluation Distributionalsemantics Wordspacemodels Clinicaltextprocessing Electronichealthrecords

a

b

s

t

r

a

c

t

Objective:Amajorsourceofinformationavailableinelectronichealthrecord(EHR)systemsaretheclinical freetextnotesdocumentingpatientcare.Managingthisinformationistime-consumingforclinicians. Automatictextsummarisationcouldassistcliniciansinobtaininganoverviewofthefreetextinformation inongoingcareepisodes,aswellasinwritingfinaldischargesummaries.Wepresentastudyofautomated textsummarisationofclinicalnotes.Itlookstoidentifywhichmethodsarebestsuitedforthistask andwhetheritispossibletoautomaticallyevaluatethequalitydifferencesofsummariesproducedby differentmethodsinanefficientandreliableway.

Methodsandmaterials:Thestudyisbasedonmaterialconsistingof66,884careepisodesfromEHRsof heartpatientsadmittedtoauniversityhospitalinFinlandbetween2005and2009.Wepresentnovel extractivetextsummarisationmethodsforsummarisingthefreetextcontentofcareepisodes.Most ofthesemethodsrelyonwordspacemodelsconstructedusingdistributionalsemanticmodelling.The summarisationeffectivenessisevaluatedusinganexperimentalautomaticevaluationapproach incor-poratingwell-knownROUGEmeasures.Wealsodevelopedamanualevaluationschemetoperforma meta-evaluationontheROUGEmeasurestoseeiftheyreflecttheopinionsofhealthcareprofessionals.

Results:Theagreementbetweenthehumanevaluatorsisgood(ICC=0.74,p<0.001),demonstratingthe stabilityoftheproposedmanualevaluationmethod.Furthermore,thecorrelationbetweenthemanual andautomatedevaluationsarehigh(>0.90Spearman’srho).Threeofthepresentedsummarisation methods(‘Composite’,‘Case-Based’and‘Translate’)significantlyoutperformtheothermethodsforall ROUGEmeasures(p<0.05,Wilcoxonsigned-ranktestandBonferronicorrection).

Conclusion: The results indicate the feasibility of the automated summarisationof care episodes. Moreover,thehighcorrelationbetweenmanualand automatedevaluations suggeststhattheless labour-intensiveautomatedevaluationscanbeusedasaproxyforhumanevaluationswhendeveloping summarisationmethods.Thisisofsignificantpracticalvalueforsummarisationmethoddevelopment, becausemanualevaluationcannotbeaffordedforeveryvariationofthesummarisationmethods.Instead, onecanresorttoautomaticevaluationduringthemethoddevelopmentprocess.

©2016TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-ND license(http://creativecommons.org/licenses/by-nc-nd/4.0/).

∗ Correspondingauthorat:DepartmentofComputerandInformationScience, NorwegianUniversityofScienceandTechnology,SemSaelandsvei9,7491 Trond-heim,Norway.Tel.:+4797502647.

E-mailaddresses:[email protected]

(H.Moen),lmemur@utu.fi(L.-M.Peltonen),

juaheim@utu.fi(J.Heimonen),ajairo@utu.fi(A.Airola),aatapa@utu.fi(T.Pahikkala), tapio.salakoski@utu.fi(T.Salakoski),sansala@utu.fi(S.Salanterä).

1. Introduction 1.1. Background

Information overload in the health sector is becoming an increasingproblemforclinicians[1,2].Theyhavetoreadmassesof text(suchasclinicalnotes,guidelinesandscientificliterature)to satisfytheirinformationneeds.Lackoftimeandresourcestodothis properlycausesproblemssuchaserrors,frustration,inefficiency andcommunicationfailures[3].

http://dx.doi.org/10.1016/j.artmed.2016.01.003

0933-3657/©2016TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4. 0/).

(2)

The contents of electronic health record (EHR) systems are largelycomposedofclinicalnotes(orclinicalnarratives) inthe formofunstructuredandunclassifiedtext.Theclinicalnotes writ-tenduringasinglecareepisode,i.e.astayina hospital,canbe quitevoluminous,especiallyforpatientssufferingfrommore com-plexandlong-termhealthproblems.Knowingthemedicalhistory ofapatientisvitalforaclinician,butscanningthroughclinical notesconsumesprecioustimethatcouldbebetterspenttreating thepatient.

Automatic summarisation of the free text content in care episodescouldassistcliniciansinatleasttwoways.First,itcould providean(indicative)overviewofthedocumentationofacare episode.Together withstructureddata (suchas laboratorytest results,images,diagnosticcodesandpersonalinformation)itcould helpclinicianstofamiliarisethemselveswiththecontentofthecare episodeandthepatient’sproblems,whichisparticularlyusefulif theinformationisneededurgently.Second,itmayhelpinwritinga dischargesummaryofacareepisode.Dischargesummariesare cru-cialincommunicationbetweendifferenthealthcareprovidersand theyareneededtoensurecontinuityofcare.However,thereare anumberofchallengeswiththem,rangingfrombeingproduced late to having insufficient information. For example, Kripalani etal.[4]showedthat dischargesummaries exchangedbetween hospitalsandprimarycarephysiciansareoftenlackingsomeof theexpectedinformation,suchasthatrelatedtotreatment pro-gression,counsellingandfollow-upproposals.Computer-assisted dischargesummariesandstandardisedtemplatesaremeasuresfor improvingtransfertimeandthequalityofdischargeinformation betweenhospitalsandprimarycarephysicians[4].Theutilisation ofautomatictextsummarisationcouldimprovethetimelinessand qualityofdischargesummariesevenfurther.

Centraltothisworkisthefocusonresource-lean1and language-independentmethods.Suchmethodsareimportantforlanguages suchasFinnish,forwhichnomajormanuallyconstructedlexical resourcessuitedforthecomprehensivesemanticanalysisofclinical textareavailable.

1.2. Relatedwork

This study focuses on the extraction-based summarisation approach,inwhichthesummaryisgeneratedbyselectinga sub-setofsentences fromtherelevanttext.Thisapproach is viable becausea sizeableportionofaclinicaltextsummaryiscreated bycopyingorderivinginformationfromclinicalnotes[2,5–7].See

[8,9]forexample,formoreinformationonextraction-basedtext summarisation.

Acentralissuein extraction-basedsummarisationis howto determinewhatthemostrelevantcontenttobeincludedina sum-maryis.Commontechniquesofextraction-basedsummarisation includetopic-basedsentenceextraction[10,11],where the rele-vanceofasentenceiscomputedwithrespecttooneormoretopics ofinterest;andcentrality-basedsentenceextraction[12,13],where thesentencesthat arethemost(strongly)associated with oth-ersareselectedontheassumptionthattheyconstitutethebest coverageofthedocuments.Inordertoavoidincludingredundant information,itiscommontoapplythemaximalmarginalrelevance criterion[10]orsimilartechniquesthattakesentenceoverlapinto account.Purelystatistical(data-driven)approachestotext sum-marisationareoftenreferredtoas‘knowledgepoor’,whereasthose usingknowledgeresourcesareconsidered‘knowledgerich’.The lattercould,forexample,includetheuseofanontologythatmodels medicalandclinicalconceptsaswellastheirrelationships.

1 Wearestrivingtowardsusingaslittlemanuallabouraspossible.

Intheirrecentreview,Mishraetal.[14]indicatedthatthereisa growinginterestinknowledge-richapproachesinthebiomedical domain,coincidingwiththeincreasedavailabilityof comprehen-sivelexical resources,such as the WordNet ontology[15] and theUMLScompendium[16](includingSNOMED-CT[17],ICD[18]

and MeSH [19]).There are severallanguage tools that rely on theseresources,suchasMetaMap[20],cTAKES[21]andSemRep

[22].Othercommonlyusedresourcetypesinclude,forexample, annotatedcorporadesignedformachinelearning(ML)algorithms (seee.g.[6,23]).However,onedisadvantagetoapproachesthatrely onmanuallyconstructedresourcesisthattheyareoftennot appli-cableacrossdomainsor languages[24,25].WordNetand UMLS (SNOMED-CT,inparticular),forexample,areonlyavailableina fewlanguages.Thecostofadaptingexistingresourcestonew lan-guages,domainsortasks,orconstructingnewresources,isoften high.

Theuseofdistributionalsemanticmethodsrepresentsa resource-lightapproachtocapturingterminologyinclinicaltexts[26–31]. Thesemethodsrelyonthedistributionalhypothesis[32]for con-structingdistributionalsemanticmodelsfromwordco-occurrence statisticsinanunsupervisedmanner,typicallyusingaverylarge corpus of unannotated text. The aim is to model similarities, or relatedness, between linguistic items (e.g. words) in a way thatreflectstheirrelativesemanticmeaning.Distributional seman-ticmodelsrepresentingword-levelsemanticsimilarityareoften referredtoaswordspacemodels(WSMs).InaWSM,awordcontext vectoriscreatedforeachuniquewordintheunderlyingcorpus. Further,eachcontextvectorrepresentsapointinthe‘wordspace’ andtheirinternaldistancesreflecttheirsemanticsimilarities. Sim-ilaritiesbetweencontextvectorsarethencalculatedtoquantify thesemantic similarityas a numericvalue (for example,using thecosinesimilarityfunction).Populartechniquesandframeworks forconstructingWSMsincludelatentsemanticanalysis(LSA)[33], randomindexing(RI)[34]andWord2vec(W2V)[35].The domain-specificityofthecorpususedforconstructingthemodelhasbeen showntobeimportantfortheusefulnessofsemanticsimilarities totheintendedtask[27].Distributionalsemanticmodelsin vari-ousformshavebeenextensivelyusedintextsummarisation,e.g.

[9,13,36].

Tothebestofourknowledge,thetaskofautomatically gener-atingtextualsummariesfromclinicalnoteshasbeenpursuedby relativelyfewresearchers,whichisalsoevidentinrecentreviews andrelatedworks,forexample[14,24].Wehaveidentifiedseveral piecesofworkfocusingonthetaskofautomaticallygenerating textualsummariesfromunstructuredclinicalnotes.Liu[37]used theMEADsummarisationtoolkit.VanVlecketal.[2]performed structuredinterviewstoidentifyandclassifyphrasesthatclinicians consideredrelevanttoexplainingapatient’shistory.Mengetal.[6]

usedanannotatedtrainingcorpustogetherwithtailoredsemantic patternstodeterminewhatinformationshouldberepeatedina newclinicalnoteorsummary.VelupillaiandKvist[23]focusedon recognisingdiagnosticstatementsinclinicaltext,learningfroman annotatedtrainingcorpus,andclassifyingthesebasedonthelevel ofcertainty theyhave inthem.Extracteddiagnosticstatements arethen usedtoproduceatextsummary. Othershave worked on more conceptual models for understanding and supporting thegenerationofinformation summariesintheclinicaldomain

[38,39].

Theevaluationofcomputer-generatedsummariesistypically performed by comparing the generated summary with a gold standard(orreferencesummary),whichrepresentstheideal man-uallyconstructedsummaryorsummaries.TheROUGE2evaluation

(3)

Fig.1.Experimentalset-up.Thefigureshowshowtheexperimentwasconducted.

package[40]hasbecomeadefactoevaluationmetricintext sum-marisation.Therequiredgoldstandardsummaries arecostlyto creategiventhemanualworkrequired.Thisisparticularlythecase inspecialiseddomainswheredomainexpertsarerequired.Lissauer etal.[3]evaluatedcomputer-generateddischargesummariesfrom neonatereportsbymanuallycomparingthemtodictated summa-ries,aswellasanalysingthemtoseewhethertheycontainedthe requiredinformationaccordingtoguidelines.Liu[37]performed automaticevaluationofcomputer-generatedsummariesof clini-calnursingnotesbyusingtheoriginaldischargereportsasgold standard summaries.Moenetal. [41]appliedbothmanual and automaticevaluationtotheassessmentofthereliabilityof auto-maticevaluation;themanualevaluationwasperformedbydomain experts and the automaticevaluation wasperformed by using ROUGEtocalculatethesimilaritybetweenthecomputer-generated summaries and theoriginaldischarge summaries (producedby clinicians).

1.3. Objectives

Themaincontributionsofthisstudycanbesummarisedas fol-lows:

•Proposalandimplementationoffournovelautomatic summari-sationmethodsdesignedforsummarisingthefreetextincare episodes;

•Proposalandimplementationofamethodologyforconducting themanualevaluationofautomaticallygeneratedcareepisode summaries;

•Empiricalanalysis of automatic evaluation measures through comparisonwithmanualevaluations;

•Performanceassessmentofthefournovelautomatic summari-sationmethodsalongwithfourbaselinemethods.

Thedatausedinthis studyisFinnishclinicaltext,but since theappliedmethodsarelanguage-independent,thecontributions shouldalsoberelevanttootherlanguages.Theoverallset-upis illustratedinFig.1.

2. Materialandmethods 2.1. Data

ThedatasetusedinthisstudyconsistsofEHRsfrompatients withanytypeofheart-relatedproblemthatwereadmittedtoa sin-gleuniversityhospitalinFinlandbetween2005and2009.Ofthese, theclinicalnoteswrittenbyphysiciansonthevariouswardsthat thepatientsvisitedwereused.However,noteswrittenbynurses werenotincluded.Fig.2showsanexampleofaclinicalnote.

Ethicalapprovalfortheresearchwasobtainedfromtheethics committeeofthehospitaldistrict(17.2.2009§67)andpermission toconducttheresearchwasobtainedfromthemedicaldirectorof thehospitaldistrict(2/2009).Thetotalsetconsistedof66,884care episodes,whichamountsto398,040notesand64millionwords3

intotal.Thisfullset,minus308careepisodesreservedfor optimi-sationandevaluation(seebelow),wereusedforconstructingthe WSMs(seeSection2.2).

Thenotesaremostly unstructured,consistingofclinicalfree textinFinnish.Varioussubheadingsdooccurintheclinicalnotes, butthesearenotstandardised,structured,oruniquelyrecognised inourcorpus.Thus,wetreattheseinthesamewayastherestof thetext.Someofthesentencesare,accordingtotheEHRsystem, consideredtobemetadata—suchasnamesoftheauthors,dates, wardsandsoon.Wetreatthefreetextsentencesandthemeta-text sentencesastwoseparatetexttypes,sothesearenotmixedinthe sensethattheycannotbelongtothesame‘sentencetopic’clusters, whicharedescribedinSection2.3.

EachcareepisodehasbeenmanuallylabelledwithICD-10codes bycliniciansasapartoftheoriginalcareprocess.Theseare nor-mallyappliedattheendofthepatient’shospitalstay,orafterthey aredischargedfromhospital.Careepisodescommonlyhaveone primaryICD-10codeattachedtothem,andanumberofoptional secondarycodes.InthisstudytheprimaryICD-10codeisusedin constructingtheRI-ICDWSM,asdescribedinSection2.2.

(4)

Fig.2. Exampleofaclinicalnote.ThisisafakecaseoriginallycreatedinFinnishbydomainexperts,thentranslatedintoEnglish.Commonmisspellingsareincluded intentionally.

Inthepresentedexperiments,werestrictourevaluationtothe careepisodes which have the primaryICD code I25 — chronic ischemicheartdisease,includingsubcodes(I25.0,I25.1,etc.).As afurtherrestriction,tojustifytheuseoftextsummarisation,we consideronlycareepisodesconsistingofsevenormoreclinical noteswrittenbyphysician.Inordertoguaranteethatthemethods aretestedonindependenttestdatathatisnotusedfordeveloping thesummarisationmodels,the308careepisodesaresplitintotwo subsets:

•Asummarisationoptimisationset,consistingof152careepisodes, usedfor optimisingparameters related tothe summarisation methods.

•Asummarisationevaluationset,consistingof156careepisodes, usedforevaluationintheconductedexperiments.Thisisfurther splitintotwosubsetsof20and136careepisodes,theformer sub-setbeingevaluatedinExperiment1andthelatterinExperiment 2.

Thissplittingisperformedaccordingtotheyearinwhichthe careepisodeswerecarriedout.

2.2. Wordspacemodelsusedforsentencesimilarityand summarisation

WeuseamethodbasedontheRItechnique,utilisingtheICD-10 codesattachedtocareepisodes(RI-ICD)toconstructaWSMforthe purposeofcalculating(semantic)similaritybetweencareepisodes. WealsousetheRItechniqueinconstructinga‘cross-text’ transla-tionmodel(RI-Translate),andtheW2Vmethodisusedtoconstruct a WSM for the purpose of calculating sentence-to-sentence

similarities, as well as sentence-to-document4 similarities. The

cosinesimilaritymetricisusedtocalculatevectorsimilarities.

RandomindexingandRI-ICD

RI[34]isatechniqueforconstructinga(pre-)compressedWSM withafixeddimensionality,doneinanincrementalfashion.Thisis achievedbyinitiatingindexvectorsforeachuniquewordinthe cor-pus.Anindexvectorisavectorofafixeddimensionality,containing mainlyzeros along witha small number of randomlyassigned non-zeros,typically1or−1.Duringtraining,contextvectorsfor wordsareconstructedbyaddingindexvectorstothem.Inthisway, thedimensionalityofthecontextvectorsremainsconstant.Inthis workweuseaversionofRIwherecontextfeaturesarebasedon theICD-10codeclassificationsofcareepisodes.Wehavecalledthis

RI-ICD,5previouslyintroducedin[42]. RI-translate

AnotherRI-basedmethodusedhereisoneintendedfor cross-lingualtranslationpurposes,describedin [43].Werefertoitas

RI-Translate.ThemethodconstructsabilingualWSMthatconnects wordsinthesourcelanguage(SL)totheirtranslatedcounterparts inthetargetlanguage(TL).Inpractice,weoperatewithtwomodels, oneforSLandoneforTL,wherebothbelongtothesamesemantic spaceinthattheyareconstructedwiththesamesetofindexvectors. Fortraining,pre-alignedtranslationpairs(inthiscase,aligned sen-tences)connectingtheSLtotheTLareusedastraininginstances. Thetrainingtakesplaceasfollows:foreachtranslationpair(SL–TL), auniqueindexvectorisgeneratedandaddedtothecorresponding contextvectorsforwordsintheSLandTLmodels.Thiswillresult

4Documentsareinthiscasetheclinicalnotes.

5Avectordimensionalityof800wasused,andthenumberofnon-zerosforthe indexvectorswassettofour.

(5)

Table1

ToptenmostsimilarwordsaccordingtheW2V-basedWSMforthequerywords‘pain’and‘foot’,togetherwiththecorrespondingcosinesimilarityscores.Thewordshave beentranslatedfromFinnishtoEnglish.

Pain (kipu) cossim Foot (jalka) cossim

Painsensation (kiputuntemus) 0.5097 Lowerlimb (alaraaja) 0.5905

Ache (särky) 0.4835 Ankle (nilkka) 0.3731

Painsymptom (kipuoire) 0.4173 Limb (raaja) 0.3454

Chestpain (rintakipu) 0.4042 Shin (sääri) 0.3405

Dullpain (jomotus) 0.4000 Peripheral (periferisia) 0.3112

Backpain (selkäkipu) 0.3953 Callus (känsä) 0.3059

Painseizure/attack (kipukohtaus) 0.3904 Topofthefoot (jalkapöytä) 0.2909

Painstatus (kiputila) 0.3685 Upperlimb (yläraaja) 0.2879

Abdominalpain (vatsakipu) 0.3653 Peripheral (perifer) 0.2875

Discomfort (vaiva) 0.3614 Inlowerlimb (alaraajassa) 0.2707

inhighcosinesimilaritybetweenwordsintheSLmodelandtheTL modelthathaveoftenoccurredinthesametranslationpairs.

Whenqueryingthesystem,thecontextvector(s)corresponding tothequeryintheSLmodelisusedasthequery.Thisqueryvector isthenmatchedagainsttheunitsintheTLmodel,usingcosine sim-ilarity,inordertofindthemostlikelytranslation(s).Thismethod isusedforsummarisationpurposesintheTranslatemethod,as describedbelow.

Word2vec

Word2vec[35] isa framework forconstructingWSMsusing a neural network. In this work we utilise the W2V CBOW

architecture.6 Table1showsanexampleofhowtheW2V-based

modelcapturessemanticsimilarityrelations.Weusethismodel inthevarioussummarisationmethodsforcomputing sentence-to-sentencesimilaritiesandsentence-to-documentsimilarities.

Composingsentenceanddocumentvectors

Sentence context vectorsare composedby normalising and summing (pointwise summation) theconstituent word context vectors weighted by their sentence term frequency multiplied by their global inverted sentence frequency (TF*ISF). A similar approachisusedforconstructingcontextvectorsrepresenting clin-icalnotes,buthereweightingisbasedontermfrequencymultiplied withglobalinverteddocumentfrequency(TF*IDF)[44],whereeach clinicalnoteisconsideredasadocument.

2.3. Summarisationmethods

We evaluated eight different summarisation methods. Ora-cleis an(unrealistic)reference methodthat hasaccesstothe true/original discharge summary when selecting sentences to extract,providinganupperboundarytohowwellanextractive summarisation methodcan workfor ourdata. LastSentences

andRandomaresimplereferencemethodsthatasuccessful sum-marisation methodshould be able tooutperform. Centrality

is a standard baseline approach that is commonly used in the field,andtheremainingfourmethods,calledRepeatedSentences,

Case-Based,TranslateandComposite,areproposedmethods developedspecificallywiththeclinicaldomaininmind.

Foreachcareepisode,thelengthofthesummarygeneratedby eachsummarisation methodissettohaveafixedsizeequalto thewordcountoftheaccompanyingdischargesummary(i.e.the ‘goldstandardsummary’forthecareepisode).Sentencesare iter-ativelyextractedfromtheclinicalnotesuntilthetotalwordcount becomesequaltoorjustexceedsthewordcountofthedischarge summary.Therefore,generatedsummariescanhaveawordcount

6ForW2Vawindowsizeof5+5andadimensionalityof800wasused.

equaltothedischargesummary,orexceedthislimitbyasubsetof thewordsinthelastextractedsentence.Thiswayofdynamically selectingthesummarisationlengthismainlydonetoenablethe calculationoftheautomaticevaluationscores(F-score),described inSection2.4.2,whichassumesequallengthofthetargetsummary (thesummarybeingevaluated)andthegoldstandardsummary.

In the summarisation methods RepeatedSentences, Case-Based,TranslateandComposite,atypeoftopicclusteringisused toperformredundancyreduction.Wefoundthateachsentenceis typicallyinformative,self-sustainingininformationcontent,and independentofothersentenceswithinasinglenote.Allsentences arefirstclusteredintounlabelledsentencetopicsinan unsuper-visedwayusingtheW2Vmodel.Acosinesimilaritythreshold, optimisedonthesummarisationoptimisationset,isusedfor deter-miningwhethertwosentencescanbeconsideredsimilarornot— whetherornottheybelongtothesametopicbasedontheircosine similarity.Theunderlyingapproachis somewhatcomparable to howsimilarparagraphsaredetectedandmergedinMcKeownetal.

[45]withtheaimofreducingsentenceredundancy.Sinceweknow wheneachsentencewaswritten,andifweassumethatweareable toclustersentencesthatdiscussthesametopicacrossclinicalnotes (e.g.thestateofapatient’spain),wecanalsoassumethatthe lat-estwrittensentencebelongingtoatopicisthemostrepresentative ofthelatestinformationconcerningthattopic.Therefore,weallow thelatestwrittensentencebelongingtoeachtopicclustertobethe representativesentence.Inanattempttomosteffectivelymodel sentencetopicclusters,theclusteringapproachisdoneasfollows: first,weassumethatallsentencesinthefirstclinicalnoteofacare episodebelongtodifferenttopics(seeNote1inFig.3foran

illus-tration).Thenweiteratethroughthecareepisode,fromthefirstto thelastwrittennote,andassigneachsentencetoeitherexisting topics(cossim≥)ornewtopics(cossim<)basedontheircosine similaritiesinrelationto.Intheutilisedsimilaritycomparison, thelatestaddedsentenceofatopicalwaysrepresentsthattopic.A sentencecanonlybelongtoonetopic,soifasentenceissimilarto twoormoretopics,i.e.cossim>,thesentenceisassignedtothe mostsimilartopic.7

Originaldischargesummary

Theoriginaldischargesummaryisatextwrittenbyaclinician, typicallyaphysician,tosummariseacareepisode.These summ-ariesarethuswrittenattheendofeachcareepisode,andoften containextractsfromtheaccompanyingclinicalnotes.Theyalso typicallycontainacertainamountofas-yetundocumented infor-mationwhichfocusesonfollow-uptreatment,andaremeantfor thereceivingward(ifany)ortheprimarycaresector.

7Intheunlikelyeventthatthecosinesimilaritiesbetweenasentenceandtwoor moretopicsarethesame,thesentenceisassignedtoarandomtopicamongthese.

(6)

Fig.3. SummarisationmethodRepeatedSentences.Theexampleillustrateshowsummariesareproducedbysentencetopicclusteringandtopicscoring.Thehighestscoring topicsfromhighesttolowestareB,E,A,C,D,G,FandH.Inthegeneratedsummarydisplayedhere,thetopicsaresortedbythepost-processingstep,andthethreelowest scoringtopics,G,FandH,areexcluded.

In this work,the original discharge summary serves as the goldstandardsummaryforitsaccompanyingcareepisodeinthe automaticevaluationapproachthatisused(seeSection 2.4.2).In addition,someofthesummarisationmethodsusetheseintheir underlyingtraining(Translate)orinthesummarisationphase (Case-Based).Naturally,foracareepisodethatistobe summa-rised,theaccompanyingoriginaldischargesummarywillnotbe availabletothesummarisationsysteminarealisticscenario.

Summarisationmethod:Oracle

Thisisacontrolmethodthathasaccesstotheoriginaldischarge summary during the summarisation process. It optimises the

ROUGE-N2F-scores(seeSection2.4.2)forthegeneratedsummary accordingtothegoldsummary,usingagreedysearchstrategy.That is,themethodextractssentencesonebyonefromtheclinicalnotes untilitreachesthelengththreshold,alwayspickingsentencesthat resultinthehighestpossibleROUGE-N2score.Thismethodis cheat-ing,sinceithasaccesstotheoriginaldischargesummaryinthe summarisationprocess.Still,itrepresentstheupperlimitforwhat isachievableintermsofROUGE-N2scoresforanextraction-based summary.

Summarisationmethod:LastSentences

Thelatestwrittenclinicalnoteatanypointshouldsupposedly representthecurrent state of thepatient.By selecting the lat-estinformation foundin thelast or latestwritten information, onecanintuitivelyassumethatthisinformationisimportantina (discharge)summary.Inthismethod,thesummaryissimply con-structedfromtheNlastwrittensentencesduringthecareepisode, whereNisthenumberofsentencesneededtoreachthelength threshold.Intuitively,thisrepresentsastrongbaseline.

Summarisationmethod:Random

Thisbaselinemethodconstructssummariesbyrandomly select-ingsentencesfromthecareepisodeuntilthelengththresholdis reached.Itprovidesalowerboundarytotheperformance,which anymeaningful summarisation approach shouldaim to signifi-cantlyoutperform.

Summarisationmethod:RepeatedSentences

Mengetal.[6]arguethat informationbeingrepeatedacross clinicalnotesisanindicatorofitsrelevancewithrespectto inclu-sioninsubsequentnotesinthesequence.Theuseoftimefeatures isalsoexploredbyLimetal.[46]inthetaskofmulti-document summarisationofnewsarticledocuments.Theunderlying hypoth-esisfortheRepeatedSentencesmethodisthatinformationthat isrepeatedinmultipleclinicalnotesthroughoutacareepisode, withtheemphasisonwhenitwaswritten,isthemostimportant informationtoincludeinasummary.Featuresfromtheinitial sen-tencetopicclusteringstepareusedforscoring.Atopicisassigned ascorebasedonthesumoftheorderofwhen,inthecareepisode, its underlying sentences were written. For example, if a topic

containssentencesfromclinicalnotenumbers3,5and6 (num-beredrelative tothe datestheywere written), thetopic score becomes14.TheNhighestscoringsentencetopics(i.e.their rep-resentative sentences) are included in the final summary. The

RepeatedSentencessummarisationmethodisillustratedinFig.3.

Summarisationmethod:Case-Based

Case-Based,or‘case-basedsummarisation’,is herean anal-ogytocase-basedreasoning(CBR)[47]whichperformsatypeof

textualcase-basedreasoning(TCBR)[48].CBRinvolvesretrieving existingorolder‘cases’withsimilarcontentasthetarget prob-lem,andthenreusingthesolutionoftheretrievedcase(orcases) tosolvethetargetproblem.Inasimilarmanner,thisprincipleis appliedhereintextsummarisation.Theunderlyinghypothesisis thatpatientswith(themost)similarcareepisodes(accordingto thedocumentedtextintheirclinicalnotes)havesimilarcontent intheirdischargesummaries.Thesentencesfromthesedischarge summaries are then treated as the central ‘topics’ for what to includeinthesummary.Thisisinlinewithevidence-basedpractice

(EBP)inthatrelevantcareepisodesareidentifiedandthe infor-mationfoundthereisrelieduponas‘evidence’forwhatshouldbe includedinthesummary.

Givenatargetcareepisodethatistobesummarised,wefirst

retrievethetopfivemostsimilarcareepisodesusinginformation retrievaloncareepisodelevel(i.e.‘careepisoderetrieval’).Forthis theRI-ICDmethodisused(explainedin[42],Section4.1).Thenwe

reusethesebyiteratingthrougheachsentenceintheirdischarge summaries.Therepresentative/lastsentencesfromeachsentence topicinthetargetcareepisode(asdescribedearlier)isthenscored bytheircosinesimilaritytoeachoftheseusingtheW2Vmodel. Outofthese,theNhighestscoringsentencesareincludedinthe generatedsummary.Fig.4illustratesthisusingamodificationof the‘CBRcycle’from[47].8

Summarisationmethod:Translate

HereweusetheRI-Translatemethod,asexplainedinSection

2.2,forthepurposeoftextsummarisation.Insteadoftranslation betweenlanguages,itis usedfor‘cross-text-type translation’— translatingfromthetextinclinicalnotes(careepisodeswithout dischargesummaries)tothetextfoundinthedischarge summa-ries,whilelimitingthetranslationcandidates(i.e.sentences)toalso comefromthesentencetopicsintheclinicalnotes.Theaimisthus toconstructatypeoftranslationsystemthatcanmapsentences inclinicalnotestothemostprobablesentencestobefoundinan accompanyingdischargesummary,basedontranslationstatistics learntfromalargeclinicalcorpus.

First,atranslation,orcross-text-typeWSMisconstructedusing theRI-Translatemethod.Herethesourcelanguage(SL)consistsof

8Fig.4alsocontainsthestepsreviseandretain,buttheseareoutsidethescopeof thiswork.

(7)

Fig.4. TheCase-Basedsummarisationmethodillustratedasa‘CBRcycle’,basedontheCBRcycleintroducedbyAamodtandPlaza[47].Theleftsideofthedashedlineis notutilisedinthiswork,butillustrateshowthefullCBRcyclecanbeusedinahospitalsetting.

thetextintheclinicalnotes,whilethedischargesummaries con-stitutethetexttargetlanguage(TL).Traininginstancesarerather coarse,aseachcareepisoderepresentsasingletraininginstance. Moreprecisely,foreachcareepisode,thecontextvectorsofthe wordsintheunderlyingclinicalnotes(SL)andthoseinits accom-panyingdischargesummary(TL)haveauniqueindexvectoradded tothem.

Whensummarisingacareepisode,eachsentence(inthe cor-respondingclinicalnotes,pre-clusteredintosentencetopics)has twosentencevectorsconstructed,oneusingwordcontextvectors fromtheSLmodel,andtheotherusingtheTLmodel.Then,each sentencevectorbuiltwiththeSLmodelisiterativelyusedtoquery thesystem.SentencesrepresentedbytheTLmodelarethenranked bytheiroverallmaxcosinesimilarityscorestothesequeries,and thetopNsentencesareincludedinthefinalsummary.

Summarisationmethod:Composite

Inthiscompositemethod,thesentence-scoringfeaturesfrom themethodsRepeatedSentences,Case-Basedand Translate

arecombined.Wefoundthatthebestautomaticevaluationscores (F-scores)fromthesummarisationoptimisationsetwereachieved whenthescoresbyCase-BasedandTranslatewerekeptastheir initialcosinescores,whileforRepeatedSentences,thesentence scoreswerefirstnormalisedbydividingonthemaxscoring sen-tence.Thisnormalisationconvertsthescorestobewithinthesame rangeasthecosine-basedscores,rangingfrom0to1.Thesethree featurescoresarethensimplytotalledtocreatethefinalfeature scoreforeachsentence.FinallythetopNsentencesareselectedfor thefinalsummary.

Summarisationmethod:Centrality

Thecentrality (orcentroid) principleis the mostcommonly reliedonsummarisationtechnique formany generictexttypes

anddomains.Itisbasedonrankingsentencesbyhow represen-tativetheyare of thecentralinformation of thetextthat is to besummarised.Inexistingwork,arangeofmethodshavebeen usedtocomputesentencecentralityinextraction-based summari-sation.ThePageRankmethod[49]hasbeenextensivelyusedfor thispurposeasagraph-basedapproach.Wedecidedtobaseour implementationonthemethodpresentedin[13],whichrelieson a graph representation togetherwitha WSM(RI).To construct theWSM,weusedW2VinsteadofRIbecausepreliminarytesting indicatedthatthismodelperformedbetter.Here,weighted PageR-ankfortextisused,referredtoas‘TextRank’[50].Edgesbetween nodes,i.e.betweensentences,areweightedaccordingtothe pre-calculatedsentencesimilarityusingW2V.Eachsentencealsohas aninitialscoresimilartothecosinesimilaritybetweenthe sen-tenceandthecorrespondingclinicalnote,representedassentence vectorsanddocumentvectors,respectively.Inaddition,toadapt thisapproachtomultipledocuments,i.e.multipleclinicalnotes, wehaveextendedthismethodwithasentencecentralityranking thatworksonmultiplenotes,inasimilarwaytohowitisdonein

[51].Thisisdonebymultiplyingedgeweightsbyoneoftwopreset constants.Constantıismultipliedwithintra-noteedgeweights, i.e.edgesgoingbetweensentenceswithinthesameclinicalnote; andinter-noteedgesaremultipliedwiththeconstant

.9

2.3.1. Post-processing

Simple post-processing is applied to each summary for the purposeofrearrangingthesentenceorder.Sentencesaresorted accordingtothedatetheywerewritten(i.e.usingthedateofthe clinicalnotetheybelongto).Internalrankingbetweensentences

9IntheexperimentweusedaPageRank˛valueof0.90,ıwas0.3,whilewas 1.0.

(8)

fromthesamedateiscarriedoutaccordingtointernalsentence order.Iftwosentencesfromtwodifferentnoteshavethesame datestamp,rankingisperformedaccordingtotheirchronological noteIDs.Meta-sentences(describedinSection2.1)areplacedfirst andrearrangedinternally.

2.4. Experimentandevaluation

Thefollowingtwoexperimentswereconducted:

Experiment1:Thefirstexperimentfocusesondeterminingthe reliabilityoftheautomaticevaluation.Thisisdoneby compar-inghow the manual and automatic evaluations (four ROUGE measures) correlate in terms of the relative rankings of the eightsummarisation methods.Here, 20 careepisodes(a sub-setof the 156 care episodes in the summarisation evaluation set)areevaluatedbothmanuallyandautomatically.Spearman’s rankcorrelationcoefficient(Spearman’srho)[52]iscalculated betweentheaveragemanualevaluationscoresandtheaverage scoresforeachoftheautomaticevaluationmetricsforeach sum-marisationmethod.

Experiment2: In the secondexperiment, the summarisation methodsaretestedonalargerevaluationsetof136careepisodes (the156careepisodesinthesummarisationevaluationsetminus the20 usedinExperiment1).Theevaluationis performedin anautomatedmannerusingfourROUGEmeasures.Theaimis primarilytodeterminewhichsummarisationmethodproduces thebestsummaries.Inordertotestwhetherthedifferentscores achievedbythedifferentsummarisationmethodswere statis-ticallysignificant,weperformedtheWilcoxonsigned-ranktest

[53]basedonthescoresfromeach ROUGEmeasure,foreach summarisationmethodpair.

In both experiments,we use thesame eight summarisation methodsdescribedtoconstructsummariesforeachcareepisode. The utilised WSMs are first constructed using the full corpus describedinSection 2.1,minustheoptimisationandevaluation setsmentioned.

A preliminary version of our evaluation set-up has been describedin[41].10Ourcomparisonofmanualandautomatic

eval-uationissimilartotheanalysisconductedbyChin-YewLinin[40]

onEnglishnewswiredatawhenintroducingtheROUGEmeasures. Onegoalistoseeifourautomaticevaluationset-upis reli-able,giventheuncertaintiesrelatedtousingtheoriginaldischarge summariesasgoldstandardsummaries.Thisisdoneby indepen-dentlyanalysingwhetheror notthere isa correlationbetween howhumanevaluatorsranktheperformanceofthesummarisation methodsandhowautomaticevaluationmetricsrankthesesame summarisationmethods.Furthermore,weaimtoreliably estab-lishwhichofthetestedsummarisationmethods(andunderlying features)performbest.

2.4.1. Manualevaluation

Themanualevaluationisconductedbythreedomainexpertsin theclinicalfield:twophysiciansandonenurse,allprofessionalsin specialisedcareandeachwithoverfiveyears’experienceof work-ingwithpatientssufferingfromheart-relatedhealthproblems.

10 TheF-scoresfromtheautomaticevaluationareonaveragenoticeablylowerin thisstudythanthosereportedin[41].Thisisprimarilybecausehereweexcludeda specifictypeofnotefromallcareepisodes,atypeofsummaryforthepatients,which isoftenwrittenatthesametimeasthefinaldischargesummary,andtheircontents tendtobeverysimilar;sometimesidentical.Inaddition,someofthemethodsused inthisexperimentarenewordifferent.

Table2

Schemeusedforthemanualevaluation.

Evaluationcriteria Rating

Sender yes=1,no=0

Reasonforadmission yes=1,no=0 Long-termdiagnosis yes=1,no=0 Procedures(e.g.operation,

angioplasty)

yes=1,no=0 Tests(e.g.lab-test,X-ray,EKG) yes=1,no=0

Medication yes=1,no=0

Healthstatusatdischarge yes=1,no=0 Plansforthefuture yes=1,no=0 Readability:howgoodistheflowof

thetext?

0.0–1.0,0.0=badto 1.0=excellent Readability:howgoodisthecontentof

thesummary?

0.0–1.0,0.0=badto 1.0=excellent

Apre-studyfocusingonthesametypeofmanualevaluationwas conductedin[41].Inthispre-study,a30-itemevaluationscheme (ortool)formanualevaluationwasdevelopedbasedonthe hospi-taldistricts’guidelinesfor writingdischargesummaries.Itused a 4-pointscaleranging from−1 to2,where, −1=not relevant, 0=notincluded,1=partiallyincludedand2=fullyincluded. How-ever,usingthisschemeturnedouttobedifficultandextremely time-consuming.Onereasonforthisisthatquiteafewoftheitems weresomewhatoverlapping andveryfine-grained,like ‘conclu-sions’,‘assessmentofthefuture’and‘statusofthediseaseatthe endofthetreatmentperiod’.Otheritemswererarelydocumented byclinicians(physicians) intheclinicalnoteswrittenduringan ongoingcareepisode,suchas‘statusofthediseaseattheendof thetreatmentperiod’,‘abilitytowork’and‘continuedcareplan’. Inaddition,acoupleoftheitemswereredundantastheyconcern whatwerefertoasstructuredinformationintheEHRsystem,such as‘careplace’and‘careperiod’,andthereislittlevalueintryingto extractthisfromthetext.Therefore,thismanualevaluationscheme wasfurtherdevelopedtoamoresimplifiedversionwithonlyten criteriaitems.11Eightofthetencriteriawererateddichotomically

‘yes’or‘no’.Thesecriteriaitemsconcernthecontentsofthe dis-chargesummary,where‘yes’meansthatthesummaryincludes contentrelated tothecriteria.Movingfroma4-pointscaletoa 2-pointscalewasdonetosimplifytheevaluationfurther.Thetwo remainingcriteriaconcernthereadabilityofthesummaryandwere ratedonascaleof0.0–1.0,where0.0waspoorand1.0excellent. TheschemeusedinthemanualevaluationisshowninTable2. Informationaboutwhattypeofnoteeachsentencebelongsto,and whenitwaswritten,waspresentedasmetadataforthemanual evaluators.

Eachevaluatorevaluatedthesame20careepisodes,witheight summariespercareepisode.Theinter-rateragreementbetween thethreeevaluatorswascalculatedwiththeintraclasscorrelation coefficient(ICC)[54]withatwo-waymixedmodelusingIBMSPSS Statisticsversion22.Basedontheexistingliterature,wefoundno fixedlimitregardingtheinterpretationofICCvalues;one sugges-tionisthatvaluesbelow0.4arepoor,valuesfrom0.4to0.59are

fair,valuesfrom0.6to0.74aregood,andvaluesfrom0.75to1.0are

excellent[55].Theinter-rateragreementbetweentheevaluatorsin thisstudywasgood(ICC=0.744,95%CI0.722–0.766,p<0.001).

GiventhequiteconcreteevaluationcriteriainTable2,onecould intuitivelyassumethatthebestsummarisationapproachwouldbe tofocusonextractingthoseexacttencriteriaitems.Asaresult,we experimentedwithonesummarisationmethodthataimedtodo justthat.However,thisperformedpoorlyinbothmanualand auto-maticevaluation.Themainreasonforthisisthatwedonothave

11Apilottestwasconductedintheprocessofdevelopingthemanualevaluation scheme.

(9)

Fig.5.Graphillustratingthetrendforhowtheautomaticevaluationmetricscorrelatewiththemanualevaluationofsummariesfrom20careepisodes.Allevaluationentries havebeennormalisedbydividingthescoresbytheirmaxscores.Thesummarisationmethodsarearrangedaccordingtothemanualevaluationscores,andthelinesvisualise howROUGEmeasuresfollowthetrendofthemanualevaluations.

anygoodwayofmappingthecriteriadescriptionstothecontentin theclinicalnotes.Forexample,thereisnostraightforwardwayof mapping‘long-termdiagnosis’toasentencenotexplicitly contain-ingtheseexactorsimilarwords.Asentencementioninglong-term diagnosiscouldbe:‘thepatienthasbeensufferingfromhighblood pressureforthelastfouryears.’

2.4.2. Automaticevaluation

Automated evaluation of summaries generated from a care episodeisperformedbyusingtheaccompanyingoriginaldischarge summaryasagoldstandard.Thisexploratoryapproach circum-ventstheneedformanuallyconstructingsuchagoldstandard.

TheROUGE evaluationtoolkit [40] contains multiplen -gram-basedevaluationmetricsthatarecommonlyusedforautomatic summarisationscoring,suchasinthedocument understanding conferences(DUC)andthetextanalysisconferences(TAC)[56]. ROUGEbasicallyworksbycalculatingthen-gramoverlapbetween atargetsummary(thesummarythatistobeevaluated),andone ormoregoldstandardsummaries.Theoutputsfromthesemetrics areprecision,recallandF-score,reflectingtheoverlapbetweenthe targetandgoldstandardsummaries.TheaverageF-scoresarewhat wereporthere.Asthereareseveralmetricstochoosefrom,weuse thefollowing12:

•ROUGE-N1unigramco-occurrencestatistics.

•ROUGE-N2bigramco-occurrencestatistics.

•ROUGE-Llongestcommonsub-sequenceco-occurrencestatistics.

•ROUGE-SU4skip-bigramandunigramco-occurrencestatistics.

3. Results

3.1. ResultsforExperiment1

To visualise how theevaluations correlate, we have plotted thescoresfromthemanualandautomaticevaluationsinagraph, showninFig.5.

12 WefoundthelistedROUGEmetricstobethemostcommonlyusedmetricsin theliterature.

Table3

Spearman’srhoresults,indicatinghowtheautomaticevaluationmetricscorrelate withthemanualevaluationscoresover20careepisodes.

Evaluationmetric Spearman’srho(p-values)

ROUGE-N1 0.9048(0.00201)

ROUGE-N2 0.9524(0.00026)

ROUGE-L 0.9524(0.00026)

ROUGE-SU4 0.9048(0.00201)

Thecorrelations betweenmanual andautomatic evaluations werecalculatedusingSpearman’srho.Theresultsareshownin Table3.Basedonthestatisticalanalysisandp-valuesinTable3, thefourROUGEmeasureshaveahighcorrelationwiththemanual evaluations.

3.2. ResultsforExperiment2

Theresultsfromtheautomaticevaluationof136careepisodes areshowninTable4.Thercolumnsshowtheinternalrankingof eachsummarisationmethodforeachevaluationmeasure.Amore illustrativerepresentationisshowninFig.6.

Wecalculated significancelevelsusingtheWilcoxon signed-ranktest, withp<0.05 (with Bonferronicorrectionfor multiple hypothesis testing). Based on the p-values the methods could be divided into three groups. First, Oracle significantly out-performedalltheothermethodsagainstalloftheROUGEmeasures (highestp-value:2.12·10−22ROUGE-N1Oraclevs.Translate).

Second,Composite,Case-BasedandTranslatesignificantly out-performedthemethodsinthethirdgroup—RepeatedSentences,

LastSentences, Centralityand Random— against allROUGE measures(highestp-value3.74·10−4 ROUGE-N2

Translatevs.

LastSentences).Inthisthirdgroup,nomethodsignificantly dif-feredfromtheRandommethodinterms ofat leastoneROUGE measure.Thep-valuesforallcomparisonsareincludedinthe sup-plementarymaterials.

Basedontheanalysiswecandividethemethods(not count-ingtheOraclemethod)intotwogroups:Composite,Case-Based

andTranslatearesuccessfulatproducingsummariesthat outper-formthesimplebaselinemethodsinallcomparisons,whereasthe

(10)

Table4

F-scoresfromtheautomaticevaluationof136careepisodes.TheorderofthesummarisationmethodsisthesameasinFig.5.

Sum.method ROUGE-N1 r ROUGE-N2 r ROUGE-L r ROUGE-SU4 r

Composite 0.3820 2 0.1849 2 0.3678 2 0.1865 2 Oracle 0.4819 1 0.2865 1 0.4683 1 0.2694 1 Case-Based 0.3634 4 0.1741 3 0.3497 4 0.1764 3 Translate 0.3703 3 0.1661 4 0.3551 3 0.1720 4 Random 0.3043 7 0.1177 7 0.2949 7 0.1241 7 RepeatedSentences 0.3301 5 0.1408 5 0.3196 5 0.1463 5 LastSentences 0.3287 6 0.1398 6 0.3184 6 0.1462 6 Centrality 0.2862 8 0.1027 8 0.2743 8 0.1151 8

Fig.6.Graphillustratinghowthevarioussummarisationmethodsperformagainstasetof136careepisodes.Allevaluationentrieshavebeennormalisedbydividingscores bytheirmaxscores.TheorderofthesummarisationmethodsisthesameasinFig.5;thelineshighlightthesimilartrendsfortheROUGEmeasuresoverallthesummarisation methods.

CentralityandRepeatedSentencesmethodsfallinthesame groupwiththesimplebaselineapproaches.

WithouttheBonferroni correction,thesignificantly differing groupswouldbeasfollows:

1.Oracle 2.Composite 3.Case-Based,Translate 4.RepeatedSentences,LastSentences 4.Centrality,Random 4. Discussion

In this work we consider a variety of resource-lean and language-independentsummarisation methods for clinicaltext. Thesemethodscircumventtheneedfortailoredlanguageresources and tools. The proposedsummarisation methods utilise WSMs constructed from word co-occurrence statistics in a large cor-pusofclinicaltext(seeSection 2.1).Thisenablesustocapture various semantic similarity relations in the clinical text in an automatic,data-driven way.Theaimis nottoconstructperfect summariesthatcanfullyreplaceindividualclinicalnotesor com-pletelyautomatetheprocessofproducingdischargesummaries, forexample.Rather,thisworkisasteptowardsexploringways ofautomaticallyconstructingindicativeclinicaltextsummariesby relyingonpurelystatisticalfeaturesfordeterminingasentence’s significance.

Weintroduceaschemethatdomainexpertscanusetomanually comparetherelativequalityofdifferentautomaticallyproduced

summaries(andtheunderlyingsummarisationmethods).The pro-posedschemeconsistsofa10-itemquestionnairemeasuringthe expert’sopinionofthereadabilityofthesummary,andwhether ithasrelevantcontent.Theschemehasbeendevelopedbasedon experiencesfromourpreliminarystudyonevaluatingclinical sum-marisationmethods[41],resultinginamorestreamlinedtoolthat iseasiertouseconsistently.

However,suchmanualevaluationrequires humaninputand isthus impracticaltouseduring summarisationmethod devel-opment,whererapidfeedbackisrequiredwhentestingdifferent method variations. Therefore, we also use the ROUGE toolkit forperformingautomatedevaluation. Wealsoseektoestablish whethertheautomatedROUGE-basedevaluationcanbeusedin placeofhumanevaluationinthecontextofclinical summarisa-tion.Thismeta-evaluationisperformedthroughrankcorrelation coefficientanalysisbetweenthemanualandautomatedevaluation. Finally,weaimtoestablishwhichsummarisationmethodperforms bestinthetaskofclinicalsummarisation.

TheresultsfromExperiment1 showthatthere isa correla-tion betweenhow the manual and automatic evaluationsrank thedifferentsummarisationmethods.Thisindicatesthatusingan automatedROUGE-basedevaluationset-upisfeasible.Further,it showsthattheautomaticevaluationscores,withtheapplied eval-uationset-up,arereliable fordeterminingwhat summarisation methodperforms best.The observationthatthe manual evalu-ators preferred theComposite method to theOracle method indicatesthatthegreedysearchstrategy,basedontheoriginal dis-chargesummary,doesnotnecessarilyproducethebestpossible extraction-based(discharge)summary.

(11)

TheresultsfromExperiment2showthatthemethods Compos-ite,Case-BasedandTranslateallworkbetterthanthebasic baselinemethods(p<0.05)(nottakingintoconsiderationthe Ora-clemethod),whereastheCentralitybaselinefailstooutperform eventheRandombaselinewiththisdata.Composite,which con-sistsofcombinedfeaturesfromRepeatedSentences,Case-Based

andTranslateconsistentlyhasthehighestROUGEperformances. However,thedifferencebetweentheseandthenextbestmethods isnotstatisticallysignificantagainstallROUGEmeasuresfollowing theBonferronicorrection.

Whenproducingthesummaries,theCompositemethod com-binesthefollowingbasicprinciples:

•Theimportanceofasentencedependsonhowmanytimesthe sameorsimilarinformationhasbeenmentionedthroughouta careepisode(RepeatedSentences).

•Bylookingatdischargesummariesofothersimilarcareepisodes, onecanassesstheimportanceofasentencebasedonwhetheror notthesameorsimilarinformationhasbeenwritteninthese summaries(Case-Based).

•If,usingaWSM-basedtranslationsystem,asentence(itsvector representation)canbe‘translated’intoavectorrepresentation thatissimilartohowthissamesentencewouldlookinthe trans-latedwordspace,itshouldbeconsideredforinclusioninthefinal summary(Translate).

•Clusteringsentencesintotopicsthatspanacrossclinicalnotesin acareepisodeallowsfortheremovalofredundancy.

Centralityisevaluated asbeingone ofthelowest-scoring summarisationmethods.Givenitsbroadusageintext summarisa-tionforotherdomains,thisdeservesacloseranalysis.Weaskedthe evaluatorstocommentonthestructureandcontentofthe summ-ariesthatthismethodproducedusingopen-endedquestions.The threequestionswere:

1.Whatimportantinformationismissingfromthesummary? 2.Whatinformationinthesummaryisunnecessary?

3.Howlogicalisthestructureofthesummary?

Thefollowingsumsupwhattheywrotebasedontheanalysis offivesummaries:

•Disorganisedstructureoftext,confusing,illogicalorderorstructure. •Theendismissing.

•Cannotgetanoverallviewofpatients’careepisode. •Importantinformationismissing.

•Informationisdiffuseandfragmented. •Sentencesarenotconnected.

•Toomanydetailsaboutunimportantstuff.

Thisseemstoindicatethatthemost‘central’information, inde-pendentofwhen it waswritten,is nota good indicatorofthe informationthatclinicianswanttohaveinthedischargesummary. Thismethoddidnotincludesentencetopicclustering,whichwas usedinseveraloftheothermethods. Thisfurthersupportsthe importanceofsuchtopicclusteringdespitetherelativelypoor per-formanceofRepeatedSentences.Infuturework,othervariations andimplementationsofcentrality-basedmethodsshouldbe eval-uated,e.g.throughtheuseoftheMEADsystem[57],similarlyto howitisdonein[37].

LastSentencesperforms relatively poorlyin comparisonto manyoftheothersummarisationmethods.Thisisaninteresting observationinthatitsuggeststhatreadingonlythelatestwritten informationornote(s)issuboptimalwhenthetaskistowritea dis-chargesummary.Italsosuggeststhattherearereasonstobelieve

thatitisbeneficialforclinicianstousetextsummarisation sys-temsintheirwork,e.g.toassistinhighlightingrelevantinformation documentedearlierinacareepisode.

Evenwithourrathercoarse-grainedmanualevaluation,when appliedtoalimitednumberofcareepisodes,ahighcorrelationis seenwiththeautomaticevaluation.Hence,thisautomatic evalu-ationapproachcanbeusedtorankthedifferentsummarisation methodsin orderofeffectiveness.And sincesuchmanual eval-uationisnotaffordableeverytimeasummarisationmethodhas beenmodified,orwhenanewmethodisdeveloped,itshouldbe possibletoresorttothisautomaticevaluationduringthemethod developmentprocess.

Thisstudyraisesquestionsabouttheusability,reliabilityand usefulnessofsuch(imperfect)automaticsummarisationsystems, particularlywhenusedatthepointofcare.Thisisdifficulttoassess basedontheutilisedevaluationapproachandscoresachievedhere. Thequestionis:whatdoesitactuallymeantohaveasystemthat isabletogeneratetextualsummariescontainingpartsoforall(i.e. perfectevaluationscore)thecontentonewouldexpecttofindina manually-createddischargesummary?Oneansweristhatitwould likelyprovideagoodstartingpointforaclinicianwhoisaboutto writetheactualdischargesummary.Itisalsolikelythatthesame automatically generated summary would provide an indicative overviewoftheinformationhavingbeendocumentedduringthe correspondingcareepisode,fromaclinician’sperspective. How-ever,patientsafetyissuesmustbeconsideredbeforethiskindof systemistakenintopractice.Ontheonehand,itisimportantthat themostrelevantinformation neededforsafecareprovisionis assuredinautomaticallygeneratedsummaries.Ontheotherhand, aslongasclinicianstreatthegeneratedsummariesasanindicative summary,thiscouldbeahelpfulfeatureinEHRsystems, partic-ularlyinsituationswheretimeisoftheessence.Futureresearch includingmoreuser-centredevaluationisrequiredtoanswerthis questioninmoredetail.

Aweaknessofthisstudyisthevalidityoftheevaluation.The utilisedmanualevaluationschemeisquitecoarse-grainedinthat itcontainstencriteriaitems,andtheratingisdoneonthelevel of‘yes’or‘no’.However,inthepreviouslymentionedpre-study

[41],a30-itemevaluationschemewastested,usingafour-point scale, but was found to be too detailed and time-consuming touse.

Theautomaticevaluationisperformedusingtheoriginal dis-chargesummaryasagoldstandardsummary,despitethefactthat thesedischargesummariesarenotthemselvesproducedinapurely extractiveway.ThisisreflectedinthefactthattheROUGEscores arearguablyquitelowcomparedtoscoresreportedinvariousother studiesontextsummarisation(seee.g.[40]).Thescoresachievedby

OracleindicatethemaximumROUGE-N2scoresachievablewith anextractive-basedsummarisationsystemforourdata.However, itisencouragingtoseethatthereisacorrelationintermsof rel-ativegoodnessbetweenmanualandautomaticevaluation,both hereandinthepre-study[41],whichispromisingforfuturework inthisdirection.

An alternative evaluation approach would be to manually developgoldstandardsummariesinapurelyextractivewayfor a setof care episodes,replacing the original discharge summ-aries as gold standard summaries in theautomatic evaluation. This approach was not pursued here, as it is more resource-intensive, but it would possibly give us more reliable results. Anotherapproachwouldbetousethesummarisationsystemin a(simulated)clinicalsettingwithcliniciansasusers.Suchan eval-uationapproach isreferred toasextrinsic evaluation,and could potentiallyshedlightontheimpactondocumentationspeedand quality,aswellasonhealthcarequalityand patientoutcomes. Thistypeofevaluationcouldalsopotentiallyprovidedirectionsfor futureworkonimprovingthesummarisationsystem.

(12)

Currentlyitisdifficulttoassesstheusefulnessandpotential impactthatthistypeofsummarisationsystemcouldhaveinareal clinicalsetting.Ontheonehand,itcouldbeaconvenienttoolfor cliniciansintermsofprovidinganindicativetextualoverviewof ongoingcareepisodes,forexample.Ontheotherhand,thepossible imperfectionoftheinformationpresentedinthegenerated summ-aries mustbeconsidered in relationtopotentialpatient safety issues.Futureworkshouldfocusmoreonextrinsicevaluationby evaluatinghowtheuseofautomatictextsummarisationsystems inaclinicalsettingwillimpactondocumentationspeedandquality, aswellashealthcarequalityandpatientoutcomes.Herewebelieve thatamoreuser-guidedsummarisationsystemisneeded,enabling real-timeincrementalsummarygeneration,similartothemethods proposedin[58].Thiswouldmeanthatthecomputer-generated summary,orthesentencesthatitsuggestsforinclusioninthefinal summary,arecalculatedbasedonanalysingwhatcontenttheuser hasalreadywrittenin(orimportedinto)thesummary.

5. Conclusion

Thisworkontheautomatedsummarisationoffreetextincare episodesintroducesandevaluatesbothaframeworkforevaluating summarisationmethods,aswellasnovelmethodsforperforming thesummarisation.Mostofthepresentedsummarisationmethods relyonstatisticalinformationderivedfromalargecorpusofclinical text,thisincludesvariousWSMs.Thebestperforming summarisa-tionmethods,accordingtotheappliedevaluation,areComposite,

Case-BasedandTranslate.TheROUGE-basedevaluation meas-uresareshowntocorrelatehighlywiththemanualevaluationin termsofrelativeranking.Basedontheseresults,webelievethat theexploredsentencefeatures,especially thoseinthe Compos-itemethod,provideusefuldirectionsonhowtoapproach this summarisationtaskinaresource-leanfashion.Furtherstudiesare neededtoassesstheapplicabilityofsuchmethodsinreal-world clinicalsettings.

Conflictofinterest

Theauthorsdeclarethattheyhavenoconflictsofinterest.

Acknowledgements

Thisstudywas partlysupportedby theResearch Council of NorwaythroughtheEviCareproject(projectno.193022),Turku University Hospital (EVO 2014), and the Academy of Finland (projectno.140323).Thestudy ispartof theresearchprojects oftheIKITIKconsortium(http://www.ikitik.fi).Wewouldliketo thankthemanualevaluatorsfortheircontributions.Wewouldalso liketothankFilipGinterforassistingusintheworkonWord2vec. Finally,wewouldliketothankthereviewersfortheirinsightful comments.ThispaperhasbeenproofreadbyLingsoftLanguage ServicesOy,andthis wasfinancedby theDepartmentof Com-puterandInformationScience,NorwegianUniversityofScience andTechnology.

AppendixA. Supplementarydata

Supplementarydataassociatedwiththisarticlecanbefound, intheonlineversion,athttp://dx.doi.org/10.1016/j.artmed.2016. 01.003.

References

[1]Hall A, WaltonG. Information overload within the health care system: a literature review.HealthInf LibrJ 2004;21(2):102–8,http://dx.doi.org/ 10.1111/j.1471-1842.2004.00506.x.

[2]VanVleckTT,SteinDM,StetsonPD,JohnsonSB.Assessingdatarelevancefor automatedgenerationofaclinicalsummary.In:TeichJM,SuermondtJ, Hripc-sakG,editors.AMIAannualsymposiumproceedings.2007.p.761–5. [3]LissauerT,PatersonC,SimonsA,BeardR.Evaluationofcomputer

gener-ated neonataldischargesummaries.Arch DisChild1991;66(4Spec No.): 433–6.

[4]KripalaniS,LeFevreF,PhillipsCO,WilliamsMV,BasaviahP,BakerDW.Deficits incommunicationandinformationtransferbetweenhospital-basedand pri-marycarephysicians:implicationsforpatientsafetyandcontinuityofcare.J AmMedAssoc2007;297(8):831–41.

[5]SørbyID,NytrøØ.Doestheelectronicpatientrecordsupportthedischarge process? Astudy onphysicians’useofclinicalinformationsystems dur-ing dischargeofpatientswithcoronaryheartdisease.HealthInfManagJ 2005;34(4):112–9.

[6]MengF,TairaRK,BuiAA,KangarlooH,ChurchillBM.Automaticgeneration ofrepeatedpatientinformationfortailoringclinicalnotes.IntJMedInform 2005;74(7–8):663–73.

[7]WrennJO,SteinDM,BakkenS,StetsonPD.Quantifyingclinicalnarrative redun-dancyinanelectronichealthrecord.JAmMedInformAssoc2010;17(1):49–53, http://dx.doi.org/10.1197/jamia.M3390.

[8]AfantenosS,KarkaletsisV,StamatopoulosP.Summarizationfrommedical doc-uments:asurvey.ArtifIntellMed2005;33(2):157–77.

[9]NenkovaA,McKeownK.AutomaticSummarization.FoundTrendsInfRetr 2011;5(2–3):103–233,http://dx.doi.org/10.1561/1500000015.

[10]Carbonell J, GoldsteinJ. The useof MMR, diversity-based reranking for reorderingdocumentsandproducingsummaries.In:CroftWB,MoffatA,van RijsbergenCJ,WilkinsonR,ZobelJ,editors.Proceedingsofthe21stannual inter-nationalACMSIGIRconferenceonresearchanddevelopmentininformation retrieval.1998.p.335–6.

[11]GoldsteinJ,MittalV,CarbonellJ,KantrowitzM.Multi-document summariza-tionbysentenceextraction.In:Proceedingsofthe2000NAACL-ANLPworkshop onautomaticsummarization–volume4,NAACL-ANLP-AutoSum’00.2000.p. 40–8,http://dx.doi.org/10.3115/1117575.1117580.

[12]Patil K,Brazdil P.SumGraph:text summarizationusing centralityinthe pathfindernetwork.IntJComputSciInfSyst2007;2(1):18–32.

[13]Chatterjee N, MohanS.Extraction-based single-documentsummarization usingrandomindexing.In:Proceedingsofthe19thIEEEinternational confer-enceontoolswithartificialintelligence–volume02,ICTAI’07.2007.p.448–55, http://dx.doi.org/10.1109/ICTAI.2007.28.

[14]MishraR,BianJ,FiszmanM,WeirCR,JonnalagaddaS,MostafaJ,etal.Text sum-marizationinthebiomedicaldomain:asystematicreviewofrecentresearch. JBiomedInform2014;52:457–67,http://dx.doi.org/10.1016/j.jbi.2014.06.009. [15]Miller GA. WordNet: a lexical database for English. Commun ACM

1995;38(11):39–41,http://dx.doi.org/10.1145/219717.219748.

[16]Unifiedmedicallanguagesystem[cited10thAugust2015].http://www.nlm. nih.gov/research/umls.

[17]InternationalHealthTerminologyStandardsDevelopmentOrganisation: sup-portingdifferentlanguages[cited10thAugust2015].http://www.ihtsdo.org/ snomed-ct/snomed-ct0/different-languages.

[18]WorldHealthOrganization,InternationalClassificationofDiseases(ICD). [19]U.S.NationalLibraryofMedicine,MeSH(MedicalSubjectHeadings)[cited10th

August2015].http://www.ncbi.nlm.nih.gov/mesh.

[20]AronsonAR,LangF-M.AnoverviewofMetaMap:historicalperspectiveand recentadvances.JAmMedInformAssoc2010;17(3):229–36.

[21]SavovaGK,MasanzJJ,OgrenPV,ZhengJ,SohnS,Kipper-SchulerKC,etal. Mayoclinicaltextanalysisandknowledgeextractionsystem(cTAKES): archi-tecture, componentevaluation andapplications. JAmMed InformAssoc 2010;17(5):507–13.

[22]RindfleschTC,FiszmanM.Theinteractionofdomainknowledgeandlinguistic structureinnaturallanguageprocessing:interpretinghypernymic proposi-tionsinbiomedicaltext.JBiomedInform2003;36(6):462–77.

[23]Velupillai S, Kvist M. Fine-grained certainty level annotations used for coarser-grained e-health scenarios. In: Gelbukh A, editor. Compu-tational linguistics and intelligent text processing, vol. 7182 of lecture notes incomputerscience.Berlin/Heidelberg:Springer; 2012. p.450–61, http://dx.doi.org/10.1007/978-3-642-28601-838.

[24]KvistM,SkeppstedtM,VelupillaiS,DalianisH.Modelinghuman compre-hensionofSwedishmedicalrecordsforintelligentaccessandsummarization systems–futurevision,aphysician’sperspective.In:FensliR,DaleJ,editors. 9thScandinavianconferenceonhealthinformatics.2011.

[25]Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform 2009;42(5):760–72.

[26]PedersenT,Pakhomov SV,PatwardhanS,ChuteCG.Measuresof seman-tic similarityandrelatednessinthebiomedicaldomain.JBiomedInform 2007;40(3):288–99.

[27]KoopmanB,ZucconG,BruzaP,SitbonL,LawleyM.Anevaluationof corpus-drivenmeasuresofmedicalconceptsimilarityforinformationretrieval.In: ChenX,LebanonG,WangH,ZakiMJ,editors.21stACMinternational confer-enceoninformationandknowledgemanagement,CIKM’12.2012.p.2439–42, http://dx.doi.org/10.1145/2396761.2398661.

[28]HenrikssonA,MoenH,SkeppstedtM,DaudaraviV,DuneldM.Synonym extrac-tionandabbreviationexpansionwithensemblesofsemanticspaces.JBiomed Semant2014;5(1):6.

(13)

[29]CohenT,WiddowsD.Empiricaldistributionalsemantics:methodsand biomed-icalapplications.JBiomedInform2009;42(2):390–405.

[30]CohenR,AviramI,ElhadadM,ElhadadN.Redundancy-awaretopic mod-eling for patient record notes. PLOS ONE 2014;9(2), http://dx.doi.org/ 10.1371/journal.pone.0114677.

[31]VineLD,ZucconG,KoopmanB,SitbonL,PruzaP.Medicalsemanticsimilarity withaneurallanguagemodel.In:LiJ,WangXS,GarofalakisMN,SoboroffI,Suel T,WangM,editors.Proceedingsofthe23rdACMinternationalconferenceon conferenceoninformationandknowledgemanagement,CIKM2014.2014.p. 1819–22,http://dx.doi.org/10.1145/2661829.2661974.

[32]HarrisZS.Distributionalstructure.Word1954;10:146–62.

[33]DeerwesterS,DumaisS,FurnasG,LandauerT,HarshmanR.Indexingbylatent semanticanalysis.JAmSocInfSci1990;41(6):391–407.

[34]KanervaP,KristoferssonJ,HolstA.RandomIndexingoftextsamplesforlatent semanticanalysis.In:Proceedingsof22ndannualconferenceoftheCognitive ScienceSociety.2000.p.1036.

[35]MikolovT,SutskeverI,ChenK,CorradoGS,DeanJ.Distributedrepresentations ofwordsandphrasesandtheircompositionality.In:BurgesC,BottouL,Welling M,GhahramaniZ,WeinbergerK,editors.AdvancesinNeuralInformation ProcessingSystems26.NeuralInformationProcessingSystemsFoundation; 2013.p.3111–9.

[36]Luhn HP. Theautomatic creationof literature abstracts.IBM J ResDev 1958;2(2):159–65.

[37]LiuS.Experiencesandreflectionsontextsummarizationtools.IntJComput IntellSyst2009;2(3):202–18.

[38]SarkarK,NasipuriM,GhoseS.Usingmachinelearningformedicaldocument summarization.IntJDatabaseTheoryAppl2011;4:31–49.

[39]AbulkhairM,ALHarbiN,FahadA,OmairS,ALHosainiH,AlAffariF.Intelligent integrationofdischargesummary:aformativemodel.In:Al-DabassD, Uthay-opasP,Sa-nguanpongS,NiramitranonJ,editors.4thinternationalconference onintelligentsystemsmodelling&simulation.IEEE;2013.p.99–104. [40]LinC-Y.Rouge:apackageforautomaticevaluationofsummaries.In:

Marie-FrancineMoensSS,editor.Textsummarizationbranchesout:proceedingsof theACL-04workshop.2004.p.74–81.

[41]MoenH,HeimonenJ,MurtolaL-M,AirolaA,PahikkalaT,TeräväV,etal. Onevaluationofautomaticallygeneratedclinicaldischargesummaries.In: Proceedingsofthe2ndEuropeanWorkshoponPracticalAspectsofHealth Informatics(PAHI).2014.p.101–14.

[42]MoenH,MarsiE,GinterF,MurtolaL-M,SalakoskiT,SalanteräS.Careepisode retrieval.In:VelupillaiS,DuneldM,KvistM,DalianisH,SkeppstedtM, Hen-rikssonA,editors.Proceedingsofthe5thinternationalworkshoponhealthtext miningandinformationanalysis(Louhi)@EACL.2014.p.116–24.

[43]KarlgrenJ,SahlgrenM,JärvinenT,CösterR.Dynamiclexicaforquery trans-lation.In:PetersC,CloughP,GonzaloJ,JonesG,KluckM,MagniniB,editors. Multilingualinformationaccessfortext,speechandimages,vol.3491of lec-turenotesincomputerscience.Berlin/Heidelberg:Springer;2005.p.150–5, http://dx.doi.org/10.1007/1151964515.

[44]JonesK.Astatisticalinterpretationoftermspecificityanditsapplicationin retrieval.JDoc1972;28(1):11–21.

[45]McKeownK,KlavansJ,HatzivassiloglouV,BarzilayR,EskinE.Towards mul-tidocumentsummarizationby reformulation:progressand prospects.In: HendlerJ, SubramanianD, editors.Proceedings ofthe sixteenthnational conferenceonartificialintelligenceandeleventhconferenceoninnovative applicationsofartificialintelligence.1999.p.453–60.

[46]LimJ-M,Kang I-S,Bae J-H,Lee J-H.Sentence extractionusingtime fea-turesinmulti-documentsummarization.In:MyaengS,ZhouM,WongK-F, ZhangH-J, editors.Informationretrieval technology,vol. 3411oflecture notes in computer science.Berlin/Heidelberg: Springer; 2005. p. 82–93, http://dx.doi.org/10.1007/978-3-540-31871-28.

[47]AamodtA,PlazaE.Case-basedreasoning:foundationalissues,methodological variations,andsystemapproaches.AICommun1994;7(1):39–59.

[48]LenzM,HübnerA,KunzeM.TextualCBR.In:LenzM,BurkhardH-D, Bartsch-SpörlB,WessS,editors.Case-basedreasoningtechnology,vol.1400oflecture notesincomputerscience.Berlin/Heidelberg:Springer; 1998. p. 115–37, http://dx.doi.org/10.1007/3-540-69351-35.

[49]Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 1998;30(1):107–17, http://dx.doi.org/ 10.1016/S0169-7552(98)00110-X.

[50]MihalceaR.Graph-basedrankingalgorithmsforsentenceextraction,applied totextsummarization.In:ProceedingsoftheACL2004oninteractiveposter anddemonstrationsessions,ACLdemo’04.2004.

[51]WanX,YangJ.Improvedaffinitygraphbasedmulti-documentsummarization. In:ProceedingsofthehumanlanguagetechnologyconferenceoftheNAACL, companionvolume:shortpapers,NAACL-Short’06.2006.p.181–4.

[52]LehmanA.JMPforbasicunivariateandmultivariatestatistics:astep-by-step guide.Cary,NC,USA:SASInstitute;2005.

[53]Wilcoxon F. Individual comparisons by ranking methods. Biometrics 1945;1:80–3,http://dx.doi.org/10.2307/3001968.

[54]ShroutPE,FleissJL.Intraclasscorrelations:usesinassessingraterreliability. PsycholBull1979;86(2):420–8.

[55]CicchettiDV.Guidelines,criteria,andrulesofthumbforevaluatingnormed and standardized assessment instruments in psychology. Psychol Assess 1994;6(4):284–90.

[56]DangHT,OwczarzakK.OverviewoftheTAC2008updatesummarizationtask. In:Proceedingsoftextanalysisconference2008workshop–notebookpapers andresults.2008.p.1–16.

[57]RadevDR,JingH,BudzikowskaM.Centroid-basedsummarizationof mul-tiple documents: sentenceextraction, utility-based evaluation, and user studies. In: Proceedings of the 2000 NAACL-ANLP workshop on auto-maticsummarization,vol.4ofNAACL-ANLP-AutoSum’00.2000.p.21–30, http://dx.doi.org/10.3115/1117575.1117578.

[58]SankarasubramaniamY,RamanathanK,GhoshS.Textsummarizationusing Wikipedia.InfProcessManag2014;50(3):443–61.

Figure

Fig. 1. Experimental set-up. The figure shows how the experiment was conducted.
Fig. 3. Summarisation method RepeatedSentences. The example illustrates how summaries are produced by sentence topic clustering and topic scoring
Fig. 4. The Case-Based summarisation method illustrated as a ‘CBR cycle’, based on the CBR cycle introduced by Aamodt and Plaza [47]
Fig. 5. Graph illustrating the trend for how the automatic evaluation metrics correlate with the manual evaluation of summaries from 20 care episodes
+2

References

Related documents

Additionally, excess distance travelled per route over the minimum required would be related to vehicle performance efficiency index.The concept of vehicle administrative

document; 4) analysis of social media covering Benghazi; 5) documents relating to video-conferences in September 2012 pertaining to Libya; 6) documents/communications

Staff members who will be on mission assignment for six months or more and who will not have eligible covered family members residing in the United States for the duration of

In verband met het verschil in duur van de gesprekken tijdens de voor- en nameting hebben we gebruik gemaakt van de procentuele verdeling van specifieke gedragen ten opzichte van

Annual number of ED visits for AGI (ICD-9) per 1,000 people in each county. Percent of the population served by DWSs exposed to total coliform contamination.. Percent of the

Finding a direct presence and direct communication with Japanese companies here in Japan is also having a ripple effect on the business we can get from Japanese companies

• Market dynamics, including consolidation, the impact of Microsoft Windows SharePoint Services for collaborative document management, and increasing competition from