Download Download PDF

(1)

Amplifying

Data

Curation

Efforts

to

Improve

the

Quality

of

Life

Science

Data

MariamAlqasab SchoolofComputerScience

UniversityofManchester

SuzanneM.Embury SchoolofComputerScience

SandradeF.MendesSampaio SchoolofComputerScience

Abstract

Intheeraofdatascience,datasetsaresharedwidelyandusedformanypurposesunforeseen bytheoriginalcreatorsofthedata.Inthiscontext,defectsindatasetscanhavefarreaching consequences, spreadingfromdatasettodataset, andaffecting theconsumersof data inwaysthatarehardtopredictorquantify. Someformofwasteisoftentheresult. For example,scientistsusingdefectivedatatoproposehypothesesforexperimentationmay wastetheirlimitedwetlabresourceschasingthewrongexperimentaltargets. Scarcedrug trialresourcesmaybeusedtotestdrugsthatactuallyhavelittlechanceofgivingacure. Becauseofthepotentialrealworldcosts,databaseownerscareaboutprovidinghighquality data.Automatedcurationtoolscanbeusedtoanextenttodiscoverandcorrectsomeforms ofdefect. However,insomeareashumancuration,performedbyhighly-traineddomain experts,isneededtoensurethatthedatarepresentsourcurrentinterpretationofreality accurately.Humancuratorsareexpensive,andthereisfarmorecurationworktobedone thantherearecuratorsavailabletoperformit. Toolsandtechniquesareneededtoenable thefullvaluetobeobtainedfromthecurationeffortcurrentlyavailable.

Inthispaper,weexploreonepossibleapproachtomaximisingthe valueobtained from humancurators,byautomaticallyextractinginformationaboutdatadefectsandcorrections fromtheworkthatthecuratorsdo.Thisinformationispackagedinasourceindependent form,toallowittobeusedbytheownersofotherdatabases(forwhichhumancuration effortisnotavailableorisinsufficient). Thisamplifiestheeffortsofthehumancurators, allowingtheirworktobeappliedtoothersources,withoutrequiringanyadditionaleffortor changeintheirprocessesortoolsets.Weshowthatthisapproachcandiscoversignificant numbersofdefects,whichcanalsobefoundinothersources.

Received17October2016 | Revisionreceived7June2017 | Accepted20June2017

CorrespondenceshouldbeaddressedtoMariamAlqasab,SchoolofComputerScience,OxfordRoad,Manchester,M13 9PL.Email:[email protected]

Anearlierversionofthispaperwaspresentedatthe12thInternationalDigitalCurationConference.

TheInternationalJournalofDigitalCurationisaninternationaljournalcommittedtoscholarlyexcellenceanddedicated totheadvancementofdigitalcurationacrossawiderangeofsectors. TheIJDCispublishedbytheUniversityof EdinburghonbehalfoftheDigitalCurationCentre.ISSN:1746-8256.URL:http://www.ijdc.net/

Copyrightrestswiththeauthors.ThisworkisreleasedunderaCreativeCommonsAttribution4.0 InternationalLicence.Fordetailspleaseseehttp://creativecommons.org/licenses/by/4.0/

InternationalJournalofDigitalCuration

(2)

Introduction

ManydatabaseownersmaketheirdataavailableontheWebforotherstouse. Whilein mostcasesnoassuranceisprovidedregardingthequalityofWebdata, dataisgenerally consideredtrustworthyifitsqualitylevelishigh. Akeypart ofthistaskismakingsure thatdatastoredinthedatabaseisanaccuratereflectionofreality,butotheraspectsofdata qualityshouldalsobetakenintoaccount,suchascompleteness,consistency,believability andcurrency.

Dataindatabasesarenotfixed: databaseproviderswillmake changesasandwhen theyarerequired,tomirrorthepartoftherealworldmodelledbythedatabase. Tokeep dataqualityhigh, databaseproviders useautomatic andmanualcuration techniquesto findandremovedefectsindata. Someofthesedefects canbedetectedusingautomatic tools,thoughthesetoolstendtobelimitedtothedetectionofsystematicerrorsofasimple andgenerallyapplicableform. Anexample wouldbeatool forcompletenesschecking thatdetermineswhetherallrequiredfieldscontaindata andwarnstheuseraboutthose thatareempty. However, theseautomatictools tendtobe limitedtosystematic errors, leavingthedetectionofnon-systematicerrorsunsolved. Typically,non-systematicerrors requireknowledgeandexpertisein thedomain tobedetected andfixed. Forexample, inbiomedicalarea,errorsintherecordedfunctionofaproteinneedahumanexpertand detailedanalysistodiscoverbyexaminationofnewpublications.

In many domains, this expertiseis not widely available and is not free. Domain expertshavetospendalotof timesearchingfor defectsindataandfixingthem. Some expertsmayvolunteertheirtimetocuratedataintheirdomainofinterest;inotherareas, ownersofcorecommunityresourceshavetohiredatacuratorstomaintaintheaccuracy, concurrency, completeness and consistency of their data. For example, the UniProt database1hasanumberofdatacuratorswhoarepaidfull-timetomanuallycurateUniProt

data. However,therearemanysmallerdatasourcesthataremanagedonavoluntarybasis andhavenofundingtoemploydatacuration. Thesedatabasescanbecomeout-of-date overtime.

Wewouldliketofindanautomatedwaytopackagedatacurationworkinaformthat canbecheaplyand easilyappliedto theseotherdatabasesanddataconsumerswithout requiring any extra work from the data curators. In this paper we presentIQBot, an approachtosolvingthisproblem. Ouraimistofindawaytodetectdataupdatesmade bycurators,determinethereasonbehindthechangeandsharethosethatmayhavevalue morewidely. WealsopresentexperimentalresultsconfirmingthatIQBotcanfinddefects thatarealsopresentinotherdatabases.

Work

Sincedatabaseconsumerscareaboutusinghighqualitydatatoproducecorrectresults, databaseprovidersmustalso careaboutdataquality andmusttry tokeep itas highas their(oftenscarce)resourcesallow. This hasleadtoresearch ondatacurationranging fromproposals for semi-automatictools requiring interactionfrom human experts, to

1 UniProt:http://www.uniprot.org

(3)

completelyautomatedcurationapproaches.

Abrams etal. provided atoolfor databaseproviderstoshare theirdatawithothers by storing themin a Merrittrepository2. Each time adatabase provider curates their

data,theyupdatetheversionstoredintherepository. Byupdatingtherepositoryversion, databaseusersareabletoget themostup-to-dateversion ofthedata. Agraphicaluser interfacehelps data consumersto find and download datasets that match their needs (Abramsetal.,2014).

Ravagli et al. proposed OntoBrowser, a collaborative tool for the curation of ontologies. Thetoolallowscuratorstoworkwithasinglecopyofdata,storedinacentral databaseto avoidredundancy (Ravagli, Pognan and Marc, 2016). The tool can also pre-mapunmappedontologytermsautomaticallybyusingfuzzymatching.

Somecurationtoolsarenotfullyautomatic,butrequirehumanstoparticipateinthe curationprocess. Buntetal. proposedasemi-automaticcurationtoolthatsendse-mails toauthors whohavea new publicationin thearea, andasks themto fillin a specially designedform. Inreturn,theprovidedinformationhelpscuratorstospeedupthecuration process,astheywillhaveallthebasicinformationneededtoeasilyidentifythedatatobe curated(Buntetal., 2012). However, authorsresponsestothesee-mailsarevoluntary. Thereisnoguaranteethatallauthorsemailedbythetoolwillresponseandparticipatein theprocess.

Other research proposed different approaches for automatic data curation tools. Kumarietal. evaluatedtoolssupportingon-demandcurationbyusingacasestudytosee howthepresentationofinformationaboutuncertaintyofdatacanaffecttheperformance ofthecuration. Theirstudyshowedthatthewayinwhichtheuncertaintywaspresented affected the way users interpreted the results. The authors suggested user interface guidelinesfor thepresentationofuncertaintytocuratorstoaidinefficientandaccurate curationdecisions(Kumari,AchmizandKennedy,2016).

Some researchers have tried to find ways to minimize the time needed for data curationbyreusingandsharingcurationeffortswithothers. Orchard etal. proposedthe InternationalMolecularExchange(IMEx),whichallowstheparticipantproteindatabases topooltheirliteraturecurationefforts(Orchardetal.,2012). Redundantcurationworkis preventedbytherathercoarse-grainedmechanismofassigningcertainacademicjournals toeachpartnerorganisation. Eachpartnerextractsproteininteractionrecordsonlyfrom thejournalstowhichtheyareassigned. Thecuratorsalsoneedtofollowasetofcuration rulesto make thework applicabletootherdatabases. Anotherresearch project, called MIntAct, also focused on sharing curation efforts between databases by providing a commoncuration platform (Orchard et al., 2013). Although bothIMEx and MIntAct share curation efforts with other databases, the sharing of curation efforts is limited betweenthedatabasesrunbythepartnerorganisations.

Asthisbriefsurveyshows,theresearchliteraturecontainssomeproposalsforcuration toolsaimedatimproving dataquality, includingsomeefforts withasimilar aimtothe researchdescribedinthispaper: theaimofmakingthebenefitsoflimitedcurationefforts stretch beyond the direct users of the database under curation. However, at present, approachestosharing curationeffort arelimitedtogroupsofparticipantswhoagree in advancetocollaborateandusethesamecurationtools. Inourownworkweaimtocreate bridgesbetweenthecurateddatasourcesandusersofthatsamedatainother,unconnected

2 UC3MerrittRepository:http://www.cdlib.org/services/uc3/merritt/

(4)

databases. Inaddition, ourapproach doesnotrequiredatacuratorstochangetheirtool setormakeanyadditionalefforttocollaboratewithothers.

Extracting

Defects

from

Changes

Toachieveouraimofincreasingthevalueofdatacurationwork,weproposeacomponent called an IQBot. IQBots monitor curated data sources, looking for the changes the curatorsmake from one version ofthe data tothe next. These changes were madeto addresssome problem acurator perceived in the data(to add missing dataor correct errors, orchange the representation of datato one that is moreuseful to current data consumers,forexample). Thesechangesareausefulsourceofinformationonthequality ofthedata. TheIQBotexamineseachchange,attemptstoinferthereasonforthechange, andexports any changes deemed to be of general interest for sharing with other data owners.Inthissection,weexplainhowthegeneralIQBotcomponentworks. Afterwards, wewillexplainhowthisgeneralcomponentisinstantiatedtoallowittogatherdefectand correctioninformationfromarealdatabase.

Themain tasksof theIQBot are: to detectchanges madetodata inthe monitored database, toinfer thereason for thechanges (i.e. to infer whenachange is aresult of a generallyuseful defect), to decide from that reason whetherthe change is of wider interest,andto publishitasadefect-correction pair. An IQBotdoesnotonly focuson themostrecentchanges madetodata. Ifthecurateddatasourcemaintainsahistoryof versions,theIQBotcanextractthefullhistoryofcurator-ledchangesoverthelifetimeof thesource.

ExtractingchangesusingIQBot

Algorithm1showsthemainprocessingbehaviourofIQBotaspseudocode. IQBotstarts bycomparingtheversionofthedataspecifiedbytheowner(typicallythelatestversion) forchanges. Ifthisisthefirsttimethissourcehasbeenmonitored,thenIQBotwillwork backthroughall theversions,lookingforandpublishingchanges. However,in normal useonlythetwomostrecentversionswillbecompared.

As the IQBot’s job is to extract changes in data by comparing the contents of consecutiveversions,thecurateddatabaseneedstohaveatleasttwoaccessibleversions ofdata. Ifthedatasourcetobemonitoreddoesnotexplicitlyprovide accesstoversion information,theowneroftheIQBotcanarrangetotakeperiodicsnapshotsofthesource, and to compare them for changes (Embury, Jin, Sampaio and Eleftheriou, 2014). It shouldbefurthernotedthatitisnotpartoftheIQBot’sjobtocheckwhethertheassigned curateddatabaseisversionedornot. Thistypeofinformationisprovidedbythepublisher ofthecurateddatabaseanditisoutsidethescopeofthispaper.

Whencomparingtwoversions,theIQBotfirstattemptstolocatetheIDsofallrecords thathavebeenmodifiedintheversion beingmonitored. Ifthemonitoredsourceoffers versioningfacilities,itmay bepossibletoqueryfor thissetofrecordIDsdirectly. Ifit doesnot, IQBotwillextractthefullsetofIDsinbothversions,andwillcheckeachone forchanges(thisisslow,butIQBotisnotintendedtoworkintime-criticalapplications). To extract these IDs (and the details of the records they refer to), IQBotneeds to be configuredtoaccessthedatasourcebeingmonitored,throughwhateverAPIisprovided

(5)

Algorithm1ThegeneralalgorithmoftheIQBot

Require: v: theversionofthedatasetwearemonitoringforchanges 1: vPrev=predecessorversiontov

2: whilevPrevexistsdo

3: ifvhasnotyetbeenmonitoredthen

4: changedRecs=getIDsofrecordschangedbetweenvandvPrev 5: foreachrecIDinchangedRecsdo

6: recV1=readtherecordwithidrecIDinversionvofthedata 7: recV0=readtherecordwithidrecIDinversionvPrevofthedata 8: ifrecV1!=recV0then

9: type=findChangeType(recV1,recV0)

10: reason=findTheReasonForTheChange(type,recV1,recV0) 11: publish(recID,recV1,recV0,reason)

12: endif

13: endfor

14: markversionvashavingbeenmonitored 15: endif

16: v=vPrev

17: vPrev=predecessorversiontov 18: endwhile

forprogrammaticaccesstodata. Ifnecessary, thedatamayneedtobedownloadedand hostedlocally,at thesitewhere theIQBotis running. Aplug-in mustbeprovided for IQBot,allowing ittoexecutethedataaccessqueriesit needstodoitswork againstthe monitoredsource, andtoinsulate IQBotfromdifferences indata model/representation (e.g. relationaltablesvstextfiles).

IQBotthencomparestheversionsofindividualrecords, lookingfordetailsofwhat haschanged,toidentifythetypeofthechange. Recordsmaybeaddedordeleted,attribute valuesmaybe changed,addedtoordeletedfrom. Recordsmay bemergedorsplitinto two. Differentdefectswillrequiredifferenttypesofchangetoaddressthem,andtherefore thetypeofchangegivesaclueastothedefectthatthechangecorrects.

Afurtherissuetoconsideristhetypeofchangethatshouldbemonitored. Curators maymakeagreatmanychangesofdifferentsortstodifferentpartsofthedata. Notallof thesemaybeofwiderinterest. AswithotheraspectsoftheIQBot,theownercandecide whichchanges are tobe extracted by defining a plug-incomponent, indicatingwhich partsoftheschemaaretobemonitoredforwhichtypesofchange.

FindingtheReasonfortheChange

Whenthe IQBotdetects achange in data, thenthe reasonbehind thechange needs to beinferred. However,thechangesfirstneedtobeclassified. Todothis, wedividedthe changesinto a numberof types: simple, partial and completechanges. Weidentified thesetypesbyobservingacollectionof1499changesdetectedbytheIQBotinasample curateddatasource.

Simple changes are those whichchange the form of a represented value, but not its content or meaning. Examples are corrections to spelling mistakes, changes of

(6)

punctuationand changes of letter case. Suchchanges are normally made topreserve consistency within thecurated database, or becauseof changing conventions used by curators. Wehypothesisethatthesechangesarenotofgreatinterestbeyondthecurated sourceitself.

Partialand completechangesboth involvea changeof contentormeaning. Partial changesare appliedin only some ofthe placesin whichthe change couldbe applied, whilecompletechangescovertheentiredataset. Theyoccurwhencuratorsneedtomake correctionstorecordedmeasurements, andupdatestointerpretation data, toreflectthe changing understandingof thescientificfield coveredby thedatasource. The reasons behindthemaretypicallyhighlydomainspecific,andtypicallycannotbegeneralised to applytodifferentdatabasesinotherareas. ToallowIQBottoinfer thereasonsforthese changes,aplug-incomponentmustbeprovided.

The

UniProt

Protein

Names

IQBot

Aswehaveseen, tocreateafunctioningIQBot,itisnecessary toconfigurethe general componentwithplug-incomponentsprovidingtherequireddomainandsystemspecific logic. To test our ideas, we created an IQBot to monitor changes made to UniProt: theUniversalProteinresource. UniProt isacorereferencefor biomedicalscientists. It containsaround66millionproteinentries,somemanuallycuratedandsomeautomatically curatedthroughsoftware tools. Due totheimportance ofthe dataandthescale ofthe curationtask,UniProt employsateamofexpertdata curators. Theirjobistoread and analyseallthepublicationinthearea,andapplytheirknowledgeinordertodetectandfix defectsindatatoprovidehighdataqualityfortheconsumers.

WeselectedUniProtasthemonitoredcurateddatabaseforthefollowingreasons: • UniProtdataisextensivelymanuallycuratedbydomainexperts.

• The data is continuously curated, with new releases being frequent(every four weeks). Thisallowsustogatherexperimentaldatafrequently.

• Alonghistoryofpreviousversionscanbeaccessed,allowingustoexperimentwith theuseofIQBotonalongdatalifetime.

• UniProthasaprogrammaticinterface,throughwhichdatacanbeaccessed. In the previous section, we explained thatcertain decisionsneeded to be madein ordertoconfigureanIQBottoworkwithaspecificsystem. First,weneededtoprovide adataaccessplug-inthatcanqueryUniProt dataprogrammatically(includingscraping versioninginformationfromtheUniProtwebpages). UniProtcontainsaversioned“entry” foreachprotein. Eachentryhasauniqueidentifier,knownastheaccessionnumber,and inordertoaccessversionedproteinentrydataweconstructaURIusingtheaccessionand therevision number. TheURIconsistsofafixedpart(http://www.uniprot.org/uniprot/), followedby theproteinaccessionnumberand theentry versionnumber. Forexample, http://www.uniprot.org/uniprot/P42645.txt?version=117. Eachprotein hasat leastone entryversion,andanewentryversionisproducedwhenevertheentrydataischangedin arelease.

UniProtcanprovideproteindatainvariousformats. Weusethetext-basedformat,in whicheachlineofaproteinentryfilerepresentsaspecificattribute/datatype. Lineshave

(7)

aninitialcodeoftwoletters,givinginformationaboutthecontentofthedataontheline. Forexample,theproteinnamecanbefoundonalinewhichhastheinitialcode“DE”.In ourcode,we extractedthe proteinnamebyaccessingtheproteinentry andreadingthe contentofthelinewiththiscode.

Second, we musttell the IQBot whatkinds of change it should monitor. UniProt providesarange ofinformation aboutproteins, suchastheproteinname, publications, organism classification, two- and three-dimensional structure and much more. As mentionedpreviously,theIQBotwillextractchangesindataandprovidethefullhistory ofchanges if it is applicable. To keep our experiment achievable, we currentlywork withjustonetypeofchange: curatedchangestoproteinnames. Proteinnamesarevery importanttoscientistsaskeyidentifyinginformation. However, theyarealsosubjectto importantchanges,asourunderstanding ofwhatproteinsexistandwhattheirrolesare changes.Forexample,ifaproteinisdiscoveredsimultaneouslyintwodifferentlabs,they willverylikelyuploadittwiceinUniProt,withdifferentnames. Whenacuratordiscovers this,andmergestheentries,asinglenewnamemustbeselectedforthecombinedprotein. Other scientists who continue to use the old names will cause ambiguities and other qualityissuesintheirdata.

Theprocessofcomparingchangesandassigningthechangetyperemainsthesameas describedinAlgorithm1. Proteinnamesareextractedfromconsecutiveversionsofeach changedproteinentry. The extractednames arecompared. If thenamesare different, thenthetypeofthechangeisassigned.

Discoveringthereasonsforthechangesismorecomplex. Asmentionedbefore,they arehighly domainspecific, and soneed tobe specifiedina plug-incomponent before theycanbeworkedwith. Ingeneral,additionalinformationmustbesoughtfromthedata todetermine whichkind ofreason is behindeachchange. Forexample, animportant pieceofsupportinginformationfornamechangesaretheevidencecodes(UniProt,2015). Evidencecodeswere introducedintoUniProtinSeptember2015. They aretermsfrom anontology3thatdescribethestrengthandtypeofevidencethatexistsinsupportofthe

proteintheentrymodels. Currently,UniProtisusingtenECOevidencecodes: sevenfor manualcurationandthreeforautomaticcuration,asshowninTable1.

EachECOevidencecodereferstoaspecifictypeofevidence. Forexample,ifanentry containsthecode ECO:0000250,it meansthatthe features oftheprotein asdescribed intheentrywerecopiedmanually fromanotherproteinentrythathasbeenfoundtobe highlysimilarintermsofitssequencetothenewlycurationprotein. Thisisaweakerform ofevidence thancodeECO:0000305,whichmeansthatthedescriptionoftheproteinin thatentrywasformulatedbasedonthescientificknowledgeofthecurator.

Byexaminingtheevidencecodesintheproteinentriesbeforeandafterachange,we canhypothesiseabout thereasonfor thechange. Forexample, if evidencecodesgrew stronger,thenwecanguessthatthechangewas duetothearrivalofbetterevidencethat overruledtheearlier,weakerclaim.

ForproteinentryversionsmadebeforeSeptember2015,wecannotcountontheECO evidencecodetoidentifythereasonforthechange,asitisnotprovided. Asaresult,we investigatedotherinformationresourcestobeabletoidentifythereasonforthechange. Aswellasexaminingthedataourselves,weusedanassociationruleminingtooloverthe changestoseewhichotherpartsoftheentries changedalongsideproteinnamechange.

3 TheEvidenceCodeOntology(ECO):http://www.evidenceontology.org

(8)

Table1. ECOevidencecodesusedinUniProtentriesandtheirmeaning(UniProt,2015).

EntryType ECO Meaning

Manual ECO:0000269 providedbyexperimentalevidence. Manual ECO:0000303 non-traceableauthorstatement. Manual ECO:0000250 sequencesimilarityevidence. Manual ECO:0000312 importedfromanotherdatabasemanually. Manual ECO:0000305 basedonscientificknowledgeofthecurator. Manual ECO:0000255 matchtosequencemodelevidence.

Manual ECO:0000244 acombinationofexperimentaland computationalevidence. Automatic ECO:0000256,ECO:0000259 matchtosequencemodelevidence. Automatic ECO:0000313 importedinformation.

Automatic ECO:0000213 acombinationofexperimentaland computationalevidence.

Wefoundfivereasonsfornamechangesfromobservingthecollectionof1499changesin proteinnames,whichwerementionedintheprevioussection. Thereasons,orderedfrom themostcommonreasontotheleast,are:

1. ProteinentriesarestoredinoneoftwosubsetsofUniProt: SwissProtorTrEMBL. The TrEMBLdatabasecontains proteinentries whichareautomatically curated, whiletheSwissProtdatabasehasthemanuallycuratedentries.

In our sample set of 1499 protein names, we found that the majority of name changes weremadewhenanentry wasmovedfromTrEMBLtoSwissProt. This isthepointatwhichaproteinentryisfirstsubjecttomanualcuration. Normally, inthefirstmanualcuration,thecuratorreviewsalltheinformationinanentryand fixes any errors found, includingchanging theprotein name if required. In our proteinnamesIQBot,weautomaticallycheckiftheproteinnamewaschangeddue tofirstmanualcurationbycomparingthenameofthedatabasewheretheentry is stored, fortheversionwherethechangeoccurredandthepreviousversion. Ifthe databasesare different,thenit meansthattheentry wascuratedmanually forthe firsttime.

This conclusion was backed upby theresults from theassociation rules, which indicatedthat anamechangehappenedwhenaproteinentry iscuratedmanually forthefirsttimebyacurator.

2. There are a number of cases where protein names changed due to merging of protein entries. Merging proteinentriesmeanscombiningseveralproteinentries into one entry. In manual curation, curatorsare requiredto follow the database curationguidelines,andoneoftheseguidelinesistomergeallentrieswhichhave the same gene name or same speciesinto one entry. This resultsin changes to someproteinnames(UniProt,2014). Wecheckedthisbycomparingthevalue of theaccessionnumberinbothversions,theonewherethenamechangedisdetected

(9)

and theprevious one. Ifthemore recentversionhas moresubnamesthan inthe previousentrywiththesameaccessionnumber,thenitmeansamergewasmade. 3. Curators manually review the literature on proteins, and when they find new

publicationrelatedtoaproteintheymaymake changestothatproteinentry,such as changing theproteinname. Inour code, welookat thepublicationsectionof theproteinentryversionwherethenamechangeisdetected. Ifitdiffersfromthe publicationsectionintheprevious entryversion,andnewpublicationshavebeen added,thenitmeansthatproteinnamechangeduetonewpublicationsinthearea. 4. It shouldbenoticedthat, inmanycases wecameacross, whensomeparts ofthe

namewereremoved,theremovedpartwasaddedasaflagtotheproteinentry. After investigating, werealizedthatthischangeoccurredduetochangesintheUniProt namingguidelines4. Theguidelinesdonotallowsometerminologiestobeusedin

proteinnames. Forexample,iftheproteinnamecontainstheword“Putative”,then the wordshouldberemoved fromtheproteinnameandaddedto asectioninthe entrycalledflags,asthiswordisnolongerconsideredacceptableinproteinnames. 5. Inraresituations,nothingtosupportanyoftheabovereasonscanbefoundinthe

entry data,wherenamechangeisfound. Fromreviewingthecurationmanualfor UniProt (UniProt,2014), itcanbe saidthatthecuratorsmadechangesto protein entries basedontheirknowledge, expertiseandanalysis ofthepublicationinthe domain. Sothiswouldbegivenasadefaultreasonifnoneothersaresuitable.

Experimental

Results

Inthiswork,wehypothesised thattheIQBotapproachcandiscoverdefectsin dataand findcorrectionsforthe defects,basedonaninterpretationofthechanges madebydata curators. Also,we setouttoprovidethereasonsbehindthedetectedchanges. Whenwe usedIQBottomonitorUniProt,itdetected1499changesinproteinnamefromatarget setof249proteinentries.

Totestthishypothesis,weaccessedacollectionofdatabaseswhichdealwithproteins tocheckwhetheranyofthemwereusingout-of-dateproteinnames. WeusedDBGET5,

whichisanonlinetoolthatgrantsaccesstoacollectionofbiomedicaldatabases. Using DBGETsavestimeandeffortaswedonotneedtobefamiliarwiththewayofaccessing eachdatabaseindividually;DBGETprovidesthesamemechanismtoaccessallprovided databases. We implemented a java program that accessed the biomedical databases inDBGET automaticallyand checked ifeachdatabase stillcontains any ofthe oldor updatedprotein names. We searched in eightdatabases: KEGG BRITE, GO, KEGG GENES,KEGGDGENES,KEGGORTHOLOGY,KEGGMGENES, NCBI-Geneand KEGGENZYME.

Algorithm2showsthepseudocodeoffindingthepreviousandcurrentproteinnames intheDBGET databases. Foreachdatabase, we check theavailabilityof bothprotein names(current andprevious). If thedatabasecontainsboth proteinnamesor only the

4 UniProtnamingguidelines:http://www.uniprot.org/docs/nameprot 5 DBGET:http://www.genome.jp/dbget/

(10)

Algorithm2PseudocodetosearchdifferentdatabasesinDBGET. Require: currentName: thenamethechangedproteinentrycurrenthas

Require: previousName : the name the changed protein entry current had before the change

1: results=queryEachDB(previousName) 2: forresultinresultsdo

3: if!result.has(currentName)then 4: db=result.getDB()

5: recordcurrentNameasbeingout-of-dateindb 6: endif

7: endfor

Figure1. TheresultsofcheckingtheavailabilityofoldandnewproteinnamesinDBGET

currentproteinname, thenitwillbeconsideredasanupdateddatabase. However,ifthe databasehasonly theprevious proteinname, thenthedatabasewillbeconsideredasan out-of-datedatabase.

Figure1displaystheresultsoftheexperiment. Avalueof1foradatabaseindicates thatthedatabasecontainsproteinnames,whethercurrent(new)orprevious(old)names. Theoldreferstothepreviousproteinnameandthenewreferstothecurrentproteinname. Avalueof0meansthatthedatabasedoesnotcontainanyoftheproteinnames. Figure1 showsthatonly onedatabase,KEGGGENES,containsthemostupdatedproteinnames. However,fourofthedatabaseshaveout-of-dateproteinnames.

Conclusion

and

Future

Work

In this paper, we proposed a means to amplify the curation efforts made on curated databasestogivevaluetotheownersofotherdatabasescontainingoverlapping(butnot identical)data. To date, we havetested ourIQBot onthe well-known (andwell-used) UniProtdatabase. Theresults demonstrated thatIQBot canidentify changes madeby curatorsanditcanfindthereasonbehindthechange,andthusthedefect. Wealsolocated examplesofdatabasescontainingout-of-datedata,whichweidentifiedthroughanalysis

(11)

ofthechangesmadebytheUniProtcurators.

Thereareseveralpossiblefuturedirectionstothiswork:

ECOevidencecode We mentionedpreviously that the ECOevidence code ontology containahugecollectionofevidencetermsandusesbyotherbiomedicaldatabases. This raises the possibility of linking our approach more tightly to the whole ontology ofECO evidencecode. Bydoing this, we aimto give theopportunity toallbiomedicaldatabases,whichuseECOevidencecode,tousetheIQBotwith minimal changestothe code. They wouldneedonly tospecifywhichdata tobe extractedandhowitisaccessed. Thereasoninferenceplug-incouldbereused.

Publishingtheresults Nowthat we have shown we can extract defects from curated databases,andfindthereasonsbehindthem,itistimetopublishtheresultsinaway whichmakethemapplicableforothersourcestouse. Makingtheresultsavailable ontheweb willhelpusersandownersofdatabasesthatarestillusingout-of-date datatoraisethequalityoftheirdatabyapplyingthepublishedcorrections.

Creatingamodelforcuration Another important next stepis to test IQBot against differentdatabases,toseewhethertheresultscanstillhold. Ideally,wewouldlike totryitwithsomenon-biomedicaldatabases,toallowustocheckthegeneralityof ourmodel.

References

Abrams,S.,Cruse,P.,Strasser,C.,Willet,P.,Boushey,G.,Kochi,J.,... Rizk-Jackson,A. (2014).Datashare:Empoweringresearcherdatacuration.InternationalJournalof DigitalCuration,9(1),110–118.

Bunt,S.M.,Grumbling,G.B.,Field,H.I.,Marygold,S.J.,Brown,N.H.,Millburn,G.H., Consortium,F.etal. (2012).Directlye-mailingauthorsofnewlypublishedpapers encouragescommunitycuration.Database,2012,bas024.

Embury,S.,Jin,B.,Sampaio, S.&Eleftheriou,I.(2014).Onthefeasibilityofcrawling linkeddatasetsforreusabledefectcorrections.InProceedingsofthe1stworkshop onlinkeddataquality(ldq2014)(Vol.1215).

Kumari,P.,Achmiz,S.&Kennedy,O.(2016).Communicatingdataqualityinon-demand curation.arXivpreprintarXiv:1606.02250.

Orchard, S., Ammari, M., Aranda, B., Breuza, L., Briganti, L., Broackes-Carter, F., ... Del-Toro,N.etal. (2013).Themintactproject—intactasacommoncuration platformfor11molecularinteractiondatabases.Nucleicacidsresearch,gkt1115.

Orchard,S., Kerrien, S., Abbani,S., Aranda, B., Bhate, J., Bidwell, S.,... Cesareni, G. et al. (2012). Protein interactiondata curation: The international molecular exchange(imex)consortium.Naturemethods,9(4),345–350.

(12)

Ravagli,C.,Pognan,F.&Marc,P.(2016).Ontobrowser:Acollaborativetoolforcuration ofontologiesbysubjectmatterexperts.Bioinformatics,btw579.

UniProt.(2014).Uniprotmanualcurationsop.http://www.uniprot.org/docs/sop_manual _curation.pdf.[Online;accessed30July2016].

UniProt.(2015).Evidences. http://www.uniprot.org/help/evidences. [Online;accessed 23May2016].

Download Download PDF

Amplifying

Data

Curation

Efforts

to

Improve

the

Quality

of

Life

Science

Data

Introduction

Related

Work

Extracting

Defects

from

Changes

The

UniProt

Protein

Names

IQBot

Experimental

Results

Conclusion

and

Future

Work

References