• No results found

An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods

N/A
N/A
Protected

Academic year: 2021

Share "An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

ContentslistsavailableatScienceDirect

Artificial

Intelligence

in

Medicine

jou rn al h om e p a g e :w w w . e l s e v i e r . c o m / l o c a t e / a i i m

An

extensive

analysis

of

disease-gene

associations

using

network

integration

and

fast

kernel-based

gene

prioritization

methods

Giorgio

Valentini

a,∗

,

Alberto

Paccanaro

b

,

Horacio

Caniza

b

,

Alfonso

E.

Romero

b

,

Matteo

Re

a

aAnacletoLabDipartimentodiInformatica,UniversitàdegliStudidiMilano,viaComelico39/41,20135Milano,Italy

bDepartmentofComputerScienceandCentreforSystemsandSyntheticBiology,RoyalHolloway,UniversityofLondon,EghamTW200EX,UK

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received11September2013

Receivedinrevisedform5March2014

Accepted10March2014

Keywords:

Genediseaseprioritization

Networkintegration

Heterogeneousdatafusion

MeSHdescriptors

Nodelabelranking

a

b

s

t

r

a

c

t

Objective:Inthecontextof“networkmedicine”,geneprioritizationmethodsrepresentoneofthemain toolstodiscovercandidatediseasegenesbyexploitingthelargeamountofdatacoveringdifferenttypes offunctionalrelationshipsbetweengenes.Severalworksproposedtointegratemultiplesourcesofdata toimprovediseasegeneprioritization,buttoourknowledgenosystematicstudiesfocusedonthe quan-titativeevaluationoftheimpactofnetworkintegrationongeneprioritization.Inthispaper,weaim atprovidinganextensiveanalysisofgene-diseaseassociationsnotlimitedtogeneticdisorders,anda systematiccomparisonofdifferentnetworkintegrationmethodsforgeneprioritization.

Materialsandmethods:Wecollectedninedifferentfunctionalnetworksrepresentingdifferentfunctional relationshipsbetweengenes,andwecombinedthemthroughbothunweightedandweightednetwork integrationmethods.Wethenprioritizedgeneswithrespecttoeachoftheconsidered708medical subjectheadings(MeSH)diseasesbyapplyingclassicalguilt-by-association,randomwalkandrandom walkwithrestartalgorithms,andtherecentlyproposedkernelizedscorefunctions.

Results:Theresultsobtainedwithclassicalrandomwalkalgorithmsandthebestsinglenetworkachieved anaverageareaunderthecurve(AUC)acrossthe708MeSHdiseasesofabout0.82,whilekernelized scorefunctionsandnetworkintegrationboostedtheaverageAUCtoabout0.89.Weightedintegration, byexploitingthedifferent“informativeness”embeddedindifferentfunctionalnetworks,outperforms unweightedintegrationat0.01significancelevel,accordingtotheWilcoxonsignedranksumtest.For eachMeSHdiseaseweprovidethetop-rankedunannotatedcandidategenes,availableforfurther bio-medicalinvestigation.

Conclusions:Networkintegrationisnecessarytoboosttheperformancesofgeneprioritizationmethods. Moreoverthemethodsbasedonkernelizedscorefunctionscanfurtherenhancediseasegeneranking results,byadoptingbothlocalandgloballearningstrategies,abletoexploittheoveralltopologyofthe network.

©2014TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/3.0/).

1. Introduction

Theraisingawareness thatadiseaseisrarelyaconsequence of anabnormality ona single gene,but it is usually theresult ofcomplexinteractionsandperturbationsinvolvinglargesetsof genesand theirrelationshipswithseveralcellularcomponents, leadtodevelopmentofthe“Networkmedicine”,anetworkbased approachtohumandisease[1].In thiscontext,gene prioritiza-tionmethodshaveprogressedquicklywiththeaimofdiscovering

∗Correspondingauthorat:DipartimentodiInformatica,UniversitàdegliStudidi

Milano,ViaComelico39,Milano,Italy.Tel.:+390250316225;fax:+390250316373.

E-mailaddress:[email protected](G.Valentini).

candidate“disease”genesbyexploitingthelargeamountof avail-able“omics”datacoveringdifferenttypesofrelationshipsbetween genes[2].

Accordingto[3],automaticgeneprioritizationmethods typi-callyproducetheiroutputseitherbyfilteringthecandidategenes intosmallersubsetsorbyrankingthecandidategenes.

Filteringmethodsarebasedonthedefinitionofasetofcriteria motivatedbytheavailableknowledgeofthemolecularbasisofthe diseaseunderinvestigation.Theirmainobjectiveistoreducethe setofpotentialdiseasegenesbyexploitingacomparisonofallthe candidategeneswithasortofgenetemplate,whichencodesthe selectioncriteriainasetofrules[4,5].Despitehavingbeenproved effective[6,7],thehardfilteringpolicyunderlyingtheirfunctioning isadouble-edgedsword.Indeed,whenarelevantgenefailstomeet http://dx.doi.org/10.1016/j.artmed.2014.03.003

(2)

justoneofthecriteriaencodedinthefilter,itbecomesafalse neg-ative,andthispreventstheabilitytodetectgenesthatareactually involvedinthedisease,butwithmechanismsnotbeenpreviously reportedinliterature.

Thesecondclassofgeneprioritizationmethods(rankingbased) avoidsthelimitationsoffilteringmethodssimplybyranking can-didatesfrommosttoleastpromisingones.Asinthecaseoffiltering methods,rankingbasedmethodscanintegratemultiplesourcesof evidenceinthegeneprioritizationprocess.Thesemethodscanbe furtherclassifiedintothreemaincategories[3]:textmining[8,9], similarityprofilingandnetworkanalysis-based[10–13].

Although powerful in their ability tomake a very effective usageoftheavailableknowledge,textminingapproachesshow a strong bias toward the identificationof straightforward can-didatesforwhichabundantknowledgeisalreadyavailable[14]. On the contrary,similarity profiling [15] and network analysis basedgeneprioritizationsystemsarenotaffectedbythis limita-tion.Indeedtheycanexploit both knowledgebases (increasing the specificity of predictions) and raw data (for novel predic-tions).

Inparticular, network basedmethods aregaining increasing popularityin diseasegene prioritization (see [16,17]for recent specific reviews). According to this approach, nodes represent genes and edges encode some notion of functional similarity betweengenes,e.g.directmolecularinteractions,transcriptional co-expression/regulation,sequenceorstructuresimilarityor par-alogy[18];theprioritizationlististhenconstructedbyexploiting thetopologyand theedge weightsof thenetwork anda setof “core”genesknowntobeassociatedtothediseaseunderstudy. In this category some methods used a random walk or a heat kernel[19],whileothersappliedWebandsocialnetworks meth-odsonaprotein–proteininteraction(PPI)network[20],andother approachesexploitedPPIandpathwayinformation toprioritize candidategenes[21,15].

Mostgeneprioritizationmethodsexploiteddifferentsourcesof informationandgenenetworks[22,23],rangingfromphenotypic similaritiesbetweendiseases andfunctional similaritybetween genes[24],toGOontologyandInterProdomainannotations[25] andprotein–protein interactions, geneexpression andcommon membershiptoKEGGpathways[26],andalsotoseveralothersets ofdatasources[15,27,28](see[22]foramoredetailedpresentation ofthedifferentcombinationsofsourcesofevidenceexploitedby recentdiseasegenesprioritizationmethods).

Despitethelargeavailabilityofworksdescribingspecific com-binationsofdatasetstodeveloptoolssuitablefordiseasegenes prioritization,“ourunderstandingofhowtoperformuseful pre-dictionsusingmultipledatasourcesoracrossbiologicalnetworks isstillrudimentary”[3],andinparticular,toourknowledge,no systematicstudiesfocusedonthecomparisonofdifferentnetwork integrationmethods.

Tocontributetofillthisgap,inthispaperwepropose,compare andanalyzedifferentnetworkintegrationstrategiestocombine multiplegenenetworksconstructedwithdifferentsourcesofsingle orheterogeneousdata.Inparticularweapplysimpleunweighted integrationmethods,that combinegenenetworkssolely onthe basisofthestructuralcharacteristicsofthenets,andwepropose weightedintegrationmethodsthatcombinenetworksaccording tothe“predictivenessstrength”ofeachtypeofnetwork,estimated throughtheassessmentoftheaccuracyofthelearningalgorithm trainedoneachofthecombinednetworks.Weconstructedand integratedninedifferentgenenetworks,includingalsosemantic similarity-basedgenenetworks,sinceithasbeenrecentlyshown thattheyimprovegene-diseaseprioritization[29,30].

Anothercontributionofthisworkconsistsintheapplication ofthekernelizedscorefunctionstothegene-diseaseprioritization problem. Thisnovelsemi-supervised network methodfornode

labelrankingadoptsboth localandgloballearningstrategiesto learnfromboththeneighborhoodofeachnodeandatthesame timefromtheoveralltopologyofthenetwork[31,32].

Anotheropenissueisrepresentedbythechoiceofthe“seed genes”tocharacterizethediseasesinvolvedinthegene prioritiza-tionanalysis[22].Previousmethodsfocusedonspecificdiseases [33,34]orongeneticdiseases[23,35]accordinge.g.totheonline Mendelianinheritanceinman(OMIM)database[36].Inorderto extendtheanalysistoalargersetofdiseases,notlimitedtogenetic disorders,inthisworkweused“seedgenes”borrowedfromthe MeSHtaxonomyofdiseases[37],byexploitinggene-MeSHdisease associationsprovidedbythecomparativetoxicogenomicsdatabase (CTD)[38].

Summarizing,ourmain contributionscanbeschematizedas follows:

•Weproposeoneofthewidestgene-diseaseprioritizationstudies, involvinggene-MeSH diseaseassociations coveringmorethan 700diseases,notlimitedtogeneticdisorders.

•Weproposenovelweightedintegrationmethodsabletocombine multiplenetworksaccordingtothe“predictivenessstrength”of eachsourceofdata.

•Acomparativeanalysisofdifferentnetwork-integration meth-ods,andaquantitativeevaluationoftheirimpactongene-disease prioritization.

•An extensive application of the kernelized score functions, a recentlyproposedsemi-supervisednetwork-basedmethodthat embedslocalandgloballearningstrategies,tothegenedisease prioritizationproblem.

This paperis structuredas follows. In Section2.1 we intro-duce MeSH and thepipelinewe applied toannotatethe “seed genes”used inourexperiments.Section2.2 describesthe func-tionalnetworksconsideredinourexperiments.TheninSection2.4 theunweightedand weighted integrationmethodsand in Sec-tion2.5thegeneprioritizationmethodsusedinourexperiments areintroduced.Theoverall experimentalsetting isdescribed in Section3.1, and the results relative to the application of the geneprioritizationmethodstothesinglefunctionalnetworksare discussedinSection3.3.Theseresultsarethenquantitatively com-paredwiththoseobtainedthroughunweighted(Section3.4)and weighted(Section3.5)networkintegrationmethods,whilein Sec-tion3.6thetop-rankedunannotatedgenesandtheAUCandp-value associatedtoeachofthe708MeSHdiseasesanalyzedinthiswork arepresented.Theconclusionsoutlinethemainfindingsofthis workandsuggestnovelresearchlinesinthecontextofthegene prioritizationandnetworkintegrationproblems.

2. Materialsandmethods

2.1. MeSH:medicalsubjectheadings

MeSH is a controlled vocabulary produced by the National Library of Medicine for indexing, cataloging, and search-ing biomedical and health-related information and documents (http://www.nlm.nih.gov/mesh,accessed30November2013).The descriptorsorsubjectheadingsofMeSHarearrangedina hierar-chy.MeSHcoversabroadrangeoftopicsanditscurrentversion consistsof16toplevelcategories.TheMeSHthesaurusisusedfor indexingarticlesfromtheworld’sleadingbiomedicaljournalsfor theMEDLINE/PubMEDdatabase.OneoftheMeSHtoplevelterms (Diseases)isusedtolabelthegenesetsusedinourexperiments andtoevaluatetheimpactofnetworkintegrationontheinference ofrelationshipsbetweengenesanddiseases.

The associations between the genes and the MeSH disease termshavebeendownloadedfromtheCTD[38],apublicresource

(3)

Fig.1.Pipelineofthegene–MeSHdiseaseannotationprocess.

thatprovidesinformationabouttheinteractionof environmen-tal chemicals with gene products and their effects on human diseases.Theserelationshipsareannotatedfromthescientific lit-eraturebyprofessionalbiocuratorswhomanuallycurateatriadof coreinteractions includingchemical-gene,chemical-diseaseand gene-diseaserelationships.TheCTDintegratesthesecoredatato generateinferredchemical-gene-diseasenetworks.

Toprovidea“goldstandard”of“seedgenes”toinfernovel gene-diseaseassociations,wefirstdownloadedtheassociationsbetween thehumangenesconsideredinourexperiments(Section2.2)and alltheavailableMeSHdiseasetermsavailableinCTD.Wethen fil-teredoutallthediseasesassociatedwithlessthanfiveandmore than200genesin ordertoboth ensurea minimumamountof aprioriinformationforourpredictiontasksandtoavoidclasses whoseassociatedgenesetsaretooheterogeneous.Thisledtothe definitionofasetcomposedby708MeSHdiseases(Fig.1).

Thefullsetofthe“goldstandard”seedgenes–MeSHdisease associations is availablefrom http://homes.di.unimi.it/valentini/ DATA/DiseaseGeneNetworks(accessed30.11.13).

ItisworthnotingthatMeSHcontrolledvocabularyofdiseases hasbeenjustproposedinthecontextoftext-mining-basedgene prioritization[39],butthoseresultscannotbesafelygeneralized tonetwork-basedmethods,sincetext-miningapproachesshowa biastowardgenesforwhichalarge“apriori”knowledgeisactually availableinliterature[14].

2.2. Functionalnetworks

We collecteddifferentsourcesof datatorepresentdifferent functionalrelationshipsbetweengenes.Moreprecisely,we con-structedgene networksusingphysicaland geneticinteractions, transcriptionalco-expression/regulationandlocalization,protein domainandgenechemicalinteractions,co-occurrenceof disease-genepairsinscientifictexts,homologuesimplicatedingenerating

similarphenotypesinotherorganisms,commonmolecular path-waysbetweengeneproducts,andcommonGOannotations.

Table1summarizesthemaincharacteristicsoftheninegene functionalnetworksusedinourexperiments.Eachgenenetwork includesasetSof8449genes(orasubsetofthem)selected accord-ingtotheproceduresdescribedin[40].Weconsideredasetof genesfor whichsufficient functionaldataareavailable, andfor whicharelativelycomparablecoverageacrossgenenetworkscan beassured.Inthisway,ontheonehandacertainamountof func-tionalinformationisensuredforeachgene,andontheotherhand theavailableinformationforeachconsideredgeneresults compa-rable.

In the restof this section we provide a brief descriptionof each gene network. The full data sets are downloadable from: http://homes.di.unimi.it/valentini/DATA/DiseaseGeneNetworks (accessed30.11.13).

2.2.1. Functionalinteractionnetwork–finet

In [41] Wu and colleagues constructed a functional protein interactionnetworkbasedonfunctionalinteractionspredictedby aNaiveBayesclassifiertrainedonpairwiserelationshipsextracted fromcuratedpathwaysandnon-curatedsourcesofinformation, includingprotein–protein interactions,gene co-expression, pro-tein domain interaction, Gene Ontology (GO) annotations and text-minedprotein interactions. Fromthe original network we extractedthesubnetworkincludingthesubsetSofgenesusedin ourexperiments.

2.2.2. Humannet–hnnet

Similarinspirittotheapproachin[41],thefunctionalnetwork constructionmethodpresentedin[27]byLeeandcolleagues inte-gratesdiverselinesofevidenceinordertoproduceafunctional humangene network. It hasbeenused inseveral teststo pre-dictcausalgenesforhumandiseasesandtoincreasethepower Table1

Characteristicsofthegenenetworksusedinourexperiments.

Network Description Type Nodes Edges Density

finet Obtainedfrommultiplesourcesofevidence Binary 8449 271466 0.0038

hnnet Obtainedfrommultiplesourcesofevidence Binary 8449 502222 0.0070

cmnet Networkprojectionsfromcancermodules Binary 8449 3414722 0.0478

gcnet NetworkprojectionsfromCTD Binary 7649 1421298 0.0242

bgnet NetworkprojectionsfromBioGRID Binary 8449 120169 0.0016

dbnet DirectrelationshipsobtainedfromBioGRID Binary 8449 3023084 0.0423

bpnet SemanticsimilaritynetworkfromGOBP Realvalued 6923 44506147 0.9286

mfnet SemanticsimilaritynetworkfromGOMF Realvalued 6145 26611887 0.7047

(4)

Fig.2.Simplifiedrepresentationofbipartitenetworkprojectionsintohomogeneousgenenetworks.(a)Binaryprojectiontoconstructthecmnetnetwork;(b)sumprojection

toconstructgcnet.Circlesrepresentgenes,squaresrepresentcancermodules(a)andchemicals(b).

ofgenome-wideassociationstudies.Alsointhiscaseweextracted fromHumanNetthesubnetworkincludingthesubsetSofgenes. 2.2.3. Cancermodulenetwork–cmnet

Byexploitinggeneexpressionprofiling,Segalandcolleagues constructedafunctionalmodulemapforcancertoinvestigate com-monalitiesandvariationsbetweendifferenttypesoftumor[42].In theirworktheauthorsanalyzedacollectionofexpressionprofiles withtheaimtoidentifysetsofgenesthatactinconcerttocarry outspecificfunctionsindifferentcancertypes,andthenproduceda modulemapconstitutedbyacollectionofthegenesetsassociated tospecificcancergenemodules.

Weusedtherelationshipsbetweenthehumangenesandthe Segal’scancermodules[42]toconstructabipartitenetwork.This networkhasbeenprojectedontothegenespacethusoriginating thecmnetnetwork.Thetypeofprojectionusedintheconstruction ofcmnetisabinarybipartitenetworkprojection,meaningthatthe weightoftheedgelinkingtwogenesintheprojectednetworkis1if thetwogenesshareatleastoneneighbourintheoriginalbipartite networkand0otherwise(Fig.2a).

2.2.4. Genechemicalnetwork–gcnet

The CTD stores information mined from literature about the interactions between genes, chemicals and diseases in many species. Since one of the objectives of this work is the evaluation of the capabilities of heterogeneous networks integration in the prediction of genes–diseases relationships, we used the genes–chemicals relationships available in the CTD to construct a gene interactions network (gcnet). To this end we downloaded from CTD the chemicals–genes interac-tions file (http://ctdbase.org/reports/CTDchemgeneixns.csv.gz, accessed30.11.13)and weconstructedabipartite network.We thenperformedaSUMprojectionontothegenespace,bywhichthe weightofanedgelinkingtwogenesequalsthenumberofthe com-monneighborsofthegenesinthebipartitenetwork.Theresulting networkhasfinallybeenbinarizedusingacutoffoffiveormore commonchemicalsinteractorstosetabinaryinteractionbetween apairofgenes(Fig.2b).

2.2.5. BioGRIDdatabasenetwork–dbnet

Thisisaprotein–proteininteractionnetworkconstructedusing directphysicalandgeneticinteractionsobtainedfromBioGRID[43] (v.3.2.96–January2013).

2.2.6. BioGRIDprojectednetwork–bgnet

Insteadofsetting-upabinaryinteractionnetworkbasedonthe directinteractionbetweentheSgenes,weconstructedabipartite networkbasedonthecontentoftheBioGRID,butusingastopnodes

theSgenesandasbottomnodesallthehumangenesBavailable inBioGRID.Moreprecisely,ifinBioGRIDdoesexistaninteraction betweenanodea∈Sandx∈B,weaddedthe(a,x)edgeinthe bipar-titenetwork.Then,accordingtoabinaryprojectiontotheSspace, anedge(a,b),a∈S,b∈Sisaddedtotheprojectednetworkifaandb shareatleastonecommonnodex∈Bintheirneighborhoodsofthe bipartitenetwork.Inthiswaywecancaptureindirectinteractions betweenpairsofgenes.

2.2.7. Semanticsimilarity-basednetworks:bpnet,mfnetand ccnet

The last three networks considered in this workhave been constructed by computing theResnik semantic similarities[44] betweenthetermsofeachdivisionoftheGeneOntology:biological process,molecularfunctionandcellularcomponent.Weobtained a pairwise gene similaritymeasure by choosing the maximum Resniksemanticsimilaritybetweenallthetermsforwhichthetwo genesareannotated.Theresultingnetworkswerenamedbpnet, mfnet and ccnetrespectively. The semantic similaritymeasures havebeencomputedusingaMATLABapplicationimplementing methodsdescribedin[45].

2.3. Basicnotation

Gene networksfor disease prioritizationcan berepresented throughanundirected weightedgraphG=(V,E),whereVisthe setofverticescorrespondingtogenesandEthesetofedges corre-spondingtosomenotionoffunctionalrelationshipbetweenpairs ofgenes/vertices.Verticesofthegraphandgenescanbedenoted withnatural numbers1,2,...,n,sinceeachvertex ofGis uni-vocallyassociatedtoagene.Thecorrespondingadjacencymatrix Wwithweightswijrepresentsthe“strength”oftherelationship betweenverticesi,jV;VM⊂Vdenotesasubsetof“positive”

ver-ticesbelongingtoaspecificMeSHsubjectheadingM(e.g.aMeSH descriptorofadisease–Section2.1).

Weconsideredtheintegrationofngenenetworks,Gd=(Vd,Ed),

1≤d≤n,andwedenoteby ¯Gtheintegratednetwork ¯G=( ¯V,E¯), with ¯V=

dVdand ¯E

dEd.Theweightsoftheedges(i,j)∈Ed arerepresentedwithwd

ij.Finallyasetoffeatures xi∈Xcanbe

asso-ciatedtoagenei.Forinstance, xicouldrepresentthegeneticor

proteininteractions,theexpressionprofileorwhateveravailable dataforagivengene/vertexi.

2.4. Networkintegrationmethods

Wedesignedandapplieddifferentnetworkintegration meth-ods to combine different sources of evidence of functional relationshipsbetween genes. Our aim consists in providing an

(5)

analysisoftheimpactofnetworkintegrationtogene prioritiza-tion,inordertounderstandwhetherthecombinationofmultiple networks,constructedfromdifferentsourcesofinformation,can significantlyenhancetheperformanceofgeneprioritization meth-ods,andtoprovideaquantitativeassessmentofthishypothesized improvement.Tothis end weprogrammatically considered rel-ativelysimple methods, rangingfrom unweighted toweighted network integration algorithms, excluding more complex algo-rithms proposed in the literature, to allows us to perform an extensiveanalysisinvolvingalargesetofdiseases,alargesetof humangenesandasignificantsubsetoftheintegrationmethods appliedtogeneprioritizationproblems.

Unweightedmethodsarecharacterizedbynetworks combina-tionsdependingonlyonthestructureofthenetworkitself,while weightedonesdependonanestimateofthelearningcapabilities ofnetworkalgorithmsorontheassessmentofthe “informative-ness”oftheavailabledata.ThemethodsproposedinSection2.4.2 (unweighted integration) and in Section2.4.3 (weighted inte-gration) share several general characteristics with previously proposedmethodsappliedingeneprioritizationproblemsorin othercomputationalbiologyproblemssuchasgenefunction pre-diction[46–49].

Forinstance,unweightedapproachessuchasthesimpleunion of networks has beenapplied to the prioritization of genes in Alzheimer’sdiseaseusingaguilt-by-associationinferencerule[47], ortotheintegrationofPPIdataofmodelorganismsmappedto humanthroughhomology[19],orinthecontextofthefunctional interpretationofgenomicvariantstotheintegrationofgene inter-actionnetworks[50],ortofindfunctionalmodulesinnetworks integratedfrommultiplepublicdatabases[51].Otherunweighted approachesforgeneprioritizationaveragethescaledGram matri-cesobtainedfromdifferentsourcesoffunctionalinformationusing suitablekernels[46].

Weightedapproaches differ for theway theweights associ-atedtoeachnetworkareestimated.Forinstance,weightscanbe obtainedthroughaniterativealgorithm showntobeequivalent toanexpectation-maximization(EM)optimizationalgorithm[52], orweightsarelearntbysolvingaquadraticallyconstrainedlinear programinanoveltydetectionsettingofthegeneprioritization problem[46],orinthecontextofthegenefunctionprediction prob-lemweightscanbeinterpretedfromaprobabilisticstandpoint[49] orestimatedusingthePPV(positivepredictionvalue)associated totheedgesofthegraph[48].

In the following sections, we describe the network pre-processingandtheunweightedandweightednetworkintegration methodsthatwetestedinourexperiments.

2.4.1. Networkpre-processing

Beforethecombinationphaseeachnetworkunderwenta pre-processingsteptoallownetworksforhavingdifferentnumberof nodes,tofiltersomeedgesintoodensegraphs,andtomakethe weightscomparableacrossdifferentnetworks.Inparticular,todeal withgenesmissinginsomenetworks,wefilledthecorresponding rows/columnsofthesymmetricadjacencymatrix Wwithzeros. Toreducethecomplexityofthenetworkandthenoiseintroduced bytoosmalledgeweights,asapre-processingstepweeliminated edgesbelowagiventhreshold.Inthiswayweremovedveryweak similaritiesbetweengenes,butatthesametimewechoserelatively lowthresholdstoavoidthegenerationof“singletons”withno con-nectionswithothernodes.Inbrief,wetunedthethresholdforeach networktoguaranteethateachvertexhasatleastoneconnection: foreach node/genewecomputed themaximum oftheweights associatedtoitsedges,andbetweentheselectedmaximawechose theminimumasa generalthresholdforthenetwork.Finally,to maketheweightscomparableacrossdifferentnetworks,avoiding theundesirableeffectthatacertainnetworkcouldovercomethe

othersbecauseofthehighvaluesofitsweights,weappliedboth Laplacianregularization[53]andasimplelinearregularizationto obtainweights ˆwij∈[0,1]:

ˆ wij=

wij−minx,ywxy maxx,ywxy−minx,ywxy

(1) whereindicesx,yVrefertothevertices/genesoftheunderlying graph.

Inourexperimentsweadoptedtheregularizationshownin(1), sincetheresultswerecomparablewithLaplacianregularization (datanotshown).

2.4.2. Unweightednetworkintegration

In the unweighted network integration the combination of differentnetworksdependsonlyonthestructureand the char-acteristics of each network, and nolearning is involved in the computationoftheintegratednetwork.

2.4.2.1. Unweighted average (UA). One of the widely applied approachisrepresentedbytheUAmethod[46,32].Theweightof eachedgeofthecombinednetworksiscomputedsimplyaveraging acrosstheavailablennetworks:

¯ wij= 1 n n

d=1 wdij (2)

Notethatin thisintegrationapproach alsoweights wij=0 con-tributestotheaverage,independentlyofthefactthatthemeasure offunctional relationshipbetweengenesiandjunderlyingthe evidencesourceisavailableornot.

2.4.2.2. Per-edgeunweightedaverage(PUA). We proposea novel method,similartoUA,butthatassuresahighcoverageofthegenes includedintheintegratedfunctionalnetwork,withoutpenalizing genesforwhichaspecificsourceofdataisunavailable.Withrespect totheUAmethod,PUAtakesintoaccountthefactthat agiven functionalrelationshipbetweenapairofgenescouldbemissing, averagingthatedgeonlybythenumberofnetworkscontaining bothgenes.

Moreprecisely,givenasetofngenenetworkstheweight ¯wijof theedge(i,j)∈E¯iscomputedasfollows:

¯ wij= 1 |D(i,j)|

d∈D(i,j) wd ij (3) whereD(i,j)={d|i∈VdjVd}.

2.4.2.3. Networkmaximumintegration(MAX). TheMAXintegration selectsthelargestweightamongalltheavailablesourcesofdata:

¯ wij=max

d w d

ij (4)

Thisapproach performstheunionofalltheavailablesourcesof evidence[47,51,50],andwhenmultipleedges(i,j)foragivenpair ongenesiandjareavailable,selectstheonewiththelargestweight. 2.4.2.4. Networkminimumintegration(MIN). Analogously,theMIN integrationselectstheminimumweight:

¯ wij=min

d w d

ij (5)

Inpracticeitrealizestheintersectionbetweenmultiplenetworks. Itcanbeimplementedintwodifferentflavours:the“drastic” algo-rithm(5)forwhichitissufficientasinglewd

ij=0inordertoset ¯

(6)

setto0arediscarded,and ¯wij=0ifandonlyiftheweightsforthe edge(i,j)inalltheavailablenetworksaresetto0:

¯ wij=

0 if

d wd ij=0 min d {w d ij|wdij=/0} otherwise (6)

Itisworthnotingthatthatthisapproachcouldbehighlyaffected bynoisydata.Itcouldbereliablewhenalargeevidenceisshared amongdifferentsourcesofdata.

2.4.3. Weightednetworkintegration

Theunweightedmethodsdonotrequiretolearnany param-eters fromthe data, while theweighted integration learnsthe “weight”associatedtoeachnetwork.Thebasicideabehindthese approachesconsistsinassociatingaparametertothe “predictive-nessstrength”ofeachtypeofnetwork.Thiscanberealizedbyusing alearningalgorithmtoassociatethe“predictivenessstrength”ofa networkwiththeassessmentoftheaccuracyofthelearning algo-rithmtrainedonthenetworkitself.

Differentweightedapproacheshavebeenproposedinthe lit-erature[46,52,48,54].Inourexperiments,consideringthatingene prioritizationthemainobjectiveconsistsineffectivelyrankingthe geneswithrespecttoagivendisease,wecomputedtheweights accordingtotheAUCobtainedforagivenMeSHdescriptor.More precisely,havingnnetworksandcMeSHdescriptors,wecan com-putetheweightd(k)forthedthnetworkandthekthMeSHdisease

inthefollowingway: d(k)=

Md(k)

n j=1Mj(k)

(7) whereMd(k)representsthemetricappliedtomeasuretheaccuracy

oftheprediction(e.g.theAUCortheprecisionatafixedrecall)with respecttokthMeSHdescriptorandthedthnetwork.The denom-inatorin(7)simplyassuresthat

nd=1d(k)=1.Thed(k)canbe

computedfor each MeSHdescriptor k byestimating the corre-spondingAUCbyleave-one-outonthetrainingdata,thatistosay, an“internal”crossvalidationisperformedtooptimizetheweights, bysubdividingeachfoldofan“external”crossvalidationapplied toevaluatethemethodinthewholedataset.

2.4.3.1. Weightedaverageperclass(WAP).Byusingthed(k)

com-putedaccordingto(7),theWAPmethodintegratesthenetworks byputtingaweightproportionaltotheperformanceofa given learningalgorithmoneachnetworkusedintheintegration:

¯ wij(k)= n

d=1 d(k)wd ij (8)

Itisworthnotingthatinthiswayweconstructadifferentweighted integratednetworkforeachMeSHdescriptor.

In order to emphasize the weight of the most informative networksand,atthesametime,toreducetheweightsoftheleast informativeones,amonotoniclogarithmictransformationofthe weightscanbeapplied,insteadofusingtheoneproposedin(7): d(k)=

log(1n −Md(k))

j=1log(1−Mj(k))

(9) WeassumethatthemetricMhasvaluesin[0,1](consider,e.g.the AUC).Notethatinapracticalimplementation,toavoidd(k),

weneedtosetanupperboundb<1forM.Forinstance,inour experimentsweusedtheAUCandwesetb=0.99.

2.4.3.2. Weighted average (WA). The WAP method adapts the weightsd(k)accordingtotheperformanceofalearningalgorithm

oneachspecificclasskunderstudy.Ononehand,thiscouldleadto

asetofnetworkswellfittedtothecharacteristicsofeachclassk,but ontheotherhandthisapproachislikelytooverfitthedata.Tothis endweintroduceasortof“regularized”versiontoreduce possi-bleoverfittingproblemsinthelearningprocess.Morepreciselywe computearegularizedweightd,byaveragingacrossclasses,inthe

spiritoftheapproachproposedin[55]inthecontextofgene func-tionpredictionproblems.Inthiswayweobtainauniqueweightd

foreachnetwork: d= 1 c c

k=1 d(k) (10)

The WAmethod,using the weights estimated in (10), builds a uniqueintegrated network, independentlyof theMeSHdisease considered: ¯ wij= n

d=1 wdij c

k=1 d(k) c = n

d=1 dwdij (11)

Notethatinthissectionweconsideredtheintegrationofgraphs representedthroughtheircorrespondingadjacencymatrices W, butitiseasytoseethatthesamemethodcanbeappliedtokernel matrices Kderivedfrom W,bysimplysubstitutingineach equa-tionthewijelementsoftheadjacencymatrixwiththekijelements

ofthecorrespondingkernelmatrix(seeSection2.5.1). 2.5. Geneprioritizationmethods

Inthis sectionweintroduce thegene prioritizationmethods appliedinourexperiments.Wefocusedonkernelizedscore func-tions, since it has been recently shown it is among the most competitivemethodsintherelatedproblemofcancermodulegene ranking[40],andonrandomwalksalgorithms,sincetheyhavebeen successfullyappliedtoprioritizegeneswithrespecttogenetic dis-eases[19].Asabaselinemethodweusedasimpleimplementation oftheguilt-by-association(GBA)principle[56].

2.5.1. Kernelizedscorefunctions

Kernel-basedrankingmethodshavebeenrecentlyproposedin thecontextofcancermodulegeneranking[40],drugranking[57] andgenefunctionpredictionproblems[58,31].Methodsbasedon kernelizedscorefunctionsarevery fast(theirtimecomplexity is approximatelylinearinsparsegraphs,oncethekernelmatrixis computed)[31],and theiraccuracy isat leastcomparable with state-of-the-artgeneprioritizationmethods[40].

ThescorefunctionsS:V−→R+arebasedonproperlychosen

kernels,bywhichwecandirectlyrankverticesaccordingtothe valuesofS(i):thehigherthescore,thehigherthelikelihoodthata genebelongstoagivenMeSHdisease.

Kernelized score functionsrely on distancemeasures defined inasuitable HilbertspaceH.Moreprecisely,letXbea general nonemptyset,:X→H,amappingtoagivenuniversal reprodu-cingkernelHilbertspaceH,andK:X×X→Ritsassociatedkernel function,suchthat<(·),(·)>H=K(·,·),where<·,·>H rep-resentstheinternalproductinH.Bychoosingadistancemeasure onaHilbertspace,wecanexploittheclassical“kernel-trick”[59] andwecanembedanyvalidkernelintothedistancemeasureitself. It is worth noting that we extend the notion of neighbour throughthekernelK:bychoosinganappropriatekernel,nodej canbeintheneighbourofnodeievenifthereisnoedgebetween themintheoriginalgraphG:i.e.wij=0,butK(xi, xj)>0.From

thisstandpointtheGrammatrix Kcanbeinterpretedasanovel “weightedadjacencymatrix”intheprojectedHilbertspaceinduced bythemapping:X→H.

(7)

IfwechoosetheminimumdistanceDNNbetweeniandVM(the

setofgenesannotatedforagivenMeSHdiseaseM),wecanobtain thenearest-neighboursscoreSNN:

DNN(i,VM)=min j∈VM 1

2(xi)−(xj)

2 (12)

Bydevelopingthesquare(12)weobtain: DNN(i,VM)=min j∈VM

1 2 <(xi),(xi)>+ 1 2<(xj),(xj)> −<(xi),(xj)>

(13) Bysubstitutingin(13)theinternalproduct<(·),(·)>witha suit-ablekernelK(·,·),wecanobtainasimilaritymeasuresimplyby changingthesign:

SimNN(i,VM)=−min j∈VM

1 2K(xi,xi)−K(xi,xj)+ 1 2K(xj,xj)

(14) IfK(xj,xj)areequalforallj∈V,wecansimplify(14),thusachieving

thenearestneighboursscoreSNN:

SNN(i,VM)=−min j∈VM

−K(xi,xj)=max j∈VM

K(xi,xj) (15)

AnaturalextensionoftheSNNscorecanbeobtainedby

introduc-ingthek-nearestneighboursdistance: DkNN(i,VM)= 1 2

j∈Ik(i) (xi)−(xj)2, (16)

whereIk(i)={j∈VM|jisrankedamongthefirstkinVM}.Byadoptinga

similarprocedureusedtoderivetheSNNscore,wecanobtainfrom

(16)thek-nearestneighboursscoreSkNN:

SkNN(i,VM)=

j∈Ik(i)

K(xi,xj) (17)

UsingadistanceDAV(i,VM)ofavertexi∈Vwithrespecttoaset

ofnodesVM,simplyastheaveragedistanceintheHilbertspace

betweeniandthesetofnodesincludedinVM:

DAV(i,VM)= 1 2(xi)− 1 |VM|

j∈VM (xj)2 (18)

wecanderivefrom(18)theaveragescoreSAV:

SAV(i,VM)=− 1 2K(xi,xi)+ 1 |VM|

j∈VM K(xi,xj) (19)

Thisscore representsthe averagesimilarity of thegene i with respecttothegenesbelongingtothesetVM.IfallK(xi,xi)areequal

foreachi∈V(i.e.the“self-similarity”ofgenesdoesnotmatter),we canfurthersimplify(19)byremovingitsfirstterm.

EvenifanyvalidkernelKcanbeappliedtocomputetheabove proposedscores,inthecontextofnetwork-basedgene prioritiza-tion,weusedrandomwalkkernels[53],sincetheycancapturethe similaritybetweengenes,takingintoaccountthetopologyofthe overallfunctionalinteractionnetwork.

TheGrammatrix Kassociatedtotheone-steprandomwalk ker-nelcanbederivedfromthesymmetricadjacencymatrix Wofthe functionalinteractionundirectedgraphG:

K=(a−1)I+D−12WD− 1

2 (20)

whereIistheidentitymatrix,Disadiagonalmatrixwithelements dii=

jwij,andaisavaluelargerthan1.

The q-step random walk kernels Kq−step= Kq, can beeasily

obtainedbymatrixmultiplicationfromtheone-steprandomwalk kernelmatrix(20),whereqrepresentsthenumberofrandomwalk

stepsintheunderlyinggraph[53].Inthisway,bysettingq=2or q=3twoverticesareconsideredsimilariftheyaredirectly con-nectedoriftheyareconnectedthroughapathincludingoneor twovertices.Alsolongerpathscouldbeconsidered,bysettingq>3: inthiswaywecandeeplyexplorethegraphtofindsimilarities betweengenesmediatedthroughlongpathsinthegraph.

2.5.2. Randomwalksandrandomwalkswithrestart

Kernelizedscorefunctionspresentedintheprevioussectioncan beinterpretedasageneralizationoftherandomwalkalgorithms, whichhavebeensuccessfullyappliedtogeneprioritization prob-lems[19,60].Randomwalk(RW)algorithms[61] rankgenesby exploringandexploitingthetopologyofthegenenetwork: ran-domwalksacrossthenetworkareperformedstartingfromasubset VM⊂VofgenesbelongingtoaspecificMeSHdescriptorMbyusing

atransitionprobabilitymatrix Q=D−1W,where Wisthe adja-cencymatrix,and Disadiagonalmatrixwithdiagonalelements dii=

jwij.

Startingfromtheinitialsetofprobabilitiespoofthegenes1...n

ofbelongingtoM,wherepi

o=1/VMifi∈VM,otherwisepio=0,the RWupdaterule:

pt+1=QTpt (21)

isrepeatedtilltoconvergenceorforafixednumberofiterations. We canobservethat therandomwalkercouldprogressively “forget”theaprioriinformationavailablefortheMeSH descrip-torM,byiterativelywalkingacrosstheoverallnetwork.Toavoid thisproblem,wecanstoptheRWalgorithmafterafewiterations, asoutlinedabove,orwecanapplytherandomwalkwithrestart (RWR)method:ateachsteptherandomwalkercanmovetooneof itsneighboursorcanrestartfromitsinitialconditionwith proba-bility:

pt+1=(1−)QTpt+po (22) WithbothRWandRWRmethodsatthesteadystatewecanrankthe vector ptoprioritizegenesaccordingtotheirlikelihoodtobelong totheMeSHdiseaseunderstudy.

2.5.3. Guiltbyassociationmethods

Asabaselinegeneprioritizationmethodweappliedasimple implementationoftheguilt-by-association(GBA)principle. Accord-ingtothisgeneralbiologicalprinciple,abiomolecularentitythat interactsorsharessomefeatureswithanotherbiomolecularentity canalsosharesomespecificbiologicalproperty(forinstance,its membershiptoagivenMeSHcategory).Incomputationalbiology thisbasicbiologicalprinciplehasbeenexploitedtodevelop meth-odsabletoassignagivenbiologicalormolecularpropertyonthe basisofthelabelingofneighborhoodsinbiomolecularnetworks [56,62].In thecontext of gene prioritization problems, wecan assessthelikelihoodthatagivengenebelongstoagivenMeSH categoryMonthebasisoftheM-labeledgenesdirectlyconnected tothegeneunderstudy.

WeimplementedasimpleversionoftheGBAapproach,inwhich thescoreforeachgeneiscomputedbychoosingthemaximumof theweightswij∈Woftheedgesconnectingthegeneitopositive labeledgenesj∈VMintheneighborhoodN(i)ofi:

S(i,M)=max

j∈N(i)wij (23)

(8)

3. Resultsanddiscussion

3.1. Experimentalset-up

Oneofthemaingoalsofthisworkconsistsinperformingan extensiveanalysisofgene-diseaseassociations,consideringalarge setofdiseases.

Moreover, we experimentally investigated the impact of networkintegrationongeneprioritization,byperforminga quan-titative comparison of the accuracy achieved by the methods describedin Section2.5usingeach ofthesinglegene networks consideredinSection2.2withthatobtainedthroughthenetwork integrationmethodsintroducedinSection2.4.

Moreprecisely, atfirstweassessed the“informativeness”of eachsinglegene networkbyanalyzingtheperformanceofGBA, RW,RWRandkernelizedscorefunctionmethods.Thenweperformed asystematic analysisofboth unweightedandweighted network integrationmethods, by combining at firstthe six binary gene interactionnetworksandthenbyexploitingalsothereal-valued semanticsimilarity-basedgenenetworksthroughtheintegration ofalltheavailableninedifferentnets(Table1).

Moreoverwe indicated someunannotated genes as reliable “diseasegene”candidatesforaselectedsetofMeSHdiseasesfor whichweobtainedrobustandaccuratepredictions.

3.2. Evaluationofthegeneprioritizationandnetworkintegration methods

Thegeneralizationperformancesofeachgeneprioritizationand networkintegrationmethodhasbeenassessedthroughaclassical cross-validationprocedure[63],settingtofivethenumberofthe folds.Moreprecisely,thenodesofthegraphhavebeenrandomly partitionedinfivefolds,andinturnafoldisselectedasthetest fold,whiletheremainingarethetrainingfolds.Thelabelsofthe testfoldareremoved,andthelabelsofthetrainingfoldsareused toinferthescorestobeassignedtothenodesofthetestfold(in oursettingwedealwithgeneprioritization,i.e.arankingproblem). Finally,havingthescorespredictedforeachofthefivefolds(that isfortheentiresetoftheavailablegenes)wecanapplystandard measurestoevaluatethecorrectnessoftheobtainedgeneranking withrespecttoeachdisease.InparticularweappliedtheAUCto evaluatetherankingofthegenes.Moreover,weappliedalsothe precisionatagivenrecalltotakeintoaccountthatforseveralMeSH diseaseswehavearelativelylownumberofknowndiseasegenes (positiveexamples).

Aftertheassessmentofthegeneralizationperformanceofthe geneprioritizationandnetworkintegrationmethods,wereported foreachoftheconsidered708MeSHdiseasesthep-valueobtained throughanonparametricstatisticaltestbasedonthe“shuffling”of thegenelabels(Section3.6).Thenwereportedthe10top-ranked unannotatedgenesforeachMeSHdisease,andweperformedalso ananalysisoftheunannotatedgenesasreliable“diseasegene” can-didatesonthebasisofthedistributionofthescoresoftheannotated genesfortheMeSHdiseasesforwhichweobtaineda veryhigh estimatedcross-validatedAUCvalue.

Weoutlinethatthereportedresultsarebased,accordingtothe literatureongeneprioritization,onretrospectivebenchmarks,and forthisreasonofferusuallyoptimisticestimatesofthe general-izationperformances,sincedisease-associations arelikelytobe directlyorindirectlyincorporatedinthegene-prioritizationdata sources[3].Asoutlinedin[64],thisproblemisdifficulttoaddress inaninitialstudyandcanberesolvedonlybylong-term perspec-tivebenchmarks,whereinpredictionsaremadeonthecurrentstate ofknowledge(thatisthecurrentavailableannotations)and vali-datedinfuturestudies,thatisoncenovelexperimentalevidence ofdisease-associationswillbeavailable.

3.3. Geneprioritizationwithsinglenetworks

Weperformedanassessmentofthe“informativeness”ofeach genenetworkthroughanextensiveexperimentalevaluationofthe averageAUCresultsacross708MeSHdiseases,usingdifferentgene prioritizationmethods(Table2).ThefirstcolumnofTable2shows thegeneprioritizationmethodsandtheirmainassociated learn-ingparameters(seeSection2.5fordetails).Foreachcolumnthe bestaverageAUCresultsachievedbythegeneprioritization meth-odsarehighlightedinbold.SAVandSkNNkernelizedscorefunctions

achieveusuallythebestresults,butalsoRWandRWRalgorithms aresometimescomparablewithkernelizedscorefunctions.The dif-ferenceisstatisticallysignificant(Wilcoxonranksumtest,˛=0.01) infavorofkernelizedscorefunctionsforthedatasetsdbnet,finet, hnnet,bpnetandccnet,whilefortheotherfourfunctionalnetworks nostatisticallysignificantdifferencehasbeendetected.

ThelastrowofTable2showstheaverageresultsacross meth-odsforeachgenenetwork.Wecanobservethatontheaverage geneprioritizationmethodsachievethebestresultswithfinetand gcnet,buttheAUCperformancesarerelativelyhighalsowithhnnet andbpnet.Theothernetsappeartobelessinformativeonthe aver-age,butconsiderthatacertainlearningisassuredwitheachofthe considerednetworks,sincetheaverageAUCisalwayssignificantly largerthan0.5.

Itisnotsurprisingthatfinet,gcnet(andalsohnnet)arethemost “informative”networks,since theyareconstructedby integrat-ingdifferentsourcesofinformation(Section2.2).Weonlyobserve that withgcnettheresultsare referred onlytoa subset ofthe genesusedinourexperiments(Table1).Itisworthalsonoting thegoodresultsobtainedwithsemanticsimilarity-basednetworks constructedfrombiologicalprocessesGOannotations(bpnet),even ifalsointhiscasetheresultsarecomputedwithrespecttoasubset oftheSgenes,andhencethecomparisonmustbeconsideredwith acertaincaution.Summarizing,theresultswitnessforthefactthat alltheconsideredgenenetworksbearacertaininformationabout thegeneprioritizationwithMeSHdiseases.Inparticularnetworks justconstructedthroughtheintegrationofdifferentsourcesof evi-denceseem tobethemost“informative”for this generanking task.

3.4. Geneprioritizationwithunweightednetworkintegration Ournetworkintegrationexperimentsstartedwiththe combi-nationofthesixbinarygenenetworksdescribedinSection2.2(that isalltheavailablegenenetworksexcludingreal-valuedsemantic similarity-basednets),usingtheunweightedcombinationmethods presentedinSection2.4.2.Table3reportstheaverageAUCresults acrossMeSHdiseaseswithUA,PUAand MAXintegration meth-ods.Notethatwedidnotperform“soft”MINintegrationsinceitis easytoseethatwithbinarynetworksthismethodis indistinguish-ablefromMAX,while“drastic”MINleadstohighlydisconnected networks.

ComparingTables 2and 3,we canobservethatunweighted integrationimprovestheperformance.Thisistrueespeciallywith UAandPUAmethods(thedifferenceisalmostalwaysstatistically significantat˛=0.01significancelevel),butinseveralcasesalso withMAX.Theimprovementdependsalsoonthegene prioritiza-tionmethodused.Forinstanceunweightedintegrationdegrades performancewithSNN(atleastwithrespecttothemost

informa-tivesinglegenenetworks),whilewiththeotherkernelizedscore functionsandwithGBA,RWandRWRalgorithmsoftenunweighted integrationimprovesAUCresults.Whilealargernumberofsteps improvestheperformanceofkernelizedscorefunctions,withthe classicalRWalgorithmweobserveadegradationofthe perform-ances.TheseresultsshowthattheclassicalRWtendsto“forget”the initial“apriori”knowledge,whilekernelizedscorefunctionsretain

(9)

Table2

Singlegenenetworks:AUCresultsaveragedacross708MeSHdiseases.Thelastrowshowstheaverageresultsacrossmethodsforeachgenenetwork.

cmnet bgnet dbnet finet hnnet gcnet bpnet mfnet ccnet

GBA 0.6620 0.6389 0.6683 0.7542 0.7323 0.7346 0.7134 0.6395 0.6250 RW1step 0.6922 0.6590 0.6037 0.7356 0.7269 0.8418 0.7646 0.6985 0.6845 RW2step 0.6829 0.6462 0.6761 0.8194 0.7802 0.8220 0.7635 0.7013 0.6812 RW3step 0.6768 0.6406 0.6531 0.8157 0.7531 0.8145 0.7611 0.6985 0.6745 RW5step 0.6718 0.6316 0.6426 0.7993 0.6973 0.8089 0.7610 0.6834 0.6711 RW10step 0.6694 0.6224 0.6222 0.7575 0.6249 0.8075 0.7411 0.6790 0.6684 RWR=0.6 0.6871 0.6515 0.6781 0.8271 0.7889 0.8401 0.7825 0.7112 0.6856 RWR=0.9 0.6878 0.6513 0.6750 0.8242 0.7870 0.8453 0.7789 0.7085 0.6825 SAV1step 0.6894 0.6574 0.6717 0.7669 0.7596 0.8167 0.7889 0.7139 0.6916 SAV2step 0.6842 0.6414 0.6831 0.8226 0.7872 0.8328 0.7888 0.7142 0.6914 SAV3step 0.6845 0.6417 0.6752 0.8255 0.7897 0.8417 0.7879 0.7146 0.6913 SAV5step 0.6850 0.6418 0.6778 0.8287 0.7943 0.8471 0.7839 0.7151 0.6907 SAV10step 0.6849 0.6408 0.6804 0.8312 0.7983 0.8407 0.7640 0.7117 0.6882 SNN1step 0.6296 0.6263 0.6667 0.7561 0.7374 0.7308 0.6971 0.6485 0.6565 SNN2step 0.6235 0.6105 0.6764 0.8031 0.7624 0.7316 0.7032 0.6478 0.6567 SNN3step 0.6228 0.6105 0.6683 0.8044 0.7638 0.7365 0.7103 0.6475 0.6574 SNN5step 0.6213 0.6107 0.6708 0.8052 0.7674 0.7481 0.7280 0.6475 0.6593 SNN10step 0.6197 0.6136 0.6744 0.8029 0.7729 0.7774 0.7703 0.6493 0.6659 SkNN1stepk=3 0.6439 0.6336 0.6705 0.7635 0.7523 0.7370 0.7645 0.6812 0.6712 SkNN2stepk=3 0.6377 0.6179 0.6817 0.8149 0.7788 0.7403 0.7705 0.6937 0.6725 SkNN3stepk=3 0.6371 0.6183 0.6737 0.8168 0.7805 0.7482 0.7765 0.6999 0.6756 SkNN5stepk=3 0.6362 0.6191 0.6763 0.8182 0.7845 0.7647 0.7815 0.7003 0.6788 SkNN10stepk=3 0.6366 0.6225 0.6798 0.8172 0.7898 0.7993 0.7695 0.7021 0.6803 SkNN1stepk=19 0.6811 0.6523 0.6717 0.7668 0.7596 0.7860 0.7702 0.6997 0.6798 SkNN2stepk=19 0.6756 0.6364 0.6831 0.8222 0.7871 0.8004 0.7763 0.7001 0.6799 SkNN3stepk=19 0.6755 0.6368 0.6752 0.8249 0.7895 0.8125 0.7819 0.7008 0.6801 SkNN5stepk=19 0.6757 0.6373 0.6779 0.8276 0.7940 0.8286 0.7902 0.7025 0.6807 SkNN10stepk=19 0.6766 0.6373 0.6810 0.8292 0.7986 0.8402 0.7774 0.7063 0.6820 Average 0.6625 0.6338 0.6691 0.8029 0.7657 0.7955 0.7624 0.6899 0.6751

thepriorinformationandareabletoexploittheoveralltopology ofthenetwork,confirmingpreviousresults[40,31].

Hereinafterwelimitedtheintegrationexperimentsto kernel-izedscorefunctionsonly,since theyusuallyperformequallyor betterthantheothercomparedmethods,andtheirempiricaltime complexityis significantlylowerthanRW andRWRalgorithms: forinstance,whileanentirecycleofcross-validationonthe708

Table3

Unweighted integration of the six binary gene networks (without semantic

similarity-basednets):AUCresultsaveragedacross708MeSHcategories.

UA PUA MAX GBA 0.8313 0.8291 0.6589 RW1step 0.8566 0.8563 0.8501 RW2step 0.8186 0.8178 0.8154 RW3step 0.7937 0.7925 0.7897 RW5step 0.7773 0.7760 0.7746 RW10step 0.7720 0.7704 0.7706 RWR=0.6 0.8533 0.8528 0.8520 RWR=0.9 0.8565 0.8531 0.8476 SAV1step 0.8538 0.8530 0.8286 SAV2step 0.8562 0.8554 0.8353 SAV3step 0.8580 0.8571 0.8405 SAV5step 0.8596 0.8587 0.8470 SAV10step 0.8548 0.8540 0.8485 SNN1step 0.6934 0.6921 0.6352 SNN2step 0.6950 0.6936 0.6331 SNN3step 0.6968 0.6954 0.6315 SNN5step 0.7020 0.7004 0.6314 SNN10step 0.7251 0.7230 0.6546 SkNN1stepk=3 0.7280 0.7266 0.6593 SkNN2stepk=3 0.7304 0.7289 0.6581 SkNN3stepk=3 0.7332 0.7317 0.6580 SkNN5stepk=3 0.7405 0.7389 0.6627 SkNN10stepk=3 0.7636 0.7616 0.6987 SkNN1stepk=19 0.8138 0.8124 0.7598 SkNN2stepk=19 0.8170 0.8155 0.7639 SkNN3stepk=19 0.8199 0.8183 0.7680 SkNN5stepk=19 0.8251 0.8233 0.7785 SkNN10stepk=19 0.8374 0.8356 0.8093

MeSHclasseswithUAintegrationrequireshourswithRWR,the sametaskrequiresonlysomeminuteswithkernelizedscore func-tions,usinganInteli72.80GHzprocessorwith16GBofRAMand aLinuxsystem.

By addingthereal-valued networksbasedonsemantic sim-ilarity measures (Section2.2), we observe a further significant enhancementoftheoverallperformance,showingthatthe integra-tionofdifferentsourcesofevidenceleadstobetterresults(Table4). ForinstancetheperformancesoftheUAapproachwithSAVusing

afivesteprandomwalkkernelareboostedfrom0.8596to0.8831 averageAUC(theincrementissignificantat˛=10−30significance levelaccordingtotheWilcoxonsignedranksumtest).Notethat theMINintegrationfailsonthistask,sincean“intersection” strat-egyinthiscontextleadstoasignificantlossofinformation,thus notallowingtoexploitthetopologicalinformationunderlyingthe entirenetwork.

Fig.3providesavisualclueofthedifferencesofaverageAUC acrossMeSHcategoriesbetweenunweightedintegrationmethods

Table4

Unweightedintegrationmethods:AUCresultsaveragedacross708MeSHcategories

includingalltheavailableninegenenetworks

UA-all PUA-all MAX-all MIN-all

SAV1step 0.8765 0.8667 0.8286 0.6541 SAV2step 0.8792 0.8701 0.8353 0.6694 SAV3step 0.8811 0.8722 0.8405 0.6824 SAV5step 0.8831 0.8744 0.8470 0.7023 SAV10step 0.8761 0.8708 0.8485 0.7264 SNN1step 0.6950 0.7050 0.6352 0.6045 SNN2step 0.6980 0.7080 0.6331 0.6087 SNN3step 0.7014 0.7108 0.6315 0.6129 SNN5step 0.7106 0.7185 0.6314 0.6212 SNN10step 0.7437 0.7490 0.6546 0.6349 SkNN1stepk=19 0.8322 0.8331 0.7598 0.6413 SkNN2stepk=19 0.8368 0.8372 0.7639 0.6520 SkNN3stepk=19 0.8413 0.8404 0.7680 0.6619 SkNN5stepk=19 0.8500 0.8465 0.7785 0.6789 SkNN10stepk=19 0.8665 0.8576 0.8093 0.7093

(10)

0 1 . 0 0 0 . 0 5 0 . 0 − 0 1 . 0 − 0.05 AUC diff. Sav 5 s Sav 10 s Sav 1 s Sav 2 s Sav 3 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s −0.10 −0.05 0.00 0.05 0.10 Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 10 s Sknn 5 s Sknn 3 s Sknn 2 s Sknn 1 s AUC diff.

(b)

(a)

−0.15 −0.10 −0.05 0.00 0.05 0.10 Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Snn 1 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s AUC diff. −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s AUC diff.

(d)

(c)

Fig.3. Unweightedintegrationmethods:differencesofaverageAUCacrossMeSHdiseaseswithrespecttothebestsinglegenenetwork(finet).(a)UA(b)PUA(c)MAX(d)

MIN.

andthebestsinglegenenetwork(finet).Fig.3(d)confirmsthatalso inthistaskMINintegrationfails,forthesamereasonsexplained above.OnthecontraryUAandPUAintegrationprovidessignificant enhancementswithbothSAVandSkNN(Fig.3(a)and(b)).Notethat

unweightedintegrationwithSNN resultsinadegradationofthe

performances(Fig.3).Wehavenotaclearexplanationofthisfact, butwethinkthattheinstabilityofscorescomputedbyusingonly oneoftheneighbours,combinedwiththeimpossibilityof weight-ingorchoosingthebestsourcesofinformation,mayaddnoiseto thepredictionprocess.

Summarizing, theresultsshowthat unweightedintegration, andespeciallyUAandPUAmethods,significantlyenhancesgene prioritizationresults.Alltheconsideredgeneprioritization meth-ods,rangingfromrandomwalkstokernelizedscorefunctions(with theexceptionofSNN),deriveabenefitfromunweightedintegration.

Moreover,theintegrationofsemanticsimilarity-basednetworks furtherimprovestheperformancesofgeneprioritization.Notethat withthesenetworks,consideredindividually,geneprioritization methodsdonotattainhighaverageAUCscores(atleastwithmfnet andccnet,Table2),buttheirintegrationsignificantlyenhancegene prioritizationresults(Table4),sincetheyconveycomplementary informationwithrespecttotheothersourcesofevidence.

3.5. Geneprioritizationwithweightednetworkintegration

WeexperimentedalsowithWAandWAPnetworkintegration toexplicitlytakeintoaccountthe“informativeness”ofeachgene network(Section2.4.3).Table5showsthatweightedintegration significantlybooststheperformanceofkernelizedscorefunctions. Inparticularfive-stepsSAVwithweightedintegrationofallthenine

availablenets(WA-all,Table5)reachesthehighestAUCaverage

score,butalmostallthegeneprioritizationalgorithmsachievetheir bestresultswithWAandWAPintegration.

ThisismoreevidentinFig.4,whereweregistera veryhigh incrementoftheaverageAUCscorewithrespecttothebest sin-glegenenetwork.ThisistrueforbothSAVandSkNN,whileforSNN

thisbehaviorislimitedtoWAPmethodsonly(Fig.4(b)and(d)). Nevertheless,notethat,onthecontrary,SNNbehavesbadlywith

unweightedintegration,independentlyofthecombinationmethod applied(Table3).

Togetmoreinsightsintotheresultsobtainedwithunweighted andweightedintegrationmethods,Fig.5comparestheAUCscores foreachclassachievedbyfivestepsSAV(oneofthebestgene

prior-itizationmethod)betweenunweightedandweightedintegration Table5

Weightedintegrationmethods:AUCresultsaveragedacross708MeSHcategories.

WAandWAPincludeonlythefirstsixfunctionalnetworks,whileWA-allandWAP-all

includealltheninefunctionalnetworks.

WA WAP WA-all WAP-all

SAV1step 0.8649 0.8680 0.8778 0.8768 SAV2step 0.8733 0.8727 0.8828 0.8802 SAV3step 0.8774 0.8763 0.8866 0.8830 SAV5step 0.8817 0.8807 0.8904 0.8861 SAV10step 0.8812 0.8823 0.8868 0.8850 SNN1step 0.7602 0.8080 0.7042 0.8165 SNN2step 0.7692 0.8126 0.7155 0.8213 SNN3step 0.7709 0.8159 0.7193 0.8240 SNN5step 0.7753 0.8206 0.7303 0.8278 SNN10step 0.7807 0.8241 0.7707 0.8328 SkNN1stepk=19 0.8394 0.8570 0.8325 0.8650 SkNN2stepk=19 0.8476 0.8614 0.8427 0.8684 SkNN3stepk=19 0.8527 0.8651 0.8489 0.8716 SkNN5stepk=19 0.8614 0.8703 0.8611 0.8762 SkNN10stepk=19 0.8744 0.8768 0.8819 0.8784

(11)

−0.05 0.00 0.05 0.10 −0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 10 s Snn 5 s Snn 3 s Snn 2 s Snn 1 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s −0.10 −0.05 0.00 0.05 0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 3 s Sknn 5 s Sknn 10 s Sknn 2 s

(b)

(a)

−0.10 −0.05 0.00 0.05 0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s −0.10 −0.05 0.00 0.05 0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 3 s Sknn 5 s Sknn 10 s Sknn 2 s

(d)

(c)

Fig.4. Weightedintegrationmethods:differencesofaverageAUCacrossMeSHcategorieswithrespecttothebestsinglegenenetwork(finet).Integrationofsixnetworks:

(a)WA(b)WAP.Integrationwithninenetworksincludingsemanticsimilarity-basednets:(c)WA(d)WAP.

withrespecttothethebestsinglenetworkfinet.ApointinFig.5 representstheAUCscore,relativetoaMeSHdisease,attainedby theintegrationmethodandbythebestsinglegenenetwork.More precisely,theAUCvalue obtainedbytheintegrationmethodis representedinordinate,whileinabscissawehavetheAUCvalue achievedwithfinet,i.e.thebestsinglenetwork.Pointsthatlieabove thebisectorofthefirstquadrantanglerepresentMeSHdiseases forwhichtheintegrationmethodachievesbetterresultsthanthe singlebestgenenetwork.InFig.5(a)mostofthepointslieabove thebisector,showingthatUAenhancesresultsobtainedwithfinet. Byaddingsemanticsimilarity-basedgenenetworksseveralpoints moves abovethebisectorline(Fig.5(b)), confirmingthatthese networksaddnovelusefulinformationforthegeneprioritization task. Lookingat Fig.5(c) we observethatwithWA integration, justwithoutsemanticsimilarity-basedgenenetworks,mostofthe pointslieabovethebisector,andtheresultsarealsobetterwhen weintegratealltheavailablenetworks(Fig.5(d)).

Fig.6 providesanoverallpictureofthedistributionsof AUC scores compared between different unweighted and weighted integration methods using five steps SAV as gene prioritization

algorithm.Whiteboxplotsrefertoweightedintegrationmethods, lightgray boxplotstounweightedintegrationmethodswithout semanticsimilarity-basedgenenetworks,anddarkgrayboxplots tounweighted methods integrating all the nine available gene networks. Weighted methods showthe best results(especially whenallthenetworksareintegrated),butalsoUAll,thatisUA inte-gratingalltheavailableninenets,achievequitesimilarresults.All theconsideredmethodsbehavebetterthanthebestsinglegene network(lastboxplotinFig.6),exceptforMIN,thatclearlyfailson thistask,asjustdiscussedabove.

Toobtainamorereliablecomparisonoftheresultsobtained withdifferentgenenetworkintegrationmethods,weappliedto

eachpairofthemtheWilcoxonsignedranksumtest,toestimate whetherasignificantstatisticaldifferencedoesexistusingthebest performing gene prioritizationmethod (SAV five steps). Table6

summarizesthemainresults:a“+”entrymeansthatasignificant statisticaldifferenceat0.01significancelevelisregisteredinfavor ofthemethodintherowwithrespecttothemethodinthecolumn; a“−”entrymeansthattheoppositeholds,anda“=”entrystands fornosignificantdifferencebetweenthemethods.

We observethat weightedintegrationis alwayssignificantly betterorequalthanalltheothercompared methods.In partic-ularWA-allintegration(thatis,WAintegrating alltheavailable nets)issignificantlybetterthanalltheotherconsidered integra-tionapproaches. NotethatalsoUA-allisalwaysbetterorequal thanalltheothers(exceptwithWA-all),showingthatalsoa sim-pleunweightedintegration,ifasufficientlylargesetofsources ofevidenceisprovided,canachieveresultscomparablewiththe morecomputationallyexpensiveweightedintegration(recallthat theweightsoftheintegrationareobtainedbyevaluatingtheAUC oneachsinglegenenetworkbyinternalcross-validation,see Sec-tion2.4.3).Quiteinterestingly,WAPdoesnotoutperformWA:even ifweconstructaspecificweightednetworkfor eachMeSH dis-easethisdoesnotintroduceasignificantadvantage(atleast,on theaverage).Thisfactcouldbeexplainedbyconsideringthatthe per-classintegration(WAP)mayintroduceacertainoverfittingto thedata,whileWA,byaveragingtheweightsacrossclassesand thus resultingin a singleintegrated network, couldreduce the overfitting,actingasasortof“regularization”,confirmingprevious resultsobtainedinthecontextofgenefunctionprediction[55].

Consideringthatforalargenumberofdiseaseswehavea rel-ativelylow number of annotated genes,we compared also the precisionatdifferentrecalllevelsbetweendifferentunweighted andweightedintegrationmethods,usingtwostepsSAV asgene

(12)

0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet UA 0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet U A−all

(a)

(b)

0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet WA 0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet W A−all

(c)

(d)

Fig.5.ComparisonofAUCresultsbetweennetworkintegrationmethodsandthebestsinglegenenetwork(finet).EachpointrepresentstheAUCscoreobtainedbySAVfive

stepswithnetworkintegrationmethods(ordinate)andwiththebestsinglenetworkfinet(abscissa)oneachofthe708MeSHdiseases.(a)UAwithsixnetworks;(b)UAwith

alltheninenetworks;(c)WAwithsixnetworks;(d)WAwithallninenetworks.

WA WAP WAall WAPall UA PUA MAX UAall PUAall MAXall MINall finet

0. 6 0. 7 0. 8 0. 9 1.0

(13)

Table6

Comparisonbetweennetworkintegrationmethods:methodswhoseAUCperformancearesignificantlybetteraremarkedwith“+”,significantlyworsewith“−−”andwith

nosignificantdifferencewith“=”(0.01significancelevel,Wilcoxonsignedranksumtest).Thecomparisonsareinthesenserowsvs.columns.

WAP-all WA WAP UA-all PUA-all MAX-all MIN-all UA PUA MAX finet

WA-all + + + + + + + + + + + WAP-all = = = + + + + + + + WA = = + + + + + + + WAP = + + + + + + + UA-all + + + + + + + PUA-all + + + + + + MAX-all + − − = + MIN-all − − − − UA + + + PUA + + MAX +

prioritizationalgorithm(Fig.7).Withboththeintegrationofthe sixbasicnetworks(Fig.7(a))andwiththeintegrationofthesix basicnetworksplusthethreesemanticsimilarity-basednetworks (Fig.7(b))weachievesignificantlybetterresultswithanyofthe consideredintegratednetworkwithrespecttothebest“single” network (finet), exceptfor theMINintegrationthat obtainsthe worstresults.Alsointhiscaseweightedintegrationoutperforms unweightedintegration,butobservethatwhenweintegrateall theavailablenetworksUAall,i.e.theunweightedaverage integra-tion,achievesbetterresultsthantheweightedper-classintegration (WAPall), confirming that WAP integration undergoes a certain overfittingtothedata.Notethatwhensemantic-similaritybased networksareadded,alltheintegrationmethodsimprovestheir precision/recallresults(thescaleoftheordinate,thatisthe pre-cisionis equal in Fig. 7(a) and (b)). For instance WA, thebest performing network integrationmethods, improves its average precisionat20%recallfrom0.26to0.30witharelativeincrement ofabout15%inprecision.Asafinalobservation,notethatallthe considerednetworkintegrationmethods(exceptMINintegration) significantlyoutperformtheresultsobtainedwiththebestsingle network,confirmingthatalsosimpleunweightedintegration algo-rithmsaresufficienttoboosttheperformanceofgeneprioritization methods.

3.6. FindingnovelassociationsbetweengenesandMeSHdiseases Thecommonusageofgenesrankingscoresingene-disease pri-oritizationexperimentsconsistsintheselectionofthetopranked unannotated genes and in the theirfurthercharacterization as

possible“candidate”genesactuallyimpliedintheonsetand pro-gressionoftheconsidereddisease.

Tothisendweprovideforeachofthe708MeSHdiseasesthe AUCobtainedbyfive-foldcross-validation,thep-valueachieved through a non parametric randomized test (see below), and the10 toprankedgenes currentlynotannotated fortheMeSH diseaseunderstudy.Tablesummarizingtheseinformationis avail-ableathttp://homes.di.unimi.it/re/suppmat/genesmeshnetwpred/ supmatTBL1.html(accessed30November2013).

Moreover,wealso providea preliminaryanalysisofthetop rankedmost reliableunannotated genesfor theMeSHdiseases predictedwithhighrobustnessandaccuracybythebestnetwork integration,i.e.WAintegratingalltheavailablenetsusingfivesteps SAVtoprioritizegenes.

Toevaluatetherobustnessofthemethodweperformeda non-parametricstatisticaltestbyrandomlyshuffling1000timesthe labelsforeachMeSHdiseaseandcountinghowmanytimesmthe AUCcomputedwithrandomlyshuffledlabelsislargerthantheAUC computedwiththetruelabels.Theresultingp-valueisjust the ratio m

1000.Interestinglyenough,weachieveap-value<0.01for649 andap-value<0.05for676ofthe708MeSHdiseases.Tochoose MeSHdiseasesbothrobustlyandaccuratelypredictedweselected MeSHdescriptorswithanaverageAUC≥0.975andp-value<0.01, resultinginasetof24diseases.Foreachoftheselecteddiseases, weextractedthelowestscorecfromthesetofpositive(annotated) genes.Then,wecomputedtheempiricalcumulativedistributionof allthescoresequalorlargerthanc,consideringbothannotatedand unannotatedgenes.Asafinalstep,usingthedistributioncomputed atthepreviousstep,wecomputedthek-percentilesofthethree

0. 0 0. 1 0. 2 0. 3 0. 4 0.5 Recall Precision 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 WA WAP UA PUA MAX MIN finet 0. 0 0. 1 0. 2 0. 3 0. 4 0.5 Recall Precision 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 WAall WAPall UAall PUAall MAXall MINall finet

(a)

(b)

Fig.7. Comparisonoftheaverageprecisionatfixedlevelsofrecallacrossthe708MeSHdiseasesbetweennetworkintegrationmethodsandthebestsinglegenenetwork

(14)

Table7

Listof24selecteddiseasesandofthecorrespondingtoprankedunannotatedgenes.

Diseaseid. Diseasename Toprankedunannotatedgenes

C535579 Cardiofaciocutaneoussyndrome KSR2,PILRA,KSR1

C536436 Coffin-Sirissyndrome PYGO1,ARID2,SMARCC2

C536664 Peroxisomebiogenesisdisorders PEX5,PEX7,LONP2

C536783 T-Lymphocytopenia BIRC8,CASP10,NAIP

C536928 Turcotsyndrome MLH3,PMS2L5,MSH3

C537345 Sitosterolemia UGT1A5,UGT2B17,SLCO1B1

C538169 Acitretinembryopathy CASP10,PEA15,SLCO3A1

D000562 Amebiasis DCLRE1C,IL19,CYP2C8

D001404 Babesiosis DCLRE1C,IL19,FCGR2C

D002062 Bursitis UGT2B4,UGT2B15,UGT1A4

D006958 Hyperostosis,Cortical,Congenital NPPC,NPR1,ACE

D007888 LeighDisease NDUFB10,NDUFB4,NDUFA12

D008118 Loiasis FCGR2C,CYP3A43,CYP8B1

D008375 MapleSyrupUrineDisease ACAD8,PDHX,PDHB

D009196 MyeloproliferativeDisorders PTPN1,CISH,SLC25A40

D009634 NoonanSyndrome KSR2,KSR1,MRAS

D010483 PeriapicalDiseases MMP13,IL12B,IL8

D012214 RheumaticHeartDisease CYP21A2,CYP8B1,CYP3A43

D014353 Trypanosomiasis,African DCLRE1C,BCL2,STAT1

D015823 AcanthamoebaKeratitis DCLRE1C,IL19,CYP2C8

D018235 SmoothMuscleTumor NFKB1,IL8,IL6

D020299 IntracranialHemorrhage,Hypertensive NPPC,NPPB,CRH

D056685 CostelloSyndrome KSR2,PILRA,KSR1

D056824 UpperExtremityDeepVeinThrombosis FGGCX,PROZ,F11

toprankedunannotatedgeneswithineachselectedMeSHterm. Consideringthatweselected24MeSHdiseases,thisprocedurelead toacollectionof72k-percentileswhosefrequencyisplottedin Fig.8.

Fig.8showsthatmostofthetoprankedunannotatedgenesare concentratedclosetothe100-percentile,showingthatthesetop ranked“falsepositive”genesare“stronglypredicted”aspossible candidatediseasegenes,sincetheirscoresareclosetothatofthe toprankedannotatedgenes.Consideralsothatthisissupportedby thefactthatweselectedonlydiseasesforwhichgeneprioritization achievedaveryhighAUCand“robust”predictions(AUC>0.975and p-value<0.01).Thetopthree falsepositivesgenesymbolsalong withthediseaseidentifiersanddiseasenamesfortheselected24 MeSHdescriptorsarelistedinTable7.

Ofcoursetheproposedtoprankedgenesareonlydiseasegene candidates,andtheseresultsneedtobebiologicallyinterpretedand shouldundergoarigorousbio-medicalanalysispriortobeactually associatedtothediseaseitself.

k−percentiles of the top 3 FP scores within each selected MeSH term

k−percentile Frequency 0 20 40 60 80 100 0 102 03 04 05 0 60

Fig.8.Frequencyofthek-percentilesofthethreetoprankedunannotatedgenes.

4. Conclusions

Weperformedanextensiveanalysisofgene-disease associa-tionsnotlimited togeneticdisorders, includingmorethan 700 MeSHdiseases.

Byusingnetworkintegrationandgeneprioritizationmethods, wereportedforeachdiseasethe10unannotatedtop-rankedgenes, availableforfurtherbio-medicalanalysis.Moreover,byanalyzing thetop-rankedpredictionsrelativetothe24bestandrobustly pre-dictedMeSHdiseases,weshowedthatourapproach candetect reliablecandidatediseasegenes.

It is well-known that the integration of multiple omics sources of evidence is of paramount importance in several application domains in computational biology [65–68]. In this work we performed a systematic comparison of unweighted integration and our proposed weighted combination methods to provide an evaluation of the impact of network integra-tion on gene prioritization. We quantitatively showed that network integration is necessary to boost gene prioritization results,accordingtopreviousresultspublishedin theliterature [15,69,27,28,46,47].

In particular, we showed that the proposed weighted inte-gration methods, by exploiting the different “informativeness” embedded in different gene interaction networks, significantly outperformunweightedintegration.Moreoverourexperimental resultsshowthattheperformancesstronglydependonthe selec-tionofthesourcesofevidenceandonthecharacteristicsofthegene networks.Forinstance,alsoasimple UAintegrationcan signifi-cantlyimprovetheperformanceofgeneprioritizationmethodsifa sufficientnumberofdiverseandcomplementarygeneinteraction networksarecombined.Fromthisstandpoint,anovelresearchline couldberepresentedbyanadaptationoftestandselectmethods, originallyproposedinthecontextofsupervisedensembles[70]to appropriatelychoosethemostpredictivesourcesofevidenceand genenetworksforeachMeSHdiseasethroughanadaptivelearning process.

Confirming previous results [30], semantic similarity-based networks,combinedwithothersourcesofevidenceboostthe per-formanceofgeneprioritizationmethods.Apossibleimprovement oftheproposedapproach couldconsistin combiningnetworks basedonsemanticsimilaritymeasuresthatembedtheontology

References

Related documents

A.19 If secondary trading were formalised, we might expect to see charter airlines further reducing their share of slots, and possibly an increase in slot holdings by short

Resurgence of an Old Pest: Biology and Ecology as it Relates to Pesticide Evaluations&#34; International Symposium on Ectoparasites of Pets, Munich, Germany, April 8-9, 2013

erroneous determination issued to Freelance on January 28, 1999 (the “January 28 Determination”). ¶5 On March 11, 2003, the Department ruled that the January 2 26 Determinations

The photovoltaic Multistring power conditioning system with MPPT control using fixed step size and variable step size incremental conductance method have been analyzed

This paper presents a new method for detection of Ventricular fibrillation by discriminating it with Ventricular tachycardia using empirical mode decomposition

Although the diagnosis of the disease is usually straightforward (compatible clinical and hematological abnormalities in a suboptimally vaccinated puppy, with or without a positive

(2012) identify that a quadruple helix system depends on not only ‘hard’ infrastructures but that the ‘soft infrastructures’ based on societal based innovation user