An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods

(1)

ContentslistsavailableatScienceDirect

Artiﬁcial

Intelligence

in

Medicine

jou rn al h om e p a g e :w w w . e l s e v i e r . c o m / l o c a t e / a i i m

An

extensive

analysis

of

disease-gene

associations

using

network

integration

and

fast

kernel-based

gene

prioritization

methods

Giorgio

Valentini

a,∗

_,

_Alberto

_Paccanaro

b

_,

_Horacio

_Caniza

b

_,

Alfonso

E. Romero

b

_,

_Matteo

_Re

a

a_AnacletoLab_–_Dipartimento_di_Informatica,_Università_degli_Studi_di_Milano,_via_Comelico_39/41,₂₀₁₃₅_Milano,_Italy

b_Department_of_Computer_Science_and_Centre_for_Systems_and_Synthetic_Biology,_Royal_Holloway,_University_of_London,_Egham_TW20_0EX,_UK

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received11September2013

Receivedinrevisedform5March2014

Accepted10March2014

Keywords:

Genediseaseprioritization

Networkintegration

Heterogeneousdatafusion

MeSHdescriptors

Nodelabelranking

a

b

s

t

r

a

c

t

Objective:Inthecontextof“networkmedicine”,geneprioritizationmethodsrepresentoneofthemain toolstodiscovercandidatediseasegenesbyexploitingthelargeamountofdatacoveringdifferenttypes offunctionalrelationshipsbetweengenes.Severalworksproposedtointegratemultiplesourcesofdata toimprovediseasegeneprioritization,buttoourknowledgenosystematicstudiesfocusedonthe quan-titativeevaluationoftheimpactofnetworkintegrationongeneprioritization.Inthispaper,weaim atprovidinganextensiveanalysisofgene-diseaseassociationsnotlimitedtogeneticdisorders,anda systematiccomparisonofdifferentnetworkintegrationmethodsforgeneprioritization.

Materialsandmethods:Wecollectedninedifferentfunctionalnetworksrepresentingdifferentfunctional relationshipsbetweengenes,andwecombinedthemthroughbothunweightedandweightednetwork integrationmethods.Wethenprioritizedgeneswithrespecttoeachoftheconsidered708medical subjectheadings(MeSH)diseasesbyapplyingclassicalguilt-by-association,randomwalkandrandom walkwithrestartalgorithms,andtherecentlyproposedkernelizedscorefunctions.

Results:Theresultsobtainedwithclassicalrandomwalkalgorithmsandthebestsinglenetworkachieved anaverageareaunderthecurve(AUC)acrossthe708MeSHdiseasesofabout0.82,whilekernelized scorefunctionsandnetworkintegrationboostedtheaverageAUCtoabout0.89.Weightedintegration, byexploitingthedifferent“informativeness”embeddedindifferentfunctionalnetworks,outperforms unweightedintegrationat0.01signiﬁcancelevel,accordingtotheWilcoxonsignedranksumtest.For eachMeSHdiseaseweprovidethetop-rankedunannotatedcandidategenes,availableforfurther bio-medicalinvestigation.

Conclusions:Networkintegrationisnecessarytoboosttheperformancesofgeneprioritizationmethods. Moreoverthemethodsbasedonkernelizedscorefunctionscanfurtherenhancediseasegeneranking results,byadoptingbothlocalandgloballearningstrategies,abletoexploittheoveralltopologyofthe network.

1. Introduction

Theraisingawareness thatadiseaseisrarelyaconsequence of anabnormality ona single gene,but it is usually theresult ofcomplexinteractionsandperturbationsinvolvinglargesetsof genesand theirrelationshipswithseveralcellularcomponents, leadtodevelopmentofthe“Networkmedicine”,anetworkbased approachtohumandisease[1].In thiscontext,gene prioritiza-tionmethodshaveprogressedquicklywiththeaimofdiscovering

∗Correspondingauthorat:DipartimentodiInformatica,UniversitàdegliStudidi

Milano,ViaComelico39,Milano,Italy.Tel.:+390250316225;fax:+390250316373.

E-mailaddress:[email protected](G.Valentini).

candidate“disease”genesbyexploitingthelargeamountof avail-able“omics”datacoveringdifferenttypesofrelationshipsbetween genes[2].

Accordingto[3],automaticgeneprioritizationmethods typi-callyproducetheiroutputseitherbyﬁlteringthecandidategenes intosmallersubsetsorbyrankingthecandidategenes.

Filteringmethodsarebasedonthedeﬁnitionofasetofcriteria motivatedbytheavailableknowledgeofthemolecularbasisofthe diseaseunderinvestigation.Theirmainobjectiveistoreducethe setofpotentialdiseasegenesbyexploitingacomparisonofallthe candidategeneswithasortofgenetemplate,whichencodesthe selectioncriteriainasetofrules[4,5].Despitehavingbeenproved effective[6,7],thehardﬁlteringpolicyunderlyingtheirfunctioning isadouble-edgedsword.Indeed,whenarelevantgenefailstomeet http://dx.doi.org/10.1016/j.artmed.2014.03.003

(2)

justoneofthecriteriaencodedintheﬁlter,itbecomesafalse neg-ative,andthispreventstheabilitytodetectgenesthatareactually involvedinthedisease,butwithmechanismsnotbeenpreviously reportedinliterature.

Thesecondclassofgeneprioritizationmethods(rankingbased) avoidsthelimitationsoffilteringmethodssimplybyranking can-didatesfrommosttoleastpromisingones.Asinthecaseoffiltering methods,rankingbasedmethodscanintegratemultiplesourcesof evidenceinthegeneprioritizationprocess.Thesemethodscanbe furtherclassifiedintothreemaincategories[3]:textmining[8,9], similarityprofilingandnetworkanalysis-based[10–13].

Although powerful in their ability tomake a very effective usageoftheavailableknowledge,textminingapproachesshow a strong bias toward the identificationof straightforward can-didatesforwhichabundantknowledgeisalreadyavailable[14]. On the contrary,similarity profiling [15] and network analysis basedgeneprioritizationsystemsarenotaffectedbythis limita-tion.Indeedtheycanexploit both knowledgebases (increasing the specificity of predictions) and raw data (for novel predic-tions).

Inparticular, network basedmethods aregaining increasing popularityin diseasegene prioritization (see [16,17]for recent speciﬁc reviews). According to this approach, nodes represent genes and edges encode some notion of functional similarity betweengenes,e.g.directmolecularinteractions,transcriptional co-expression/regulation,sequenceorstructuresimilarityor par-alogy[18];theprioritizationlististhenconstructedbyexploiting thetopologyand theedge weightsof thenetwork anda setof “core”genesknowntobeassociatedtothediseaseunderstudy. In this category some methods used a random walk or a heat kernel[19],whileothersappliedWebandsocialnetworks meth-odsonaprotein–proteininteraction(PPI)network[20],andother approachesexploitedPPIandpathwayinformation toprioritize candidategenes[21,15].

Mostgeneprioritizationmethodsexploiteddifferentsourcesof informationandgenenetworks[22,23],rangingfromphenotypic similaritiesbetweendiseases andfunctional similaritybetween genes[24],toGOontologyandInterProdomainannotations[25] andprotein–protein interactions, geneexpression andcommon membershiptoKEGGpathways[26],andalsotoseveralothersets ofdatasources[15,27,28](see[22]foramoredetailedpresentation ofthedifferentcombinationsofsourcesofevidenceexploitedby recentdiseasegenesprioritizationmethods).

Despitethelargeavailabilityofworksdescribingspeciﬁc com-binationsofdatasetstodeveloptoolssuitablefordiseasegenes prioritization,“ourunderstandingofhowtoperformuseful pre-dictionsusingmultipledatasourcesoracrossbiologicalnetworks isstillrudimentary”[3],andinparticular,toourknowledge,no systematicstudiesfocusedonthecomparisonofdifferentnetwork integrationmethods.

Tocontributetoﬁllthisgap,inthispaperwepropose,compare andanalyzedifferentnetworkintegrationstrategiestocombine multiplegenenetworksconstructedwithdifferentsourcesofsingle orheterogeneousdata.Inparticularweapplysimpleunweighted integrationmethods,that combinegenenetworkssolely onthe basisofthestructuralcharacteristicsofthenets,andwepropose weightedintegrationmethodsthatcombinenetworksaccording tothe“predictivenessstrength”ofeachtypeofnetwork,estimated throughtheassessmentoftheaccuracyofthelearningalgorithm trainedoneachofthecombinednetworks.Weconstructedand integratedninedifferentgenenetworks,includingalsosemantic similarity-basedgenenetworks,sinceithasbeenrecentlyshown thattheyimprovegene-diseaseprioritization[29,30].

Anothercontributionofthisworkconsistsintheapplication ofthekernelizedscorefunctionstothegene-diseaseprioritization problem. Thisnovelsemi-supervised network methodfornode

labelrankingadoptsboth localandgloballearningstrategiesto learnfromboththeneighborhoodofeachnodeandatthesame timefromtheoveralltopologyofthenetwork[31,32].

Anotheropenissueisrepresentedbythechoiceofthe“seed genes”tocharacterizethediseasesinvolvedinthegene prioritiza-tionanalysis[22].Previousmethodsfocusedonspeciﬁcdiseases [33,34]orongeneticdiseases[23,35]accordinge.g.totheonline Mendelianinheritanceinman(OMIM)database[36].Inorderto extendtheanalysistoalargersetofdiseases,notlimitedtogenetic disorders,inthisworkweused“seedgenes”borrowedfromthe MeSHtaxonomyofdiseases[37],byexploitinggene-MeSHdisease associationsprovidedbythecomparativetoxicogenomicsdatabase (CTD)[38].

Summarizing,ourmain contributionscanbeschematizedas follows:

•Weproposeoneofthewidestgene-diseaseprioritizationstudies, involvinggene-MeSH diseaseassociations coveringmorethan 700diseases,notlimitedtogeneticdisorders.

•Weproposenovelweightedintegrationmethodsabletocombine multiplenetworksaccordingtothe“predictivenessstrength”of eachsourceofdata.

•Acomparativeanalysisofdifferentnetwork-integration meth-ods,andaquantitativeevaluationoftheirimpactongene-disease prioritization.

•An extensive application of the kernelized score functions, a recentlyproposedsemi-supervisednetwork-basedmethodthat embedslocalandgloballearningstrategies,tothegenedisease prioritizationproblem.

This paperis structuredas follows. In Section2.1 we intro-duce MeSH and thepipelinewe applied toannotatethe “seed genes”used inourexperiments.Section2.2 describesthe func-tionalnetworksconsideredinourexperiments.TheninSection2.4 theunweightedand weighted integrationmethodsand in Sec-tion2.5thegeneprioritizationmethodsusedinourexperiments areintroduced.Theoverall experimentalsetting isdescribed in Section3.1, and the results relative to the application of the geneprioritizationmethodstothesinglefunctionalnetworksare discussedinSection3.3.Theseresultsarethenquantitatively com-paredwiththoseobtainedthroughunweighted(Section3.4)and weighted(Section3.5)networkintegrationmethods,whilein Sec-tion3.6thetop-rankedunannotatedgenesandtheAUCandp-value associatedtoeachofthe708MeSHdiseasesanalyzedinthiswork arepresented.Theconclusionsoutlinethemainﬁndingsofthis workandsuggestnovelresearchlinesinthecontextofthegene prioritizationandnetworkintegrationproblems.

2. Materialsandmethods

2.1. MeSH:medicalsubjectheadings

MeSH is a controlled vocabulary produced by the National Library of Medicine for indexing, cataloging, and search-ing biomedical and health-related information and documents (http://www.nlm.nih.gov/mesh,accessed30November2013).The descriptorsorsubjectheadingsofMeSHarearrangedina hierar-chy.MeSHcoversabroadrangeoftopicsanditscurrentversion consistsof16toplevelcategories.TheMeSHthesaurusisusedfor indexingarticlesfromtheworld’sleadingbiomedicaljournalsfor theMEDLINE/PubMEDdatabase.OneoftheMeSHtoplevelterms (Diseases)isusedtolabelthegenesetsusedinourexperiments andtoevaluatetheimpactofnetworkintegrationontheinference ofrelationshipsbetweengenesanddiseases.

The associations between the genes and the MeSH disease termshavebeendownloadedfromtheCTD[38],apublicresource

(3)

Fig.1.Pipelineofthegene–MeSHdiseaseannotationprocess.

thatprovidesinformationabouttheinteractionof environmen-tal chemicals with gene products and their effects on human diseases.Theserelationshipsareannotatedfromthescientiﬁc lit-eraturebyprofessionalbiocuratorswhomanuallycurateatriadof coreinteractions includingchemical-gene,chemical-diseaseand gene-diseaserelationships.TheCTDintegratesthesecoredatato generateinferredchemical-gene-diseasenetworks.

Toprovidea“goldstandard”of“seedgenes”toinfernovel gene-diseaseassociations,wefirstdownloadedtheassociationsbetween thehumangenesconsideredinourexperiments(Section2.2)and alltheavailableMeSHdiseasetermsavailableinCTD.Wethen fil-teredoutallthediseasesassociatedwithlessthanfiveandmore than200genesin ordertoboth ensurea minimumamountof aprioriinformationforourpredictiontasksandtoavoidclasses whoseassociatedgenesetsaretooheterogeneous.Thisledtothe definitionofasetcomposedby708MeSHdiseases(Fig.1).

Thefullsetofthe“goldstandard”seedgenes–MeSHdisease associations is availablefrom http://homes.di.unimi.it/valentini/ DATA/DiseaseGeneNetworks(accessed30.11.13).

ItisworthnotingthatMeSHcontrolledvocabularyofdiseases hasbeenjustproposedinthecontextoftext-mining-basedgene prioritization[39],butthoseresultscannotbesafelygeneralized tonetwork-basedmethods,sincetext-miningapproachesshowa biastowardgenesforwhichalarge“apriori”knowledgeisactually availableinliterature[14].

2.2. Functionalnetworks

We collecteddifferentsourcesof datatorepresentdifferent functionalrelationshipsbetweengenes.Moreprecisely,we con-structedgene networksusingphysicaland geneticinteractions, transcriptionalco-expression/regulationandlocalization,protein domainandgenechemicalinteractions,co-occurrenceof disease-genepairsinscientiﬁctexts,homologuesimplicatedingenerating

similarphenotypesinotherorganisms,commonmolecular path-waysbetweengeneproducts,andcommonGOannotations.

Table1summarizesthemaincharacteristicsoftheninegene functionalnetworksusedinourexperiments.Eachgenenetwork includesasetSof8449genes(orasubsetofthem)selected accord-ingtotheproceduresdescribedin[40].Weconsideredasetof genesfor whichsufﬁcient functionaldataareavailable, andfor whicharelativelycomparablecoverageacrossgenenetworkscan beassured.Inthisway,ontheonehandacertainamountof func-tionalinformationisensuredforeachgene,andontheotherhand theavailableinformationforeachconsideredgeneresults compa-rable.

In the restof this section we provide a brief descriptionof each gene network. The full data sets are downloadable from: http://homes.di.unimi.it/valentini/DATA/DiseaseGeneNetworks (accessed30.11.13).

2.2.1. Functionalinteractionnetwork–ﬁnet

In [41] Wu and colleagues constructed a functional protein interactionnetworkbasedonfunctionalinteractionspredictedby aNaiveBayesclassiﬁertrainedonpairwiserelationshipsextracted fromcuratedpathwaysandnon-curatedsourcesofinformation, includingprotein–protein interactions,gene co-expression, pro-tein domain interaction, Gene Ontology (GO) annotations and text-minedprotein interactions. Fromthe original network we extractedthesubnetworkincludingthesubsetSofgenesusedin ourexperiments.

2.2.2. Humannet–hnnet

Similarinspirittotheapproachin[41],thefunctionalnetwork constructionmethodpresentedin[27]byLeeandcolleagues inte-gratesdiverselinesofevidenceinordertoproduceafunctional humangene network. It hasbeenused inseveral teststo pre-dictcausalgenesforhumandiseasesandtoincreasethepower Table1

Characteristicsofthegenenetworksusedinourexperiments.

Network Description Type Nodes Edges Density

ﬁnet Obtainedfrommultiplesourcesofevidence Binary 8449 271466 0.0038

hnnet Obtainedfrommultiplesourcesofevidence Binary 8449 502222 0.0070

cmnet Networkprojectionsfromcancermodules Binary 8449 3414722 0.0478

gcnet NetworkprojectionsfromCTD Binary 7649 1421298 0.0242

bgnet NetworkprojectionsfromBioGRID Binary 8449 120169 0.0016

dbnet DirectrelationshipsobtainedfromBioGRID Binary 8449 3023084 0.0423

bpnet SemanticsimilaritynetworkfromGOBP Realvalued 6923 44506147 0.9286

mfnet SemanticsimilaritynetworkfromGOMF Realvalued 6145 26611887 0.7047

(4)

Fig.2.Simpliﬁedrepresentationofbipartitenetworkprojectionsintohomogeneousgenenetworks.(a)Binaryprojectiontoconstructthecmnetnetwork;(b)sumprojection

toconstructgcnet.Circlesrepresentgenes,squaresrepresentcancermodules(a)andchemicals(b).

ofgenome-wideassociationstudies.Alsointhiscaseweextracted fromHumanNetthesubnetworkincludingthesubsetSofgenes. 2.2.3. Cancermodulenetwork–cmnet

Byexploitinggeneexpressionprofiling,Segalandcolleagues constructedafunctionalmodulemapforcancertoinvestigate com-monalitiesandvariationsbetweendifferenttypesoftumor[42].In theirworktheauthorsanalyzedacollectionofexpressionprofiles withtheaimtoidentifysetsofgenesthatactinconcerttocarry outspecificfunctionsindifferentcancertypes,andthenproduceda modulemapconstitutedbyacollectionofthegenesetsassociated tospecificcancergenemodules.

Weusedtherelationshipsbetweenthehumangenesandthe Segal’scancermodules[42]toconstructabipartitenetwork.This networkhasbeenprojectedontothegenespacethusoriginating thecmnetnetwork.Thetypeofprojectionusedintheconstruction ofcmnetisabinarybipartitenetworkprojection,meaningthatthe weightoftheedgelinkingtwogenesintheprojectednetworkis1if thetwogenesshareatleastoneneighbourintheoriginalbipartite networkand0otherwise(Fig.2a).

2.2.4. Genechemicalnetwork–gcnet

The CTD stores information mined from literature about the interactions between genes, chemicals and diseases in many species. Since one of the objectives of this work is the evaluation of the capabilities of heterogeneous networks integration in the prediction of genes–diseases relationships, we used the genes–chemicals relationships available in the CTD to construct a gene interactions network (gcnet). To this end we downloaded from CTD the chemicals–genes interac-tions file (http://ctdbase.org/reports/CTDchemgeneixns.csv.gz, accessed30.11.13)and weconstructedabipartite network.We thenperformedaSUMprojectionontothegenespace,bywhichthe weightofanedgelinkingtwogenesequalsthenumberofthe com-monneighborsofthegenesinthebipartitenetwork.Theresulting networkhasfinallybeenbinarizedusingacutoffoffiveormore commonchemicalsinteractorstosetabinaryinteractionbetween apairofgenes(Fig.2b).

2.2.5. BioGRIDdatabasenetwork–dbnet

Thisisaprotein–proteininteractionnetworkconstructedusing directphysicalandgeneticinteractionsobtainedfromBioGRID[43] (v.3.2.96–January2013).

2.2.6. BioGRIDprojectednetwork–bgnet

Insteadofsetting-upabinaryinteractionnetworkbasedonthe directinteractionbetweentheSgenes,weconstructedabipartite networkbasedonthecontentoftheBioGRID,butusingastopnodes

theSgenesandasbottomnodesallthehumangenesBavailable inBioGRID.Moreprecisely,ifinBioGRIDdoesexistaninteraction betweenanodea∈Sandx∈B,weaddedthe(a,x)edgeinthe bipar-titenetwork.Then,accordingtoabinaryprojectiontotheSspace, anedge(a,b),a∈S,b∈Sisaddedtotheprojectednetworkifaandb shareatleastonecommonnodex∈Bintheirneighborhoodsofthe bipartitenetwork.Inthiswaywecancaptureindirectinteractions betweenpairsofgenes.

2.2.7. Semanticsimilarity-basednetworks:bpnet,mfnetand ccnet

The last three networks considered in this workhave been constructed by computing theResnik semantic similarities[44] betweenthetermsofeachdivisionoftheGeneOntology:biological process,molecularfunctionandcellularcomponent.Weobtained a pairwise gene similaritymeasure by choosing the maximum Resniksemanticsimilaritybetweenallthetermsforwhichthetwo genesareannotated.Theresultingnetworkswerenamedbpnet, mfnet and ccnetrespectively. The semantic similaritymeasures havebeencomputedusingaMATLABapplicationimplementing methodsdescribedin[45].

2.3. Basicnotation

Gene networksfor disease prioritizationcan berepresented throughanundirected weightedgraphG=(V,E),whereVisthe setofverticescorrespondingtogenesandEthesetofedges corre-spondingtosomenotionoffunctionalrelationshipbetweenpairs ofgenes/vertices.Verticesofthegraphandgenescanbedenoted withnatural numbers1,2,...,n,sinceeachvertex ofGis uni-vocallyassociatedtoagene.Thecorrespondingadjacencymatrix Wwithweightswijrepresentsthe“strength”oftherelationship betweenverticesi,j_∈V;VM⊂Vdenotesasubsetof“positive”

ver-ticesbelongingtoaspeciﬁcMeSHsubjectheadingM(e.g.aMeSH descriptorofadisease–Section2.1).

Weconsideredtheintegrationofngenenetworks,Gd₌₍_Vd_,_Ed_),

1≤d≤n,andwedenoteby ¯Gtheintegratednetwork ¯G=( ¯V,_E¯), with ¯V=

_dVd_{and ¯}_E_⊆

dEd.Theweightsoftheedges(i,j)∈Ed arerepresentedwithwd

ij.Finallyasetoffeatures xi∈Xcanbe

asso-ciatedtoagenei.Forinstance, xicouldrepresentthegeneticor

proteininteractions,theexpressionproﬁleorwhateveravailable dataforagivengene/vertexi.

2.4. Networkintegrationmethods

Wedesignedandapplieddifferentnetworkintegration meth-ods to combine different sources of evidence of functional relationshipsbetween genes. Our aim consists in providing an

(5)

analysisoftheimpactofnetworkintegrationtogene prioritiza-tion,inordertounderstandwhetherthecombinationofmultiple networks,constructedfromdifferentsourcesofinformation,can signiﬁcantlyenhancetheperformanceofgeneprioritization meth-ods,andtoprovideaquantitativeassessmentofthishypothesized improvement.Tothis end weprogrammatically considered rel-ativelysimple methods, rangingfrom unweighted toweighted network integration algorithms, excluding more complex algo-rithms proposed in the literature, to allows us to perform an extensiveanalysisinvolvingalargesetofdiseases,alargesetof humangenesandasigniﬁcantsubsetoftheintegrationmethods appliedtogeneprioritizationproblems.

Unweightedmethodsarecharacterizedbynetworks combina-tionsdependingonlyonthestructureofthenetworkitself,while weightedonesdependonanestimateofthelearningcapabilities ofnetworkalgorithmsorontheassessmentofthe “informative-ness”oftheavailabledata.ThemethodsproposedinSection2.4.2 (unweighted integration) and in Section2.4.3 (weighted inte-gration) share several general characteristics with previously proposedmethodsappliedingeneprioritizationproblemsorin othercomputationalbiologyproblemssuchasgenefunction pre-diction[46–49].

Forinstance,unweightedapproachessuchasthesimpleunion of networks has beenapplied to the prioritization of genes in Alzheimer’sdiseaseusingaguilt-by-associationinferencerule[47], ortotheintegrationofPPIdataofmodelorganismsmappedto humanthroughhomology[19],orinthecontextofthefunctional interpretationofgenomicvariantstotheintegrationofgene inter-actionnetworks[50],ortoﬁndfunctionalmodulesinnetworks integratedfrommultiplepublicdatabases[51].Otherunweighted approachesforgeneprioritizationaveragethescaledGram matri-cesobtainedfromdifferentsourcesoffunctionalinformationusing suitablekernels[46].

Weightedapproaches differ for theway theweights associ-atedtoeachnetworkareestimated.Forinstance,weightscanbe obtainedthroughaniterativealgorithm showntobeequivalent toanexpectation-maximization(EM)optimizationalgorithm[52], orweightsarelearntbysolvingaquadraticallyconstrainedlinear programinanoveltydetectionsettingofthegeneprioritization problem[46],orinthecontextofthegenefunctionprediction prob-lemweightscanbeinterpretedfromaprobabilisticstandpoint[49] orestimatedusingthePPV(positivepredictionvalue)associated totheedgesofthegraph[48].

In the following sections, we describe the network pre-processingandtheunweightedandweightednetworkintegration methodsthatwetestedinourexperiments.

2.4.1. Networkpre-processing

Beforethecombinationphaseeachnetworkunderwenta pre-processingsteptoallownetworksforhavingdifferentnumberof nodes,toﬁltersomeedgesintoodensegraphs,andtomakethe weightscomparableacrossdifferentnetworks.Inparticular,todeal withgenesmissinginsomenetworks,weﬁlledthecorresponding rows/columnsofthesymmetricadjacencymatrix Wwithzeros. Toreducethecomplexityofthenetworkandthenoiseintroduced bytoosmalledgeweights,asapre-processingstepweeliminated edgesbelowagiventhreshold.Inthiswayweremovedveryweak similaritiesbetweengenes,butatthesametimewechoserelatively lowthresholdstoavoidthegenerationof“singletons”withno con-nectionswithothernodes.Inbrief,wetunedthethresholdforeach networktoguaranteethateachvertexhasatleastoneconnection: foreach node/genewecomputed themaximum oftheweights associatedtoitsedges,andbetweentheselectedmaximawechose theminimumasa generalthresholdforthenetwork.Finally,to maketheweightscomparableacrossdifferentnetworks,avoiding theundesirableeffectthatacertainnetworkcouldovercomethe

othersbecauseofthehighvaluesofitsweights,weappliedboth Laplacianregularization[53]andasimplelinearregularizationto obtainweights ˆwij∈[0,1]:

ˆ wij=

wij−minx,ywxy max_x,ywxy−minx,ywxy

(1) whereindicesx,y_∈Vrefertothevertices/genesoftheunderlying graph.

Inourexperimentsweadoptedtheregularizationshownin(1), sincetheresultswerecomparablewithLaplacianregularization (datanotshown).

2.4.2. Unweightednetworkintegration

In the unweighted network integration the combination of differentnetworksdependsonlyonthestructureand the char-acteristics of each network, and nolearning is involved in the computationoftheintegratednetwork.

2.4.2.1. Unweighted average (UA). One of the widely applied approachisrepresentedbytheUAmethod[46,32].Theweightof eachedgeofthecombinednetworksiscomputedsimplyaveraging acrosstheavailablennetworks:

¯ wij= 1 n n

d=1 wd_ij (2)

Notethatin thisintegrationapproach alsoweights wij=0 con-tributestotheaverage,independentlyofthefactthatthemeasure offunctional relationshipbetweengenesiandjunderlyingthe evidencesourceisavailableornot.

2.4.2.2. Per-edgeunweightedaverage(PUA). We proposea novel method,similartoUA,butthatassuresahighcoverageofthegenes includedintheintegratedfunctionalnetwork,withoutpenalizing genesforwhichaspeciﬁcsourceofdataisunavailable.Withrespect totheUAmethod,PUAtakesintoaccountthefactthat agiven functionalrelationshipbetweenapairofgenescouldbemissing, averagingthatedgeonlybythenumberofnetworkscontaining bothgenes.

Moreprecisely,givenasetofngenenetworkstheweight ¯wijof theedge(i,j)∈_E¯_is_computed_as_follows:

¯ wij= 1 |D(i,j)|

d∈D(i,j) wd ij (3) whereD(i,j)={d|i∈Vd_∧_j_∈_Vd_}_.

2.4.2.3. Networkmaximumintegration(MAX). TheMAXintegration selectsthelargestweightamongalltheavailablesourcesofdata:

¯ wij=max

d w d

ij (4)

Thisapproach performstheunionofalltheavailablesourcesof evidence[47,51,50],andwhenmultipleedges(i,j)foragivenpair ongenesiandjareavailable,selectstheonewiththelargestweight. 2.4.2.4. Networkminimumintegration(MIN). Analogously,theMIN integrationselectstheminimumweight:

¯ wij=min

d w d

ij (5)

Inpracticeitrealizestheintersectionbetweenmultiplenetworks. Itcanbeimplementedintwodifferentﬂavours:the“drastic” algo-rithm(5)forwhichitissufﬁcientasinglewd

ij=0inordertoset ¯

(6)

setto0arediscarded,and ¯wij=0ifandonlyiftheweightsforthe edge(i,j)inalltheavailablenetworksaresetto0:

¯ wij=

0 if

∀

d wd ij=0 min d {w d ij|wdij=/0} otherwise (6)

Itisworthnotingthatthatthisapproachcouldbehighlyaffected bynoisydata.Itcouldbereliablewhenalargeevidenceisshared amongdifferentsourcesofdata.

2.4.3. Weightednetworkintegration

Theunweightedmethodsdonotrequiretolearnany param-eters fromthe data, while theweighted integration learnsthe “weight”associatedtoeachnetwork.Thebasicideabehindthese approachesconsistsinassociatingaparametertothe “predictive-nessstrength”ofeachtypeofnetwork.Thiscanberealizedbyusing alearningalgorithmtoassociatethe“predictivenessstrength”ofa networkwiththeassessmentoftheaccuracyofthelearning algo-rithmtrainedonthenetworkitself.

Differentweightedapproacheshavebeenproposedinthe lit-erature[46,52,48,54].Inourexperiments,consideringthatingene prioritizationthemainobjectiveconsistsineffectivelyrankingthe geneswithrespecttoagivendisease,wecomputedtheweights accordingtotheAUCobtainedforagivenMeSHdescriptor.More precisely,havingnnetworksandcMeSHdescriptors,wecan com-putetheweightd₍_k₎_for_the_d_th_network_and_the_k_th_MeSH_disease

inthefollowingway: d₍_k₎₌

Md(k)

n j=1Mj(k)

(7) whereMd₍_k₎_represents_the_metric_applied_to_measure_the_accuracy

oftheprediction(e.g.theAUCortheprecisionataﬁxedrecall)with respecttokthMeSHdescriptorandthedthnetwork.The denom-inatorin(7)simplyassuresthat

n_d₌₁d₍_k₎₌_1._Thed₍_k₎_can_be

computedfor each MeSHdescriptor k byestimating the corre-spondingAUCbyleave-one-outonthetrainingdata,thatistosay, an“internal”crossvalidationisperformedtooptimizetheweights, bysubdividingeachfoldofan“external”crossvalidationapplied toevaluatethemethodinthewholedataset.

2.4.3.1. Weightedaverageperclass(WAP).Byusingthed₍_k₎

com-putedaccordingto(7),theWAPmethodintegratesthenetworks byputtingaweightproportionaltotheperformanceofa given learningalgorithmoneachnetworkusedintheintegration:

¯ wij(k)= n

d=1 d₍_k₎_wd ij (8)

Itisworthnotingthatinthiswayweconstructadifferentweighted integratednetworkforeachMeSHdescriptor.

In order to emphasize the weight of the most informative networksand,atthesametime,toreducetheweightsoftheleast informativeones,amonotoniclogarithmictransformationofthe weightscanbeapplied,insteadofusingtheoneproposedin(7): d(k)=

log(1n −Md(k))

j=1log(1−Mj(k))

(9) WeassumethatthemetricMhasvaluesin[0,1](consider,e.g.the AUC).Notethatinapracticalimplementation,toavoidd₍_k₎_→_∞_,

weneedtosetanupperboundb<1forM.Forinstance,inour experimentsweusedtheAUCandwesetb=0.99.

2.4.3.2. Weighted average (WA). The WAP method adapts the weightsd₍_k₎_according_to_the_performance_of_a_learning_algorithm

oneachspeciﬁcclasskunderstudy.Ononehand,thiscouldleadto

asetofnetworkswellfittedtothecharacteristicsofeachclassk,but ontheotherhandthisapproachislikelytooverfitthedata.Tothis endweintroduceasortof“regularized”versiontoreduce possi-bleoverfittingproblemsinthelearningprocess.Morepreciselywe computearegularizedweightd_,_by_averaging_across_classes,_in_the

spiritoftheapproachproposedin[55]inthecontextofgene func-tionpredictionproblems.Inthiswayweobtainauniqueweightd

foreachnetwork: d= 1 c c

k=1 d(k) (10)

The WAmethod,using the weights estimated in (10), builds a uniqueintegrated network, independentlyof theMeSHdisease considered: ¯ wij= n

d=1 wd_ij c

k=1 d₍_k₎ c = n

d=1 dwd_ij (11)

Notethatinthissectionweconsideredtheintegrationofgraphs representedthroughtheircorrespondingadjacencymatrices W, butitiseasytoseethatthesamemethodcanbeappliedtokernel matrices Kderivedfrom W,bysimplysubstitutingineach equa-tionthewijelementsoftheadjacencymatrixwiththekijelements

ofthecorrespondingkernelmatrix(seeSection2.5.1). 2.5. Geneprioritizationmethods

Inthis sectionweintroduce thegene prioritizationmethods appliedinourexperiments.Wefocusedonkernelizedscore func-tions, since it has been recently shown it is among the most competitivemethodsintherelatedproblemofcancermodulegene ranking[40],andonrandomwalksalgorithms,sincetheyhavebeen successfullyappliedtoprioritizegeneswithrespecttogenetic dis-eases[19].Asabaselinemethodweusedasimpleimplementation oftheguilt-by-association(GBA)principle[56].

2.5.1. Kernelizedscorefunctions

Kernel-basedrankingmethodshavebeenrecentlyproposedin thecontextofcancermodulegeneranking[40],drugranking[57] andgenefunctionpredictionproblems[58,31].Methodsbasedon kernelizedscorefunctionsarevery fast(theirtimecomplexity is approximatelylinearinsparsegraphs,oncethekernelmatrixis computed)[31],and theiraccuracy isat leastcomparable with state-of-the-artgeneprioritizationmethods[40].

ThescorefunctionsS:V−→R+_are_based_on_properly_chosen

kernels,bywhichwecandirectlyrankverticesaccordingtothe valuesofS(i):thehigherthescore,thehigherthelikelihoodthata genebelongstoagivenMeSHdisease.

Kernelized score functionsrely on distancemeasures deﬁned inasuitable HilbertspaceH.Moreprecisely,letXbea general nonemptyset,:X→H,amappingtoagivenuniversal reprodu-cingkernelHilbertspaceH,andK:X×X→Ritsassociatedkernel function,suchthat<(·),(·)>H=K(·,·),where<·,·>H rep-resentstheinternalproductinH.Bychoosingadistancemeasure onaHilbertspace,wecanexploittheclassical“kernel-trick”[59] andwecanembedanyvalidkernelintothedistancemeasureitself. It is worth noting that we extend the notion of neighbour throughthekernelK:bychoosinganappropriatekernel,nodej canbeintheneighbourofnodeievenifthereisnoedgebetween themintheoriginalgraphG:i.e.wij=0,butK(xi, xj)>0.From

thisstandpointtheGrammatrix _Kcanbeinterpretedasanovel “weightedadjacencymatrix”intheprojectedHilbertspaceinduced bythemapping:X→H.

(7)

IfwechoosetheminimumdistanceDNNbetweeniandVM(the

setofgenesannotatedforagivenMeSHdiseaseM),wecanobtain thenearest-neighboursscoreSNN:

DNN(i,VM)=min j∈V_M 1

2(xi)−(xj)

2 ₍₁₂₎

Bydevelopingthesquare(12)weobtain: DNN(i,VM)=min j∈VM

₁ 2 <(xi),(xi)>+ 1 2<(xj),(xj)> −<(xi),(xj)>

(13) Bysubstitutingin(13)theinternalproduct<(·),(·)>witha suit-ablekernelK(·,·),wecanobtainasimilaritymeasuresimplyby changingthesign:

SimNN(i,VM)=−min j∈VM

₁ 2K(xi,xi)−K(xi,xj)+ 1 2K(xj,xj)

(14) IfK(xj,xj)areequalforallj∈V,wecansimplify(14),thusachieving

thenearestneighboursscoreSNN:

SNN(i,VM)=−min j∈VM

−K(xi,xj)=max j∈VM

K(xi,xj) (15)

AnaturalextensionoftheSNNscorecanbeobtainedby

introduc-ingthek-nearestneighboursdistance: DkNN(i,VM)= 1 2

j∈I_k(i) (xi)−(xj)2, (16)

whereIk(i)={j∈VM|jisrankedamongtheﬁrstkinVM}.Byadoptinga

similarprocedureusedtoderivetheSNNscore,wecanobtainfrom

(16)thek-nearestneighboursscoreSkNN:

SkNN(i,VM)=

j∈I_k(i)

K(xi,xj) (17)

UsingadistanceDAV(i,VM)ofavertexi∈Vwithrespecttoaset

ofnodesVM,simplyastheaveragedistanceintheHilbertspace

betweeniandthesetofnodesincludedinVM:

DAV(i,VM)= 1 2(xi)− 1 |VM|

j∈VM (xj)2 (18)

wecanderivefrom(18)theaveragescoreSAV:

SAV(i,VM)=− 1 2K(xi,xi)+ 1 |VM|

j∈VM K(xi,xj) (19)

Thisscore representsthe averagesimilarity of thegene i with respecttothegenesbelongingtothesetVM.IfallK(xi,xi)areequal

foreachi∈V(i.e.the“self-similarity”ofgenesdoesnotmatter),we canfurthersimplify(19)byremovingitsﬁrstterm.

EvenifanyvalidkernelKcanbeappliedtocomputetheabove proposedscores,inthecontextofnetwork-basedgene prioritiza-tion,weusedrandomwalkkernels[53],sincetheycancapturethe similaritybetweengenes,takingintoaccountthetopologyofthe overallfunctionalinteractionnetwork.

TheGrammatrix Kassociatedtotheone-steprandomwalk ker-nelcanbederivedfromthesymmetricadjacencymatrix Wofthe functionalinteractionundirectedgraphG:

K=(a−1)_I₊_D−12WD− 1

2 (20)

whereIistheidentitymatrix,Disadiagonalmatrixwithelements dii=

jwij,andaisavaluelargerthan1.

The q-step random walk kernels _Kq−step= Kq, can beeasily

obtainedbymatrixmultiplicationfromtheone-steprandomwalk kernelmatrix(20),whereqrepresentsthenumberofrandomwalk

stepsintheunderlyinggraph[53].Inthisway,bysettingq=2or q=3twoverticesareconsideredsimilariftheyaredirectly con-nectedoriftheyareconnectedthroughapathincludingoneor twovertices.Alsolongerpathscouldbeconsidered,bysettingq>3: inthiswaywecandeeplyexplorethegraphtoﬁndsimilarities betweengenesmediatedthroughlongpathsinthegraph.

2.5.2. Randomwalksandrandomwalkswithrestart

Kernelizedscorefunctionspresentedintheprevioussectioncan beinterpretedasageneralizationoftherandomwalkalgorithms, whichhavebeensuccessfullyappliedtogeneprioritization prob-lems[19,60].Randomwalk(RW)algorithms[61] rankgenesby exploringandexploitingthetopologyofthegenenetwork: ran-domwalksacrossthenetworkareperformedstartingfromasubset VM⊂VofgenesbelongingtoaspeciﬁcMeSHdescriptorMbyusing

atransitionprobabilitymatrix Q=D−1_W_,_where _W_is_the adja-cencymatrix,and Disadiagonalmatrixwithdiagonalelements dii=

jwij.

Startingfromtheinitialsetofprobabilitiespoofthegenes1...n

ofbelongingtoM,wherepi

o=1/VMifi∈VM,otherwisepio=0,the RWupdaterule:

pt+1=QTpt (21)

isrepeatedtilltoconvergenceorforaﬁxednumberofiterations. We canobservethat therandomwalkercouldprogressively “forget”theaprioriinformationavailablefortheMeSH descrip-torM,byiterativelywalkingacrosstheoverallnetwork.Toavoid thisproblem,wecanstoptheRWalgorithmafterafewiterations, asoutlinedabove,orwecanapplytherandomwalkwithrestart (RWR)method:ateachsteptherandomwalkercanmovetooneof itsneighboursorcanrestartfromitsinitialconditionwith proba-bility:

pt+1=(1−)QTpt+po (22) WithbothRWandRWRmethodsatthesteadystatewecanrankthe vector ptoprioritizegenesaccordingtotheirlikelihoodtobelong totheMeSHdiseaseunderstudy.

2.5.3. Guiltbyassociationmethods

Asabaselinegeneprioritizationmethodweappliedasimple implementationoftheguilt-by-association(GBA)principle. Accord-ingtothisgeneralbiologicalprinciple,abiomolecularentitythat interactsorsharessomefeatureswithanotherbiomolecularentity canalsosharesomespeciﬁcbiologicalproperty(forinstance,its membershiptoagivenMeSHcategory).Incomputationalbiology thisbasicbiologicalprinciplehasbeenexploitedtodevelop meth-odsabletoassignagivenbiologicalormolecularpropertyonthe basisofthelabelingofneighborhoodsinbiomolecularnetworks [56,62].In thecontext of gene prioritization problems, wecan assessthelikelihoodthatagivengenebelongstoagivenMeSH categoryMonthebasisoftheM-labeledgenesdirectlyconnected tothegeneunderstudy.

WeimplementedasimpleversionoftheGBAapproach,inwhich thescoreforeachgeneiscomputedbychoosingthemaximumof theweightswij∈Woftheedgesconnectingthegeneitopositive labeledgenesj∈VMintheneighborhoodN(i)ofi:

S(i,M)=max

j∈N(i)wij (23)

(8)

3. Resultsanddiscussion

3.1. Experimentalset-up

Oneofthemaingoalsofthisworkconsistsinperformingan extensiveanalysisofgene-diseaseassociations,consideringalarge setofdiseases.

Moreover, we experimentally investigated the impact of networkintegrationongeneprioritization,byperforminga quan-titative comparison of the accuracy achieved by the methods describedin Section2.5usingeach ofthesinglegene networks consideredinSection2.2withthatobtainedthroughthenetwork integrationmethodsintroducedinSection2.4.

Moreprecisely, atﬁrstweassessed the“informativeness”of eachsinglegene networkbyanalyzingtheperformanceofGBA, RW,RWRandkernelizedscorefunctionmethods.Thenweperformed asystematic analysisofboth unweightedandweighted network integrationmethods, by combining at ﬁrstthe six binary gene interactionnetworksandthenbyexploitingalsothereal-valued semanticsimilarity-basedgenenetworksthroughtheintegration ofalltheavailableninedifferentnets(Table1).

Moreoverwe indicated someunannotated genes as reliable “diseasegene”candidatesforaselectedsetofMeSHdiseasesfor whichweobtainedrobustandaccuratepredictions.

3.2. Evaluationofthegeneprioritizationandnetworkintegration methods

Thegeneralizationperformancesofeachgeneprioritizationand networkintegrationmethodhasbeenassessedthroughaclassical cross-validationprocedure[63],settingtofivethenumberofthe folds.Moreprecisely,thenodesofthegraphhavebeenrandomly partitionedinfivefolds,andinturnafoldisselectedasthetest fold,whiletheremainingarethetrainingfolds.Thelabelsofthe testfoldareremoved,andthelabelsofthetrainingfoldsareused toinferthescorestobeassignedtothenodesofthetestfold(in oursettingwedealwithgeneprioritization,i.e.arankingproblem). Finally,havingthescorespredictedforeachofthefivefolds(that isfortheentiresetoftheavailablegenes)wecanapplystandard measurestoevaluatethecorrectnessoftheobtainedgeneranking withrespecttoeachdisease.InparticularweappliedtheAUCto evaluatetherankingofthegenes.Moreover,weappliedalsothe precisionatagivenrecalltotakeintoaccountthatforseveralMeSH diseaseswehavearelativelylownumberofknowndiseasegenes (positiveexamples).

Aftertheassessmentofthegeneralizationperformanceofthe geneprioritizationandnetworkintegrationmethods,wereported foreachoftheconsidered708MeSHdiseasesthep-valueobtained throughanonparametricstatisticaltestbasedonthe“shufﬂing”of thegenelabels(Section3.6).Thenwereportedthe10top-ranked unannotatedgenesforeachMeSHdisease,andweperformedalso ananalysisoftheunannotatedgenesasreliable“diseasegene” can-didatesonthebasisofthedistributionofthescoresoftheannotated genesfortheMeSHdiseasesforwhichweobtaineda veryhigh estimatedcross-validatedAUCvalue.

Weoutlinethatthereportedresultsarebased,accordingtothe literatureongeneprioritization,onretrospectivebenchmarks,and forthisreasonofferusuallyoptimisticestimatesofthe general-izationperformances,sincedisease-associations arelikelytobe directlyorindirectlyincorporatedinthegene-prioritizationdata sources[3].Asoutlinedin[64],thisproblemisdifﬁculttoaddress inaninitialstudyandcanberesolvedonlybylong-term perspec-tivebenchmarks,whereinpredictionsaremadeonthecurrentstate ofknowledge(thatisthecurrentavailableannotations)and vali-datedinfuturestudies,thatisoncenovelexperimentalevidence ofdisease-associationswillbeavailable.

3.3. Geneprioritizationwithsinglenetworks

Weperformedanassessmentofthe“informativeness”ofeach genenetworkthroughanextensiveexperimentalevaluationofthe averageAUCresultsacross708MeSHdiseases,usingdifferentgene prioritizationmethods(Table2).TheﬁrstcolumnofTable2shows thegeneprioritizationmethodsandtheirmainassociated learn-ingparameters(seeSection2.5fordetails).Foreachcolumnthe bestaverageAUCresultsachievedbythegeneprioritization meth-odsarehighlightedinbold.SAVandSkNNkernelizedscorefunctions

achieveusuallythebestresults,butalsoRWandRWRalgorithms aresometimescomparablewithkernelizedscorefunctions.The dif-ferenceisstatisticallysignificant(Wilcoxonranksumtest,˛=0.01) infavorofkernelizedscorefunctionsforthedatasetsdbnet,finet, hnnet,bpnetandccnet,whilefortheotherfourfunctionalnetworks nostatisticallysignificantdifferencehasbeendetected.

ThelastrowofTable2showstheaverageresultsacross meth-odsforeachgenenetwork.Wecanobservethatontheaverage geneprioritizationmethodsachievethebestresultswithﬁnetand gcnet,buttheAUCperformancesarerelativelyhighalsowithhnnet andbpnet.Theothernetsappeartobelessinformativeonthe aver-age,butconsiderthatacertainlearningisassuredwitheachofthe considerednetworks,sincetheaverageAUCisalwayssigniﬁcantly largerthan0.5.

Itisnotsurprisingthatﬁnet,gcnet(andalsohnnet)arethemost “informative”networks,since theyareconstructedby integrat-ingdifferentsourcesofinformation(Section2.2).Weonlyobserve that withgcnettheresultsare referred onlytoa subset ofthe genesusedinourexperiments(Table1).Itisworthalsonoting thegoodresultsobtainedwithsemanticsimilarity-basednetworks constructedfrombiologicalprocessesGOannotations(bpnet),even ifalsointhiscasetheresultsarecomputedwithrespecttoasubset oftheSgenes,andhencethecomparisonmustbeconsideredwith acertaincaution.Summarizing,theresultswitnessforthefactthat alltheconsideredgenenetworksbearacertaininformationabout thegeneprioritizationwithMeSHdiseases.Inparticularnetworks justconstructedthroughtheintegrationofdifferentsourcesof evi-denceseem tobethemost“informative”for this generanking task.

3.4. Geneprioritizationwithunweightednetworkintegration Ournetworkintegrationexperimentsstartedwiththe combi-nationofthesixbinarygenenetworksdescribedinSection2.2(that isalltheavailablegenenetworksexcludingreal-valuedsemantic similarity-basednets),usingtheunweightedcombinationmethods presentedinSection2.4.2.Table3reportstheaverageAUCresults acrossMeSHdiseaseswithUA,PUAand MAXintegration meth-ods.Notethatwedidnotperform“soft”MINintegrationsinceitis easytoseethatwithbinarynetworksthismethodis indistinguish-ablefromMAX,while“drastic”MINleadstohighlydisconnected networks.

ComparingTables 2and 3,we canobservethatunweighted integrationimprovestheperformance.Thisistrueespeciallywith UAandPUAmethods(thedifferenceisalmostalwaysstatistically signiﬁcantat˛=0.01signiﬁcancelevel),butinseveralcasesalso withMAX.Theimprovementdependsalsoonthegene prioritiza-tionmethodused.Forinstanceunweightedintegrationdegrades performancewithSNN(atleastwithrespecttothemost

informa-tivesinglegenenetworks),whilewiththeotherkernelizedscore functionsandwithGBA,RWandRWRalgorithmsoftenunweighted integrationimprovesAUCresults.Whilealargernumberofsteps improvestheperformanceofkernelizedscorefunctions,withthe classicalRWalgorithmweobserveadegradationofthe perform-ances.TheseresultsshowthattheclassicalRWtendsto“forget”the initial“apriori”knowledge,whilekernelizedscorefunctionsretain

(9)

Table2

Singlegenenetworks:AUCresultsaveragedacross708MeSHdiseases.Thelastrowshowstheaverageresultsacrossmethodsforeachgenenetwork.

cmnet bgnet dbnet ﬁnet hnnet gcnet bpnet mfnet ccnet

GBA 0.6620 0.6389 0.6683 0.7542 0.7323 0.7346 0.7134 0.6395 0.6250 RW1step 0.6922 0.6590 0.6037 0.7356 0.7269 0.8418 0.7646 0.6985 0.6845 RW2step 0.6829 0.6462 0.6761 0.8194 0.7802 0.8220 0.7635 0.7013 0.6812 RW3step 0.6768 0.6406 0.6531 0.8157 0.7531 0.8145 0.7611 0.6985 0.6745 RW5step 0.6718 0.6316 0.6426 0.7993 0.6973 0.8089 0.7610 0.6834 0.6711 RW10step 0.6694 0.6224 0.6222 0.7575 0.6249 0.8075 0.7411 0.6790 0.6684 RWR=0.6 0.6871 0.6515 0.6781 0.8271 0.7889 0.8401 0.7825 0.7112 0.6856 RWR=0.9 0.6878 0.6513 0.6750 0.8242 0.7870 0.8453 0.7789 0.7085 0.6825 SAV1step 0.6894 0.6574 0.6717 0.7669 0.7596 0.8167 0.7889 0.7139 0.6916 SAV2step 0.6842 0.6414 0.6831 0.8226 0.7872 0.8328 0.7888 0.7142 0.6914 SAV3step 0.6845 0.6417 0.6752 0.8255 0.7897 0.8417 0.7879 0.7146 0.6913 SAV5step 0.6850 0.6418 0.6778 0.8287 0.7943 0.8471 0.7839 0.7151 0.6907 SAV10step 0.6849 0.6408 0.6804 0.8312 0.7983 0.8407 0.7640 0.7117 0.6882 SNN1step 0.6296 0.6263 0.6667 0.7561 0.7374 0.7308 0.6971 0.6485 0.6565 SNN2step 0.6235 0.6105 0.6764 0.8031 0.7624 0.7316 0.7032 0.6478 0.6567 SNN3step 0.6228 0.6105 0.6683 0.8044 0.7638 0.7365 0.7103 0.6475 0.6574 SNN5step 0.6213 0.6107 0.6708 0.8052 0.7674 0.7481 0.7280 0.6475 0.6593 SNN10step 0.6197 0.6136 0.6744 0.8029 0.7729 0.7774 0.7703 0.6493 0.6659 SkNN1stepk=3 0.6439 0.6336 0.6705 0.7635 0.7523 0.7370 0.7645 0.6812 0.6712 SkNN2stepk=3 0.6377 0.6179 0.6817 0.8149 0.7788 0.7403 0.7705 0.6937 0.6725 SkNN3stepk=3 0.6371 0.6183 0.6737 0.8168 0.7805 0.7482 0.7765 0.6999 0.6756 SkNN5stepk=3 0.6362 0.6191 0.6763 0.8182 0.7845 0.7647 0.7815 0.7003 0.6788 SkNN10stepk=3 0.6366 0.6225 0.6798 0.8172 0.7898 0.7993 0.7695 0.7021 0.6803 SkNN1stepk=19 0.6811 0.6523 0.6717 0.7668 0.7596 0.7860 0.7702 0.6997 0.6798 SkNN2stepk=19 0.6756 0.6364 0.6831 0.8222 0.7871 0.8004 0.7763 0.7001 0.6799 SkNN3stepk=19 0.6755 0.6368 0.6752 0.8249 0.7895 0.8125 0.7819 0.7008 0.6801 SkNN5stepk=19 0.6757 0.6373 0.6779 0.8276 0.7940 0.8286 0.7902 0.7025 0.6807 SkNN10stepk=19 0.6766 0.6373 0.6810 0.8292 0.7986 0.8402 0.7774 0.7063 0.6820 Average 0.6625 0.6338 0.6691 0.8029 0.7657 0.7955 0.7624 0.6899 0.6751

thepriorinformationandareabletoexploittheoveralltopology ofthenetwork,conﬁrmingpreviousresults[40,31].

Hereinafterwelimitedtheintegrationexperimentsto kernel-izedscorefunctionsonly,since theyusuallyperformequallyor betterthantheothercomparedmethods,andtheirempiricaltime complexityis signiﬁcantlylowerthanRW andRWRalgorithms: forinstance,whileanentirecycleofcross-validationonthe708

Table3

Unweighted integration of the six binary gene networks (without semantic

similarity-basednets):AUCresultsaveragedacross708MeSHcategories.

UA PUA MAX GBA 0.8313 0.8291 0.6589 RW1step 0.8566 0.8563 0.8501 RW2step 0.8186 0.8178 0.8154 RW3step 0.7937 0.7925 0.7897 RW5step 0.7773 0.7760 0.7746 RW10step 0.7720 0.7704 0.7706 RWR=0.6 0.8533 0.8528 0.8520 RWR=0.9 0.8565 0.8531 0.8476 SAV1step 0.8538 0.8530 0.8286 SAV2step 0.8562 0.8554 0.8353 SAV3step 0.8580 0.8571 0.8405 SAV5step 0.8596 0.8587 0.8470 SAV10step 0.8548 0.8540 0.8485 SNN1step 0.6934 0.6921 0.6352 SNN2step 0.6950 0.6936 0.6331 SNN3step 0.6968 0.6954 0.6315 SNN5step 0.7020 0.7004 0.6314 SNN10step 0.7251 0.7230 0.6546 SkNN1stepk=3 0.7280 0.7266 0.6593 SkNN2stepk=3 0.7304 0.7289 0.6581 SkNN3stepk=3 0.7332 0.7317 0.6580 SkNN5stepk=3 0.7405 0.7389 0.6627 SkNN10stepk=3 0.7636 0.7616 0.6987 SkNN1stepk=19 0.8138 0.8124 0.7598 SkNN2stepk=19 0.8170 0.8155 0.7639 SkNN3stepk=19 0.8199 0.8183 0.7680 SkNN5stepk=19 0.8251 0.8233 0.7785 SkNN10stepk=19 0.8374 0.8356 0.8093

MeSHclasseswithUAintegrationrequireshourswithRWR,the sametaskrequiresonlysomeminuteswithkernelizedscore func-tions,usinganInteli72.80GHzprocessorwith16GBofRAMand aLinuxsystem.

By addingthereal-valued networksbasedonsemantic sim-ilarity measures (Section2.2), we observe a further signiﬁcant enhancementoftheoverallperformance,showingthatthe integra-tionofdifferentsourcesofevidenceleadstobetterresults(Table4). ForinstancetheperformancesoftheUAapproachwithSAVusing

afivesteprandomwalkkernelareboostedfrom0.8596to0.8831 averageAUC(theincrementissignificantat˛=10−30_{significance} levelaccordingtotheWilcoxonsignedranksumtest).Notethat theMINintegrationfailsonthistask,sincean“intersection” strat-egyinthiscontextleadstoasignificantlossofinformation,thus notallowingtoexploitthetopologicalinformationunderlyingthe entirenetwork.

Fig.3providesavisualclueofthedifferencesofaverageAUC acrossMeSHcategoriesbetweenunweightedintegrationmethods

Table4

Unweightedintegrationmethods:AUCresultsaveragedacross708MeSHcategories

includingalltheavailableninegenenetworks

UA-all PUA-all MAX-all MIN-all

SAV1step 0.8765 0.8667 0.8286 0.6541 SAV2step 0.8792 0.8701 0.8353 0.6694 SAV3step 0.8811 0.8722 0.8405 0.6824 SAV5step 0.8831 0.8744 0.8470 0.7023 SAV10step 0.8761 0.8708 0.8485 0.7264 SNN1step 0.6950 0.7050 0.6352 0.6045 SNN2step 0.6980 0.7080 0.6331 0.6087 SNN3step 0.7014 0.7108 0.6315 0.6129 SNN5step 0.7106 0.7185 0.6314 0.6212 SNN10step 0.7437 0.7490 0.6546 0.6349 SkNN1stepk=19 0.8322 0.8331 0.7598 0.6413 SkNN2stepk=19 0.8368 0.8372 0.7639 0.6520 SkNN3stepk=19 0.8413 0.8404 0.7680 0.6619 SkNN5stepk=19 0.8500 0.8465 0.7785 0.6789 SkNN10stepk=19 0.8665 0.8576 0.8093 0.7093

(10)

0 1 . 0 0 0 . 0 5 0 . 0 − 0 1 . 0 − 0.05 AUC diff. Sav 5 s Sav 10 s Sav 1 s Sav 2 s Sav 3 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s −0.10 −0.05 0.00 0.05 0.10 Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 10 s Sknn 5 s Sknn 3 s Sknn 2 s Sknn 1 s AUC diff.

(b)

(a)

−0.15 −0.10 −0.05 0.00 0.05 0.10 Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Snn 1 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s AUC diff. −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s AUC diff.

(d)

(c)

Fig.3. Unweightedintegrationmethods:differencesofaverageAUCacrossMeSHdiseaseswithrespecttothebestsinglegenenetwork(ﬁnet).(a)UA(b)PUA(c)MAX(d)

MIN.

andthebestsinglegenenetwork(finet).Fig.3(d)confirmsthatalso inthistaskMINintegrationfails,forthesamereasonsexplained above.OnthecontraryUAandPUAintegrationprovidessignificant enhancementswithbothSAVandSkNN(Fig.3(a)and(b)).Notethat

unweightedintegrationwithSNN resultsinadegradationofthe

performances(Fig.3).Wehavenotaclearexplanationofthisfact, butwethinkthattheinstabilityofscorescomputedbyusingonly oneoftheneighbours,combinedwiththeimpossibilityof weight-ingorchoosingthebestsourcesofinformation,mayaddnoiseto thepredictionprocess.

Summarizing, theresultsshowthat unweightedintegration, andespeciallyUAandPUAmethods,signiﬁcantlyenhancesgene prioritizationresults.Alltheconsideredgeneprioritization meth-ods,rangingfromrandomwalkstokernelizedscorefunctions(with theexceptionofSNN),deriveabeneﬁtfromunweightedintegration.

Moreover,theintegrationofsemanticsimilarity-basednetworks furtherimprovestheperformancesofgeneprioritization.Notethat withthesenetworks,consideredindividually,geneprioritization methodsdonotattainhighaverageAUCscores(atleastwithmfnet andccnet,Table2),buttheirintegrationsigniﬁcantlyenhancegene prioritizationresults(Table4),sincetheyconveycomplementary informationwithrespecttotheothersourcesofevidence.

3.5. Geneprioritizationwithweightednetworkintegration

WeexperimentedalsowithWAandWAPnetworkintegration toexplicitlytakeintoaccountthe“informativeness”ofeachgene network(Section2.4.3).Table5showsthatweightedintegration signiﬁcantlybooststheperformanceofkernelizedscorefunctions. Inparticularﬁve-stepsSAVwithweightedintegrationofallthenine

availablenets(WA-all,Table5)reachesthehighestAUCaverage

score,butalmostallthegeneprioritizationalgorithmsachievetheir bestresultswithWAandWAPintegration.

ThisismoreevidentinFig.4,whereweregistera veryhigh incrementoftheaverageAUCscorewithrespecttothebest sin-glegenenetwork.ThisistrueforbothSAVandSkNN,whileforSNN

thisbehaviorislimitedtoWAPmethodsonly(Fig.4(b)and(d)). Nevertheless,notethat,onthecontrary,SNNbehavesbadlywith

unweightedintegration,independentlyofthecombinationmethod applied(Table3).

Togetmoreinsightsintotheresultsobtainedwithunweighted andweightedintegrationmethods,Fig.5comparestheAUCscores foreachclassachievedbyﬁvestepsSAV(oneofthebestgene

prior-itizationmethod)betweenunweightedandweightedintegration Table5

Weightedintegrationmethods:AUCresultsaveragedacross708MeSHcategories.

WAandWAPincludeonlytheﬁrstsixfunctionalnetworks,whileWA-allandWAP-all

includealltheninefunctionalnetworks.

WA WAP WA-all WAP-all

SAV1step 0.8649 0.8680 0.8778 0.8768 SAV2step 0.8733 0.8727 0.8828 0.8802 SAV3step 0.8774 0.8763 0.8866 0.8830 SAV5step 0.8817 0.8807 0.8904 0.8861 SAV10step 0.8812 0.8823 0.8868 0.8850 SNN1step 0.7602 0.8080 0.7042 0.8165 SNN2step 0.7692 0.8126 0.7155 0.8213 SNN3step 0.7709 0.8159 0.7193 0.8240 SNN5step 0.7753 0.8206 0.7303 0.8278 SNN10step 0.7807 0.8241 0.7707 0.8328 SkNN1stepk=19 0.8394 0.8570 0.8325 0.8650 SkNN2stepk=19 0.8476 0.8614 0.8427 0.8684 SkNN3stepk=19 0.8527 0.8651 0.8489 0.8716 SkNN5stepk=19 0.8614 0.8703 0.8611 0.8762 SkNN10stepk=19 0.8744 0.8768 0.8819 0.8784

(11)

−0.05 0.00 0.05 0.10 −0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 10 s Snn 5 s Snn 3 s Snn 2 s Snn 1 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s −0.10 −0.05 0.00 0.05 0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 3 s Sknn 5 s Sknn 10 s Sknn 2 s

(b)

(a)

−0.10 −0.05 0.00 0.05 0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s −0.10 −0.05 0.00 0.05 0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 3 s Sknn 5 s Sknn 10 s Sknn 2 s

(d)

(c)

Fig.4. Weightedintegrationmethods:differencesofaverageAUCacrossMeSHcategorieswithrespecttothebestsinglegenenetwork(ﬁnet).Integrationofsixnetworks:

(a)WA(b)WAP.Integrationwithninenetworksincludingsemanticsimilarity-basednets:(c)WA(d)WAP.

withrespecttothethebestsinglenetworkfinet.ApointinFig.5 representstheAUCscore,relativetoaMeSHdisease,attainedby theintegrationmethodandbythebestsinglegenenetwork.More precisely,theAUCvalue obtainedbytheintegrationmethodis representedinordinate,whileinabscissawehavetheAUCvalue achievedwithfinet,i.e.thebestsinglenetwork.Pointsthatlieabove thebisectorofthefirstquadrantanglerepresentMeSHdiseases forwhichtheintegrationmethodachievesbetterresultsthanthe singlebestgenenetwork.InFig.5(a)mostofthepointslieabove thebisector,showingthatUAenhancesresultsobtainedwithfinet. Byaddingsemanticsimilarity-basedgenenetworksseveralpoints moves abovethebisectorline(Fig.5(b)), confirmingthatthese networksaddnovelusefulinformationforthegeneprioritization task. Lookingat Fig.5(c) we observethatwithWA integration, justwithoutsemanticsimilarity-basedgenenetworks,mostofthe pointslieabovethebisector,andtheresultsarealsobetterwhen weintegratealltheavailablenetworks(Fig.5(d)).

Fig.6 providesanoverallpictureofthedistributionsof AUC scores compared between different unweighted and weighted integration methods using ﬁve steps SAV as gene prioritization

algorithm.Whiteboxplotsrefertoweightedintegrationmethods, lightgray boxplotstounweightedintegrationmethodswithout semanticsimilarity-basedgenenetworks,anddarkgrayboxplots tounweighted methods integrating all the nine available gene networks. Weighted methods showthe best results(especially whenallthenetworksareintegrated),butalsoUAll,thatisUA inte-gratingalltheavailableninenets,achievequitesimilarresults.All theconsideredmethodsbehavebetterthanthebestsinglegene network(lastboxplotinFig.6),exceptforMIN,thatclearlyfailson thistask,asjustdiscussedabove.

Toobtainamorereliablecomparisonoftheresultsobtained withdifferentgenenetworkintegrationmethods,weappliedto

eachpairofthemtheWilcoxonsignedranksumtest,toestimate whetherasigniﬁcantstatisticaldifferencedoesexistusingthebest performing gene prioritizationmethod (SAV ﬁve steps). Table6

summarizesthemainresults:a“+”entrymeansthatasignificant statisticaldifferenceat0.01significancelevelisregisteredinfavor ofthemethodintherowwithrespecttothemethodinthecolumn; a“−”entrymeansthattheoppositeholds,anda“=”entrystands fornosignificantdifferencebetweenthemethods.

We observethat weightedintegrationis alwayssignificantly betterorequalthanalltheothercompared methods.In partic-ularWA-allintegration(thatis,WAintegrating alltheavailable nets)issignificantlybetterthanalltheotherconsidered integra-tionapproaches. NotethatalsoUA-allisalwaysbetterorequal thanalltheothers(exceptwithWA-all),showingthatalsoa sim-pleunweightedintegration,ifasufficientlylargesetofsources ofevidenceisprovided,canachieveresultscomparablewiththe morecomputationallyexpensiveweightedintegration(recallthat theweightsoftheintegrationareobtainedbyevaluatingtheAUC oneachsinglegenenetworkbyinternalcross-validation,see Sec-tion2.4.3).Quiteinterestingly,WAPdoesnotoutperformWA:even ifweconstructaspecificweightednetworkfor eachMeSH dis-easethisdoesnotintroduceasignificantadvantage(atleast,on theaverage).Thisfactcouldbeexplainedbyconsideringthatthe per-classintegration(WAP)mayintroduceacertainoverfittingto thedata,whileWA,byaveragingtheweightsacrossclassesand thus resultingin a singleintegrated network, couldreduce the overfitting,actingasasortof“regularization”,confirmingprevious resultsobtainedinthecontextofgenefunctionprediction[55].

Consideringthatforalargenumberofdiseaseswehavea rel-ativelylow number of annotated genes,we compared also the precisionatdifferentrecalllevelsbetweendifferentunweighted andweightedintegrationmethods,usingtwostepsSAV asgene

(12)

0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet UA 0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet U A−all

(a)

(b)

0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet WA 0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet W A−all

(c)

(d)

Fig.5.ComparisonofAUCresultsbetweennetworkintegrationmethodsandthebestsinglegenenetwork(ﬁnet).EachpointrepresentstheAUCscoreobtainedbySAVﬁve

stepswithnetworkintegrationmethods(ordinate)andwiththebestsinglenetworkﬁnet(abscissa)oneachofthe708MeSHdiseases.(a)UAwithsixnetworks;(b)UAwith

alltheninenetworks;(c)WAwithsixnetworks;(d)WAwithallninenetworks.

WA WAP WAall WAPall UA PUA MAX UAall PUAall MAXall MINall finet

0. 6 0. 7 0. 8 0. 9 1.0

(13)

Table6

Comparisonbetweennetworkintegrationmethods:methodswhoseAUCperformancearesigniﬁcantlybetteraremarkedwith“+”,signiﬁcantlyworsewith“−−”andwith

nosigniﬁcantdifferencewith“=”(0.01signiﬁcancelevel,Wilcoxonsignedranksumtest).Thecomparisonsareinthesenserowsvs.columns.

WAP-all WA WAP UA-all PUA-all MAX-all MIN-all UA PUA MAX ﬁnet

WA-all + + + + + + + + + + + WAP-all = = = + + + + + + + WA = = + + + + + + + WAP = + + + + + + + UA-all + + + + + + + PUA-all + + + + + + MAX-all + − − = + MIN-all − − − − UA + + + PUA + + MAX +

prioritizationalgorithm(Fig.7).Withboththeintegrationofthe sixbasicnetworks(Fig.7(a))andwiththeintegrationofthesix basicnetworksplusthethreesemanticsimilarity-basednetworks (Fig.7(b))weachievesignificantlybetterresultswithanyofthe consideredintegratednetworkwithrespecttothebest“single” network (finet), exceptfor theMINintegrationthat obtainsthe worstresults.Alsointhiscaseweightedintegrationoutperforms unweightedintegration,butobservethatwhenweintegrateall theavailablenetworksUAall,i.e.theunweightedaverage integra-tion,achievesbetterresultsthantheweightedper-classintegration (WAPall), confirming that WAP integration undergoes a certain overfittingtothedata.Notethatwhensemantic-similaritybased networksareadded,alltheintegrationmethodsimprovestheir precision/recallresults(thescaleoftheordinate,thatisthe pre-cisionis equal in Fig. 7(a) and (b)). For instance WA, thebest performing network integrationmethods, improves its average precisionat20%recallfrom0.26to0.30witharelativeincrement ofabout15%inprecision.Asafinalobservation,notethatallthe considerednetworkintegrationmethods(exceptMINintegration) significantlyoutperformtheresultsobtainedwiththebestsingle network,confirmingthatalsosimpleunweightedintegration algo-rithmsaresufficienttoboosttheperformanceofgeneprioritization methods.

3.6. FindingnovelassociationsbetweengenesandMeSHdiseases Thecommonusageofgenesrankingscoresingene-disease pri-oritizationexperimentsconsistsintheselectionofthetopranked unannotated genes and in the theirfurthercharacterization as

possible“candidate”genesactuallyimpliedintheonsetand pro-gressionoftheconsidereddisease.

Tothisendweprovideforeachofthe708MeSHdiseasesthe AUCobtainedbyﬁve-foldcross-validation,thep-valueachieved through a non parametric randomized test (see below), and the10 toprankedgenes currentlynotannotated fortheMeSH diseaseunderstudy.Tablesummarizingtheseinformationis avail-ableathttp://homes.di.unimi.it/re/suppmat/genesmeshnetwpred/ supmatTBL1.html(accessed30November2013).

Moreover,wealso providea preliminaryanalysisofthetop rankedmost reliableunannotated genesfor theMeSHdiseases predictedwithhighrobustnessandaccuracybythebestnetwork integration,i.e.WAintegratingalltheavailablenetsusingﬁvesteps SAVtoprioritizegenes.

Toevaluatetherobustnessofthemethodweperformeda non-parametricstatisticaltestbyrandomlyshufﬂing1000timesthe labelsforeachMeSHdiseaseandcountinghowmanytimesmthe AUCcomputedwithrandomlyshufﬂedlabelsislargerthantheAUC computedwiththetruelabels.Theresultingp-valueisjust the ratio m

1000.Interestinglyenough,weachieveap-value<0.01for649 andap-value<0.05for676ofthe708MeSHdiseases.Tochoose MeSHdiseasesbothrobustlyandaccuratelypredictedweselected MeSHdescriptorswithanaverageAUC≥0.975andp-value<0.01, resultinginasetof24diseases.Foreachoftheselecteddiseases, weextractedthelowestscorecfromthesetofpositive(annotated) genes.Then,wecomputedtheempiricalcumulativedistributionof allthescoresequalorlargerthanc,consideringbothannotatedand unannotatedgenes.Asaﬁnalstep,usingthedistributioncomputed atthepreviousstep,wecomputedthek-percentilesofthethree

0. 0 0. 1 0. 2 0. 3 0. 4 0.5 Recall Precision 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 WA WAP UA PUA MAX MIN finet 0. 0 0. 1 0. 2 0. 3 0. 4 0.5 Recall Precision 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 WAall WAPall UAall PUAall MAXall MINall finet

(a)

(b)

Fig.7. Comparisonoftheaverageprecisionatﬁxedlevelsofrecallacrossthe708MeSHdiseasesbetweennetworkintegrationmethodsandthebestsinglegenenetwork

(14)

Table7

Listof24selecteddiseasesandofthecorrespondingtoprankedunannotatedgenes.

Diseaseid. Diseasename Toprankedunannotatedgenes

C535579 Cardiofaciocutaneoussyndrome KSR2,PILRA,KSR1

C536436 Cofﬁn-Sirissyndrome PYGO1,ARID2,SMARCC2

C536664 Peroxisomebiogenesisdisorders PEX5,PEX7,LONP2

C536783 T-Lymphocytopenia BIRC8,CASP10,NAIP

C536928 Turcotsyndrome MLH3,PMS2L5,MSH3

C537345 Sitosterolemia UGT1A5,UGT2B17,SLCO1B1

C538169 Acitretinembryopathy CASP10,PEA15,SLCO3A1

D000562 Amebiasis DCLRE1C,IL19,CYP2C8

D001404 Babesiosis DCLRE1C,IL19,FCGR2C

D002062 Bursitis UGT2B4,UGT2B15,UGT1A4

D006958 Hyperostosis,Cortical,Congenital NPPC,NPR1,ACE

D007888 LeighDisease NDUFB10,NDUFB4,NDUFA12

D008118 Loiasis FCGR2C,CYP3A43,CYP8B1

D008375 MapleSyrupUrineDisease ACAD8,PDHX,PDHB

D009196 MyeloproliferativeDisorders PTPN1,CISH,SLC25A40

D009634 NoonanSyndrome KSR2,KSR1,MRAS

D010483 PeriapicalDiseases MMP13,IL12B,IL8

D012214 RheumaticHeartDisease CYP21A2,CYP8B1,CYP3A43

D014353 Trypanosomiasis,African DCLRE1C,BCL2,STAT1

D015823 AcanthamoebaKeratitis DCLRE1C,IL19,CYP2C8

D018235 SmoothMuscleTumor NFKB1,IL8,IL6

D020299 IntracranialHemorrhage,Hypertensive NPPC,NPPB,CRH

D056685 CostelloSyndrome KSR2,PILRA,KSR1

D056824 UpperExtremityDeepVeinThrombosis FGGCX,PROZ,F11

toprankedunannotatedgeneswithineachselectedMeSHterm. Consideringthatweselected24MeSHdiseases,thisprocedurelead toacollectionof72k-percentileswhosefrequencyisplottedin Fig.8.

Fig.8showsthatmostofthetoprankedunannotatedgenesare concentratedclosetothe100-percentile,showingthatthesetop ranked“falsepositive”genesare“stronglypredicted”aspossible candidatediseasegenes,sincetheirscoresareclosetothatofthe toprankedannotatedgenes.Consideralsothatthisissupportedby thefactthatweselectedonlydiseasesforwhichgeneprioritization achievedaveryhighAUCand“robust”predictions(AUC>0.975and p-value<0.01).Thetopthree falsepositivesgenesymbolsalong withthediseaseidentiﬁersanddiseasenamesfortheselected24 MeSHdescriptorsarelistedinTable7.

Ofcoursetheproposedtoprankedgenesareonlydiseasegene candidates,andtheseresultsneedtobebiologicallyinterpretedand shouldundergoarigorousbio-medicalanalysispriortobeactually associatedtothediseaseitself.

k−percentiles of the top 3 FP scores within each selected MeSH term

k−percentile Frequency 0 20 40 60 80 100 0 102 03 04 05 0 60

Fig.8.Frequencyofthek-percentilesofthethreetoprankedunannotatedgenes.

4. Conclusions

Weperformedanextensiveanalysisofgene-disease associa-tionsnotlimited togeneticdisorders, includingmorethan 700 MeSHdiseases.

Byusingnetworkintegrationandgeneprioritizationmethods, wereportedforeachdiseasethe10unannotatedtop-rankedgenes, availableforfurtherbio-medicalanalysis.Moreover,byanalyzing thetop-rankedpredictionsrelativetothe24bestandrobustly pre-dictedMeSHdiseases,weshowedthatourapproach candetect reliablecandidatediseasegenes.

It is well-known that the integration of multiple omics sources of evidence is of paramount importance in several application domains in computational biology [65–68]. In this work we performed a systematic comparison of unweighted integration and our proposed weighted combination methods to provide an evaluation of the impact of network integra-tion on gene prioritization. We quantitatively showed that network integration is necessary to boost gene prioritization results,accordingtopreviousresultspublishedin theliterature [15,69,27,28,46,47].

In particular, we showed that the proposed weighted inte-gration methods, by exploiting the different “informativeness” embedded in different gene interaction networks, significantly outperformunweightedintegration.Moreoverourexperimental resultsshowthattheperformancesstronglydependonthe selec-tionofthesourcesofevidenceandonthecharacteristicsofthegene networks.Forinstance,alsoasimple UAintegrationcan signifi-cantlyimprovetheperformanceofgeneprioritizationmethodsifa sufficientnumberofdiverseandcomplementarygeneinteraction networksarecombined.Fromthisstandpoint,anovelresearchline couldberepresentedbyanadaptationoftestandselectmethods, originallyproposedinthecontextofsupervisedensembles[70]to appropriatelychoosethemostpredictivesourcesofevidenceand genenetworksforeachMeSHdiseasethroughanadaptivelearning process.

Conﬁrming previous results [30], semantic similarity-based networks,combinedwithothersourcesofevidenceboostthe per-formanceofgeneprioritizationmethods.Apossibleimprovement oftheproposedapproach couldconsistin combiningnetworks basedonsemanticsimilaritymeasuresthatembedtheontology