ContentslistsavailableatScienceDirect
Artificial
Intelligence
in
Medicine
jou rn al h om e p a g e :w w w . e l s e v i e r . c o m / l o c a t e / a i i m
An
extensive
analysis
of
disease-gene
associations
using
network
integration
and
fast
kernel-based
gene
prioritization
methods
Giorgio
Valentini
a,∗,
Alberto
Paccanaro
b,
Horacio
Caniza
b,
Alfonso
E.
Romero
b,
Matteo
Re
aaAnacletoLab–DipartimentodiInformatica,UniversitàdegliStudidiMilano,viaComelico39/41,20135Milano,Italy
bDepartmentofComputerScienceandCentreforSystemsandSyntheticBiology,RoyalHolloway,UniversityofLondon,EghamTW200EX,UK
a
r
t
i
c
l
e
i
n
f
o
Articlehistory:
Received11September2013
Receivedinrevisedform5March2014
Accepted10March2014
Keywords:
Genediseaseprioritization
Networkintegration
Heterogeneousdatafusion
MeSHdescriptors
Nodelabelranking
a
b
s
t
r
a
c
t
Objective:Inthecontextof“networkmedicine”,geneprioritizationmethodsrepresentoneofthemain toolstodiscovercandidatediseasegenesbyexploitingthelargeamountofdatacoveringdifferenttypes offunctionalrelationshipsbetweengenes.Severalworksproposedtointegratemultiplesourcesofdata toimprovediseasegeneprioritization,buttoourknowledgenosystematicstudiesfocusedonthe quan-titativeevaluationoftheimpactofnetworkintegrationongeneprioritization.Inthispaper,weaim atprovidinganextensiveanalysisofgene-diseaseassociationsnotlimitedtogeneticdisorders,anda systematiccomparisonofdifferentnetworkintegrationmethodsforgeneprioritization.
Materialsandmethods:Wecollectedninedifferentfunctionalnetworksrepresentingdifferentfunctional relationshipsbetweengenes,andwecombinedthemthroughbothunweightedandweightednetwork integrationmethods.Wethenprioritizedgeneswithrespecttoeachoftheconsidered708medical subjectheadings(MeSH)diseasesbyapplyingclassicalguilt-by-association,randomwalkandrandom walkwithrestartalgorithms,andtherecentlyproposedkernelizedscorefunctions.
Results:Theresultsobtainedwithclassicalrandomwalkalgorithmsandthebestsinglenetworkachieved anaverageareaunderthecurve(AUC)acrossthe708MeSHdiseasesofabout0.82,whilekernelized scorefunctionsandnetworkintegrationboostedtheaverageAUCtoabout0.89.Weightedintegration, byexploitingthedifferent“informativeness”embeddedindifferentfunctionalnetworks,outperforms unweightedintegrationat0.01significancelevel,accordingtotheWilcoxonsignedranksumtest.For eachMeSHdiseaseweprovidethetop-rankedunannotatedcandidategenes,availableforfurther bio-medicalinvestigation.
Conclusions:Networkintegrationisnecessarytoboosttheperformancesofgeneprioritizationmethods. Moreoverthemethodsbasedonkernelizedscorefunctionscanfurtherenhancediseasegeneranking results,byadoptingbothlocalandgloballearningstrategies,abletoexploittheoveralltopologyofthe network.
©2014TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/3.0/).
1. Introduction
Theraisingawareness thatadiseaseisrarelyaconsequence of anabnormality ona single gene,but it is usually theresult ofcomplexinteractionsandperturbationsinvolvinglargesetsof genesand theirrelationshipswithseveralcellularcomponents, leadtodevelopmentofthe“Networkmedicine”,anetworkbased approachtohumandisease[1].In thiscontext,gene prioritiza-tionmethodshaveprogressedquicklywiththeaimofdiscovering
∗Correspondingauthorat:DipartimentodiInformatica,UniversitàdegliStudidi
Milano,ViaComelico39,Milano,Italy.Tel.:+390250316225;fax:+390250316373.
E-mailaddress:[email protected](G.Valentini).
candidate“disease”genesbyexploitingthelargeamountof avail-able“omics”datacoveringdifferenttypesofrelationshipsbetween genes[2].
Accordingto[3],automaticgeneprioritizationmethods typi-callyproducetheiroutputseitherbyfilteringthecandidategenes intosmallersubsetsorbyrankingthecandidategenes.
Filteringmethodsarebasedonthedefinitionofasetofcriteria motivatedbytheavailableknowledgeofthemolecularbasisofthe diseaseunderinvestigation.Theirmainobjectiveistoreducethe setofpotentialdiseasegenesbyexploitingacomparisonofallthe candidategeneswithasortofgenetemplate,whichencodesthe selectioncriteriainasetofrules[4,5].Despitehavingbeenproved effective[6,7],thehardfilteringpolicyunderlyingtheirfunctioning isadouble-edgedsword.Indeed,whenarelevantgenefailstomeet http://dx.doi.org/10.1016/j.artmed.2014.03.003
justoneofthecriteriaencodedinthefilter,itbecomesafalse neg-ative,andthispreventstheabilitytodetectgenesthatareactually involvedinthedisease,butwithmechanismsnotbeenpreviously reportedinliterature.
Thesecondclassofgeneprioritizationmethods(rankingbased) avoidsthelimitationsoffilteringmethodssimplybyranking can-didatesfrommosttoleastpromisingones.Asinthecaseoffiltering methods,rankingbasedmethodscanintegratemultiplesourcesof evidenceinthegeneprioritizationprocess.Thesemethodscanbe furtherclassifiedintothreemaincategories[3]:textmining[8,9], similarityprofilingandnetworkanalysis-based[10–13].
Although powerful in their ability tomake a very effective usageoftheavailableknowledge,textminingapproachesshow a strong bias toward the identificationof straightforward can-didatesforwhichabundantknowledgeisalreadyavailable[14]. On the contrary,similarity profiling [15] and network analysis basedgeneprioritizationsystemsarenotaffectedbythis limita-tion.Indeedtheycanexploit both knowledgebases (increasing the specificity of predictions) and raw data (for novel predic-tions).
Inparticular, network basedmethods aregaining increasing popularityin diseasegene prioritization (see [16,17]for recent specific reviews). According to this approach, nodes represent genes and edges encode some notion of functional similarity betweengenes,e.g.directmolecularinteractions,transcriptional co-expression/regulation,sequenceorstructuresimilarityor par-alogy[18];theprioritizationlististhenconstructedbyexploiting thetopologyand theedge weightsof thenetwork anda setof “core”genesknowntobeassociatedtothediseaseunderstudy. In this category some methods used a random walk or a heat kernel[19],whileothersappliedWebandsocialnetworks meth-odsonaprotein–proteininteraction(PPI)network[20],andother approachesexploitedPPIandpathwayinformation toprioritize candidategenes[21,15].
Mostgeneprioritizationmethodsexploiteddifferentsourcesof informationandgenenetworks[22,23],rangingfromphenotypic similaritiesbetweendiseases andfunctional similaritybetween genes[24],toGOontologyandInterProdomainannotations[25] andprotein–protein interactions, geneexpression andcommon membershiptoKEGGpathways[26],andalsotoseveralothersets ofdatasources[15,27,28](see[22]foramoredetailedpresentation ofthedifferentcombinationsofsourcesofevidenceexploitedby recentdiseasegenesprioritizationmethods).
Despitethelargeavailabilityofworksdescribingspecific com-binationsofdatasetstodeveloptoolssuitablefordiseasegenes prioritization,“ourunderstandingofhowtoperformuseful pre-dictionsusingmultipledatasourcesoracrossbiologicalnetworks isstillrudimentary”[3],andinparticular,toourknowledge,no systematicstudiesfocusedonthecomparisonofdifferentnetwork integrationmethods.
Tocontributetofillthisgap,inthispaperwepropose,compare andanalyzedifferentnetworkintegrationstrategiestocombine multiplegenenetworksconstructedwithdifferentsourcesofsingle orheterogeneousdata.Inparticularweapplysimpleunweighted integrationmethods,that combinegenenetworkssolely onthe basisofthestructuralcharacteristicsofthenets,andwepropose weightedintegrationmethodsthatcombinenetworksaccording tothe“predictivenessstrength”ofeachtypeofnetwork,estimated throughtheassessmentoftheaccuracyofthelearningalgorithm trainedoneachofthecombinednetworks.Weconstructedand integratedninedifferentgenenetworks,includingalsosemantic similarity-basedgenenetworks,sinceithasbeenrecentlyshown thattheyimprovegene-diseaseprioritization[29,30].
Anothercontributionofthisworkconsistsintheapplication ofthekernelizedscorefunctionstothegene-diseaseprioritization problem. Thisnovelsemi-supervised network methodfornode
labelrankingadoptsboth localandgloballearningstrategiesto learnfromboththeneighborhoodofeachnodeandatthesame timefromtheoveralltopologyofthenetwork[31,32].
Anotheropenissueisrepresentedbythechoiceofthe“seed genes”tocharacterizethediseasesinvolvedinthegene prioritiza-tionanalysis[22].Previousmethodsfocusedonspecificdiseases [33,34]orongeneticdiseases[23,35]accordinge.g.totheonline Mendelianinheritanceinman(OMIM)database[36].Inorderto extendtheanalysistoalargersetofdiseases,notlimitedtogenetic disorders,inthisworkweused“seedgenes”borrowedfromthe MeSHtaxonomyofdiseases[37],byexploitinggene-MeSHdisease associationsprovidedbythecomparativetoxicogenomicsdatabase (CTD)[38].
Summarizing,ourmain contributionscanbeschematizedas follows:
•Weproposeoneofthewidestgene-diseaseprioritizationstudies, involvinggene-MeSH diseaseassociations coveringmorethan 700diseases,notlimitedtogeneticdisorders.
•Weproposenovelweightedintegrationmethodsabletocombine multiplenetworksaccordingtothe“predictivenessstrength”of eachsourceofdata.
•Acomparativeanalysisofdifferentnetwork-integration meth-ods,andaquantitativeevaluationoftheirimpactongene-disease prioritization.
•An extensive application of the kernelized score functions, a recentlyproposedsemi-supervisednetwork-basedmethodthat embedslocalandgloballearningstrategies,tothegenedisease prioritizationproblem.
This paperis structuredas follows. In Section2.1 we intro-duce MeSH and thepipelinewe applied toannotatethe “seed genes”used inourexperiments.Section2.2 describesthe func-tionalnetworksconsideredinourexperiments.TheninSection2.4 theunweightedand weighted integrationmethodsand in Sec-tion2.5thegeneprioritizationmethodsusedinourexperiments areintroduced.Theoverall experimentalsetting isdescribed in Section3.1, and the results relative to the application of the geneprioritizationmethodstothesinglefunctionalnetworksare discussedinSection3.3.Theseresultsarethenquantitatively com-paredwiththoseobtainedthroughunweighted(Section3.4)and weighted(Section3.5)networkintegrationmethods,whilein Sec-tion3.6thetop-rankedunannotatedgenesandtheAUCandp-value associatedtoeachofthe708MeSHdiseasesanalyzedinthiswork arepresented.Theconclusionsoutlinethemainfindingsofthis workandsuggestnovelresearchlinesinthecontextofthegene prioritizationandnetworkintegrationproblems.
2. Materialsandmethods
2.1. MeSH:medicalsubjectheadings
MeSH is a controlled vocabulary produced by the National Library of Medicine for indexing, cataloging, and search-ing biomedical and health-related information and documents (http://www.nlm.nih.gov/mesh,accessed30November2013).The descriptorsorsubjectheadingsofMeSHarearrangedina hierar-chy.MeSHcoversabroadrangeoftopicsanditscurrentversion consistsof16toplevelcategories.TheMeSHthesaurusisusedfor indexingarticlesfromtheworld’sleadingbiomedicaljournalsfor theMEDLINE/PubMEDdatabase.OneoftheMeSHtoplevelterms (Diseases)isusedtolabelthegenesetsusedinourexperiments andtoevaluatetheimpactofnetworkintegrationontheinference ofrelationshipsbetweengenesanddiseases.
The associations between the genes and the MeSH disease termshavebeendownloadedfromtheCTD[38],apublicresource
Fig.1.Pipelineofthegene–MeSHdiseaseannotationprocess.
thatprovidesinformationabouttheinteractionof environmen-tal chemicals with gene products and their effects on human diseases.Theserelationshipsareannotatedfromthescientific lit-eraturebyprofessionalbiocuratorswhomanuallycurateatriadof coreinteractions includingchemical-gene,chemical-diseaseand gene-diseaserelationships.TheCTDintegratesthesecoredatato generateinferredchemical-gene-diseasenetworks.
Toprovidea“goldstandard”of“seedgenes”toinfernovel gene-diseaseassociations,wefirstdownloadedtheassociationsbetween thehumangenesconsideredinourexperiments(Section2.2)and alltheavailableMeSHdiseasetermsavailableinCTD.Wethen fil-teredoutallthediseasesassociatedwithlessthanfiveandmore than200genesin ordertoboth ensurea minimumamountof aprioriinformationforourpredictiontasksandtoavoidclasses whoseassociatedgenesetsaretooheterogeneous.Thisledtothe definitionofasetcomposedby708MeSHdiseases(Fig.1).
Thefullsetofthe“goldstandard”seedgenes–MeSHdisease associations is availablefrom http://homes.di.unimi.it/valentini/ DATA/DiseaseGeneNetworks(accessed30.11.13).
ItisworthnotingthatMeSHcontrolledvocabularyofdiseases hasbeenjustproposedinthecontextoftext-mining-basedgene prioritization[39],butthoseresultscannotbesafelygeneralized tonetwork-basedmethods,sincetext-miningapproachesshowa biastowardgenesforwhichalarge“apriori”knowledgeisactually availableinliterature[14].
2.2. Functionalnetworks
We collecteddifferentsourcesof datatorepresentdifferent functionalrelationshipsbetweengenes.Moreprecisely,we con-structedgene networksusingphysicaland geneticinteractions, transcriptionalco-expression/regulationandlocalization,protein domainandgenechemicalinteractions,co-occurrenceof disease-genepairsinscientifictexts,homologuesimplicatedingenerating
similarphenotypesinotherorganisms,commonmolecular path-waysbetweengeneproducts,andcommonGOannotations.
Table1summarizesthemaincharacteristicsoftheninegene functionalnetworksusedinourexperiments.Eachgenenetwork includesasetSof8449genes(orasubsetofthem)selected accord-ingtotheproceduresdescribedin[40].Weconsideredasetof genesfor whichsufficient functionaldataareavailable, andfor whicharelativelycomparablecoverageacrossgenenetworkscan beassured.Inthisway,ontheonehandacertainamountof func-tionalinformationisensuredforeachgene,andontheotherhand theavailableinformationforeachconsideredgeneresults compa-rable.
In the restof this section we provide a brief descriptionof each gene network. The full data sets are downloadable from: http://homes.di.unimi.it/valentini/DATA/DiseaseGeneNetworks (accessed30.11.13).
2.2.1. Functionalinteractionnetwork–finet
In [41] Wu and colleagues constructed a functional protein interactionnetworkbasedonfunctionalinteractionspredictedby aNaiveBayesclassifiertrainedonpairwiserelationshipsextracted fromcuratedpathwaysandnon-curatedsourcesofinformation, includingprotein–protein interactions,gene co-expression, pro-tein domain interaction, Gene Ontology (GO) annotations and text-minedprotein interactions. Fromthe original network we extractedthesubnetworkincludingthesubsetSofgenesusedin ourexperiments.
2.2.2. Humannet–hnnet
Similarinspirittotheapproachin[41],thefunctionalnetwork constructionmethodpresentedin[27]byLeeandcolleagues inte-gratesdiverselinesofevidenceinordertoproduceafunctional humangene network. It hasbeenused inseveral teststo pre-dictcausalgenesforhumandiseasesandtoincreasethepower Table1
Characteristicsofthegenenetworksusedinourexperiments.
Network Description Type Nodes Edges Density
finet Obtainedfrommultiplesourcesofevidence Binary 8449 271466 0.0038
hnnet Obtainedfrommultiplesourcesofevidence Binary 8449 502222 0.0070
cmnet Networkprojectionsfromcancermodules Binary 8449 3414722 0.0478
gcnet NetworkprojectionsfromCTD Binary 7649 1421298 0.0242
bgnet NetworkprojectionsfromBioGRID Binary 8449 120169 0.0016
dbnet DirectrelationshipsobtainedfromBioGRID Binary 8449 3023084 0.0423
bpnet SemanticsimilaritynetworkfromGOBP Realvalued 6923 44506147 0.9286
mfnet SemanticsimilaritynetworkfromGOMF Realvalued 6145 26611887 0.7047
Fig.2.Simplifiedrepresentationofbipartitenetworkprojectionsintohomogeneousgenenetworks.(a)Binaryprojectiontoconstructthecmnetnetwork;(b)sumprojection
toconstructgcnet.Circlesrepresentgenes,squaresrepresentcancermodules(a)andchemicals(b).
ofgenome-wideassociationstudies.Alsointhiscaseweextracted fromHumanNetthesubnetworkincludingthesubsetSofgenes. 2.2.3. Cancermodulenetwork–cmnet
Byexploitinggeneexpressionprofiling,Segalandcolleagues constructedafunctionalmodulemapforcancertoinvestigate com-monalitiesandvariationsbetweendifferenttypesoftumor[42].In theirworktheauthorsanalyzedacollectionofexpressionprofiles withtheaimtoidentifysetsofgenesthatactinconcerttocarry outspecificfunctionsindifferentcancertypes,andthenproduceda modulemapconstitutedbyacollectionofthegenesetsassociated tospecificcancergenemodules.
Weusedtherelationshipsbetweenthehumangenesandthe Segal’scancermodules[42]toconstructabipartitenetwork.This networkhasbeenprojectedontothegenespacethusoriginating thecmnetnetwork.Thetypeofprojectionusedintheconstruction ofcmnetisabinarybipartitenetworkprojection,meaningthatthe weightoftheedgelinkingtwogenesintheprojectednetworkis1if thetwogenesshareatleastoneneighbourintheoriginalbipartite networkand0otherwise(Fig.2a).
2.2.4. Genechemicalnetwork–gcnet
The CTD stores information mined from literature about the interactions between genes, chemicals and diseases in many species. Since one of the objectives of this work is the evaluation of the capabilities of heterogeneous networks integration in the prediction of genes–diseases relationships, we used the genes–chemicals relationships available in the CTD to construct a gene interactions network (gcnet). To this end we downloaded from CTD the chemicals–genes interac-tions file (http://ctdbase.org/reports/CTDchemgeneixns.csv.gz, accessed30.11.13)and weconstructedabipartite network.We thenperformedaSUMprojectionontothegenespace,bywhichthe weightofanedgelinkingtwogenesequalsthenumberofthe com-monneighborsofthegenesinthebipartitenetwork.Theresulting networkhasfinallybeenbinarizedusingacutoffoffiveormore commonchemicalsinteractorstosetabinaryinteractionbetween apairofgenes(Fig.2b).
2.2.5. BioGRIDdatabasenetwork–dbnet
Thisisaprotein–proteininteractionnetworkconstructedusing directphysicalandgeneticinteractionsobtainedfromBioGRID[43] (v.3.2.96–January2013).
2.2.6. BioGRIDprojectednetwork–bgnet
Insteadofsetting-upabinaryinteractionnetworkbasedonthe directinteractionbetweentheSgenes,weconstructedabipartite networkbasedonthecontentoftheBioGRID,butusingastopnodes
theSgenesandasbottomnodesallthehumangenesBavailable inBioGRID.Moreprecisely,ifinBioGRIDdoesexistaninteraction betweenanodea∈Sandx∈B,weaddedthe(a,x)edgeinthe bipar-titenetwork.Then,accordingtoabinaryprojectiontotheSspace, anedge(a,b),a∈S,b∈Sisaddedtotheprojectednetworkifaandb shareatleastonecommonnodex∈Bintheirneighborhoodsofthe bipartitenetwork.Inthiswaywecancaptureindirectinteractions betweenpairsofgenes.
2.2.7. Semanticsimilarity-basednetworks:bpnet,mfnetand ccnet
The last three networks considered in this workhave been constructed by computing theResnik semantic similarities[44] betweenthetermsofeachdivisionoftheGeneOntology:biological process,molecularfunctionandcellularcomponent.Weobtained a pairwise gene similaritymeasure by choosing the maximum Resniksemanticsimilaritybetweenallthetermsforwhichthetwo genesareannotated.Theresultingnetworkswerenamedbpnet, mfnet and ccnetrespectively. The semantic similaritymeasures havebeencomputedusingaMATLABapplicationimplementing methodsdescribedin[45].
2.3. Basicnotation
Gene networksfor disease prioritizationcan berepresented throughanundirected weightedgraphG=(V,E),whereVisthe setofverticescorrespondingtogenesandEthesetofedges corre-spondingtosomenotionoffunctionalrelationshipbetweenpairs ofgenes/vertices.Verticesofthegraphandgenescanbedenoted withnatural numbers1,2,...,n,sinceeachvertex ofGis uni-vocallyassociatedtoagene.Thecorrespondingadjacencymatrix Wwithweightswijrepresentsthe“strength”oftherelationship betweenverticesi,j∈V;VM⊂Vdenotesasubsetof“positive”
ver-ticesbelongingtoaspecificMeSHsubjectheadingM(e.g.aMeSH descriptorofadisease–Section2.1).
Weconsideredtheintegrationofngenenetworks,Gd=(Vd,Ed),
1≤d≤n,andwedenoteby ¯Gtheintegratednetwork ¯G=( ¯V,E¯), with ¯V=
dVdand ¯E⊆dEd.Theweightsoftheedges(i,j)∈Ed arerepresentedwithwd
ij.Finallyasetoffeatures xi∈Xcanbe
asso-ciatedtoagenei.Forinstance, xicouldrepresentthegeneticor
proteininteractions,theexpressionprofileorwhateveravailable dataforagivengene/vertexi.
2.4. Networkintegrationmethods
Wedesignedandapplieddifferentnetworkintegration meth-ods to combine different sources of evidence of functional relationshipsbetween genes. Our aim consists in providing an
analysisoftheimpactofnetworkintegrationtogene prioritiza-tion,inordertounderstandwhetherthecombinationofmultiple networks,constructedfromdifferentsourcesofinformation,can significantlyenhancetheperformanceofgeneprioritization meth-ods,andtoprovideaquantitativeassessmentofthishypothesized improvement.Tothis end weprogrammatically considered rel-ativelysimple methods, rangingfrom unweighted toweighted network integration algorithms, excluding more complex algo-rithms proposed in the literature, to allows us to perform an extensiveanalysisinvolvingalargesetofdiseases,alargesetof humangenesandasignificantsubsetoftheintegrationmethods appliedtogeneprioritizationproblems.
Unweightedmethodsarecharacterizedbynetworks combina-tionsdependingonlyonthestructureofthenetworkitself,while weightedonesdependonanestimateofthelearningcapabilities ofnetworkalgorithmsorontheassessmentofthe “informative-ness”oftheavailabledata.ThemethodsproposedinSection2.4.2 (unweighted integration) and in Section2.4.3 (weighted inte-gration) share several general characteristics with previously proposedmethodsappliedingeneprioritizationproblemsorin othercomputationalbiologyproblemssuchasgenefunction pre-diction[46–49].
Forinstance,unweightedapproachessuchasthesimpleunion of networks has beenapplied to the prioritization of genes in Alzheimer’sdiseaseusingaguilt-by-associationinferencerule[47], ortotheintegrationofPPIdataofmodelorganismsmappedto humanthroughhomology[19],orinthecontextofthefunctional interpretationofgenomicvariantstotheintegrationofgene inter-actionnetworks[50],ortofindfunctionalmodulesinnetworks integratedfrommultiplepublicdatabases[51].Otherunweighted approachesforgeneprioritizationaveragethescaledGram matri-cesobtainedfromdifferentsourcesoffunctionalinformationusing suitablekernels[46].
Weightedapproaches differ for theway theweights associ-atedtoeachnetworkareestimated.Forinstance,weightscanbe obtainedthroughaniterativealgorithm showntobeequivalent toanexpectation-maximization(EM)optimizationalgorithm[52], orweightsarelearntbysolvingaquadraticallyconstrainedlinear programinanoveltydetectionsettingofthegeneprioritization problem[46],orinthecontextofthegenefunctionprediction prob-lemweightscanbeinterpretedfromaprobabilisticstandpoint[49] orestimatedusingthePPV(positivepredictionvalue)associated totheedgesofthegraph[48].
In the following sections, we describe the network pre-processingandtheunweightedandweightednetworkintegration methodsthatwetestedinourexperiments.
2.4.1. Networkpre-processing
Beforethecombinationphaseeachnetworkunderwenta pre-processingsteptoallownetworksforhavingdifferentnumberof nodes,tofiltersomeedgesintoodensegraphs,andtomakethe weightscomparableacrossdifferentnetworks.Inparticular,todeal withgenesmissinginsomenetworks,wefilledthecorresponding rows/columnsofthesymmetricadjacencymatrix Wwithzeros. Toreducethecomplexityofthenetworkandthenoiseintroduced bytoosmalledgeweights,asapre-processingstepweeliminated edgesbelowagiventhreshold.Inthiswayweremovedveryweak similaritiesbetweengenes,butatthesametimewechoserelatively lowthresholdstoavoidthegenerationof“singletons”withno con-nectionswithothernodes.Inbrief,wetunedthethresholdforeach networktoguaranteethateachvertexhasatleastoneconnection: foreach node/genewecomputed themaximum oftheweights associatedtoitsedges,andbetweentheselectedmaximawechose theminimumasa generalthresholdforthenetwork.Finally,to maketheweightscomparableacrossdifferentnetworks,avoiding theundesirableeffectthatacertainnetworkcouldovercomethe
othersbecauseofthehighvaluesofitsweights,weappliedboth Laplacianregularization[53]andasimplelinearregularizationto obtainweights ˆwij∈[0,1]:
ˆ wij=
wij−minx,ywxy maxx,ywxy−minx,ywxy
(1) whereindicesx,y∈Vrefertothevertices/genesoftheunderlying graph.
Inourexperimentsweadoptedtheregularizationshownin(1), sincetheresultswerecomparablewithLaplacianregularization (datanotshown).
2.4.2. Unweightednetworkintegration
In the unweighted network integration the combination of differentnetworksdependsonlyonthestructureand the char-acteristics of each network, and nolearning is involved in the computationoftheintegratednetwork.
2.4.2.1. Unweighted average (UA). One of the widely applied approachisrepresentedbytheUAmethod[46,32].Theweightof eachedgeofthecombinednetworksiscomputedsimplyaveraging acrosstheavailablennetworks:
¯ wij= 1 n n
d=1 wdij (2)Notethatin thisintegrationapproach alsoweights wij=0 con-tributestotheaverage,independentlyofthefactthatthemeasure offunctional relationshipbetweengenesiandjunderlyingthe evidencesourceisavailableornot.
2.4.2.2. Per-edgeunweightedaverage(PUA). We proposea novel method,similartoUA,butthatassuresahighcoverageofthegenes includedintheintegratedfunctionalnetwork,withoutpenalizing genesforwhichaspecificsourceofdataisunavailable.Withrespect totheUAmethod,PUAtakesintoaccountthefactthat agiven functionalrelationshipbetweenapairofgenescouldbemissing, averagingthatedgeonlybythenumberofnetworkscontaining bothgenes.
Moreprecisely,givenasetofngenenetworkstheweight ¯wijof theedge(i,j)∈E¯iscomputedasfollows:
¯ wij= 1 |D(i,j)|
d∈D(i,j) wd ij (3) whereD(i,j)={d|i∈Vd∧j∈Vd}.2.4.2.3. Networkmaximumintegration(MAX). TheMAXintegration selectsthelargestweightamongalltheavailablesourcesofdata:
¯ wij=max
d w d
ij (4)
Thisapproach performstheunionofalltheavailablesourcesof evidence[47,51,50],andwhenmultipleedges(i,j)foragivenpair ongenesiandjareavailable,selectstheonewiththelargestweight. 2.4.2.4. Networkminimumintegration(MIN). Analogously,theMIN integrationselectstheminimumweight:
¯ wij=min
d w d
ij (5)
Inpracticeitrealizestheintersectionbetweenmultiplenetworks. Itcanbeimplementedintwodifferentflavours:the“drastic” algo-rithm(5)forwhichitissufficientasinglewd
ij=0inordertoset ¯
setto0arediscarded,and ¯wij=0ifandonlyiftheweightsforthe edge(i,j)inalltheavailablenetworksaresetto0:
¯ wij=
0 if∀
d wd ij=0 min d {w d ij|wdij=/0} otherwise (6)Itisworthnotingthatthatthisapproachcouldbehighlyaffected bynoisydata.Itcouldbereliablewhenalargeevidenceisshared amongdifferentsourcesofdata.
2.4.3. Weightednetworkintegration
Theunweightedmethodsdonotrequiretolearnany param-eters fromthe data, while theweighted integration learnsthe “weight”associatedtoeachnetwork.Thebasicideabehindthese approachesconsistsinassociatingaparametertothe “predictive-nessstrength”ofeachtypeofnetwork.Thiscanberealizedbyusing alearningalgorithmtoassociatethe“predictivenessstrength”ofa networkwiththeassessmentoftheaccuracyofthelearning algo-rithmtrainedonthenetworkitself.
Differentweightedapproacheshavebeenproposedinthe lit-erature[46,52,48,54].Inourexperiments,consideringthatingene prioritizationthemainobjectiveconsistsineffectivelyrankingthe geneswithrespecttoagivendisease,wecomputedtheweights accordingtotheAUCobtainedforagivenMeSHdescriptor.More precisely,havingnnetworksandcMeSHdescriptors,wecan com-putetheweightd(k)forthedthnetworkandthekthMeSHdisease
inthefollowingway: d(k)=
Md(k)
n j=1Mj(k)
(7) whereMd(k)representsthemetricappliedtomeasuretheaccuracy
oftheprediction(e.g.theAUCortheprecisionatafixedrecall)with respecttokthMeSHdescriptorandthedthnetwork.The denom-inatorin(7)simplyassuresthat
nd=1d(k)=1.Thed(k)canbecomputedfor each MeSHdescriptor k byestimating the corre-spondingAUCbyleave-one-outonthetrainingdata,thatistosay, an“internal”crossvalidationisperformedtooptimizetheweights, bysubdividingeachfoldofan“external”crossvalidationapplied toevaluatethemethodinthewholedataset.
2.4.3.1. Weightedaverageperclass(WAP).Byusingthed(k)
com-putedaccordingto(7),theWAPmethodintegratesthenetworks byputtingaweightproportionaltotheperformanceofa given learningalgorithmoneachnetworkusedintheintegration:
¯ wij(k)= n
d=1 d(k)wd ij (8)Itisworthnotingthatinthiswayweconstructadifferentweighted integratednetworkforeachMeSHdescriptor.
In order to emphasize the weight of the most informative networksand,atthesametime,toreducetheweightsoftheleast informativeones,amonotoniclogarithmictransformationofthe weightscanbeapplied,insteadofusingtheoneproposedin(7): d(k)=
log(1n −Md(k))j=1log(1−Mj(k))
(9) WeassumethatthemetricMhasvaluesin[0,1](consider,e.g.the AUC).Notethatinapracticalimplementation,toavoidd(k)→∞,
weneedtosetanupperboundb<1forM.Forinstance,inour experimentsweusedtheAUCandwesetb=0.99.
2.4.3.2. Weighted average (WA). The WAP method adapts the weightsd(k)accordingtotheperformanceofalearningalgorithm
oneachspecificclasskunderstudy.Ononehand,thiscouldleadto
asetofnetworkswellfittedtothecharacteristicsofeachclassk,but ontheotherhandthisapproachislikelytooverfitthedata.Tothis endweintroduceasortof“regularized”versiontoreduce possi-bleoverfittingproblemsinthelearningprocess.Morepreciselywe computearegularizedweightd,byaveragingacrossclasses,inthe
spiritoftheapproachproposedin[55]inthecontextofgene func-tionpredictionproblems.Inthiswayweobtainauniqueweightd
foreachnetwork: d= 1 c c
k=1 d(k) (10)The WAmethod,using the weights estimated in (10), builds a uniqueintegrated network, independentlyof theMeSHdisease considered: ¯ wij= n
d=1 wdij c k=1 d(k) c = n d=1 dwdij (11)Notethatinthissectionweconsideredtheintegrationofgraphs representedthroughtheircorrespondingadjacencymatrices W, butitiseasytoseethatthesamemethodcanbeappliedtokernel matrices Kderivedfrom W,bysimplysubstitutingineach equa-tionthewijelementsoftheadjacencymatrixwiththekijelements
ofthecorrespondingkernelmatrix(seeSection2.5.1). 2.5. Geneprioritizationmethods
Inthis sectionweintroduce thegene prioritizationmethods appliedinourexperiments.Wefocusedonkernelizedscore func-tions, since it has been recently shown it is among the most competitivemethodsintherelatedproblemofcancermodulegene ranking[40],andonrandomwalksalgorithms,sincetheyhavebeen successfullyappliedtoprioritizegeneswithrespecttogenetic dis-eases[19].Asabaselinemethodweusedasimpleimplementation oftheguilt-by-association(GBA)principle[56].
2.5.1. Kernelizedscorefunctions
Kernel-basedrankingmethodshavebeenrecentlyproposedin thecontextofcancermodulegeneranking[40],drugranking[57] andgenefunctionpredictionproblems[58,31].Methodsbasedon kernelizedscorefunctionsarevery fast(theirtimecomplexity is approximatelylinearinsparsegraphs,oncethekernelmatrixis computed)[31],and theiraccuracy isat leastcomparable with state-of-the-artgeneprioritizationmethods[40].
ThescorefunctionsS:V−→R+arebasedonproperlychosen
kernels,bywhichwecandirectlyrankverticesaccordingtothe valuesofS(i):thehigherthescore,thehigherthelikelihoodthata genebelongstoagivenMeSHdisease.
Kernelized score functionsrely on distancemeasures defined inasuitable HilbertspaceH.Moreprecisely,letXbea general nonemptyset,:X→H,amappingtoagivenuniversal reprodu-cingkernelHilbertspaceH,andK:X×X→Ritsassociatedkernel function,suchthat<(·),(·)>H=K(·,·),where<·,·>H rep-resentstheinternalproductinH.Bychoosingadistancemeasure onaHilbertspace,wecanexploittheclassical“kernel-trick”[59] andwecanembedanyvalidkernelintothedistancemeasureitself. It is worth noting that we extend the notion of neighbour throughthekernelK:bychoosinganappropriatekernel,nodej canbeintheneighbourofnodeievenifthereisnoedgebetween themintheoriginalgraphG:i.e.wij=0,butK(xi, xj)>0.From
thisstandpointtheGrammatrix Kcanbeinterpretedasanovel “weightedadjacencymatrix”intheprojectedHilbertspaceinduced bythemapping:X→H.
IfwechoosetheminimumdistanceDNNbetweeniandVM(the
setofgenesannotatedforagivenMeSHdiseaseM),wecanobtain thenearest-neighboursscoreSNN:
DNN(i,VM)=min j∈VM 1
2(xi)−(xj)
2 (12)
Bydevelopingthesquare(12)weobtain: DNN(i,VM)=min j∈VM
1 2 <(xi),(xi)>+ 1 2<(xj),(xj)> −<(xi),(xj)> (13) Bysubstitutingin(13)theinternalproduct<(·),(·)>witha suit-ablekernelK(·,·),wecanobtainasimilaritymeasuresimplyby changingthesign:SimNN(i,VM)=−min j∈VM
1 2K(xi,xi)−K(xi,xj)+ 1 2K(xj,xj) (14) IfK(xj,xj)areequalforallj∈V,wecansimplify(14),thusachievingthenearestneighboursscoreSNN:
SNN(i,VM)=−min j∈VM
−K(xi,xj)=max j∈VM
K(xi,xj) (15)
AnaturalextensionoftheSNNscorecanbeobtainedby
introduc-ingthek-nearestneighboursdistance: DkNN(i,VM)= 1 2
j∈Ik(i) (xi)−(xj)2, (16)whereIk(i)={j∈VM|jisrankedamongthefirstkinVM}.Byadoptinga
similarprocedureusedtoderivetheSNNscore,wecanobtainfrom
(16)thek-nearestneighboursscoreSkNN:
SkNN(i,VM)=
j∈Ik(i)K(xi,xj) (17)
UsingadistanceDAV(i,VM)ofavertexi∈Vwithrespecttoaset
ofnodesVM,simplyastheaveragedistanceintheHilbertspace
betweeniandthesetofnodesincludedinVM:
DAV(i,VM)= 1 2(xi)− 1 |VM|
j∈VM (xj)2 (18)wecanderivefrom(18)theaveragescoreSAV:
SAV(i,VM)=− 1 2K(xi,xi)+ 1 |VM|
j∈VM K(xi,xj) (19)Thisscore representsthe averagesimilarity of thegene i with respecttothegenesbelongingtothesetVM.IfallK(xi,xi)areequal
foreachi∈V(i.e.the“self-similarity”ofgenesdoesnotmatter),we canfurthersimplify(19)byremovingitsfirstterm.
EvenifanyvalidkernelKcanbeappliedtocomputetheabove proposedscores,inthecontextofnetwork-basedgene prioritiza-tion,weusedrandomwalkkernels[53],sincetheycancapturethe similaritybetweengenes,takingintoaccountthetopologyofthe overallfunctionalinteractionnetwork.
TheGrammatrix Kassociatedtotheone-steprandomwalk ker-nelcanbederivedfromthesymmetricadjacencymatrix Wofthe functionalinteractionundirectedgraphG:
K=(a−1)I+D−12WD− 1
2 (20)
whereIistheidentitymatrix,Disadiagonalmatrixwithelements dii=
jwij,andaisavaluelargerthan1.
The q-step random walk kernels Kq−step= Kq, can beeasily
obtainedbymatrixmultiplicationfromtheone-steprandomwalk kernelmatrix(20),whereqrepresentsthenumberofrandomwalk
stepsintheunderlyinggraph[53].Inthisway,bysettingq=2or q=3twoverticesareconsideredsimilariftheyaredirectly con-nectedoriftheyareconnectedthroughapathincludingoneor twovertices.Alsolongerpathscouldbeconsidered,bysettingq>3: inthiswaywecandeeplyexplorethegraphtofindsimilarities betweengenesmediatedthroughlongpathsinthegraph.
2.5.2. Randomwalksandrandomwalkswithrestart
Kernelizedscorefunctionspresentedintheprevioussectioncan beinterpretedasageneralizationoftherandomwalkalgorithms, whichhavebeensuccessfullyappliedtogeneprioritization prob-lems[19,60].Randomwalk(RW)algorithms[61] rankgenesby exploringandexploitingthetopologyofthegenenetwork: ran-domwalksacrossthenetworkareperformedstartingfromasubset VM⊂VofgenesbelongingtoaspecificMeSHdescriptorMbyusing
atransitionprobabilitymatrix Q=D−1W,where Wisthe adja-cencymatrix,and Disadiagonalmatrixwithdiagonalelements dii=
jwij.Startingfromtheinitialsetofprobabilitiespoofthegenes1...n
ofbelongingtoM,wherepi
o=1/VMifi∈VM,otherwisepio=0,the RWupdaterule:
pt+1=QTpt (21)
isrepeatedtilltoconvergenceorforafixednumberofiterations. We canobservethat therandomwalkercouldprogressively “forget”theaprioriinformationavailablefortheMeSH descrip-torM,byiterativelywalkingacrosstheoverallnetwork.Toavoid thisproblem,wecanstoptheRWalgorithmafterafewiterations, asoutlinedabove,orwecanapplytherandomwalkwithrestart (RWR)method:ateachsteptherandomwalkercanmovetooneof itsneighboursorcanrestartfromitsinitialconditionwith proba-bility:
pt+1=(1−)QTpt+po (22) WithbothRWandRWRmethodsatthesteadystatewecanrankthe vector ptoprioritizegenesaccordingtotheirlikelihoodtobelong totheMeSHdiseaseunderstudy.
2.5.3. Guiltbyassociationmethods
Asabaselinegeneprioritizationmethodweappliedasimple implementationoftheguilt-by-association(GBA)principle. Accord-ingtothisgeneralbiologicalprinciple,abiomolecularentitythat interactsorsharessomefeatureswithanotherbiomolecularentity canalsosharesomespecificbiologicalproperty(forinstance,its membershiptoagivenMeSHcategory).Incomputationalbiology thisbasicbiologicalprinciplehasbeenexploitedtodevelop meth-odsabletoassignagivenbiologicalormolecularpropertyonthe basisofthelabelingofneighborhoodsinbiomolecularnetworks [56,62].In thecontext of gene prioritization problems, wecan assessthelikelihoodthatagivengenebelongstoagivenMeSH categoryMonthebasisoftheM-labeledgenesdirectlyconnected tothegeneunderstudy.
WeimplementedasimpleversionoftheGBAapproach,inwhich thescoreforeachgeneiscomputedbychoosingthemaximumof theweightswij∈Woftheedgesconnectingthegeneitopositive labeledgenesj∈VMintheneighborhoodN(i)ofi:
S(i,M)=max
j∈N(i)wij (23)
3. Resultsanddiscussion
3.1. Experimentalset-up
Oneofthemaingoalsofthisworkconsistsinperformingan extensiveanalysisofgene-diseaseassociations,consideringalarge setofdiseases.
Moreover, we experimentally investigated the impact of networkintegrationongeneprioritization,byperforminga quan-titative comparison of the accuracy achieved by the methods describedin Section2.5usingeach ofthesinglegene networks consideredinSection2.2withthatobtainedthroughthenetwork integrationmethodsintroducedinSection2.4.
Moreprecisely, atfirstweassessed the“informativeness”of eachsinglegene networkbyanalyzingtheperformanceofGBA, RW,RWRandkernelizedscorefunctionmethods.Thenweperformed asystematic analysisofboth unweightedandweighted network integrationmethods, by combining at firstthe six binary gene interactionnetworksandthenbyexploitingalsothereal-valued semanticsimilarity-basedgenenetworksthroughtheintegration ofalltheavailableninedifferentnets(Table1).
Moreoverwe indicated someunannotated genes as reliable “diseasegene”candidatesforaselectedsetofMeSHdiseasesfor whichweobtainedrobustandaccuratepredictions.
3.2. Evaluationofthegeneprioritizationandnetworkintegration methods
Thegeneralizationperformancesofeachgeneprioritizationand networkintegrationmethodhasbeenassessedthroughaclassical cross-validationprocedure[63],settingtofivethenumberofthe folds.Moreprecisely,thenodesofthegraphhavebeenrandomly partitionedinfivefolds,andinturnafoldisselectedasthetest fold,whiletheremainingarethetrainingfolds.Thelabelsofthe testfoldareremoved,andthelabelsofthetrainingfoldsareused toinferthescorestobeassignedtothenodesofthetestfold(in oursettingwedealwithgeneprioritization,i.e.arankingproblem). Finally,havingthescorespredictedforeachofthefivefolds(that isfortheentiresetoftheavailablegenes)wecanapplystandard measurestoevaluatethecorrectnessoftheobtainedgeneranking withrespecttoeachdisease.InparticularweappliedtheAUCto evaluatetherankingofthegenes.Moreover,weappliedalsothe precisionatagivenrecalltotakeintoaccountthatforseveralMeSH diseaseswehavearelativelylownumberofknowndiseasegenes (positiveexamples).
Aftertheassessmentofthegeneralizationperformanceofthe geneprioritizationandnetworkintegrationmethods,wereported foreachoftheconsidered708MeSHdiseasesthep-valueobtained throughanonparametricstatisticaltestbasedonthe“shuffling”of thegenelabels(Section3.6).Thenwereportedthe10top-ranked unannotatedgenesforeachMeSHdisease,andweperformedalso ananalysisoftheunannotatedgenesasreliable“diseasegene” can-didatesonthebasisofthedistributionofthescoresoftheannotated genesfortheMeSHdiseasesforwhichweobtaineda veryhigh estimatedcross-validatedAUCvalue.
Weoutlinethatthereportedresultsarebased,accordingtothe literatureongeneprioritization,onretrospectivebenchmarks,and forthisreasonofferusuallyoptimisticestimatesofthe general-izationperformances,sincedisease-associations arelikelytobe directlyorindirectlyincorporatedinthegene-prioritizationdata sources[3].Asoutlinedin[64],thisproblemisdifficulttoaddress inaninitialstudyandcanberesolvedonlybylong-term perspec-tivebenchmarks,whereinpredictionsaremadeonthecurrentstate ofknowledge(thatisthecurrentavailableannotations)and vali-datedinfuturestudies,thatisoncenovelexperimentalevidence ofdisease-associationswillbeavailable.
3.3. Geneprioritizationwithsinglenetworks
Weperformedanassessmentofthe“informativeness”ofeach genenetworkthroughanextensiveexperimentalevaluationofthe averageAUCresultsacross708MeSHdiseases,usingdifferentgene prioritizationmethods(Table2).ThefirstcolumnofTable2shows thegeneprioritizationmethodsandtheirmainassociated learn-ingparameters(seeSection2.5fordetails).Foreachcolumnthe bestaverageAUCresultsachievedbythegeneprioritization meth-odsarehighlightedinbold.SAVandSkNNkernelizedscorefunctions
achieveusuallythebestresults,butalsoRWandRWRalgorithms aresometimescomparablewithkernelizedscorefunctions.The dif-ferenceisstatisticallysignificant(Wilcoxonranksumtest,˛=0.01) infavorofkernelizedscorefunctionsforthedatasetsdbnet,finet, hnnet,bpnetandccnet,whilefortheotherfourfunctionalnetworks nostatisticallysignificantdifferencehasbeendetected.
ThelastrowofTable2showstheaverageresultsacross meth-odsforeachgenenetwork.Wecanobservethatontheaverage geneprioritizationmethodsachievethebestresultswithfinetand gcnet,buttheAUCperformancesarerelativelyhighalsowithhnnet andbpnet.Theothernetsappeartobelessinformativeonthe aver-age,butconsiderthatacertainlearningisassuredwitheachofthe considerednetworks,sincetheaverageAUCisalwayssignificantly largerthan0.5.
Itisnotsurprisingthatfinet,gcnet(andalsohnnet)arethemost “informative”networks,since theyareconstructedby integrat-ingdifferentsourcesofinformation(Section2.2).Weonlyobserve that withgcnettheresultsare referred onlytoa subset ofthe genesusedinourexperiments(Table1).Itisworthalsonoting thegoodresultsobtainedwithsemanticsimilarity-basednetworks constructedfrombiologicalprocessesGOannotations(bpnet),even ifalsointhiscasetheresultsarecomputedwithrespecttoasubset oftheSgenes,andhencethecomparisonmustbeconsideredwith acertaincaution.Summarizing,theresultswitnessforthefactthat alltheconsideredgenenetworksbearacertaininformationabout thegeneprioritizationwithMeSHdiseases.Inparticularnetworks justconstructedthroughtheintegrationofdifferentsourcesof evi-denceseem tobethemost“informative”for this generanking task.
3.4. Geneprioritizationwithunweightednetworkintegration Ournetworkintegrationexperimentsstartedwiththe combi-nationofthesixbinarygenenetworksdescribedinSection2.2(that isalltheavailablegenenetworksexcludingreal-valuedsemantic similarity-basednets),usingtheunweightedcombinationmethods presentedinSection2.4.2.Table3reportstheaverageAUCresults acrossMeSHdiseaseswithUA,PUAand MAXintegration meth-ods.Notethatwedidnotperform“soft”MINintegrationsinceitis easytoseethatwithbinarynetworksthismethodis indistinguish-ablefromMAX,while“drastic”MINleadstohighlydisconnected networks.
ComparingTables 2and 3,we canobservethatunweighted integrationimprovestheperformance.Thisistrueespeciallywith UAandPUAmethods(thedifferenceisalmostalwaysstatistically significantat˛=0.01significancelevel),butinseveralcasesalso withMAX.Theimprovementdependsalsoonthegene prioritiza-tionmethodused.Forinstanceunweightedintegrationdegrades performancewithSNN(atleastwithrespecttothemost
informa-tivesinglegenenetworks),whilewiththeotherkernelizedscore functionsandwithGBA,RWandRWRalgorithmsoftenunweighted integrationimprovesAUCresults.Whilealargernumberofsteps improvestheperformanceofkernelizedscorefunctions,withthe classicalRWalgorithmweobserveadegradationofthe perform-ances.TheseresultsshowthattheclassicalRWtendsto“forget”the initial“apriori”knowledge,whilekernelizedscorefunctionsretain
Table2
Singlegenenetworks:AUCresultsaveragedacross708MeSHdiseases.Thelastrowshowstheaverageresultsacrossmethodsforeachgenenetwork.
cmnet bgnet dbnet finet hnnet gcnet bpnet mfnet ccnet
GBA 0.6620 0.6389 0.6683 0.7542 0.7323 0.7346 0.7134 0.6395 0.6250 RW1step 0.6922 0.6590 0.6037 0.7356 0.7269 0.8418 0.7646 0.6985 0.6845 RW2step 0.6829 0.6462 0.6761 0.8194 0.7802 0.8220 0.7635 0.7013 0.6812 RW3step 0.6768 0.6406 0.6531 0.8157 0.7531 0.8145 0.7611 0.6985 0.6745 RW5step 0.6718 0.6316 0.6426 0.7993 0.6973 0.8089 0.7610 0.6834 0.6711 RW10step 0.6694 0.6224 0.6222 0.7575 0.6249 0.8075 0.7411 0.6790 0.6684 RWR=0.6 0.6871 0.6515 0.6781 0.8271 0.7889 0.8401 0.7825 0.7112 0.6856 RWR=0.9 0.6878 0.6513 0.6750 0.8242 0.7870 0.8453 0.7789 0.7085 0.6825 SAV1step 0.6894 0.6574 0.6717 0.7669 0.7596 0.8167 0.7889 0.7139 0.6916 SAV2step 0.6842 0.6414 0.6831 0.8226 0.7872 0.8328 0.7888 0.7142 0.6914 SAV3step 0.6845 0.6417 0.6752 0.8255 0.7897 0.8417 0.7879 0.7146 0.6913 SAV5step 0.6850 0.6418 0.6778 0.8287 0.7943 0.8471 0.7839 0.7151 0.6907 SAV10step 0.6849 0.6408 0.6804 0.8312 0.7983 0.8407 0.7640 0.7117 0.6882 SNN1step 0.6296 0.6263 0.6667 0.7561 0.7374 0.7308 0.6971 0.6485 0.6565 SNN2step 0.6235 0.6105 0.6764 0.8031 0.7624 0.7316 0.7032 0.6478 0.6567 SNN3step 0.6228 0.6105 0.6683 0.8044 0.7638 0.7365 0.7103 0.6475 0.6574 SNN5step 0.6213 0.6107 0.6708 0.8052 0.7674 0.7481 0.7280 0.6475 0.6593 SNN10step 0.6197 0.6136 0.6744 0.8029 0.7729 0.7774 0.7703 0.6493 0.6659 SkNN1stepk=3 0.6439 0.6336 0.6705 0.7635 0.7523 0.7370 0.7645 0.6812 0.6712 SkNN2stepk=3 0.6377 0.6179 0.6817 0.8149 0.7788 0.7403 0.7705 0.6937 0.6725 SkNN3stepk=3 0.6371 0.6183 0.6737 0.8168 0.7805 0.7482 0.7765 0.6999 0.6756 SkNN5stepk=3 0.6362 0.6191 0.6763 0.8182 0.7845 0.7647 0.7815 0.7003 0.6788 SkNN10stepk=3 0.6366 0.6225 0.6798 0.8172 0.7898 0.7993 0.7695 0.7021 0.6803 SkNN1stepk=19 0.6811 0.6523 0.6717 0.7668 0.7596 0.7860 0.7702 0.6997 0.6798 SkNN2stepk=19 0.6756 0.6364 0.6831 0.8222 0.7871 0.8004 0.7763 0.7001 0.6799 SkNN3stepk=19 0.6755 0.6368 0.6752 0.8249 0.7895 0.8125 0.7819 0.7008 0.6801 SkNN5stepk=19 0.6757 0.6373 0.6779 0.8276 0.7940 0.8286 0.7902 0.7025 0.6807 SkNN10stepk=19 0.6766 0.6373 0.6810 0.8292 0.7986 0.8402 0.7774 0.7063 0.6820 Average 0.6625 0.6338 0.6691 0.8029 0.7657 0.7955 0.7624 0.6899 0.6751
thepriorinformationandareabletoexploittheoveralltopology ofthenetwork,confirmingpreviousresults[40,31].
Hereinafterwelimitedtheintegrationexperimentsto kernel-izedscorefunctionsonly,since theyusuallyperformequallyor betterthantheothercomparedmethods,andtheirempiricaltime complexityis significantlylowerthanRW andRWRalgorithms: forinstance,whileanentirecycleofcross-validationonthe708
Table3
Unweighted integration of the six binary gene networks (without semantic
similarity-basednets):AUCresultsaveragedacross708MeSHcategories.
UA PUA MAX GBA 0.8313 0.8291 0.6589 RW1step 0.8566 0.8563 0.8501 RW2step 0.8186 0.8178 0.8154 RW3step 0.7937 0.7925 0.7897 RW5step 0.7773 0.7760 0.7746 RW10step 0.7720 0.7704 0.7706 RWR=0.6 0.8533 0.8528 0.8520 RWR=0.9 0.8565 0.8531 0.8476 SAV1step 0.8538 0.8530 0.8286 SAV2step 0.8562 0.8554 0.8353 SAV3step 0.8580 0.8571 0.8405 SAV5step 0.8596 0.8587 0.8470 SAV10step 0.8548 0.8540 0.8485 SNN1step 0.6934 0.6921 0.6352 SNN2step 0.6950 0.6936 0.6331 SNN3step 0.6968 0.6954 0.6315 SNN5step 0.7020 0.7004 0.6314 SNN10step 0.7251 0.7230 0.6546 SkNN1stepk=3 0.7280 0.7266 0.6593 SkNN2stepk=3 0.7304 0.7289 0.6581 SkNN3stepk=3 0.7332 0.7317 0.6580 SkNN5stepk=3 0.7405 0.7389 0.6627 SkNN10stepk=3 0.7636 0.7616 0.6987 SkNN1stepk=19 0.8138 0.8124 0.7598 SkNN2stepk=19 0.8170 0.8155 0.7639 SkNN3stepk=19 0.8199 0.8183 0.7680 SkNN5stepk=19 0.8251 0.8233 0.7785 SkNN10stepk=19 0.8374 0.8356 0.8093
MeSHclasseswithUAintegrationrequireshourswithRWR,the sametaskrequiresonlysomeminuteswithkernelizedscore func-tions,usinganInteli72.80GHzprocessorwith16GBofRAMand aLinuxsystem.
By addingthereal-valued networksbasedonsemantic sim-ilarity measures (Section2.2), we observe a further significant enhancementoftheoverallperformance,showingthatthe integra-tionofdifferentsourcesofevidenceleadstobetterresults(Table4). ForinstancetheperformancesoftheUAapproachwithSAVusing
afivesteprandomwalkkernelareboostedfrom0.8596to0.8831 averageAUC(theincrementissignificantat˛=10−30significance levelaccordingtotheWilcoxonsignedranksumtest).Notethat theMINintegrationfailsonthistask,sincean“intersection” strat-egyinthiscontextleadstoasignificantlossofinformation,thus notallowingtoexploitthetopologicalinformationunderlyingthe entirenetwork.
Fig.3providesavisualclueofthedifferencesofaverageAUC acrossMeSHcategoriesbetweenunweightedintegrationmethods
Table4
Unweightedintegrationmethods:AUCresultsaveragedacross708MeSHcategories
includingalltheavailableninegenenetworks
UA-all PUA-all MAX-all MIN-all
SAV1step 0.8765 0.8667 0.8286 0.6541 SAV2step 0.8792 0.8701 0.8353 0.6694 SAV3step 0.8811 0.8722 0.8405 0.6824 SAV5step 0.8831 0.8744 0.8470 0.7023 SAV10step 0.8761 0.8708 0.8485 0.7264 SNN1step 0.6950 0.7050 0.6352 0.6045 SNN2step 0.6980 0.7080 0.6331 0.6087 SNN3step 0.7014 0.7108 0.6315 0.6129 SNN5step 0.7106 0.7185 0.6314 0.6212 SNN10step 0.7437 0.7490 0.6546 0.6349 SkNN1stepk=19 0.8322 0.8331 0.7598 0.6413 SkNN2stepk=19 0.8368 0.8372 0.7639 0.6520 SkNN3stepk=19 0.8413 0.8404 0.7680 0.6619 SkNN5stepk=19 0.8500 0.8465 0.7785 0.6789 SkNN10stepk=19 0.8665 0.8576 0.8093 0.7093
0 1 . 0 0 0 . 0 5 0 . 0 − 0 1 . 0 − 0.05 AUC diff. Sav 5 s Sav 10 s Sav 1 s Sav 2 s Sav 3 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s −0.10 −0.05 0.00 0.05 0.10 Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 10 s Sknn 5 s Sknn 3 s Sknn 2 s Sknn 1 s AUC diff.
(b)
(a)
−0.15 −0.10 −0.05 0.00 0.05 0.10 Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Snn 1 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s AUC diff. −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s AUC diff.(d)
(c)
Fig.3. Unweightedintegrationmethods:differencesofaverageAUCacrossMeSHdiseaseswithrespecttothebestsinglegenenetwork(finet).(a)UA(b)PUA(c)MAX(d)
MIN.
andthebestsinglegenenetwork(finet).Fig.3(d)confirmsthatalso inthistaskMINintegrationfails,forthesamereasonsexplained above.OnthecontraryUAandPUAintegrationprovidessignificant enhancementswithbothSAVandSkNN(Fig.3(a)and(b)).Notethat
unweightedintegrationwithSNN resultsinadegradationofthe
performances(Fig.3).Wehavenotaclearexplanationofthisfact, butwethinkthattheinstabilityofscorescomputedbyusingonly oneoftheneighbours,combinedwiththeimpossibilityof weight-ingorchoosingthebestsourcesofinformation,mayaddnoiseto thepredictionprocess.
Summarizing, theresultsshowthat unweightedintegration, andespeciallyUAandPUAmethods,significantlyenhancesgene prioritizationresults.Alltheconsideredgeneprioritization meth-ods,rangingfromrandomwalkstokernelizedscorefunctions(with theexceptionofSNN),deriveabenefitfromunweightedintegration.
Moreover,theintegrationofsemanticsimilarity-basednetworks furtherimprovestheperformancesofgeneprioritization.Notethat withthesenetworks,consideredindividually,geneprioritization methodsdonotattainhighaverageAUCscores(atleastwithmfnet andccnet,Table2),buttheirintegrationsignificantlyenhancegene prioritizationresults(Table4),sincetheyconveycomplementary informationwithrespecttotheothersourcesofevidence.
3.5. Geneprioritizationwithweightednetworkintegration
WeexperimentedalsowithWAandWAPnetworkintegration toexplicitlytakeintoaccountthe“informativeness”ofeachgene network(Section2.4.3).Table5showsthatweightedintegration significantlybooststheperformanceofkernelizedscorefunctions. Inparticularfive-stepsSAVwithweightedintegrationofallthenine
availablenets(WA-all,Table5)reachesthehighestAUCaverage
score,butalmostallthegeneprioritizationalgorithmsachievetheir bestresultswithWAandWAPintegration.
ThisismoreevidentinFig.4,whereweregistera veryhigh incrementoftheaverageAUCscorewithrespecttothebest sin-glegenenetwork.ThisistrueforbothSAVandSkNN,whileforSNN
thisbehaviorislimitedtoWAPmethodsonly(Fig.4(b)and(d)). Nevertheless,notethat,onthecontrary,SNNbehavesbadlywith
unweightedintegration,independentlyofthecombinationmethod applied(Table3).
Togetmoreinsightsintotheresultsobtainedwithunweighted andweightedintegrationmethods,Fig.5comparestheAUCscores foreachclassachievedbyfivestepsSAV(oneofthebestgene
prior-itizationmethod)betweenunweightedandweightedintegration Table5
Weightedintegrationmethods:AUCresultsaveragedacross708MeSHcategories.
WAandWAPincludeonlythefirstsixfunctionalnetworks,whileWA-allandWAP-all
includealltheninefunctionalnetworks.
WA WAP WA-all WAP-all
SAV1step 0.8649 0.8680 0.8778 0.8768 SAV2step 0.8733 0.8727 0.8828 0.8802 SAV3step 0.8774 0.8763 0.8866 0.8830 SAV5step 0.8817 0.8807 0.8904 0.8861 SAV10step 0.8812 0.8823 0.8868 0.8850 SNN1step 0.7602 0.8080 0.7042 0.8165 SNN2step 0.7692 0.8126 0.7155 0.8213 SNN3step 0.7709 0.8159 0.7193 0.8240 SNN5step 0.7753 0.8206 0.7303 0.8278 SNN10step 0.7807 0.8241 0.7707 0.8328 SkNN1stepk=19 0.8394 0.8570 0.8325 0.8650 SkNN2stepk=19 0.8476 0.8614 0.8427 0.8684 SkNN3stepk=19 0.8527 0.8651 0.8489 0.8716 SkNN5stepk=19 0.8614 0.8703 0.8611 0.8762 SkNN10stepk=19 0.8744 0.8768 0.8819 0.8784
−0.05 0.00 0.05 0.10 −0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 10 s Snn 5 s Snn 3 s Snn 2 s Snn 1 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s −0.10 −0.05 0.00 0.05 0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 3 s Sknn 5 s Sknn 10 s Sknn 2 s
(b)
(a)
−0.10 −0.05 0.00 0.05 0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 2 s Sknn 3 s Sknn 5 s Sknn 10 s −0.10 −0.05 0.00 0.05 0.10 AUC diff. Sav 1 s Sav 2 s Sav 3 s Sav 5 s Sav 10 s Snn 1 s Snn 2 s Snn 3 s Snn 5 s Snn 10 s Sknn 1 s Sknn 3 s Sknn 5 s Sknn 10 s Sknn 2 s(d)
(c)
Fig.4. Weightedintegrationmethods:differencesofaverageAUCacrossMeSHcategorieswithrespecttothebestsinglegenenetwork(finet).Integrationofsixnetworks:
(a)WA(b)WAP.Integrationwithninenetworksincludingsemanticsimilarity-basednets:(c)WA(d)WAP.
withrespecttothethebestsinglenetworkfinet.ApointinFig.5 representstheAUCscore,relativetoaMeSHdisease,attainedby theintegrationmethodandbythebestsinglegenenetwork.More precisely,theAUCvalue obtainedbytheintegrationmethodis representedinordinate,whileinabscissawehavetheAUCvalue achievedwithfinet,i.e.thebestsinglenetwork.Pointsthatlieabove thebisectorofthefirstquadrantanglerepresentMeSHdiseases forwhichtheintegrationmethodachievesbetterresultsthanthe singlebestgenenetwork.InFig.5(a)mostofthepointslieabove thebisector,showingthatUAenhancesresultsobtainedwithfinet. Byaddingsemanticsimilarity-basedgenenetworksseveralpoints moves abovethebisectorline(Fig.5(b)), confirmingthatthese networksaddnovelusefulinformationforthegeneprioritization task. Lookingat Fig.5(c) we observethatwithWA integration, justwithoutsemanticsimilarity-basedgenenetworks,mostofthe pointslieabovethebisector,andtheresultsarealsobetterwhen weintegratealltheavailablenetworks(Fig.5(d)).
Fig.6 providesanoverallpictureofthedistributionsof AUC scores compared between different unweighted and weighted integration methods using five steps SAV as gene prioritization
algorithm.Whiteboxplotsrefertoweightedintegrationmethods, lightgray boxplotstounweightedintegrationmethodswithout semanticsimilarity-basedgenenetworks,anddarkgrayboxplots tounweighted methods integrating all the nine available gene networks. Weighted methods showthe best results(especially whenallthenetworksareintegrated),butalsoUAll,thatisUA inte-gratingalltheavailableninenets,achievequitesimilarresults.All theconsideredmethodsbehavebetterthanthebestsinglegene network(lastboxplotinFig.6),exceptforMIN,thatclearlyfailson thistask,asjustdiscussedabove.
Toobtainamorereliablecomparisonoftheresultsobtained withdifferentgenenetworkintegrationmethods,weappliedto
eachpairofthemtheWilcoxonsignedranksumtest,toestimate whetherasignificantstatisticaldifferencedoesexistusingthebest performing gene prioritizationmethod (SAV five steps). Table6
summarizesthemainresults:a“+”entrymeansthatasignificant statisticaldifferenceat0.01significancelevelisregisteredinfavor ofthemethodintherowwithrespecttothemethodinthecolumn; a“−”entrymeansthattheoppositeholds,anda“=”entrystands fornosignificantdifferencebetweenthemethods.
We observethat weightedintegrationis alwayssignificantly betterorequalthanalltheothercompared methods.In partic-ularWA-allintegration(thatis,WAintegrating alltheavailable nets)issignificantlybetterthanalltheotherconsidered integra-tionapproaches. NotethatalsoUA-allisalwaysbetterorequal thanalltheothers(exceptwithWA-all),showingthatalsoa sim-pleunweightedintegration,ifasufficientlylargesetofsources ofevidenceisprovided,canachieveresultscomparablewiththe morecomputationallyexpensiveweightedintegration(recallthat theweightsoftheintegrationareobtainedbyevaluatingtheAUC oneachsinglegenenetworkbyinternalcross-validation,see Sec-tion2.4.3).Quiteinterestingly,WAPdoesnotoutperformWA:even ifweconstructaspecificweightednetworkfor eachMeSH dis-easethisdoesnotintroduceasignificantadvantage(atleast,on theaverage).Thisfactcouldbeexplainedbyconsideringthatthe per-classintegration(WAP)mayintroduceacertainoverfittingto thedata,whileWA,byaveragingtheweightsacrossclassesand thus resultingin a singleintegrated network, couldreduce the overfitting,actingasasortof“regularization”,confirmingprevious resultsobtainedinthecontextofgenefunctionprediction[55].
Consideringthatforalargenumberofdiseaseswehavea rel-ativelylow number of annotated genes,we compared also the precisionatdifferentrecalllevelsbetweendifferentunweighted andweightedintegrationmethods,usingtwostepsSAV asgene
0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet UA 0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet U A−all
(a)
(b)
0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet WA 0.5 0.6 0.7 0.8 0.9 1.0 0. 5 0. 6 0. 7 0. 8 0. 9 1.0 finet W A−all(c)
(d)
Fig.5.ComparisonofAUCresultsbetweennetworkintegrationmethodsandthebestsinglegenenetwork(finet).EachpointrepresentstheAUCscoreobtainedbySAVfive
stepswithnetworkintegrationmethods(ordinate)andwiththebestsinglenetworkfinet(abscissa)oneachofthe708MeSHdiseases.(a)UAwithsixnetworks;(b)UAwith
alltheninenetworks;(c)WAwithsixnetworks;(d)WAwithallninenetworks.
WA WAP WAall WAPall UA PUA MAX UAall PUAall MAXall MINall finet
0. 6 0. 7 0. 8 0. 9 1.0
Table6
Comparisonbetweennetworkintegrationmethods:methodswhoseAUCperformancearesignificantlybetteraremarkedwith“+”,significantlyworsewith“−−”andwith
nosignificantdifferencewith“=”(0.01significancelevel,Wilcoxonsignedranksumtest).Thecomparisonsareinthesenserowsvs.columns.
WAP-all WA WAP UA-all PUA-all MAX-all MIN-all UA PUA MAX finet
WA-all + + + + + + + + + + + WAP-all = = = + + + + + + + WA = = + + + + + + + WAP = + + + + + + + UA-all + + + + + + + PUA-all + + + + + + MAX-all + − − = + MIN-all − − − − UA + + + PUA + + MAX +
prioritizationalgorithm(Fig.7).Withboththeintegrationofthe sixbasicnetworks(Fig.7(a))andwiththeintegrationofthesix basicnetworksplusthethreesemanticsimilarity-basednetworks (Fig.7(b))weachievesignificantlybetterresultswithanyofthe consideredintegratednetworkwithrespecttothebest“single” network (finet), exceptfor theMINintegrationthat obtainsthe worstresults.Alsointhiscaseweightedintegrationoutperforms unweightedintegration,butobservethatwhenweintegrateall theavailablenetworksUAall,i.e.theunweightedaverage integra-tion,achievesbetterresultsthantheweightedper-classintegration (WAPall), confirming that WAP integration undergoes a certain overfittingtothedata.Notethatwhensemantic-similaritybased networksareadded,alltheintegrationmethodsimprovestheir precision/recallresults(thescaleoftheordinate,thatisthe pre-cisionis equal in Fig. 7(a) and (b)). For instance WA, thebest performing network integrationmethods, improves its average precisionat20%recallfrom0.26to0.30witharelativeincrement ofabout15%inprecision.Asafinalobservation,notethatallthe considerednetworkintegrationmethods(exceptMINintegration) significantlyoutperformtheresultsobtainedwiththebestsingle network,confirmingthatalsosimpleunweightedintegration algo-rithmsaresufficienttoboosttheperformanceofgeneprioritization methods.
3.6. FindingnovelassociationsbetweengenesandMeSHdiseases Thecommonusageofgenesrankingscoresingene-disease pri-oritizationexperimentsconsistsintheselectionofthetopranked unannotated genes and in the theirfurthercharacterization as
possible“candidate”genesactuallyimpliedintheonsetand pro-gressionoftheconsidereddisease.
Tothisendweprovideforeachofthe708MeSHdiseasesthe AUCobtainedbyfive-foldcross-validation,thep-valueachieved through a non parametric randomized test (see below), and the10 toprankedgenes currentlynotannotated fortheMeSH diseaseunderstudy.Tablesummarizingtheseinformationis avail-ableathttp://homes.di.unimi.it/re/suppmat/genesmeshnetwpred/ supmatTBL1.html(accessed30November2013).
Moreover,wealso providea preliminaryanalysisofthetop rankedmost reliableunannotated genesfor theMeSHdiseases predictedwithhighrobustnessandaccuracybythebestnetwork integration,i.e.WAintegratingalltheavailablenetsusingfivesteps SAVtoprioritizegenes.
Toevaluatetherobustnessofthemethodweperformeda non-parametricstatisticaltestbyrandomlyshuffling1000timesthe labelsforeachMeSHdiseaseandcountinghowmanytimesmthe AUCcomputedwithrandomlyshuffledlabelsislargerthantheAUC computedwiththetruelabels.Theresultingp-valueisjust the ratio m
1000.Interestinglyenough,weachieveap-value<0.01for649 andap-value<0.05for676ofthe708MeSHdiseases.Tochoose MeSHdiseasesbothrobustlyandaccuratelypredictedweselected MeSHdescriptorswithanaverageAUC≥0.975andp-value<0.01, resultinginasetof24diseases.Foreachoftheselecteddiseases, weextractedthelowestscorecfromthesetofpositive(annotated) genes.Then,wecomputedtheempiricalcumulativedistributionof allthescoresequalorlargerthanc,consideringbothannotatedand unannotatedgenes.Asafinalstep,usingthedistributioncomputed atthepreviousstep,wecomputedthek-percentilesofthethree
0. 0 0. 1 0. 2 0. 3 0. 4 0.5 Recall Precision 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 WA WAP UA PUA MAX MIN finet 0. 0 0. 1 0. 2 0. 3 0. 4 0.5 Recall Precision 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 WAall WAPall UAall PUAall MAXall MINall finet
(a)
(b)
Fig.7. Comparisonoftheaverageprecisionatfixedlevelsofrecallacrossthe708MeSHdiseasesbetweennetworkintegrationmethodsandthebestsinglegenenetwork
Table7
Listof24selecteddiseasesandofthecorrespondingtoprankedunannotatedgenes.
Diseaseid. Diseasename Toprankedunannotatedgenes
C535579 Cardiofaciocutaneoussyndrome KSR2,PILRA,KSR1
C536436 Coffin-Sirissyndrome PYGO1,ARID2,SMARCC2
C536664 Peroxisomebiogenesisdisorders PEX5,PEX7,LONP2
C536783 T-Lymphocytopenia BIRC8,CASP10,NAIP
C536928 Turcotsyndrome MLH3,PMS2L5,MSH3
C537345 Sitosterolemia UGT1A5,UGT2B17,SLCO1B1
C538169 Acitretinembryopathy CASP10,PEA15,SLCO3A1
D000562 Amebiasis DCLRE1C,IL19,CYP2C8
D001404 Babesiosis DCLRE1C,IL19,FCGR2C
D002062 Bursitis UGT2B4,UGT2B15,UGT1A4
D006958 Hyperostosis,Cortical,Congenital NPPC,NPR1,ACE
D007888 LeighDisease NDUFB10,NDUFB4,NDUFA12
D008118 Loiasis FCGR2C,CYP3A43,CYP8B1
D008375 MapleSyrupUrineDisease ACAD8,PDHX,PDHB
D009196 MyeloproliferativeDisorders PTPN1,CISH,SLC25A40
D009634 NoonanSyndrome KSR2,KSR1,MRAS
D010483 PeriapicalDiseases MMP13,IL12B,IL8
D012214 RheumaticHeartDisease CYP21A2,CYP8B1,CYP3A43
D014353 Trypanosomiasis,African DCLRE1C,BCL2,STAT1
D015823 AcanthamoebaKeratitis DCLRE1C,IL19,CYP2C8
D018235 SmoothMuscleTumor NFKB1,IL8,IL6
D020299 IntracranialHemorrhage,Hypertensive NPPC,NPPB,CRH
D056685 CostelloSyndrome KSR2,PILRA,KSR1
D056824 UpperExtremityDeepVeinThrombosis FGGCX,PROZ,F11
toprankedunannotatedgeneswithineachselectedMeSHterm. Consideringthatweselected24MeSHdiseases,thisprocedurelead toacollectionof72k-percentileswhosefrequencyisplottedin Fig.8.
Fig.8showsthatmostofthetoprankedunannotatedgenesare concentratedclosetothe100-percentile,showingthatthesetop ranked“falsepositive”genesare“stronglypredicted”aspossible candidatediseasegenes,sincetheirscoresareclosetothatofthe toprankedannotatedgenes.Consideralsothatthisissupportedby thefactthatweselectedonlydiseasesforwhichgeneprioritization achievedaveryhighAUCand“robust”predictions(AUC>0.975and p-value<0.01).Thetopthree falsepositivesgenesymbolsalong withthediseaseidentifiersanddiseasenamesfortheselected24 MeSHdescriptorsarelistedinTable7.
Ofcoursetheproposedtoprankedgenesareonlydiseasegene candidates,andtheseresultsneedtobebiologicallyinterpretedand shouldundergoarigorousbio-medicalanalysispriortobeactually associatedtothediseaseitself.
k−percentiles of the top 3 FP scores within each selected MeSH term
k−percentile Frequency 0 20 40 60 80 100 0 102 03 04 05 0 60
Fig.8.Frequencyofthek-percentilesofthethreetoprankedunannotatedgenes.
4. Conclusions
Weperformedanextensiveanalysisofgene-disease associa-tionsnotlimited togeneticdisorders, includingmorethan 700 MeSHdiseases.
Byusingnetworkintegrationandgeneprioritizationmethods, wereportedforeachdiseasethe10unannotatedtop-rankedgenes, availableforfurtherbio-medicalanalysis.Moreover,byanalyzing thetop-rankedpredictionsrelativetothe24bestandrobustly pre-dictedMeSHdiseases,weshowedthatourapproach candetect reliablecandidatediseasegenes.
It is well-known that the integration of multiple omics sources of evidence is of paramount importance in several application domains in computational biology [65–68]. In this work we performed a systematic comparison of unweighted integration and our proposed weighted combination methods to provide an evaluation of the impact of network integra-tion on gene prioritization. We quantitatively showed that network integration is necessary to boost gene prioritization results,accordingtopreviousresultspublishedin theliterature [15,69,27,28,46,47].
In particular, we showed that the proposed weighted inte-gration methods, by exploiting the different “informativeness” embedded in different gene interaction networks, significantly outperformunweightedintegration.Moreoverourexperimental resultsshowthattheperformancesstronglydependonthe selec-tionofthesourcesofevidenceandonthecharacteristicsofthegene networks.Forinstance,alsoasimple UAintegrationcan signifi-cantlyimprovetheperformanceofgeneprioritizationmethodsifa sufficientnumberofdiverseandcomplementarygeneinteraction networksarecombined.Fromthisstandpoint,anovelresearchline couldberepresentedbyanadaptationoftestandselectmethods, originallyproposedinthecontextofsupervisedensembles[70]to appropriatelychoosethemostpredictivesourcesofevidenceand genenetworksforeachMeSHdiseasethroughanadaptivelearning process.
Confirming previous results [30], semantic similarity-based networks,combinedwithothersourcesofevidenceboostthe per-formanceofgeneprioritizationmethods.Apossibleimprovement oftheproposedapproach couldconsistin combiningnetworks basedonsemanticsimilaritymeasuresthatembedtheontology