• No results found

Toxicity prediction from toxicogenomic data based on class association rule mining

N/A
N/A
Protected

Academic year: 2021

Share "Toxicity prediction from toxicogenomic data based on class association rule mining"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

ContentslistsavailableatScienceDirect

Toxicology

Reports

j o u r n al ho me p ag e : w w w . e l s e v i e r . c o m / l o c a t e / t o x r e p

Toxicity

prediction

from

toxicogenomic

data

based

on

class

association

rule

mining

Keisuke

Nagata

a,∗

,

Takashi

Washio

b

,

Yoshinobu

Kawahara

b

,

Akira

Unami

a

aDrugSafetyResearchLaboratories,AstellasPharmaInc.,2-1-6Kashima,Yodogawa-ku,Osaka532-8514,Japan bTheInstituteofScientificandIndustrialResearch,OsakaUniversity,8-1Mihogaoka,Ibaraki,Osaka567-0047,Japan

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received25August2014

Receivedinrevisedform20October2014 Accepted20October2014

Availableonline7November2014 Keywords:

Microarray Toxicogenomics

Classassociationrulemining CBA

a

b

s

t

r

a

c

t

WhiletherecentadventofnewtechnologiesinbiologysuchasDNAmicroarrayand next-generationsequencerhasgivenresearchersalargevolumeofdatarepresenting genome-widebiologicalresponses,itisnotnecessarilyeasytoderiveknowledgethatisaccurate andunderstandableatthesametime.Inthisstudy,weappliedtheClassificationBased onAssociation(CBA)algorithm,oneoftheclassassociationruleminingtechniques,tothe TG-GATEsdatabase,wherebothtoxicogenomicandtoxicologicaldataofmorethan150 compoundsinratandhumanarestored.Wecomparedthegeneratedclassifiersbetween CBAandlineardiscriminantanalysis(LDA)andshowedthatCBAissuperiortoLDAinterms ofbothpredictiveperformances(accuracy:83%forCBAvs.75%forLDA,sensitivity:82%for CBAvs.72%forLDA,specificity:85%forCBAvs.75%forLDA)andinterpretability. ©2014TheAuthors.PublishedbyElsevierIrelandLtd.Thisisanopenaccessarticleunder

theCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/3.0/).

1. Introduction

NewtechnologiessuchasDNA microarrayand next-generationsequencer have allowedresearchers tolearn biologicalphenomenaingenomeortranscriptomelevels. Especiallyintoxicology,thesenewtechnologieshaveled toanewsubdiscipline,termedtoxicogenomics. Toxicoge-nomics isconcernedwiththeidentificationofpotential human and environment toxicants, and their putative mechanisms of action, through the use of genomics resources[1].Forexample,byevaluatingand character-izingdifferentialgeneexpressions,inhumansoranimals, after exposure to drugs, it is possible to use complex expressionpatternstopredicttoxicologicaloutcomesand toidentify mechanismsinvolved withor related tothe toxic event [2]. Traditionally, to construct a such pre-dictiveclassifier,techniquesinmachinelearningsuchas

∗ Correspondingauthor.Tel.:+810662107028. E-mailaddress:[email protected](K.Nagata).

k-nearestneighbors,lineardiscriminantanalysis(LDA)and supportvectormachine(SVM)havebeenmostlyused[3]. However,buildingaclassifierthatisaccurateand under-standable at the same time is not necessarily an easy task.Forexample,whileSVMachieveshighclassification accuracy,resultingclassifiersarehardtointerpretas vari-ables are transformed nonlinearlyinto a feature space, and hence difficult to use in order to extract relevant biological knowledge from it [4]. Veryoften, predictive accuracy,understandability,andcomputationaldemands needtobetradedoffagainstoneanother,because algo-rithmsoftencompromiseonetogainperformanceinthe other[5].

In this study, weapplied theclassification based on association(CBA)algorithmtotoxicogenomicdatainan aimtobuildaclassifierthatisaccurateandunderstandable atthesametime.Wecompareditspredictiveperformances andinterpretabilityofgeneratedclassifierswiththoseof LDA,whichisconsideredtobeoneofthemoststandard classificationmethodsandhaveagoodbalancebetween accuracyandinterpretability.

http://dx.doi.org/10.1016/j.toxrep.2014.10.014

2214-7500/© 2014 The Authors. Published by Elsevier Ireland Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

(2)

CBAisoneoftheclassassociationrule(CAR)mining algorithms,whichintegrateassociationrulemining (find-ingalltherulesexistinginthedatabasethatsatisfysome constraints)andclassificationrulemining(discoveringa smallsetofrulesinthedatabasethatformsanaccurate classifier)byfocusingonminingaspecialsubsetof associ-ationrules,calledclassassociationrules(CARs)[6].One of the advantages of CAR mining algorithms over con-ventionalmethods(especiallySVM)isitsinterpretability, becauseclassifiersaregeneratedasasetofsimplerules withoutmuch sacrifice of accuracy [7].Another advan-tageisthatCARminingalgorithmscanbeappliednotonly tolinearlyseparablecases,butalsotolinearly insepara-blecases,whereLDAorotherlinearclassificationmethods arenotapplicable[8].SVMcanhandlelinearlyinseparable casesbymappingoriginaldataintoasuitablefeaturespace, butwithlossofinterpretability.Besides,especiallywhen appliedtogeneexpressiondata,CARminingalgorithms, whichpredictaclasslabelbasedonspecificsetsof differen-tiallyexpressedgenesthatareactuallyobservedintraining samples,areexpectedtogeneratemorebiologically rea-sonableclassifiers,becauseit isgenerallynotindividual genesbutsetsofgenesthatcollectivelydefinephenotypes suchasdrugresponses[9].WhileapplicationsofCBAand itsvariantsinbiologicalresearchhavebeenreportedin sev-eralreports[10–14],thereissofarnoreportswithdirect implicationfortoxicogenomics,whichisuniqueinthatthe numberof variables tobeanalyzed isusually farmuch greaterintoxicogenomics(morethan30,000genes)than inotherapplicationsandthisso-calledhighdimensionality makesitdifficulttoanalyzeitsdata.

To compare the predictive performances and inter-pretability of CBA and LDA, utilizing the TG-GATEs database,wherebothmicroarrayandtoxicologicaldataof morethan150compoundsinrats(invivoandin vitro) andhumans(invitro)arestored,webuiltbothCBAand LDAclassifiersthatpredictwhetherachemicalcompound inducesincreasesinliverweightafter14-dayrepetitive treatments in rats based on transcriptomic data of 3-dayrepetitivetreatments.Althoughmeasurableincreases in mRNA(indicative of enzymeinduction) are likely to precede, increase in liver weight is the most sensitive indicatorof hepatocellularhypertrophy and occurprior tomorphologicalchanges.Whileitshouldbealsonoted that hepatocellular hypertrophy without histological or clinicalpathologicalalterationsisconsideredtobean adap-tivenon-adversechange,certaindegreesofliverweight increaseappeared tobecorrelated withthesubsequent developmentofirreversibletoxicitysuchasfibrosis, necro-sis,vacuolization,fattydegeneration,andevenneoplasia [15] and early detection of hepatocellular hypertrophy basedonliverweightorgeneexpressionsisexpectedto beuseful,forexample,inselectingcompoundswithless riskofhepatotoxicityindrugdevelopment.

2. Materialandmethods

2.1. Datasource

TG-GATEs is a toxicogenomic database devel-oped by The Toxicogenomics Project (TGP), a joint

government-private sector project organized by the NationalInstituteofBiomedicalInnovation,National Insti-tuteofHealthSciencesand15pharmaceuticalcompanies in Japan, and The Toxicogenomics Informatics Project (TGP2), a follow-on project fromTGP organizedby the National Institute of Biomedical Innovation, National Institute of Health Sciences and 13 companies. Gene expression and toxicity data in vivo (rats) and in vitro (primaryculturedhepatocytesofratsandhumans)after treatmentsofmorethan150compoundsarestoredinthe TG-GATEsdatabase.TG-GATEsisnowreleasedforpublic asOpenTG-GATEs(http://toxico.nibio.go.jp).

FromtheTG-GATEsdatabase,weusedgeneexpression data(n=3pergroup)onedayafter3-dayrepetitivedoses (hereinafter4D)intheliverofratsandliverweightdata (relativeliverweightscalculatedfrombodyweights)(n=5 pergroup)onedayafter14-dayrepetitivedoses(15D)in ratsforthisstudy.Foreachcompound, onlythedataof thehighestdosegroupanditscontrolgroupwasused.Of 150compounds,weomittedonecompoundandanalyzed theremaining149compoundsbecausethatonecompound wasfoundtohavekilledanimalsbefore15Dinthestudy andthereforenodataisavailableforliverweightof15D.

2.2. CBA(classificationbasedonassociation)

2.2.1. Software

IncourtesyofDr.FransCoenen,weusedaCBAprogram availableontheLUCS-KDDwebsite,whichisimplemented accordingtotheoriginalalgorithmby[6],exceptthatCARs arefirstgeneratedusingtheApriori-TFPalgorithminstead oftheCBA-RGalgorithm.

2.2.2. Concept

ThebasicconceptofCBAisbrieflyexplainedherebased ontheexplanationsfrom[6]withexamplesinthisstudy. Fordetail,referto[6].LetDbethedataset,asetofrecords

d(d∈D).LetIbethesetofallnon-classitemsinD,andYbe thesetofclasslabelsinD.Inthisstudy,anon-classitemis apairofgeneIDanditsdiscretizedexpression(IncorDec) (Inc:increased,Dec:decreased)andaclasslabelisapair ofatargetparameter(RLW:relativeliverweight)andits discretizedvalue(IncorNI,orDecorND)(NI:notincreased, ND:notdecreased).ThesetofclasslabelsYinthisstudyis either{(RLW,Inc),(RLW,NI)}or{(RLW,Dec),(RLW,ND)}. Wesaythatarecordd∈DcontainsX⊆I,orsimplyX⊆d,if

dhasallthenon-classitemsofX.Similarly,arecordd∈D

containsy∈Y,orsimplyy⊆d,ifdhastheclasslabely.A ruleisanassociationoftheformX→y(e.g.(Gene01,Inc), (Gene02,Dec)→(RLW,Inc)).ForaruleX→y,Xiscalledan antecedentoftheruleandyiscalledaconsequenceofthe rule.AruleX→yholdsinDwithconfidencecifc%ofthe recordsinDthatcontainXarelabeledwithclassy.Arule

X→yhassupportsinDifs%oftherecordsinDcontainX

andarelabeledwithclassy.TheobjectivesofCBAare(1) togeneratethecompletesetofrulesthatsatisfythe user-specifiedminimumsupport(calledminsup)andminimum confidence(calledminconf)constraints,and(2)tobuilda classifierfromtheserules(classassociationrules,orCARs).

(3)

TheoriginalCBAalgorithmofLiuetal.consistsoftwo parts, a rule generator (called CBA-RG) and a classifier builder(calledCBA-CB),eachcorrespondingto(1)and(2). ThekeyoperationofCBA-RGistofindallrulesX→y

thathavesupportaboveminsup.Rulesthatsatisfyminsup

arecalledfrequent,whiletherestarecalledinfrequent.For alltherulesthathavethesameantecedent,therulewith thehighestconfidenceischosenasthepossible rule(PR) representingthissetofrules.Iftherearemorethanone ruleswiththesamehighestconfidence,oneruleis ran-domlyselected.Iftheconfidenceisgreaterthanminconf,

theruleis accurate.ThesetofCARsthus consistsofall

thePRsthatarebothfrequentandaccurate.TheCBA-RG

algorithmeffectivelysearchesforalltheCARsinadataset basedontheApriorialgorithm[16],assumingthe down-wardclosurepropertythatforanyX,Xisfrequentifand onlyifanysubsetxofXisfrequent.InsteadofCBA-RG,the Coenen’sCBAprogramisimplementedwiththe Apriori-TFPalgorithm[17,18],avariantoftheApriorialgorithms thatutilizesa tree-structureddatarepresentationsfora higherperformance.

Theoperationofthelatterpart,CBA-CB,isdescribedas followsin[6].“Giventworules,riandrj,rirj(alsocalled

riprecedesrjorrihasahigherprecedencethanrj)if 1.theconfidenceofriisgreaterthanthatofrj,or

2.theirconfidencesarethesame,butthesupportofriis greaterthanthatofrj,or

3.boththeconfidencesandsupportsofriandrj arethe same,butriisgeneratedearlierthanrj.

LetRbethesetofgeneratedrulesandDthetraining data”.CBA-CBis“tochooseasetofhighprecedencerules inRtocoverD”.Ageneratedclassifierisoftheform,<r1,

r2,...,rn,defaultclass>,whereri,∈Randra,rbifb>a.In

classifyingasamplewithaunknownclasslabel,thefirst rulethatsatisfiesthesamplewillclassifyit.Ifthereisno rulethatappliestothesample,ittakesonthedefaultclass,

defaultclass.Belowisasimpleexampleofclassifiers.

Example:

(Gene01, Inc), (Gene02, Dec)→(RLW, Inc)

(Gene01, Inc), (Gene03,Inc)→(RLW, Inc)

(NULL)→(RLW, NI)

Inthisexample.eachlinecorrespondstoaruleincluded intheclassifier.Therulewiththe(NULL)antecedentmeans thedefaultruleofthisclassifier.Whenasample,(Gene01, Inc), (Gene03, Inc) withan unknown class label (it is unknownwhetherRLWisIncorNI),isclassified,the classi-fieranswers(RLW,Inc),asthesecondrulefirstsatisfiesthe sample.Inanothercase,whereasample,(Gene01,Inc), (Gene02, Inc),isclassified,theclassifieranswers (RLW, NI),asnoneoftherulesexceptthedefaultrulesatisfies thesampleandthusthedefaultruleisapplied.

2.3. Dataanalysis

PriortotheCBAanalysis,wehavepreprocessedgene expressiondataintheliver(4D)andliverweightdata(15D) ofratsafterrepetitivedosesfor149compoundsfromthe TG-GATEsdatabase.First,geneexpressionswerecorrected andnormalizedbytheMAS5.0algorithm[19]toreduce inter-arrayvariances[20].Liverweightsweretransformed intorelativeliverweight,aratioofliverweightdividedby bodyweighttoavoidlargevariationsinbodyweight ske-wingorganweightinterpretation[15].Second,valueswere averagedoverindividualanimalsincludedineachgroup. Then,foreachcompound-treatedgroup,afoldchangewas calculatedas aratioof anaveragevalueofa treatment groupdividedby anaveragevalue ofits corresponding controlgroup,toreduceinter-studyvariances[21].Finally, wediscretizedgeneexpressionsandrelativeliverweights basedontheirfoldchanges (fc) andpvalues (p)of the student’st-test conductedbetweena compound-treated groupanditscorrespondingcontrolgroup,accordingto thecriteriashownbelow.

2.3.1. Geneexpressiondata

Iffc>2andp<0.05,assign“Inc”(increased). Iffc<0.5andp<0.05,assign“Dec”(decreased). Otherwise,assign“NC”(notchanged).

2.3.2. Liverweightdata

1.Whenaclassifierforincreasedliverweightwasbuilt:If fc>1 and p<0.05, assign “Inc” (increased).Otherwise, assign“NI”(notincreased).

2.Whenaclassifierfordecreasedliverweightwasbuilt:If fc<1andp<0.05,assign“Dec”(decreased).Otherwise, assign“ND”(notdecreased).

Discretization thresholds for gene expressions com-binedwithfoldchangesandstatisticaltest(e.g.student’s

t-test)haveoftenbeenappliedinmicroarraydataanalysis andisreportedtobebetterthanpvaluealone[22].In gen-eral,numericalparametersobtainedintoxicitystudiesare judgedtobeincreasedordecreased,basedessentiallyon statisticalcomparisonwithcontemporarycontrolsand,if available,additionallyonhistoricaldata[23].Inthisstudy, wediscretizedliverweightsbasedonlyonstatisticaltests, asnohistoricaldatawasavailable.

BeforeproceedingtoCBA,geneexpressionsdiscretized as “NC” in each group were discarded from the data, becausewewereinterestedonlyingeneswithincreased ordecreasedexpressions.Wethenanalyzedthedatawith CBA,withdiscretizedgeneexpressionsasnon-classitems anddiscretizedliverweightsasclasslabels.

2.4. Lineardiscriminantanalysis(LDA)

2.4.1. Software

WeusedtheldafunctionintheMASSlibraryofR.R’slda functionisimplementedbasedonRao’sLDA[24,25],also knownasFisher-RaoLDA,whichgeneralizedFisher’sLDA [26]tomultipleclasses.

(4)

2.4.2. Dataanalysis

PriortotheLDAanalysis,thedatawaspreprocessedas describedintheCBAsection,exceptthatgeneexpressions werenotdiscretized.BeforeproceedingtoLDA,the fea-tureselectionstepwasconductedtoreducethenumber ofgenes,becauseclassicalLDArequiresthetotal scatter matrixtobenonsingular,whilethematrixcanbesingular whenthesamplesize(149)doesnotexceedthenumber offeatures(genes)(morethan30,000)[27],andtendsto overfitand becomelessinterpretablein thepresenceof manyirrelevantand/orredundantfeatures[28].Basedon thepreviousreportsonmicroarraydataanalysis[29,30], weselectedonlythegenesthatwereup-regulated(fc>2 andp<0.05)or down-regulated(fc<0.5and p<0.05)in thegroupswithincreasedordecreasedliverweightwhen comparedtothenot-increasedornot-decreasedgroups, respectively.

2.5. Predictiveperformancecomparison

TocomparepredictiveperformancesofCBAandLDA, we conducted 10-fold cross validation [31] for each methodswiththetotalof149records(compounds),and evaluatedsensitivity, specificity, and accuracy averaged over 10 validations. These parameters are defined as follows[32].

Sensitivity Truepositive/(truepositive+falsenegative) Specificity Truenegative/(truenegative+falsepositive) Accuracy (Truepositive+truenegative)/total

10-foldcrossvalidation,ormoregenerallyk-foldcross validation,isoneofthestandardmethodsforevaluating predictiveperformancesofclassifiers.Thismethoddivide adatasetintoequally-sizedkpartitions(1,2,...,k).Inthe firststep,thefirstpartition(1)isreservedasatestsetand theotherpartitions(2,3,...k)areusedasatrainingsetto buildaclassifier.Onceaclassifierisbuilt,itisvalidatedfor itspredictiveperformanceswithatestset(thefirst parti-tioninthiscase).k-Foldcrossvalidationrepeatsthisstepsk

timeschangingapartitionservingasatestsetonebyone. Intheend,averagedpredictiveperformanceoverk vali-dationstepsisregardedasthepredictiveperformanceofa classificationalgorithm.

2.6. Student’st-test

For statistical comparison of mean gene expressions orliverweightsbetweenacompound-treatedgroupand itscorrespondingcontrolgroupforeachcompound, the unpairedtwotailedstudent’st-testwithoutequalvariance assumptionwasconducted.Specifically,thisstatisticaltest wasconductedinthediscretizationstepofCBAandthe featureselectionstepofLDA.Whengeneexpressionswere comparedbetweentwogroups,geneexpressionswere log-transformedwithbaseoftwopriortothestatisticaltest. Logtransformationsofgeneexpressiondataisknownto resultinmoreconsistentstatisticalinferencesandbeoften considereddesirable,duetoitslargecoefficientofvariation [33].

Itiswellknownthatthestandardp-valuemethodleads tothehighrateoffalsepositiveswhenappliedinrepeated testing.Thisisthecasewhenanalyzinggeneexpression data collectedvia microarrays, as this usually involves testing fromseveral thousands totens of thousands of hypothesessimultaneously.Whileanumberofadjustment procedures (e.g.controllingthefalse discoveryrate)are available,theyareoftentooconservativeformicroarray studiesinthattheycanleadtolowsensitivity[34],thus increasingtheriskofmissingtruepositives.Inthisstudy, noadjustmentswereapplied,takingitintoconsideration thateveniffalsepositivegeneswithnoorlittlerelevance forliverweightsweredetectedbystatisticaltests,the clas-sification methods woulddiscardmany of themfrom a generatedclassifier,hencemarginalizetheimpactofsuch false positiveswhile minimizingtheriskof overlooking trueimportantchanges.

2.7. Pathwayanalysis

Canonicalpathwayanalysisforthegenesincludedin theCBA-generatedclassifierwasconductedwithQIAGEN’s IngenuityPathwayAnalysis(IPA)softwaretounderstand whatpathway(andhencefunction)thesegenesaremainly involved.ThereasonwhyweusedIPA,notapublicly avail-abledatabase,isitshighqualityofinformation.IPAisbased on “expertly curated biological interactions and func-tionalannotationsfrommillionsofindividuallymodeled relationships betweenproteins, genes, complexes, cells, tissues, drugs, and diseases” and “reviewed for accu-racy byPhDscientists”(accordingtoQIAGEN’s website: http://www.ingenuity.com/products/ipa).

Canonical pathways are a set of pre-built pathways basedontheliterature.CanonicalpathwayanalysisofIPA answershowstatisticallysignificantlythepathwayswere affected,consideringhowmanymoleculesauser-specified set and a pathway share. In this study, we conducted canonicalpathwayanalysiswithallthegenesincludedin ourCBA-generatedclassifier.Incanonicalpathway analy-sis,specifiedgenesareconvertedtotheircorresponding moleculesandmatchedupagainstthemoleculesineach pathway.

2.8. Computer

Inthisstudy,weusedapersonalcomputerwithIntel Corei5-3320M2.6GHzCPUand4GBRAMfortheanalyses.

3. Results

3.1. Selectionofminimumsupportandconfidence

InCBA,ausermustspecifytwoparameters:minimum support (minsup) and minimum confidence (minconf). There is no universal criteria for these parameters. In this study, we assumed that lower minsup and higher confidence are basicallydesirable.Thatis tosay,a rule is considered useful, if the rule X→y satisfies a large fraction of recordsthat matches therule antecedent X, evenifthenumberofrecordsthatmatchesXissmall.This is because a drug-induced response (or more generally

(5)

Table1

ExplorationofvariousCBAsettings.

minsup(%) minconf(%) Averageaccuracy(%) Totaltime(s) (A)Whenminsupwasfixedat10%

10 50 77 0.61

10 80 76 0.59

10 90 79 0.58

10 100 77 0.58

(B)Whenminconfwasfixedat90%

20 90 0 0.42

15 90 9 0.42

10 90 79 0.58

8 90 83 22.37

7 90 Insufficientmemory

AccuracyofCBAclassifiersforincreasedrelativeliverweightwas evalu-atedin10-foldcrossvalidationsundervariouscombinationsofminsup andminconf.

biologicalresponse)isconsideredtobenotcaused bya

single mechanism. Rather, it is expected that there are

severaldifferentmechanisms,thusdifferentgene

expres-sionpatterns,finally leadingtothetargetdrug-induced

response, and thateach gene expressionpattern occurs

in arelativelylow frequencyamong thedatasetevenif

thedataset contains anenoughrecords withthetarget

drug-inducedresponse.Ifsettoostrict,however,thereis

ariskofmissingusefulruleswithfewexceptionsfortoo

highminconfandofselectingaccidentalruleswithonly

a few satisfying recordsfor toolow minsup.Moreover,

minsupisalsolimitedbycomputationalresources,asthe

lower the minsup is set, the higher the computational

demandis,intermsofbothtimeandmemory.

Toexploretheidealsettingsofminsupand minconf,

weevaluatedaccuracyofCBAclassifiersforincreasedliver

weightin10-foldcrossvalidationsundervarious

combi-nationsofminsupandminconf(Table1).First,wefixed

theminsupat10%andchangedtheminconffrom50%to 100%.Whiletheminconfat90%markedthehighest accu-racy(79%),therewerenoobviousdifferencesortendency inaccuracyamongthedifferentminconfs.Next,wefixed the minconf at90% and changed theminsup from 20% downward. Lowering the minsup remarkablyimproved accuracy,butprolongedcomputationaltimeatthesame time.Theaccuracyreachedat83%withminsupat8%.We triedwithminsupat7%,butfailedtofinishthe computa-tionduetomemoryinsufficiency.Similartendencieswere alsoconfirmedwhenassessingaccuracyofclassifiersfor decreasedliverweightunderdifferentminsupsand min-confs(datanotshown).

Basedontheseresults,weadoptedtheminsupat8% andminconfat90%forthefollowinganalyses.

3.2. Predictiveperformance

We compared predictive performance of classifiers between CBA and LDA with 10-fold cross validation (Table2).Whenincreasedliverweightwastargeted(that is,whenaclassifierforincreasedliverweightwasbuilt), CBAoutperformedLDAinallofthethreecriteria:accuracy (83%forCBA vs.75%forLDA),sensitivity(82%vs.72%), andspecificity(85%vs.75%).Whendecreasedliverweight wastargeted, CBAscored betteraccuracy(86% vs.73%)

and sensitivity (22% vs. 6%), while LDA marked better specificity(90%vs.95%).

WealsocomparedbetweenCBAandCBA-DR(CBA with-outdefaultrule),ourmodifiedversionoftheoriginalCBA (Table2).CBA-DRdoesnotpredictifa sampledoesnot matchanyruleexceptthedefaultruleinaclassifier,and,in turn,returna‘hold’.Whenincreasedliverweightwas tar-geted,CBA-DRmarkedloweraccuracy(83%forCBAvs.79% forCBA-DR)andspecificity(85%vs.29%)andhigher sen-sitivity(82%vs.100%).Whendecreasedliverweightwas targeted,CBA-DRmarkedlowersensitivity(22%forCBA vs.0%forCBA-DR)andhigheraccuracy(86%vs.95%)and specificity(90%vs.100%).

3.3. Interpretability

Wecomparedtheformofgeneratedclassifiersbetween CBAandLDA(Fig.1),whenalltherecordswereusedasa trainingsetforincreasedliverweight.CBAtellsusasetof rules,arrangedinorderofconfidence.Eachruleconsistsof anantecedent,whichisanitemsetintheformof(non-class attribute,itsdiscretizedvalue),andaconsequenceinthe formof(classattribute,itsclasslabel),shownafter“−>” here.

Ontheotherhand,LDAtellsusasinglediscriminative function(fd),whichisapolynomialofnon-classattribute valueswiththeircoefficients.Coefficientsina discrimina-tivefunctionofLDAreflectdiscriminativepowerofeach non-classattribute(gene,here),withhigherpositivevalues andlower negativevaluesmeaning largercontributions toeachcorrespondingclasslabelofaclassattribute(liver weight,here).

3.4. Biologicalrelevance

To look into how biologically reasonable the CBA-generated classifier is, we conducted the canonical pathwayanalysisforthesetofgenesselectedinthe clas-sifierwhenalltherecordswereusedasatrainingsetfor increasedliverweight(Table3)(forbrevity,onlytop10 pathwaysinorderof−logpareshown).BecauseLDAitself, incontrasttoCBA,doesnotexplicitlyselectasetofgenes inbuildingaclassifier,wedidnotcompareCBAwithLDA here.

Wecouldassumethatthemostsignificantpathways involvedwiththegenesinourclassifierweremainlydrug metabolism-relatedones,suchasXenobioticMetabolism Signaling,LPS/IL-1Mediated InhibitionofPXRFunction, PXR/RXRActivationetc.

Fig.2AisanexcerptaroundtheNRF2moleculefrom the illustration of the Xenobiotic Metabolism Signaling pathway,exportedfromIPA.NRF2isakeymodulatorof oxidativestressresponses.Inresponseofoxidativestress, NRF2isreleasedintothenucleusandup-regulates down-stream antioxidant enzymes, mainly drug metabolism enzymes.Actually,thegenesofdrugmetabolismenzymes suchasGST, NQO,and UGTdownstream of NRF2 were included in our classifier, suggesting the induction of drugmetabolismenzymestriggeredbyNRF-2-dependent oxidativestressresponses.

(6)

Table2

Comparisonofpredictiveperformances.

Method Targetdirection Averageover10-foldcrossvalidation

Total TP FN FP TP Hold Accuracy(%) Sensitivity(%) Specificity(%) CBA Inc 14.9 4.4 1.1 1.4 8 – 83 82 85 LDA Inc 14.9 2.7 1 2.8 8.4 – 75 72 75 CBA-DR Inc 14.9 4.4 0 1.4 0.8 8.3 79 100 29 CBA Dec 14.9 0.2 0.7 1.4 12.6 – 86 22 90 LDA Dec 14.9 0.2 3.3 0.7 10.7 – 73 6 95 CBA-DR Dec 14.9 0 0.7 0 12.6 1.6 95 0 100 PredictiveperformanceofclassifierswascomparedamongCBA,LDA,CBA-DRwith10-foldcrossvalidation.

Targetdirection:aclassifierwasbuiltforwhetherincreased(Inc)ordecreased(Dec)relativeliverweight.Total:averagenumberoftotalrecordsinatest setofeachtrialinacrossvalidation.TP:averagenumberoftruepositiverecordsinatestset.FN:averagenumberoffalsenegativerecordsinatestset.FP: averagenumberoffalsepositiverecordsinatestset.TN:averagenumberoftruenegativerecordsinatestset.Hold:averagenumberofrecordsinatest setthatdidnotmatchanyrulesexceptthedefaultrule(onlyforCBA-DR).

Notethataccuracy,sensitivityandspecificityfortheCBA-DRmethodwerecalculatedexcluding‘hold’samples.Totalsarenotintegershere,asthenumber ofrecordsintheoriginaldatasetwas149andthuscannotbedividedby10,thenumberoftrialsinthecrossvalidation.

CBA

(1368121_at, Inc), (1381852_at, Inc) -> (RLW, Inc), Support: 13.4%, Confidence: 100.0% (1387022_at, Inc), (1370067_at, Inc) -> (RLW, Inc), Support: 11.4%, Confidence: 100.0% (1368905_at, Inc), (1387783_a_at, Inc) -> (RLW, Inc), Support: 10.7%, Confidence: 100.0% (1371076_at, Inc), (1370828_at, Inc) -> (RLW, Inc), Support:9.4%, Confidence: 100.0% (1371089_at, Inc), (1368905_at, Inc), (1387599_a_at, Inc) -> (RLW, Inc), Support:8.7%, Confidence: 100.0% (1368905_at, Inc), (1375845_at, Inc) -> (RLW, Inc), Support:8.7%, Confidence: 100.0% (1368905_at, Inc), (1371076_at, Inc), (1371143_at, Inc) -> (RLW, Inc), Support: 8.1%, Confidence: 100.0% (1368905_at, Inc), (1390145_at, Dec), (1387006_at, Inc) -> (RLW, Inc), Support:8.1%, Confidence: 100.0% (1371076_at, Inc), (1387307_at, Dec) -> (RLW, Inc), Support:8.1%, Confidence: 100.0% (1370698_at, Inc), (1381852_at, Inc) -> (RLW, Inc), Support: 10.7%, Confidence: 94.1% (1387022_at, Inc), (1371076_at, Inc), (1384225_at, Dec) -> (RLW, Inc), Support: 10.1%, Confidence: 93.8% (1369440_at, Dec) -> (RLW, Inc), Support: 10.1%, Confidence: 93.8% (1377599_at, Inc) -> (RLW,NI), Support:9.4%, Confidence: 93.3% (1373814_at, Dec), (1389253_at, Inc) -> (RLW, Inc), Support:8.1%, Confidence: 92.3% (1371089_at, Inc), (1368905_at, Inc), (1371942_at, Inc) -> (RLW, Inc), Support:8.1%, Confidence: 92.3%

(NULL) -> (RLW,NI)

LDA

fd = 1.0982 fc(1380013_at) + 0.6116 fc(1387740_at) + 0.4895 fc(1389253_at) + 0.4870 fc(1368317_at) +0.4471 fc(1370870_at) + 0.4202 fc(1374187_at) + 0.2830 fc(1387766_a_at) + 0.2012 fc(1390358_at) +0.1999 fc(1371076_at) + 0.1195 fc(1369759_at) + 0.1109 fc(1384431_at) + 0.0638 fc(1387936_at) +0.0317 fc(1382137_at) + 0.0292 fc(1368905_at) + 0.0126 fc(1369698_at) + 0.0081 fc(1368718_at) +0.0063 fc(1369921_at) + 0.0041 fc(1370269_at) + 0.0039 fc(1387022_at) + 0.0024 fc(1374070_at) + 0.0023 fc(1387100_at) + 0.0002 fc(1398250_at) - 0.0003 fc(1387825_at) - 0.0034 fc(1385247_at) -0.0055 fc(1370491_a_at) -0.0124 fc(1371089_at) -0.0180 fc(1380669_at) -0.0723 fc(1370902_at) -0.1127 fc(1392413_at) -0.1159 fc(1388211_s_at) -0.1487 fc(1388210_at) -0.2057 fc(1387574_at) - 0.2170 fc(1380536_at) - 0.2384 fc(1375845_at) - 0.3318 fc(1389179_at) - 0.4109 fc(1378169_at) -0.4740 fc(1395403_at) -0.6657 fc(1391544_at) + 3.4389

If fd > 0, predicts RLW as Inc. Else, RLW as NI.

Antecedent: a set of non-class items in the form of (gene_id, Inc or Dec)

Consequence: a class label as predicon result in the form of (RLW, Inc or NI)

Support and Confidence of the rule

The first rule that sasfies the sample classifies it. If there is no rule sasfying the sample, the final default rule, (NULL), is applied.

In LDA, a classifier is represented by a discriminave funcon, fd, which is a polynomial of non-class aribute values (fold changes) with their coefficients

In CBA, a cl as si fi e r is a se t of rule s (e ach li ne ).

If the discriminave funcon is posive, the classifier predicts RLW as Inc. Otherwise, RLW as NI.

Fig.1. ComparisonoftheclassifierformbetweenCBAandLDA.TheformofgeneratedclassifierswerecomparedbetweenCBAandLDA,whenalltherecords wereusedasatrainingsetforincreasedrelativeliverweight.[CBA]Theclassifierconsistsofasetofrules,representedas“antecedent→consequence,support, confidence”,oneruleparline,inorderofconfidence.Anantecedentisasetofnon-classitems,eachitemrepresentedas(geneid,IncorDec).Aconsequence isaclasslabelthatisusedasaclassificationresultifthecorrespondingantecedentissatisfied,shownhereas(RLW,IncorNI).Thefinalrulewithan antecedent(NULL)isthedefaultrule,whichissatisfiedforanyrecordsandappliedifalltheprecedingrulesarenotmet.[LDA]Theclassifierisshownasa discriminativefunction,fd.fc(geneid)isafoldchangeofagenespecifiedwithgeneid.Iffdispositive,theclassifierpredictsRLWasInc.Otherwise,RLW asNI.geneid:RepresentedhereasanAffymetrixprobeID.RLW:relativeliverweight.Inc:increased.Dec:decreased.NI:notincreased.

(7)

Table3

CanonicalpathwayanalysisofCBAclassifier.

PathwayName −logp Molecules CorrespondingGenes Total Inc Dec

Xenobioticmetabolismsignaling 8.96 219 8 0 Gsta3,Aldh1a1,Ugt2b1,

Nqo1,RGD1559459,Cyp2b2,Ces2c, Sult2a2

LPS/IL-1mediatedinhibitionofRXRfunction 5.07 178 4 1 Abccg8,Gsta3

PXR/RXRactivation 3.95 58 3 0 Aldh1a1,Cyp2b2,Sult2a2 Arylhydrocarbonreceptorsignaling 2.94 127 3 0 Gsta3,Aldh1a1,Nqo1 NicotineDegradationIII 2.77 37 2 0 Ugt2b1,Cyp2b2 MelatoninDegradationI 2.75 38 2 0 Ugt2b1,Cyp2b2 Serotonindegradation 2.67 42 2 0 Aldh1a1,Ugt2b1 Superpathwayofmelatonindegradation 2.67 42 2 0 Ugt2b1,Cyp2b2 NRF2-mediatedoxidativestressresponse 2.66 159 3 0 Gsta3,Akr7a3,Nqo1 NicotineDegradationII 2.65 43 2 0 Ugt2b1,Cyp2b2 HistidineDegradationIII 2 6 0 1 Hal

ThecanonicalpathwayanalysiswasconductedwiththeIngenuityIPAsoftwareforthegenesincludedintheCBAclassifierwhenalltherecordswereused asatrainingsetforincreasedrelativeliverweight.Notethat,forbrevity,onlytop10pathwaysinorderof-logpareshownhere.

−logp:−logofp,wherepisavaluerepresentingstatisticalsignificanceintheanalysis.Asmallerpvalue(thusalarger−logpvalue)meansthatthepathway ismorestatisticallysignificantlyinvolved.Molecules:thetotal,increased(upregulated)numberanddecreased(downregulated)numberofmoleculesin eachpathwayareshown.Correspondinggenes:correspondingratgenesfortheincreasedordecreasedmoleculesincludedinthepathwayareshown.

Fig.2Bshowsoverlappingamongthecanonical path-ways detected as significant, which were divided into three clusters. The largest cluster consists of drug metabolism-related pathwaysas describedabove. Inter-estingly,twootherclusters,histidinedegradation-related andgluconeogenesis-related,werealsodetectedwithno

overlapbetweenthedrugmetabolism-relatedclusterand them.

WethensummarizedAffymetrixprobeIDs,gene sym-bolsand genenamesfor eachgeneinourclassifierand dividedthemintofourcategories,drugmetabolism, gluco-neogenesis,histidinedegradationandtheother(Table4),

Fig.2.CanonicalpathwayillustrationsofCBAclassifier.[A]AnexcerptaroundtheNRF2moleculefromtheillustrationoftheXenobioticMetabolism Signalingpathway,exportedfromIPA.[B]Overlappingamongthecanonicalpathwaysdetectedassignificant,whichweredividedintothreeclusters, exportedfromIPA.Eachnodecorrespondstoeachcanonicalpathwaydetectedassignificant.Eachlinkcorrespondstothenumberofmoleculesshared betweentwopathways.Colordepthofnodescorrespondstothe−logpvalue(thedeeperdepthis,thelargerthe−logpvaluesis).Linewidthoflinks correspondstothenumberofmoleculessharedbetweentwopathways(nolinemeansnosharedmoleculesbetweentwopathways).(Forinterpretation ofthereferencestocolorinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

(8)

Table4

DetailsandcategoryofthegenesinourCBAclassifier.

AffymetrixprobeID Genesymbol Changedirection Genenameordetail Drugmetabolism

1368121at Akr7a3 Inc Aldo-ketoreductasefamily7,memberA3(aflatoxinaldehydereductase) 1381852at RGD1559459 Inc SimilartoExpressedsequenceAI788959(Ugt2b34,Musmusculus) 1387022at Aldh1a1 Inc Aldehydedehydrogenase1family,memberA1

1368905 at Ces2C Inc Carboxylesterase2C

1371076 at Cyp2b2 Inc CytochromeP450,family2,subfamilyb,polypeptide2 1371089at Gsta3 Inc GlutathioneS-transferasealpha3

1387599aat Nqo1 Inc NAD(P)Hdehydrogenase,quinone1

1370698at Ugt2b1 Inc UDPglucuronosyltransferase2family,polypeptideB1

1387006at Sult2a2 Inc Sulfotransferasefamily2A,dehydroepiandrosterone(DHEA)-preferring,member2 1371942at Gstt3 Inc glutathioneS-transferase,theta3

Gluconeogenesis

1370067at Me1 Inc Malicenzyme1,NADP(+)-dependent,cytosolic Histidinedegradation

1387307 at Hal Dec Histidineammonia-lyase Other

1387783aat Acaa1b Inc Acetyl-CoenzymeAacyltransferase1B 1370828at Zdhhc2 Inc Zincfinger,DHHC-typecontaining2 1375845at Aig1 Inc Androgen-induced1

1371143at Serpina7 Inc Serpinpeptidaseinhibitor,cladeA(alpha-1antiproteinase,antitrypsin),member7 1390145 at Dmxl2 Dec Dmx-like2

1384225at (NA) Dec (NA)

1369440at Abcg8 Dec ATP-bindingcassette,subfamilyG(WHITE),member8 1377599at Lpin1 Inc Lipin1

1373814at R3hdm2 Dec R3Hdomaincontaining2 1389253at Vnn1 Inc Vanin1

AffymetrixprobeIDs,genesymbolsandgenenamesforeachgeneinourCBAclassifieraresummarized.Thegenesaredividedintofourcategories,drug metabolism,gluconeogenesis,histidinedegradationandtheother.

Changedirection:thedirectionofchange(IncorDec)intheclassifier.NA:notavailable.NofurtherinformationisavailableforthegenewithAffymetrix probeID,1384225 at.

basedonthecanonicalpathwayanalysis.Of22genes,10

genesweredrugmetabolism-related.

Ourclassifierwasshownagain,withgenesconverted

fromAffymetrixprobeIDs togene symbolsand colored

according to their category (Fig. 3). The mostly drug

metabolism-relatednatureofourclassifierwasconfirmed, asmostoftherulesintheclassifierincludeddrugoneor moremetabolism-relatedgenes(showninred).

4. Discussion

Whenincreasedliver weightwastargeted,CBA out-performed LDA in all of the three criteria: accuracy, sensitivity, and specificity. In contrast, when decreased liverweightwastargeted,bothCBAandLDAscoredlow sensitivities and highspecificities. Thesetendencies are attributabletothelowfrequencyofdecreasedliverweight

(Akr7a3,Inc), (RGD1559459, Inc) -> (RLW, Inc), Support: 13.4%, Confidence = 100.0% (Aldh1a1, Inc), (Me1, Inc) -> (RLW, Inc), Support: 11.4%, Confidence = 100.0% (Ces2c,Inc), (Acaa1b, Inc) -> (RLW, Inc), Support: 10.7%, Confidence = 100.0% (Cyp2b2,Inc), (Zdhhc2, Inc) -> (RLW, Inc), Support: 9.4%, Confidence = 100.0% (Gsta3,Inc), (Ces2c,Inc), (Nqo1,Inc) -> (RLW, Inc), Support: 8.7%, Confidence = 100.0% (Ces2c,Inc), (Aig1, Inc) -> (RLW, Inc), Support: 8.7%, Confidence = 100.0% (Ces2c,Inc), (Cyp2b2,Inc), (Serpina7, Inc) -> (RLW, Inc), Support: 8.1%, Confidence = 100.0% (Ces2c,Inc), (Dmxl2, Dec), (Sult2a2,Inc) -> (RLW, Inc), Support: 8.1%, Confidence = 100.0% (Cyp2b2,Inc), (Hal,Dec) -> (RLW, Inc), Support: 8.1%, Confidence = 100.0% (Ugt2b1,Inc), (RGD1559459,Inc) -> (RLW, Inc), Support: 10.7%, Confidence = 94.1% (Aldh1a1,Inc), (Cyp2b2,Inc), (1384225_at, Dec) -> (RLW, Inc), Support: 10.1%, Confidence = 93.8% (Abcg8, Dec) -> (RLW, Inc), Support: 10.1%, Confidence = 93.8% (Lpin1, Inc) -> (RLW,NI), Support: 9.4%, Confidence = 93.3% (R3hdm2, Dec), (Vnn1, Inc) -> (RLW, Inc), Support: 8.1%, Confidence = 92.3% (Gsta3,Inc), (Ces2c,Inc), (Gstt3, Inc) -> (RLW, Inc), Support: 8.1%, Confidence = 92.3%

(NULL) -> (RLW,NI)

Fig.3.OurCBAClassifierwithCategorizedGeneSymbols.TheCBAclassifier,thesameasoneinFig.1,isshownagain,withthegenesconvertedfrom AffymetrixprobeIDstogenesymbolsandcoloredaccordingtotheircategory.Red:drugmetabolism-related.Blue:gluconeogenesis-related.Green: histidinedegradation-related.Black:Other.(Forinterpretationofthereferencestocolorinthisfigurelegend,thereaderisreferredtothewebversionof thisarticle.)

(9)

inthedataset.Forsuchadataset,aclassifierreturninga negativeanswer(i.e.nofordecreasedliverweight)with a highfrequency, regardlessof predictivity,can scorea goodspecificitybutapoorsensitivity.Exceptforsuchan imbalanceddataset,CBAsucceededinbuildingabetter predictiveclassifierthanLDAinthisstudy.This superior-ityofCBAoverLDAisconsideredtoreflectthenon-linear natureofthedataset.Generally,adrug-inducedresponse (ormoregenerallybiologicalresponse)isconsideredtobe causednotbythesinglemechanism,butbyseveral dif-ferentmechanisms.Thus,thereareseveraldifferent,not necessarily linearlyseparable, gene expression patterns thatfinallyleadtothesameresponse(e.g.increasedliver weight).Inthislight,CBAislikelytobuildabetter clas-sifierforadatasetintoxicology,ormorebroadlybiology, thanLDA,asCBAcancaptureslinearlyinseparablepatterns residinginthedataset.

WealsocomparedbetweenCBAandCBA-DR,our mod-ified version of the original CBA. Whenincreased liver weightwastargeted,CBA-DRmarkedloweraccuracythan CBA.Interestinglyhowever,CBA-DRmarked100% sensi-tivity.Thiscanbesaidasfollows:ifCBAreturnsan“Inc” answerforliverweightandweknowthedefaultruleisnot appliedintheclassificationprocess,wecansaythatliver weightwouldbeincreasedwithhigherconfidencethanif wedon’tknowwhetherthedefaultruleisappliedornot.In addition,wecanalsoinferhowreliabletheclassificationis inCBAwhennon-defaultruleismet,basedonitssupport and confidence. Therefore, CBAoffersnot only a classi-ficationresult,butalsoadditionalinformation regarding reliabilityofclassification.Thiscanbeanotheradvantage ofCBAoverLDA,whichreturnsonlyaclassificationresult. Intermsofinterpretability,whilebothCBAandLDAgive usinformationregardingimportantgeneswhichcan dis-criminateincreasedliverweightswell,LDAdoesnottake theconceptofco-expressionintoaccount.For example, inoursetting,a rule(1368905at,Inc)occurred6times intheCBA-generatedclassifier.Thisrule,however,always occurredwithotherrules,reflectingthepatternactually observed inthetraining dataset.Therefore,even ifthe gene,1368905at,ishighlyincreasedinanunknown sam-ple,itdoesnotnecessarilymeanincreasedliverweight. Suchco-expressedpatternwasnottakenintoaccountby LDA.Besides,whilecoefficientvaluesareusefultoinfer importance of each gene in LDA,the final prediction is determinedbythetotalofallthetermsinapolynomial,not byasingleorsmallsetofgenes.Theclassificationprocessof CBAismuchsimplerandeasytounderstand,becauseeach ruleisassimpleasasingleorsmallsetofgenesandthe predictionisdeterminedoncearuleissatisfied,regardless oftheothergenes.ThischaracteristicofCBAmakesa gen-eratedclassifiereasytounderstand,evenforanon-expert user,becauseaCBA-generatedclassifiercanbeexpressed alsoinanaturallanguage(e.g.“IfgeneAisincreasedand geneBisdecreased,thentheclassifierpredictsliverweight tobeincrease”),notinamathematicalequationasiscase inLDA.

Canonical pathway analysis with IPA revealed that the genes included in our CBA-generated classifier for increased liver weight were mostly drug metabolism-related ones.Thisisreasonable asinductionsof hepatic

drug metabolizing enzymes are well known to induce hepatocellular hypertrophy [35], of which increases in liverweightisthemostsensitiveindicator[15].CBA suc-ceededinbuildingabiologicallyrelevantclassifierwithout anypriorknowledgesuchasliterature.Intriguingly, the classifierincludedgeneswithotherfunctionssuchas glu-coneogenesis and histidine degradation, which are not directlyrelatedtoincreasedliverweightor hepatocellu-larhypertrophy.Whileitisunclearwhetherthesegenes wereactuallycausalornot,CBAcanbeusedtolookfor geneswithanunknownfunctionbuthighcorrelationfor aspecifiedoutcomeaswellastobuildabiologically rea-sonableclassifiers.Inaddition,itwasalsoconsideredtobe anadvantagethatCBAautomaticallyselectsasmallsetof genestobuildaclassifier,whileLDAdoesnot.

5. Conclusions

We applied the CBA algorithm to the TG-GATEs database,wherebothtoxicogenomicandother toxicolog-icaldataofmorethan150compoundsinratandhuman arestored,tobuildapredictiveclassifierofincreasedor decreased liverweight for an unknowncompound. We comparedthegeneratedclassifiersbetweenCBAandLDA, andshowedthatCBAissuperiortoLDAintermsofboth predictiveperformancesandinterpretability.

Transparencydocument

TheTransparencydocumentassociatedwiththisarticle canbefoundintheonlineversion.

Acknowledgements

WewishtothankDr.FransCoenen(Universityof Liv-erpool) for kindly allowing us to use his software for ourresearch.WealsothankTakashiMatsudaandKotaro Tamura(AstellasPharmaInc.)fortheirusefuladvices.

AppendixA. Supplementarydata

Supplementary material related to this article can be found, in the online version, at doi:10.1016/ j.toxrep.2014.10.014.

References

[1]E.F.Nuwaysir,M.Bittner,J.Trent,J.C.Barrett,C.A.Afshari, Microar-raysandtoxicology:theadventoftoxicogenomics,Mol.Carcinog.24 (1999)153–159.

[2]L.Suter,L.E.Babiss,E.B.Wheeldon,Toxicogenomicsinpredictive toxicologyindrugdevelopment,Chem.Biol.11(2004)161–171. [3]J.H.Phan,C.F.Quo,M.D.Wang,Functionalgenomicsandproteomics

intheclinicalneurosciences:dataminingandbioinformatics,Prog. BrainRes.158(2006)83–108.

[4]G.Ratsch,S.Sonnenburg,C.Schafer,LearninginterpretableSVMsfor biologicalsequenceclassification,BMCBioinf.7(Suppl.1)(2006)S9. [5]C.Apte,S.J.Hong,R.Natarajan,E.P.D.Pednault,F.Tipu,S.M.Weiss, Dataintensiveanalyticsforpredictivemodeling,IBMJ.Res.Dev.47 (2003)17–23.

[6]B.Liu,W.Hsu,Y.Ma,Integratingclassificationandassociationrule mining,in:Proc.1998Int.Conf.KnowledgeDiscoveryandData Min-ing(KDD’98),1998,pp.80–86.

[7]F.P.Pach,A.Gyenesei,J.Abonyi,Compactfuzzyassociation rule-basedclassifier,ExpertSyst.Appl.34(2008)2406–2416.

(10)

[8]D.L.Sampson,T.J.Parker,Z.Upton,C.P.Hurst,Acomparisonof meth-odsforclassifyingclinicalsamplesbasedonproteomicsdata:acase studyforstatisticalandmachinelearningapproaches,PLoSOne6 (2011)e24973.

[9]A.R.Bateman,N.El-Hachem,A.H.Beck,H.J.Aerts,B.Haibe-Kains, Importanceofcollectioningenesetenrichmentanalysisofdrug responseincancercelllines,Sci.Rep.4(2014)4092.

[10]S.H.Chiu,C.C.Chen,G.F.Yuan,T.H.Lin,Associationalgorithmto minetherulesthatgovernenzymedefinitionandtoclassifyprotein sequences,BMCBioinf.7(2006)304.

[11]K.Kianmehr,R.Alhajj,CARSVM:aclassassociationrule-based classi-ficationframeworkanditsapplicationtogeneexpressiondata,Artif Intell.Med.44(2008)7–25.

[12]M.Tamura, P.D’Haeseleer,Microbialgenotype-phenotype map-pingby classassociationrule mining,Bioinformatics24(2008) 1523–1529.

[13]S.Dua,P.C.Kidambi,Proteinstructuralclassificationusing ortho-gonaltransformationandclass-associationrules,Int.J.DataMin. Bioinf.4(2010)175–190.

[14]R.Paul,T.Groza,J.Hunter,A.Zankl,Inferringcharacteristic pheno-typesviaclassassociationrulemininginthebonedysplasiadomain, J.Biomed.Inf.48(2014)73–83.

[15]A.P.Hall,C.R.Elcombe,J.R.Foster,T.Harada,W.Kaufmann,A. Knip-pel,K.Kuttler,D.E.Malarkey,R.R.Maronpot,A.Nishikawa,T.Nolte, A.Schulte,V.Strauss,M.J.York,Liverhypertrophy:areviewof adaptive(adverseandnon-adverse)changes—conclusionsfromthe 3rdInternationalESTPExpertWorkshop,Toxicol.Pathol.40(2012) 971–994.

[16]R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proc. 20th VLDB Conference (VLDB-94), 1994, pp. 487–499.

[17]F.Coenen,P.Leng,S.Ahmed,Datastructureforassociationrule mining:T-treesandP-trees,IEEETrans.Knowl.DataEng.16(2004) 774–778.

[18]F.Coenen,G.Goulbourne,P.Leng,Treestructuresformining associ-ationrules,DataMin.Knowl.Discovery8(2004)25–51.

[19]E.Hubbell,W.M.Liu,R.Mei,Robustestimatorsforexpression anal-ysis,Bioinformatics18(2002)1585–1592.

[20]S.Welle,A.I.Brooks,C.A.Thornton,Computationalmethodfor reduc-ingvariancewithAffymetrixmicroarrays,BMCBioinf.3(2002)23. [21]C.Cheng,K.Shen,C.Song,J.Luo,G.C.Tseng,Ratioadjustmentand

cal-ibrationschemeforgene-wisenormalizationtoenhancemicroarray inter-studyprediction,Bioinformatics25(2009)1655–1661. [22]D.J.McCarthy,G.K.Smyth,Testingsignificancerelativetoa

fold-changethresholdisaTREAT,Bioinformatics25(2009)765–771.

[23]M.F.Festing,D.G.Altman,Guidelinesforthedesignandstatistical analysisofexperimentsusinglaboratoryanimals,ILARJ.43(2002) 244–258.

[24]R.C.Rao,Theutilizationofmultiplemeasurementsinproblemsof biologicalclassification,J.R.Stat.Soc.,Ser.B10(1948)159–203. [25]W.N. Venables, B.D. Ripley, Modern Applied Statistics with S,

Springer,NewYork,USA,2002.

[26]R.A.Fisher,Theuseofmultiplemeasurementsintaxonomic prob-lems,Ann.Eugen.7(1936)179–188.

[27]J.Ye,T.Xiong,Q.Li,R.Janardan,J.Bi,V.Cherkassky,C.Kambhamettu, Efficientmodelselectionforregularizedlineardiscriminant anal-ysis,in:Proceedingsofthe15thACMInternationalConferenceon InformationandKnowledgeManagement,ACM,2006,pp.532–539. [28]Q.Gu,Z.Li,J.Han,Lineardiscriminantdimensionalityreduction,in: MachineLearningandKnowledgeDiscoveryinDatabases,Springer, Heidelberg,Germany,2011,pp.549–564.

[29]N.Kondoh,S.Ohkura,M.Arai,A.Hada,T.Ishikawa,Y.Yamazaki,M. Shindoh,M.Takahashi,Y.Kitagawa,O.Matsubara,M.Yamamoto, Geneexpressionsignaturesthatcandiscriminateoralleukoplakia subtypes and squamous cell carcinoma, Oral Oncol. 43 (2007) 455–462.

[30]W.Shi,A.Bugrim,Y.Nikolsky,T.Nikolskya,R.J.Brennan, Character-isticsofgenomicsignaturesderivedusingunivariatemethodsand mechanisticallyanchoredfunctionaldescriptorsforpredicting drug-andxenobiotic-inducednephrotoxicity,Toxicol.Mech.Methods18 (2008)267–276.

[31]C.Ambroise,G.J.McLachlan,Selectionbiasingeneextractiononthe basisofmicroarraygene-expressiondata,Proc.Natl.Acad.Sci.U.S.A. 99(2002)6562–6566.

[32]C.M. Florkowski,Sensitivity, specificity,receiver-operating char-acteristic(ROC)curvesandlikelihoodratios:communicatingthe performanceofdiagnostictests,Clin.Biochem.Rev.29(Suppl.1) (2008)S83–S87.

[33]A.D.Long,H.J.Mangalam,B.Y.Chan,L.Tolleri,G.W.Hatfield,P.Baldi, ImprovedstatisticalinferencefromDNAmicroarraydatausing anal-ysisofvarianceandaBayesianstatisticalframework.Analysisof globalgeneexpressioninEscherichiacoliK12,J.Biol.Chem.276 (2001)19937–19944.

[34]Y.Pawitan,S.Michiels,S.Koscielny,A.Gusnanto,A.Ploner,False discoveryrate,sensitivityandsamplesizeformicroarraystudies, Bioinformatics21(2005)3017–3024.

[35]D.Ennulat,D.Walker,F.Clemo,M.Magid-Slav,D.Ledieu,M. Gra-ham,S.Botts,L.Boone,Effectsofhepaticdrug-metabolizingenzyme inductiononclinicalpathologyparametersinanimalsandman, Tox-icol.Pathol.38(2010)810–828.

References

Related documents

South European welfare regimes had the largest health inequalities (with an exception of a smaller rate difference for limiting longstanding illness), while countries with

As inter-speaker variability among these the two groups was minimal, ranging from 0% to 2% of lack of concord in the 21-40 group and from 41% to 46% in the 71+ generation, we

Personal Direct General Lumbar Puncture Personal N/A Direct X General Myelogram Personal Direct General Thoracentesis Personal N/A Direct N/A General Paracentesis

In this PhD thesis new organic NIR materials (both π-conjugated polymers and small molecules) based on α,β-unsubstituted meso-positioning thienyl BODIPY have been

[87] demonstrated the use of time-resolved fluorescence measurements to study the enhanced FRET efficiency and increased fluorescent lifetime of immobi- lized quantum dots on a

This booklet addresses issues in relation to the Diaspora Engagement Affairs Directorate General, Investment Procedures in Ethiopia, Investment Incentives in Ethiopia, Customs