Toxicity prediction from toxicogenomic data based on class association rule mining

(1)

ContentslistsavailableatScienceDirect

Toxicology

Reports

j o u r n al ho me p ag e : w w w . e l s e v i e r . c o m / l o c a t e / t o x r e p

Toxicity

prediction

from

toxicogenomic

data

based

on

class

association

rule

mining

Keisuke

Nagata

a,∗

_,

_Takashi

_Washio

b

_,

_Yoshinobu

_Kawahara

b

_,

_Akira

_Unami

a

a_Drug_Safety_Research_{Laboratories,}_Astellas_Pharma_Inc.,_2-1-6_Kashima,_Yodogawa-ku,_Osaka_532-8514,_Japan b_The_Institute_of_Scientiﬁc_and_Industrial_Research,_Osaka_University,_8-1_Mihogaoka,_Ibaraki,_Osaka_567-0047,_Japan

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received25August2014

Receivedinrevisedform20October2014 Accepted20October2014

Availableonline7November2014 Keywords:

Microarray Toxicogenomics

Classassociationrulemining CBA

a

b

s

t

r

a

c

t

WhiletherecentadventofnewtechnologiesinbiologysuchasDNAmicroarrayand next-generationsequencerhasgivenresearchersalargevolumeofdatarepresenting genome-widebiologicalresponses,itisnotnecessarilyeasytoderiveknowledgethatisaccurate andunderstandableatthesametime.Inthisstudy,weappliedtheClassificationBased onAssociation(CBA)algorithm,oneoftheclassassociationruleminingtechniques,tothe TG-GATEsdatabase,wherebothtoxicogenomicandtoxicologicaldataofmorethan150 compoundsinratandhumanarestored.Wecomparedthegeneratedclassifiersbetween CBAandlineardiscriminantanalysis(LDA)andshowedthatCBAissuperiortoLDAinterms ofbothpredictiveperformances(accuracy:83%forCBAvs.75%forLDA,sensitivity:82%for CBAvs.72%forLDA,specificity:85%forCBAvs.75%forLDA)andinterpretability. ©2014TheAuthors.PublishedbyElsevierIrelandLtd.Thisisanopenaccessarticleunder

theCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/3.0/).

1. Introduction

NewtechnologiessuchasDNA microarrayand next-generationsequencer have allowedresearchers tolearn biologicalphenomenaingenomeortranscriptomelevels. Especiallyintoxicology,thesenewtechnologieshaveled toanewsubdiscipline,termedtoxicogenomics. Toxicoge-nomics isconcernedwiththeidentiﬁcationofpotential human and environment toxicants, and their putative mechanisms of action, through the use of genomics resources[1].Forexample,byevaluatingand character-izingdifferentialgeneexpressions,inhumansoranimals, after exposure to drugs, it is possible to use complex expressionpatternstopredicttoxicologicaloutcomesand toidentify mechanismsinvolved withor related tothe toxic event [2]. Traditionally, to construct a such pre-dictiveclassiﬁer,techniquesinmachinelearningsuchas

∗ Correspondingauthor.Tel.:+810662107028. E-mailaddress:[email protected](K.Nagata).

k-nearestneighbors,lineardiscriminantanalysis(LDA)and supportvectormachine(SVM)havebeenmostlyused[3]. However,buildingaclassifierthatisaccurateand under-standable at the same time is not necessarily an easy task.Forexample,whileSVMachieveshighclassification accuracy,resultingclassifiersarehardtointerpretas vari-ables are transformed nonlinearlyinto a feature space, and hence difficult to use in order to extract relevant biological knowledge from it [4]. Veryoften, predictive accuracy,understandability,andcomputationaldemands needtobetradedoffagainstoneanother,because algo-rithmsoftencompromiseonetogainperformanceinthe other[5].

In this study, weapplied theclassification based on association(CBA)algorithmtotoxicogenomicdatainan aimtobuildaclassifierthatisaccurateandunderstandable atthesametime.Wecompareditspredictiveperformances andinterpretabilityofgeneratedclassifierswiththoseof LDA,whichisconsideredtobeoneofthemoststandard classificationmethodsandhaveagoodbalancebetween accuracyandinterpretability.

http://dx.doi.org/10.1016/j.toxrep.2014.10.014

(2)

CBAisoneoftheclassassociationrule(CAR)mining algorithms,whichintegrateassociationrulemining (find-ingalltherulesexistinginthedatabasethatsatisfysome constraints)andclassificationrulemining(discoveringa smallsetofrulesinthedatabasethatformsanaccurate classifier)byfocusingonminingaspecialsubsetof associ-ationrules,calledclassassociationrules(CARs)[6].One of the advantages of CAR mining algorithms over con-ventionalmethods(especiallySVM)isitsinterpretability, becauseclassifiersaregeneratedasasetofsimplerules withoutmuch sacrifice of accuracy [7].Another advan-tageisthatCARminingalgorithmscanbeappliednotonly tolinearlyseparablecases,butalsotolinearly insepara-blecases,whereLDAorotherlinearclassificationmethods arenotapplicable[8].SVMcanhandlelinearlyinseparable casesbymappingoriginaldataintoasuitablefeaturespace, butwithlossofinterpretability.Besides,especiallywhen appliedtogeneexpressiondata,CARminingalgorithms, whichpredictaclasslabelbasedonspecificsetsof differen-tiallyexpressedgenesthatareactuallyobservedintraining samples,areexpectedtogeneratemorebiologically rea-sonableclassifiers,becauseit isgenerallynotindividual genesbutsetsofgenesthatcollectivelydefinephenotypes suchasdrugresponses[9].WhileapplicationsofCBAand itsvariantsinbiologicalresearchhavebeenreportedin sev-eralreports[10–14],thereissofarnoreportswithdirect implicationfortoxicogenomics,whichisuniqueinthatthe numberof variables tobeanalyzed isusually farmuch greaterintoxicogenomics(morethan30,000genes)than inotherapplicationsandthisso-calledhighdimensionality makesitdifficulttoanalyzeitsdata.

To compare the predictive performances and inter-pretability of CBA and LDA, utilizing the TG-GATEs database,wherebothmicroarrayandtoxicologicaldataof morethan150compoundsinrats(invivoandin vitro) andhumans(invitro)arestored,webuiltbothCBAand LDAclassiﬁersthatpredictwhetherachemicalcompound inducesincreasesinliverweightafter14-dayrepetitive treatments in rats based on transcriptomic data of 3-dayrepetitivetreatments.Althoughmeasurableincreases in mRNA(indicative of enzymeinduction) are likely to precede, increase in liver weight is the most sensitive indicatorof hepatocellularhypertrophy and occurprior tomorphologicalchanges.Whileitshouldbealsonoted that hepatocellular hypertrophy without histological or clinicalpathologicalalterationsisconsideredtobean adap-tivenon-adversechange,certaindegreesofliverweight increaseappeared tobecorrelated withthesubsequent developmentofirreversibletoxicitysuchasﬁbrosis, necro-sis,vacuolization,fattydegeneration,andevenneoplasia [15] and early detection of hepatocellular hypertrophy basedonliverweightorgeneexpressionsisexpectedto beuseful,forexample,inselectingcompoundswithless riskofhepatotoxicityindrugdevelopment.

2. Materialandmethods

2.1. Datasource

TG-GATEs is a toxicogenomic database devel-oped by The Toxicogenomics Project (TGP), a joint

government-private sector project organized by the NationalInstituteofBiomedicalInnovation,National Insti-tuteofHealthSciencesand15pharmaceuticalcompanies in Japan, and The Toxicogenomics Informatics Project (TGP2), a follow-on project fromTGP organizedby the National Institute of Biomedical Innovation, National Institute of Health Sciences and 13 companies. Gene expression and toxicity data in vivo (rats) and in vitro (primaryculturedhepatocytesofratsandhumans)after treatmentsofmorethan150compoundsarestoredinthe TG-GATEsdatabase.TG-GATEsisnowreleasedforpublic asOpenTG-GATEs(http://toxico.nibio.go.jp).

FromtheTG-GATEsdatabase,weusedgeneexpression data(n=3pergroup)onedayafter3-dayrepetitivedoses (hereinafter4D)intheliverofratsandliverweightdata (relativeliverweightscalculatedfrombodyweights)(n=5 pergroup)onedayafter14-dayrepetitivedoses(15D)in ratsforthisstudy.Foreachcompound, onlythedataof thehighestdosegroupanditscontrolgroupwasused.Of 150compounds,weomittedonecompoundandanalyzed theremaining149compoundsbecausethatonecompound wasfoundtohavekilledanimalsbefore15Dinthestudy andthereforenodataisavailableforliverweightof15D.

2.2. CBA(classiﬁcationbasedonassociation)

2.2.1. Software

IncourtesyofDr.FransCoenen,weusedaCBAprogram availableontheLUCS-KDDwebsite,whichisimplemented accordingtotheoriginalalgorithmby[6],exceptthatCARs areﬁrstgeneratedusingtheApriori-TFPalgorithminstead oftheCBA-RGalgorithm.

2.2.2. Concept

ThebasicconceptofCBAisbrieﬂyexplainedherebased ontheexplanationsfrom[6]withexamplesinthisstudy. Fordetail,referto[6].LetDbethedataset,asetofrecords

d(d∈D).LetIbethesetofallnon-classitemsinD,andYbe thesetofclasslabelsinD.Inthisstudy,anon-classitemis apairofgeneIDanditsdiscretizedexpression(IncorDec) (Inc:increased,Dec:decreased)andaclasslabelisapair ofatargetparameter(RLW:relativeliverweight)andits discretizedvalue(IncorNI,orDecorND)(NI:notincreased, ND:notdecreased).ThesetofclasslabelsYinthisstudyis either{(RLW,Inc),(RLW,NI)}or{(RLW,Dec),(RLW,ND)}. Wesaythatarecordd∈DcontainsX⊆I,orsimplyX⊆d,if

dhasallthenon-classitemsofX.Similarly,arecordd∈D

containsy∈Y,orsimplyy⊆d,ifdhastheclasslabely.A ruleisanassociationoftheformX→y(e.g.(Gene01,Inc), (Gene02,Dec)→(RLW,Inc)).ForaruleX→y,Xiscalledan antecedentoftheruleandyiscalledaconsequenceofthe rule.AruleX→yholdsinDwithconﬁdencecifc%ofthe recordsinDthatcontainXarelabeledwithclassy.Arule

X→yhassupportsinDifs%oftherecordsinDcontainX

andarelabeledwithclassy.TheobjectivesofCBAare(1) togeneratethecompletesetofrulesthatsatisfythe user-specifiedminimumsupport(calledminsup)andminimum confidence(calledminconf)constraints,and(2)tobuilda classifierfromtheserules(classassociationrules,orCARs).

(3)

TheoriginalCBAalgorithmofLiuetal.consistsoftwo parts, a rule generator (called CBA-RG) and a classiﬁer builder(calledCBA-CB),eachcorrespondingto(1)and(2). ThekeyoperationofCBA-RGistoﬁndallrulesX→y

thathavesupportaboveminsup.Rulesthatsatisfyminsup

arecalledfrequent,whiletherestarecalledinfrequent.For alltherulesthathavethesameantecedent,therulewith thehighestconfidenceischosenasthepossible rule(PR) representingthissetofrules.Iftherearemorethanone ruleswiththesamehighestconfidence,oneruleis ran-domlyselected.Iftheconfidenceisgreaterthanminconf,

theruleis accurate.ThesetofCARsthus consistsofall

thePRsthatarebothfrequentandaccurate.TheCBA-RG

algorithmeffectivelysearchesforalltheCARsinadataset basedontheApriorialgorithm[16],assumingthe down-wardclosurepropertythatforanyX,Xisfrequentifand onlyifanysubsetxofXisfrequent.InsteadofCBA-RG,the Coenen’sCBAprogramisimplementedwiththe Apriori-TFPalgorithm[17,18],avariantoftheApriorialgorithms thatutilizesa tree-structureddatarepresentationsfora higherperformance.

Theoperationofthelatterpart,CBA-CB,isdescribedas followsin[6].“Giventworules,riandrj,rirj(alsocalled

riprecedesrjorrihasahigherprecedencethanrj)if 1.theconﬁdenceofriisgreaterthanthatofrj,or

2.theirconﬁdencesarethesame,butthesupportofriis greaterthanthatofrj,or

3.boththeconﬁdencesandsupportsofriandrj arethe same,butriisgeneratedearlierthanrj.

LetRbethesetofgeneratedrulesandDthetraining data”.CBA-CBis“tochooseasetofhighprecedencerules inRtocoverD”.Ageneratedclassiﬁerisoftheform,<r1,

r2,...,rn,defaultclass>,whereri,∈Randra,rbifb>a.In

classifyingasamplewithaunknownclasslabel,theﬁrst rulethatsatisﬁesthesamplewillclassifyit.Ifthereisno rulethatappliestothesample,ittakesonthedefaultclass,

defaultclass.Belowisasimpleexampleofclassiﬁers.

Example:

(Gene01, Inc), (Gene02, Dec)→(RLW, Inc)

(Gene01, Inc), (Gene03,Inc)→(RLW, Inc)

(NULL)→(RLW, NI)

Inthisexample.eachlinecorrespondstoaruleincluded intheclassifier.Therulewiththe(NULL)antecedentmeans thedefaultruleofthisclassifier.Whenasample,(Gene01, Inc), (Gene03, Inc) withan unknown class label (it is unknownwhetherRLWisIncorNI),isclassified,the classi-fieranswers(RLW,Inc),asthesecondrulefirstsatisfiesthe sample.Inanothercase,whereasample,(Gene01,Inc), (Gene02, Inc),isclassified,theclassifieranswers (RLW, NI),asnoneoftherulesexceptthedefaultrulesatisfies thesampleandthusthedefaultruleisapplied.

2.3. Dataanalysis

PriortotheCBAanalysis,wehavepreprocessedgene expressiondataintheliver(4D)andliverweightdata(15D) ofratsafterrepetitivedosesfor149compoundsfromthe TG-GATEsdatabase.First,geneexpressionswerecorrected andnormalizedbytheMAS5.0algorithm[19]toreduce inter-arrayvariances[20].Liverweightsweretransformed intorelativeliverweight,aratioofliverweightdividedby bodyweighttoavoidlargevariationsinbodyweight ske-wingorganweightinterpretation[15].Second,valueswere averagedoverindividualanimalsincludedineachgroup. Then,foreachcompound-treatedgroup,afoldchangewas calculatedas aratioof anaveragevalueofa treatment groupdividedby anaveragevalue ofits corresponding controlgroup,toreduceinter-studyvariances[21].Finally, wediscretizedgeneexpressionsandrelativeliverweights basedontheirfoldchanges (fc) andpvalues (p)of the student’st-test conductedbetweena compound-treated groupanditscorrespondingcontrolgroup,accordingto thecriteriashownbelow.

2.3.1. Geneexpressiondata

Iffc>2andp<0.05,assign“Inc”(increased). Iffc<0.5andp<0.05,assign“Dec”(decreased). Otherwise,assign“NC”(notchanged).

2.3.2. Liverweightdata

1.Whenaclassiﬁerforincreasedliverweightwasbuilt:If fc>1 and p<0.05, assign “Inc” (increased).Otherwise, assign“NI”(notincreased).

2.Whenaclassiﬁerfordecreasedliverweightwasbuilt:If fc<1andp<0.05,assign“Dec”(decreased).Otherwise, assign“ND”(notdecreased).

Discretization thresholds for gene expressions com-binedwithfoldchangesandstatisticaltest(e.g.student’s

t-test)haveoftenbeenappliedinmicroarraydataanalysis andisreportedtobebetterthanpvaluealone[22].In gen-eral,numericalparametersobtainedintoxicitystudiesare judgedtobeincreasedordecreased,basedessentiallyon statisticalcomparisonwithcontemporarycontrolsand,if available,additionallyonhistoricaldata[23].Inthisstudy, wediscretizedliverweightsbasedonlyonstatisticaltests, asnohistoricaldatawasavailable.

BeforeproceedingtoCBA,geneexpressionsdiscretized as “NC” in each group were discarded from the data, becausewewereinterestedonlyingeneswithincreased ordecreasedexpressions.Wethenanalyzedthedatawith CBA,withdiscretizedgeneexpressionsasnon-classitems anddiscretizedliverweightsasclasslabels.

2.4. Lineardiscriminantanalysis(LDA)

2.4.1. Software

WeusedtheldafunctionintheMASSlibraryofR.R’slda functionisimplementedbasedonRao’sLDA[24,25],also knownasFisher-RaoLDA,whichgeneralizedFisher’sLDA [26]tomultipleclasses.

(4)

2.4.2. Dataanalysis

PriortotheLDAanalysis,thedatawaspreprocessedas describedintheCBAsection,exceptthatgeneexpressions werenotdiscretized.BeforeproceedingtoLDA,the fea-tureselectionstepwasconductedtoreducethenumber ofgenes,becauseclassicalLDArequiresthetotal scatter matrixtobenonsingular,whilethematrixcanbesingular whenthesamplesize(149)doesnotexceedthenumber offeatures(genes)(morethan30,000)[27],andtendsto overﬁtand becomelessinterpretablein thepresenceof manyirrelevantand/orredundantfeatures[28].Basedon thepreviousreportsonmicroarraydataanalysis[29,30], weselectedonlythegenesthatwereup-regulated(fc>2 andp<0.05)or down-regulated(fc<0.5and p<0.05)in thegroupswithincreasedordecreasedliverweightwhen comparedtothenot-increasedornot-decreasedgroups, respectively.

2.5. Predictiveperformancecomparison

TocomparepredictiveperformancesofCBAandLDA, we conducted 10-fold cross validation [31] for each methodswiththetotalof149records(compounds),and evaluatedsensitivity, speciﬁcity, and accuracy averaged over 10 validations. These parameters are deﬁned as follows[32].

Sensitivity Truepositive/(truepositive+falsenegative) Speciﬁcity Truenegative/(truenegative+falsepositive) Accuracy (Truepositive+truenegative)/total

10-foldcrossvalidation,ormoregenerallyk-foldcross validation,isoneofthestandardmethodsforevaluating predictiveperformancesofclassifiers.Thismethoddivide adatasetintoequally-sizedkpartitions(1,2,...,k).Inthe firststep,thefirstpartition(1)isreservedasatestsetand theotherpartitions(2,3,...k)areusedasatrainingsetto buildaclassifier.Onceaclassifierisbuilt,itisvalidatedfor itspredictiveperformanceswithatestset(thefirst parti-tioninthiscase).k-Foldcrossvalidationrepeatsthisstepsk

timeschangingapartitionservingasatestsetonebyone. Intheend,averagedpredictiveperformanceoverk vali-dationstepsisregardedasthepredictiveperformanceofa classiﬁcationalgorithm.

2.6. Student’st-test

For statistical comparison of mean gene expressions orliverweightsbetweenacompound-treatedgroupand itscorrespondingcontrolgroupforeachcompound, the unpairedtwotailedstudent’st-testwithoutequalvariance assumptionwasconducted.Speciﬁcally,thisstatisticaltest wasconductedinthediscretizationstepofCBAandthe featureselectionstepofLDA.Whengeneexpressionswere comparedbetweentwogroups,geneexpressionswere log-transformedwithbaseoftwopriortothestatisticaltest. Logtransformationsofgeneexpressiondataisknownto resultinmoreconsistentstatisticalinferencesandbeoften considereddesirable,duetoitslargecoefﬁcientofvariation [33].

Itiswellknownthatthestandardp-valuemethodleads tothehighrateoffalsepositiveswhenappliedinrepeated testing.Thisisthecasewhenanalyzinggeneexpression data collectedvia microarrays, as this usually involves testing fromseveral thousands totens of thousands of hypothesessimultaneously.Whileanumberofadjustment procedures (e.g.controllingthefalse discoveryrate)are available,theyareoftentooconservativeformicroarray studiesinthattheycanleadtolowsensitivity[34],thus increasingtheriskofmissingtruepositives.Inthisstudy, noadjustmentswereapplied,takingitintoconsideration thateveniffalsepositivegeneswithnoorlittlerelevance forliverweightsweredetectedbystatisticaltests,the clas-siﬁcation methods woulddiscardmany of themfrom a generatedclassiﬁer,hencemarginalizetheimpactofsuch false positiveswhile minimizingtheriskof overlooking trueimportantchanges.

2.7. Pathwayanalysis

Canonicalpathwayanalysisforthegenesincludedin theCBA-generatedclassiﬁerwasconductedwithQIAGEN’s IngenuityPathwayAnalysis(IPA)softwaretounderstand whatpathway(andhencefunction)thesegenesaremainly involved.ThereasonwhyweusedIPA,notapublicly avail-abledatabase,isitshighqualityofinformation.IPAisbased on “expertly curated biological interactions and func-tionalannotationsfrommillionsofindividuallymodeled relationships betweenproteins, genes, complexes, cells, tissues, drugs, and diseases” and “reviewed for accu-racy byPhDscientists”(accordingtoQIAGEN’s website: http://www.ingenuity.com/products/ipa).

Canonical pathways are a set of pre-built pathways basedontheliterature.CanonicalpathwayanalysisofIPA answershowstatisticallysignificantlythepathwayswere affected,consideringhowmanymoleculesauser-specified set and a pathway share. In this study, we conducted canonicalpathwayanalysiswithallthegenesincludedin ourCBA-generatedclassifier.Incanonicalpathway analy-sis,specifiedgenesareconvertedtotheircorresponding moleculesandmatchedupagainstthemoleculesineach pathway.

2.8. Computer

Inthisstudy,weusedapersonalcomputerwithIntel Corei5-3320M2.6GHzCPUand4GBRAMfortheanalyses.

3. Results

3.1. Selectionofminimumsupportandconﬁdence

InCBA,ausermustspecifytwoparameters:minimum support (minsup) and minimum confidence (minconf). There is no universal criteria for these parameters. In this study, we assumed that lower minsup and higher confidence are basicallydesirable.Thatis tosay,a rule is considered useful, if the rule X→y satisfies a large fraction of recordsthat matches therule antecedent X, evenifthenumberofrecordsthatmatchesXissmall.This is because a drug-induced response (or more generally

(5)

Table1

ExplorationofvariousCBAsettings.

minsup(%) minconf(%) Averageaccuracy(%) Totaltime(s) (A)Whenminsupwasﬁxedat10%

10 50 77 0.61

10 80 76 0.59

10 90 79 0.58

10 100 77 0.58

(B)Whenminconfwasﬁxedat90%

20 90 0 0.42

15 90 9 0.42

10 90 79 0.58

8 90 83 22.37

7 90 Insufﬁcientmemory

AccuracyofCBAclassiﬁersforincreasedrelativeliverweightwas evalu-atedin10-foldcrossvalidationsundervariouscombinationsofminsup andminconf.

biologicalresponse)isconsideredtobenotcaused bya

single mechanism. Rather, it is expected that there are

severaldifferentmechanisms,thusdifferentgene

expres-sionpatterns,ﬁnally leadingtothetargetdrug-induced

response, and thateach gene expressionpattern occurs

in arelativelylow frequencyamong thedatasetevenif

thedataset contains anenoughrecords withthetarget

drug-inducedresponse.Ifsettoostrict,however,thereis

ariskofmissingusefulruleswithfewexceptionsfortoo

highminconfandofselectingaccidentalruleswithonly

a few satisfying recordsfor toolow minsup.Moreover,

minsupisalsolimitedbycomputationalresources,asthe

lower the minsup is set, the higher the computational

demandis,intermsofbothtimeandmemory.

Toexploretheidealsettingsofminsupand minconf,

weevaluatedaccuracyofCBAclassiﬁersforincreasedliver

weightin10-foldcrossvalidationsundervarious

combi-nationsofminsupandminconf(Table1).First,weﬁxed

theminsupat10%andchangedtheminconffrom50%to 100%.Whiletheminconfat90%markedthehighest accu-racy(79%),therewerenoobviousdifferencesortendency inaccuracyamongthedifferentminconfs.Next,wefixed the minconf at90% and changed theminsup from 20% downward. Lowering the minsup remarkablyimproved accuracy,butprolongedcomputationaltimeatthesame time.Theaccuracyreachedat83%withminsupat8%.We triedwithminsupat7%,butfailedtofinishthe computa-tionduetomemoryinsufficiency.Similartendencieswere alsoconfirmedwhenassessingaccuracyofclassifiersfor decreasedliverweightunderdifferentminsupsand min-confs(datanotshown).

Basedontheseresults,weadoptedtheminsupat8% andminconfat90%forthefollowinganalyses.

3.2. Predictiveperformance

We compared predictive performance of classifiers between CBA and LDA with 10-fold cross validation (Table2).Whenincreasedliverweightwastargeted(that is,whenaclassifierforincreasedliverweightwasbuilt), CBAoutperformedLDAinallofthethreecriteria:accuracy (83%forCBA vs.75%forLDA),sensitivity(82%vs.72%), andspecificity(85%vs.75%).Whendecreasedliverweight wastargeted, CBAscored betteraccuracy(86% vs.73%)

and sensitivity (22% vs. 6%), while LDA marked better speciﬁcity(90%vs.95%).

WealsocomparedbetweenCBAandCBA-DR(CBA with-outdefaultrule),ourmodifiedversionoftheoriginalCBA (Table2).CBA-DRdoesnotpredictifa sampledoesnot matchanyruleexceptthedefaultruleinaclassifier,and,in turn,returna‘hold’.Whenincreasedliverweightwas tar-geted,CBA-DRmarkedloweraccuracy(83%forCBAvs.79% forCBA-DR)andspecificity(85%vs.29%)andhigher sen-sitivity(82%vs.100%).Whendecreasedliverweightwas targeted,CBA-DRmarkedlowersensitivity(22%forCBA vs.0%forCBA-DR)andhigheraccuracy(86%vs.95%)and specificity(90%vs.100%).

3.3. Interpretability

Wecomparedtheformofgeneratedclassiﬁersbetween CBAandLDA(Fig.1),whenalltherecordswereusedasa trainingsetforincreasedliverweight.CBAtellsusasetof rules,arrangedinorderofconﬁdence.Eachruleconsistsof anantecedent,whichisanitemsetintheformof(non-class attribute,itsdiscretizedvalue),andaconsequenceinthe formof(classattribute,itsclasslabel),shownafter“−>” here.

Ontheotherhand,LDAtellsusasinglediscriminative function(fd),whichisapolynomialofnon-classattribute valueswiththeircoefficients.Coefficientsina discrimina-tivefunctionofLDAreflectdiscriminativepowerofeach non-classattribute(gene,here),withhigherpositivevalues andlower negativevaluesmeaning largercontributions toeachcorrespondingclasslabelofaclassattribute(liver weight,here).

3.4. Biologicalrelevance

To look into how biologically reasonable the CBA-generated classifier is, we conducted the canonical pathwayanalysisforthesetofgenesselectedinthe clas-sifierwhenalltherecordswereusedasatrainingsetfor increasedliverweight(Table3)(forbrevity,onlytop10 pathwaysinorderof−logpareshown).BecauseLDAitself, incontrasttoCBA,doesnotexplicitlyselectasetofgenes inbuildingaclassifier,wedidnotcompareCBAwithLDA here.

Wecouldassumethatthemostsigniﬁcantpathways involvedwiththegenesinourclassiﬁerweremainlydrug metabolism-relatedones,suchasXenobioticMetabolism Signaling,LPS/IL-1Mediated InhibitionofPXRFunction, PXR/RXRActivationetc.

Fig.2AisanexcerptaroundtheNRF2moleculefrom the illustration of the Xenobiotic Metabolism Signaling pathway,exportedfromIPA.NRF2isakeymodulatorof oxidativestressresponses.Inresponseofoxidativestress, NRF2isreleasedintothenucleusandup-regulates down-stream antioxidant enzymes, mainly drug metabolism enzymes.Actually,thegenesofdrugmetabolismenzymes suchasGST, NQO,and UGTdownstream of NRF2 were included in our classiﬁer, suggesting the induction of drugmetabolismenzymestriggeredbyNRF-2-dependent oxidativestressresponses.

(6)

Table2

Comparisonofpredictiveperformances.

Method Targetdirection Averageover10-foldcrossvalidation

Total TP FN FP TP Hold Accuracy(%) Sensitivity(%) Speciﬁcity(%) CBA Inc 14.9 4.4 1.1 1.4 8 – 83 82 85 LDA Inc 14.9 2.7 1 2.8 8.4 – 75 72 75 CBA-DR Inc 14.9 4.4 0 1.4 0.8 8.3 79 100 29 CBA Dec 14.9 0.2 0.7 1.4 12.6 – 86 22 90 LDA Dec 14.9 0.2 3.3 0.7 10.7 – 73 6 95 CBA-DR Dec 14.9 0 0.7 0 12.6 1.6 95 0 100 PredictiveperformanceofclassiﬁerswascomparedamongCBA,LDA,CBA-DRwith10-foldcrossvalidation.

Targetdirection:aclassiﬁerwasbuiltforwhetherincreased(Inc)ordecreased(Dec)relativeliverweight.Total:averagenumberoftotalrecordsinatest setofeachtrialinacrossvalidation.TP:averagenumberoftruepositiverecordsinatestset.FN:averagenumberoffalsenegativerecordsinatestset.FP: averagenumberoffalsepositiverecordsinatestset.TN:averagenumberoftruenegativerecordsinatestset.Hold:averagenumberofrecordsinatest setthatdidnotmatchanyrulesexceptthedefaultrule(onlyforCBA-DR).

Notethataccuracy,sensitivityandspeciﬁcityfortheCBA-DRmethodwerecalculatedexcluding‘hold’samples.Totalsarenotintegershere,asthenumber ofrecordsintheoriginaldatasetwas149andthuscannotbedividedby10,thenumberoftrialsinthecrossvalidation.

CBA

(1368121_at, Inc), (1381852_at, Inc) -> (RLW, Inc), Support: 13.4%, Confidence: 100.0% (1387022_at, Inc), (1370067_at, Inc) -> (RLW, Inc), Support: 11.4%, Confidence: 100.0% (1368905_at, Inc), (1387783_a_at, Inc) -> (RLW, Inc), Support: 10.7%, Confidence: 100.0% (1371076_at, Inc), (1370828_at, Inc) -> (RLW, Inc), Support:9.4%, Confidence: 100.0% (1371089_at, Inc), (1368905_at, Inc), (1387599_a_at, Inc) -> (RLW, Inc), Support:8.7%, Confidence: 100.0% (1368905_at, Inc), (1375845_at, Inc) -> (RLW, Inc), Support:8.7%, Confidence: 100.0% (1368905_at, Inc), (1371076_at, Inc), (1371143_at, Inc) -> (RLW, Inc), Support: 8.1%, Confidence: 100.0% (1368905_at, Inc), (1390145_at, Dec), (1387006_at, Inc) -> (RLW, Inc), Support:8.1%, Confidence: 100.0% (1371076_at, Inc), (1387307_at, Dec) -> (RLW, Inc), Support:8.1%, Confidence: 100.0% (1370698_at, Inc), (1381852_at, Inc) -> (RLW, Inc), Support: 10.7%, Confidence: 94.1% (1387022_at, Inc), (1371076_at, Inc), (1384225_at, Dec) -> (RLW, Inc), Support: 10.1%, Confidence: 93.8% (1369440_at, Dec) -> (RLW, Inc), Support: 10.1%, Confidence: 93.8% (1377599_at, Inc) -> (RLW,NI), Support:9.4%, Confidence: 93.3% (1373814_at, Dec), (1389253_at, Inc) -> (RLW, Inc), Support:8.1%, Confidence: 92.3% (1371089_at, Inc), (1368905_at, Inc), (1371942_at, Inc) -> (RLW, Inc), Support:8.1%, Confidence: 92.3%

(NULL) -> (RLW,NI)

LDA

fd = 1.0982 fc(1380013_at) + 0.6116 fc(1387740_at) + 0.4895 fc(1389253_at) + 0.4870 fc(1368317_at) +0.4471 fc(1370870_at) + 0.4202 fc(1374187_at) + 0.2830 fc(1387766_a_at) + 0.2012 fc(1390358_at) +0.1999 fc(1371076_at) + 0.1195 fc(1369759_at) + 0.1109 fc(1384431_at) + 0.0638 fc(1387936_at) +0.0317 fc(1382137_at) + 0.0292 fc(1368905_at) + 0.0126 fc(1369698_at) + 0.0081 fc(1368718_at) +0.0063 fc(1369921_at) + 0.0041 fc(1370269_at) + 0.0039 fc(1387022_at) + 0.0024 fc(1374070_at) + 0.0023 fc(1387100_at) + 0.0002 fc(1398250_at) - 0.0003 fc(1387825_at) - 0.0034 fc(1385247_at) -0.0055 fc(1370491_a_at) -0.0124 fc(1371089_at) -0.0180 fc(1380669_at) -0.0723 fc(1370902_at) -0.1127 fc(1392413_at) -0.1159 fc(1388211_s_at) -0.1487 fc(1388210_at) -0.2057 fc(1387574_at) - 0.2170 fc(1380536_at) - 0.2384 fc(1375845_at) - 0.3318 fc(1389179_at) - 0.4109 fc(1378169_at) -0.4740 fc(1395403_at) -0.6657 fc(1391544_at) + 3.4389

If fd > 0, predicts RLW as Inc. Else, RLW as NI.

Antecedent: a set of non-class items in the form of (gene_id, Inc or Dec)

Consequence: a class label as predicon result in the form of (RLW, Inc or NI)

Support and Conﬁdence of the rule

The first rule that sasfies the sample classifies it. If there is no rule sasfying the sample, the final default rule, (NULL), is applied.

In LDA, a classiﬁer is represented by a discriminave funcon, fd, which is a polynomial of non-class aribute values (fold changes) with their coeﬃcients

In CBA, a cl as si ﬁ e r is a se t of rule s (e ach li ne ).

If the discriminave funcon is posive, the classiﬁer predicts RLW as Inc. Otherwise, RLW as NI.

Fig.1. ComparisonoftheclassifierformbetweenCBAandLDA.TheformofgeneratedclassifierswerecomparedbetweenCBAandLDA,whenalltherecords wereusedasatrainingsetforincreasedrelativeliverweight.[CBA]Theclassifierconsistsofasetofrules,representedas“antecedent→consequence,support, confidence”,oneruleparline,inorderofconfidence.Anantecedentisasetofnon-classitems,eachitemrepresentedas(geneid,IncorDec).Aconsequence isaclasslabelthatisusedasaclassificationresultifthecorrespondingantecedentissatisfied,shownhereas(RLW,IncorNI).Thefinalrulewithan antecedent(NULL)isthedefaultrule,whichissatisfiedforanyrecordsandappliedifalltheprecedingrulesarenotmet.[LDA]Theclassifierisshownasa discriminativefunction,fd.fc(geneid)isafoldchangeofagenespecifiedwithgeneid.Iffdispositive,theclassifierpredictsRLWasInc.Otherwise,RLW asNI.geneid:RepresentedhereasanAffymetrixprobeID.RLW:relativeliverweight.Inc:increased.Dec:decreased.NI:notincreased.

(7)

Table3

CanonicalpathwayanalysisofCBAclassiﬁer.

PathwayName −logp Molecules CorrespondingGenes Total Inc Dec

Xenobioticmetabolismsignaling 8.96 219 8 0 Gsta3,Aldh1a1,Ugt2b1,

Nqo1,RGD1559459,Cyp2b2,Ces2c, Sult2a2

LPS/IL-1mediatedinhibitionofRXRfunction 5.07 178 4 1 Abccg8,Gsta3

PXR/RXRactivation 3.95 58 3 0 Aldh1a1,Cyp2b2,Sult2a2 Arylhydrocarbonreceptorsignaling 2.94 127 3 0 Gsta3,Aldh1a1,Nqo1 NicotineDegradationIII 2.77 37 2 0 Ugt2b1,Cyp2b2 MelatoninDegradationI 2.75 38 2 0 Ugt2b1,Cyp2b2 Serotonindegradation 2.67 42 2 0 Aldh1a1,Ugt2b1 Superpathwayofmelatonindegradation 2.67 42 2 0 Ugt2b1,Cyp2b2 NRF2-mediatedoxidativestressresponse 2.66 159 3 0 Gsta3,Akr7a3,Nqo1 NicotineDegradationII 2.65 43 2 0 Ugt2b1,Cyp2b2 HistidineDegradationIII 2 6 0 1 Hal

ThecanonicalpathwayanalysiswasconductedwiththeIngenuityIPAsoftwareforthegenesincludedintheCBAclassiﬁerwhenalltherecordswereused asatrainingsetforincreasedrelativeliverweight.Notethat,forbrevity,onlytop10pathwaysinorderof-logpareshownhere.

−logp:−logofp,wherepisavaluerepresentingstatisticalsigniﬁcanceintheanalysis.Asmallerpvalue(thusalarger−logpvalue)meansthatthepathway ismorestatisticallysigniﬁcantlyinvolved.Molecules:thetotal,increased(upregulated)numberanddecreased(downregulated)numberofmoleculesin eachpathwayareshown.Correspondinggenes:correspondingratgenesfortheincreasedordecreasedmoleculesincludedinthepathwayareshown.

Fig.2Bshowsoverlappingamongthecanonical path-ways detected as signiﬁcant, which were divided into three clusters. The largest cluster consists of drug metabolism-related pathwaysas describedabove. Inter-estingly,twootherclusters,histidinedegradation-related andgluconeogenesis-related,werealsodetectedwithno

overlapbetweenthedrugmetabolism-relatedclusterand them.

WethensummarizedAffymetrixprobeIDs,gene sym-bolsand genenamesfor eachgeneinourclassiﬁerand dividedthemintofourcategories,drugmetabolism, gluco-neogenesis,histidinedegradationandtheother(Table4),

Fig.2.CanonicalpathwayillustrationsofCBAclassifier.[A]AnexcerptaroundtheNRF2moleculefromtheillustrationoftheXenobioticMetabolism Signalingpathway,exportedfromIPA.[B]Overlappingamongthecanonicalpathwaysdetectedassignificant,whichweredividedintothreeclusters, exportedfromIPA.Eachnodecorrespondstoeachcanonicalpathwaydetectedassignificant.Eachlinkcorrespondstothenumberofmoleculesshared betweentwopathways.Colordepthofnodescorrespondstothe−logpvalue(thedeeperdepthis,thelargerthe−logpvaluesis).Linewidthoflinks correspondstothenumberofmoleculessharedbetweentwopathways(nolinemeansnosharedmoleculesbetweentwopathways).(Forinterpretation ofthereferencestocolorinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

(8)

Table4

DetailsandcategoryofthegenesinourCBAclassiﬁer.

AffymetrixprobeID Genesymbol Changedirection Genenameordetail Drugmetabolism

1368121at Akr7a3 Inc Aldo-ketoreductasefamily7,memberA3(aﬂatoxinaldehydereductase) 1381852at RGD1559459 Inc SimilartoExpressedsequenceAI788959(Ugt2b34,Musmusculus) 1387022at Aldh1a1 Inc Aldehydedehydrogenase1family,memberA1

1368905 at Ces2C Inc Carboxylesterase2C

1371076 at Cyp2b2 Inc CytochromeP450,family2,subfamilyb,polypeptide2 1371089at Gsta3 Inc GlutathioneS-transferasealpha3

1387599aat Nqo1 Inc NAD(P)Hdehydrogenase,quinone1

1370698at Ugt2b1 Inc UDPglucuronosyltransferase2family,polypeptideB1

1387006at Sult2a2 Inc Sulfotransferasefamily2A,dehydroepiandrosterone(DHEA)-preferring,member2 1371942at Gstt3 Inc glutathioneS-transferase,theta3

Gluconeogenesis

1370067at Me1 Inc Malicenzyme1,NADP(+)-dependent,cytosolic Histidinedegradation

1387307 at Hal Dec Histidineammonia-lyase Other

1387783aat Acaa1b Inc Acetyl-CoenzymeAacyltransferase1B 1370828at Zdhhc2 Inc Zincﬁnger,DHHC-typecontaining2 1375845at Aig1 Inc Androgen-induced1

1371143at Serpina7 Inc Serpinpeptidaseinhibitor,cladeA(alpha-1antiproteinase,antitrypsin),member7 1390145 at Dmxl2 Dec Dmx-like2

1384225at (NA) Dec (NA)

1369440at Abcg8 Dec ATP-bindingcassette,subfamilyG(WHITE),member8 1377599at Lpin1 Inc Lipin1

1373814at R3hdm2 Dec R3Hdomaincontaining2 1389253at Vnn1 Inc Vanin1

AffymetrixprobeIDs,genesymbolsandgenenamesforeachgeneinourCBAclassiﬁeraresummarized.Thegenesaredividedintofourcategories,drug metabolism,gluconeogenesis,histidinedegradationandtheother.

Changedirection:thedirectionofchange(IncorDec)intheclassiﬁer.NA:notavailable.NofurtherinformationisavailableforthegenewithAffymetrix probeID,1384225 at.

basedonthecanonicalpathwayanalysis.Of22genes,10

genesweredrugmetabolism-related.

Ourclassiﬁerwasshownagain,withgenesconverted

fromAffymetrixprobeIDs togene symbolsand colored

according to their category (Fig. 3). The mostly drug

metabolism-relatednatureofourclassifierwasconfirmed, asmostoftherulesintheclassifierincludeddrugoneor moremetabolism-relatedgenes(showninred).

4. Discussion

Whenincreasedliver weightwastargeted,CBA out-performed LDA in all of the three criteria: accuracy, sensitivity, and speciﬁcity. In contrast, when decreased liverweightwastargeted,bothCBAandLDAscoredlow sensitivities and highspeciﬁcities. Thesetendencies are attributabletothelowfrequencyofdecreasedliverweight

(Akr7a3,Inc), (RGD1559459, Inc) -> (RLW, Inc), Support: 13.4%, Confidence = 100.0% (Aldh1a1, Inc), (Me1, Inc) -> (RLW, Inc), Support: 11.4%, Confidence = 100.0% (Ces2c,Inc), (Acaa1b, Inc) -> (RLW, Inc), Support: 10.7%, Confidence = 100.0% (Cyp2b2,Inc), (Zdhhc2, Inc) -> (RLW, Inc), Support: 9.4%, Confidence = 100.0% (Gsta3,Inc), (Ces2c,Inc), (Nqo1,Inc) -> (RLW, Inc), Support: 8.7%, Confidence = 100.0% (Ces2c,Inc), (Aig1, Inc) -> (RLW, Inc), Support: 8.7%, Confidence = 100.0% (Ces2c,Inc), (Cyp2b2,Inc), (Serpina7, Inc) -> (RLW, Inc), Support: 8.1%, Confidence = 100.0% (Ces2c,Inc), (Dmxl2, Dec), (Sult2a2,Inc) -> (RLW, Inc), Support: 8.1%, Confidence = 100.0% (Cyp2b2,Inc), (Hal,Dec) -> (RLW, Inc), Support: 8.1%, Confidence = 100.0% (Ugt2b1,Inc), (RGD1559459,Inc) -> (RLW, Inc), Support: 10.7%, Confidence = 94.1% (Aldh1a1,Inc), (Cyp2b2,Inc), (1384225_at, Dec) -> (RLW, Inc), Support: 10.1%, Confidence = 93.8% (Abcg8, Dec) -> (RLW, Inc), Support: 10.1%, Confidence = 93.8% (Lpin1, Inc) -> (RLW,NI), Support: 9.4%, Confidence = 93.3% (R3hdm2, Dec), (Vnn1, Inc) -> (RLW, Inc), Support: 8.1%, Confidence = 92.3% (Gsta3,Inc), (Ces2c,Inc), (Gstt3, Inc) -> (RLW, Inc), Support: 8.1%, Confidence = 92.3%

(NULL) -> (RLW,NI)

Fig.3.OurCBAClassifierwithCategorizedGeneSymbols.TheCBAclassifier,thesameasoneinFig.1,isshownagain,withthegenesconvertedfrom AffymetrixprobeIDstogenesymbolsandcoloredaccordingtotheircategory.Red:drugmetabolism-related.Blue:gluconeogenesis-related.Green: histidinedegradation-related.Black:Other.(Forinterpretationofthereferencestocolorinthisfigurelegend,thereaderisreferredtothewebversionof thisarticle.)

(9)

inthedataset.Forsuchadataset,aclassifierreturninga negativeanswer(i.e.nofordecreasedliverweight)with a highfrequency, regardlessof predictivity,can scorea goodspecificitybutapoorsensitivity.Exceptforsuchan imbalanceddataset,CBAsucceededinbuildingabetter predictiveclassifierthanLDAinthisstudy.This superior-ityofCBAoverLDAisconsideredtoreflectthenon-linear natureofthedataset.Generally,adrug-inducedresponse (ormoregenerallybiologicalresponse)isconsideredtobe causednotbythesinglemechanism,butbyseveral dif-ferentmechanisms.Thus,thereareseveraldifferent,not necessarily linearlyseparable, gene expression patterns thatfinallyleadtothesameresponse(e.g.increasedliver weight).Inthislight,CBAislikelytobuildabetter clas-sifierforadatasetintoxicology,ormorebroadlybiology, thanLDA,asCBAcancaptureslinearlyinseparablepatterns residinginthedataset.

WealsocomparedbetweenCBAandCBA-DR,our mod-ified version of the original CBA. Whenincreased liver weightwastargeted,CBA-DRmarkedloweraccuracythan CBA.Interestinglyhowever,CBA-DRmarked100% sensi-tivity.Thiscanbesaidasfollows:ifCBAreturnsan“Inc” answerforliverweightandweknowthedefaultruleisnot appliedintheclassificationprocess,wecansaythatliver weightwouldbeincreasedwithhigherconfidencethanif wedon’tknowwhetherthedefaultruleisappliedornot.In addition,wecanalsoinferhowreliabletheclassificationis inCBAwhennon-defaultruleismet,basedonitssupport and confidence. Therefore, CBAoffersnot only a classi-ficationresult,butalsoadditionalinformation regarding reliabilityofclassification.Thiscanbeanotheradvantage ofCBAoverLDA,whichreturnsonlyaclassificationresult. Intermsofinterpretability,whilebothCBAandLDAgive usinformationregardingimportantgeneswhichcan dis-criminateincreasedliverweightswell,LDAdoesnottake theconceptofco-expressionintoaccount.For example, inoursetting,a rule(1368905at,Inc)occurred6times intheCBA-generatedclassifier.Thisrule,however,always occurredwithotherrules,reflectingthepatternactually observed inthetraining dataset.Therefore,even ifthe gene,1368905at,ishighlyincreasedinanunknown sam-ple,itdoesnotnecessarilymeanincreasedliverweight. Suchco-expressedpatternwasnottakenintoaccountby LDA.Besides,whilecoefficientvaluesareusefultoinfer importance of each gene in LDA,the final prediction is determinedbythetotalofallthetermsinapolynomial,not byasingleorsmallsetofgenes.Theclassificationprocessof CBAismuchsimplerandeasytounderstand,becauseeach ruleisassimpleasasingleorsmallsetofgenesandthe predictionisdeterminedoncearuleissatisfied,regardless oftheothergenes.ThischaracteristicofCBAmakesa gen-eratedclassifiereasytounderstand,evenforanon-expert user,becauseaCBA-generatedclassifiercanbeexpressed alsoinanaturallanguage(e.g.“IfgeneAisincreasedand geneBisdecreased,thentheclassifierpredictsliverweight tobeincrease”),notinamathematicalequationasiscase inLDA.

Canonical pathway analysis with IPA revealed that the genes included in our CBA-generated classiﬁer for increased liver weight were mostly drug metabolism-related ones.Thisisreasonable asinductionsof hepatic

drug metabolizing enzymes are well known to induce hepatocellular hypertrophy [35], of which increases in liverweightisthemostsensitiveindicator[15].CBA suc-ceededinbuildingabiologicallyrelevantclassifierwithout anypriorknowledgesuchasliterature.Intriguingly, the classifierincludedgeneswithotherfunctionssuchas glu-coneogenesis and histidine degradation, which are not directlyrelatedtoincreasedliverweightor hepatocellu-larhypertrophy.Whileitisunclearwhetherthesegenes wereactuallycausalornot,CBAcanbeusedtolookfor geneswithanunknownfunctionbuthighcorrelationfor aspecifiedoutcomeaswellastobuildabiologically rea-sonableclassifiers.Inaddition,itwasalsoconsideredtobe anadvantagethatCBAautomaticallyselectsasmallsetof genestobuildaclassifier,whileLDAdoesnot.

5. Conclusions

We applied the CBA algorithm to the TG-GATEs database,wherebothtoxicogenomicandother toxicolog-icaldataofmorethan150compoundsinratandhuman arestored,tobuildapredictiveclassiﬁerofincreasedor decreased liverweight for an unknowncompound. We comparedthegeneratedclassiﬁersbetweenCBAandLDA, andshowedthatCBAissuperiortoLDAintermsofboth predictiveperformancesandinterpretability.

Transparencydocument

TheTransparencydocumentassociatedwiththisarticle canbefoundintheonlineversion.

Acknowledgements

WewishtothankDr.FransCoenen(Universityof Liv-erpool) for kindly allowing us to use his software for ourresearch.WealsothankTakashiMatsudaandKotaro Tamura(AstellasPharmaInc.)fortheirusefuladvices.

AppendixA. Supplementarydata

Supplementary material related to this article can be found, in the online version, at doi:10.1016/ j.toxrep.2014.10.014.

References

[1]E.F.Nuwaysir,M.Bittner,J.Trent,J.C.Barrett,C.A.Afshari, Microar-raysandtoxicology:theadventoftoxicogenomics,Mol.Carcinog.24 (1999)153–159.

[2]L.Suter,L.E.Babiss,E.B.Wheeldon,Toxicogenomicsinpredictive toxicologyindrugdevelopment,Chem.Biol.11(2004)161–171. [3]J.H.Phan,C.F.Quo,M.D.Wang,Functionalgenomicsandproteomics

intheclinicalneurosciences:dataminingandbioinformatics,Prog. BrainRes.158(2006)83–108.

[4]G.Ratsch,S.Sonnenburg,C.Schafer,LearninginterpretableSVMsfor biologicalsequenceclassiﬁcation,BMCBioinf.7(Suppl.1)(2006)S9. [5]C.Apte,S.J.Hong,R.Natarajan,E.P.D.Pednault,F.Tipu,S.M.Weiss, Dataintensiveanalyticsforpredictivemodeling,IBMJ.Res.Dev.47 (2003)17–23.

[6]B.Liu,W.Hsu,Y.Ma,Integratingclassiﬁcationandassociationrule mining,in:Proc.1998Int.Conf.KnowledgeDiscoveryandData Min-ing(KDD’98),1998,pp.80–86.

[7]F.P.Pach,A.Gyenesei,J.Abonyi,Compactfuzzyassociation rule-basedclassiﬁer,ExpertSyst.Appl.34(2008)2406–2416.

(10)

[8]D.L.Sampson,T.J.Parker,Z.Upton,C.P.Hurst,Acomparisonof meth-odsforclassifyingclinicalsamplesbasedonproteomicsdata:acase studyforstatisticalandmachinelearningapproaches,PLoSOne6 (2011)e24973.

[9]A.R.Bateman,N.El-Hachem,A.H.Beck,H.J.Aerts,B.Haibe-Kains, Importanceofcollectioningenesetenrichmentanalysisofdrug responseincancercelllines,Sci.Rep.4(2014)4092.

[10]S.H.Chiu,C.C.Chen,G.F.Yuan,T.H.Lin,Associationalgorithmto minetherulesthatgovernenzymedeﬁnitionandtoclassifyprotein sequences,BMCBioinf.7(2006)304.

[11]K.Kianmehr,R.Alhajj,CARSVM:aclassassociationrule-based classi-ﬁcationframeworkanditsapplicationtogeneexpressiondata,Artif Intell.Med.44(2008)7–25.

[12]M.Tamura, P.D’Haeseleer,Microbialgenotype-phenotype map-pingby classassociationrule mining,Bioinformatics24(2008) 1523–1529.

[13]S.Dua,P.C.Kidambi,Proteinstructuralclassiﬁcationusing ortho-gonaltransformationandclass-associationrules,Int.J.DataMin. Bioinf.4(2010)175–190.

[14]R.Paul,T.Groza,J.Hunter,A.Zankl,Inferringcharacteristic pheno-typesviaclassassociationrulemininginthebonedysplasiadomain, J.Biomed.Inf.48(2014)73–83.

[15]A.P.Hall,C.R.Elcombe,J.R.Foster,T.Harada,W.Kaufmann,A. Knip-pel,K.Kuttler,D.E.Malarkey,R.R.Maronpot,A.Nishikawa,T.Nolte, A.Schulte,V.Strauss,M.J.York,Liverhypertrophy:areviewof adaptive(adverseandnon-adverse)changes—conclusionsfromthe 3rdInternationalESTPExpertWorkshop,Toxicol.Pathol.40(2012) 971–994.

[16]R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proc. 20th VLDB Conference (VLDB-94), 1994, pp. 487–499.

[17]F.Coenen,P.Leng,S.Ahmed,Datastructureforassociationrule mining:T-treesandP-trees,IEEETrans.Knowl.DataEng.16(2004) 774–778.

[18]F.Coenen,G.Goulbourne,P.Leng,Treestructuresformining associ-ationrules,DataMin.Knowl.Discovery8(2004)25–51.

[19]E.Hubbell,W.M.Liu,R.Mei,Robustestimatorsforexpression anal-ysis,Bioinformatics18(2002)1585–1592.

[20]S.Welle,A.I.Brooks,C.A.Thornton,Computationalmethodfor reduc-ingvariancewithAffymetrixmicroarrays,BMCBioinf.3(2002)23. [21]C.Cheng,K.Shen,C.Song,J.Luo,G.C.Tseng,Ratioadjustmentand

cal-ibrationschemeforgene-wisenormalizationtoenhancemicroarray inter-studyprediction,Bioinformatics25(2009)1655–1661. [22]D.J.McCarthy,G.K.Smyth,Testingsigniﬁcancerelativetoa

fold-changethresholdisaTREAT,Bioinformatics25(2009)765–771.

[23]M.F.Festing,D.G.Altman,Guidelinesforthedesignandstatistical analysisofexperimentsusinglaboratoryanimals,ILARJ.43(2002) 244–258.

[24]R.C.Rao,Theutilizationofmultiplemeasurementsinproblemsof biologicalclassiﬁcation,J.R.Stat.Soc.,Ser.B10(1948)159–203. [25]W.N. Venables, B.D. Ripley, Modern Applied Statistics with S,

Springer,NewYork,USA,2002.

[26]R.A.Fisher,Theuseofmultiplemeasurementsintaxonomic prob-lems,Ann.Eugen.7(1936)179–188.

[27]J.Ye,T.Xiong,Q.Li,R.Janardan,J.Bi,V.Cherkassky,C.Kambhamettu, Efﬁcientmodelselectionforregularizedlineardiscriminant anal-ysis,in:Proceedingsofthe15thACMInternationalConferenceon InformationandKnowledgeManagement,ACM,2006,pp.532–539. [28]Q.Gu,Z.Li,J.Han,Lineardiscriminantdimensionalityreduction,in: MachineLearningandKnowledgeDiscoveryinDatabases,Springer, Heidelberg,Germany,2011,pp.549–564.

[29]N.Kondoh,S.Ohkura,M.Arai,A.Hada,T.Ishikawa,Y.Yamazaki,M. Shindoh,M.Takahashi,Y.Kitagawa,O.Matsubara,M.Yamamoto, Geneexpressionsignaturesthatcandiscriminateoralleukoplakia subtypes and squamous cell carcinoma, Oral Oncol. 43 (2007) 455–462.

[30]W.Shi,A.Bugrim,Y.Nikolsky,T.Nikolskya,R.J.Brennan, Character-isticsofgenomicsignaturesderivedusingunivariatemethodsand mechanisticallyanchoredfunctionaldescriptorsforpredicting drug-andxenobiotic-inducednephrotoxicity,Toxicol.Mech.Methods18 (2008)267–276.

[31]C.Ambroise,G.J.McLachlan,Selectionbiasingeneextractiononthe basisofmicroarraygene-expressiondata,Proc.Natl.Acad.Sci.U.S.A. 99(2002)6562–6566.

[32]C.M. Florkowski,Sensitivity, speciﬁcity,receiver-operating char-acteristic(ROC)curvesandlikelihoodratios:communicatingthe performanceofdiagnostictests,Clin.Biochem.Rev.29(Suppl.1) (2008)S83–S87.

[33]A.D.Long,H.J.Mangalam,B.Y.Chan,L.Tolleri,G.W.Hatﬁeld,P.Baldi, ImprovedstatisticalinferencefromDNAmicroarraydatausing anal-ysisofvarianceandaBayesianstatisticalframework.Analysisof globalgeneexpressioninEscherichiacoliK12,J.Biol.Chem.276 (2001)19937–19944.

[34]Y.Pawitan,S.Michiels,S.Koscielny,A.Gusnanto,A.Ploner,False discoveryrate,sensitivityandsamplesizeformicroarraystudies, Bioinformatics21(2005)3017–3024.

[35]D.Ennulat,D.Walker,F.Clemo,M.Magid-Slav,D.Ledieu,M. Gra-ham,S.Botts,L.Boone,Effectsofhepaticdrug-metabolizingenzyme inductiononclinicalpathologyparametersinanimalsandman, Tox-icol.Pathol.38(2010)810–828.