Fully automatic classification of breast cancer microarray images

12 

Loading.... (view fulltext now)

Loading....

Loading....

Loading....

Loading....

Full text

(1)

Availableonlineatwww.sciencedirect.com

ScienceDirect

JournalofElectricalSystemsandInformationTechnology3(2016)348–359

Fully

automatic

classification

of

breast

cancer

microarray

images

Nastaran

Dehghan

Khalilabad

a

,

Hamid

Hassanpour

a,

,

Mohammad

Reza

Abbaszadegan

b aShahroodUniversityofTechnology,Shahrood,Iran

bMashhadUniversityofMedicalSciences,Mashhad,Iran

Received22October2015;receivedinrevisedform12June2016;accepted17June2016 Availableonline11August2016

Abstract

Amicroarrayimageisusedasanaccuratemethodfordiagnosisofcancerousdiseases.Theaimofthisresearchistoprovidean approachfordetectionofbreastcancertype.First,rawdataisextractedfrommicroarrayimages.Determiningtheexactlocationof eachgeneiscarriedoutusingimageprocessingtechniques.Then,bythesumofthepixelsassociatedwitheachgene,theamount of“genesexpression”isextractedasrawdata.Toidentifymoreeffectivegenes,informationgainmethodonthesetofrawdatais used.Finally,thetypeofcancercanberecognizedviaanalyzingtheobtaineddatausingadecisiontree.Theproposedapproach hasanaccuracyof95.23%indiagnosingthebreastcancertypes.

©2016ElectronicsResearchInstitute(ERI).ProductionandhostingbyElsevierB.V.ThisisanopenaccessarticleundertheCC BY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).

Keywords:Microarray;Geneexpression;Informationgain;Breastcancer;Decisiontree

1. Introduction

AccordingtotheWorldHealthOrganization(WHO),breastcanceristhetopcanceramongwomeninbothdeveloped anddevelopingcountries.Earlydetectionandsurvivalareimportantissuestocontrolbreastcancer(Shulmanetal.,

2010).

OneofthemostimportantandaccuratemethodologiestodiagnosethediseaseisDNAanalysisofindividuals.DNA microarraysareanimportanttechnologytostudygeneexpression(Canoetal.,2009).Amicroarrayisaglassslideon whichsingle-strandedDNAmoleculesareattachedatfixedlocationscalledspots.Therecanbethousandsofspotson asinglemicroarraychip,eachspotonamicroarraychipcontainsmultipleidenticalstrandsofDNAwhichidentify onegene.

Themicroarrayimagescontainmultipleblocks(referredtosub-grid)andtheseblocksconsistofmanyspots,situated inrowsandcolumns.Eachspotisanareaintheimagewhichrepresentsthelevelofthehybridizationbetweenasingle probeandthesamples.

Correspondingauthor.

E-mailaddress:h.hassanpour@shahroodut.ac.ir(H.Hassanpour). PeerreviewundertheresponsibilityofElectronicsResearchInstitute(ERI).

http://dx.doi.org/10.1016/j.jesit.2016.06.001

2314-7172/©2016ElectronicsResearchInstitute(ERI).ProductionandhostingbyElsevierB.V.ThisisanopenaccessarticleundertheCC BY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).

(2)

Fig.1.Usingmicroarrayimagesandimageprocessingtechniquesforbreastcancerclassification.

Hybridizationexperimentswithtwo samplesconsistof thefollowingsteps:firstly,the totalmRNA(Messenger RNA)fromcells intwodifferentconditions (forexample,healthyandcancerouscells)isdyedwithtwodifferent fluorescentlabels(commonlycy3 andcy5). Secondly,the labeledsamplesare washedoverthemicroarray. These labeledgeneproductsarehybridizedtotheircomplementarysequencesinthespots(Stekel,2003).Theintensityof eachspotsignifiesthedegreeofhybridizationofthesampletoaknowngene,therebyindicatingtheexpressionlevel oftheparticulargene.

Geneexpressionprofilingbymicroarraymethodhasbeencameoutforclassificationanddiagnosticpredictionof tumor.Imagesaretransformedintogeneexpressionmatricesinwhichrowscorrespondtogenes,andthecolumns representsamplesortrialconditions(GunavathiandPremalatha,2014).

Key steps inmicroarray analysis includeimage processingand dataanalysis (Fig. 1). Image processing isan importantstepinmicroarrayexperimentswhichincreasesaccuracyofdatamicroarrayanalysesincludingclustering. Withintheimageprocessingstep,thequantificationofgeneexpressionlevelsfrommicroarrayimagesareperformed infoursteps,includingrotation recognition,sub-griddetection,griddingandcalculatinggeneexpression.Sub-grid detectioninvolvespartitioningtheblocksintodistinctcellsintheimage,thenassigningcoordinatestoeachblocks andineachsub-gridvariousspotsareisolatedintheimage.Geneexpressiondataiscalculatedbyusingthelogratio betweenthetwointensitiesofeachdye.Consideringthat,erroratanearlystagecanleadtoincorrectresults,hence, griddingisthemostimportantstepforsubsequenttasks.

Withinthedataanalysissteps,thetwostepsforpredictingbreastcancerincludethefollowing:featureselection andtumorsclassificationmethods.Featureselectionorgeneselectionistheprocessbywhichgreatamountsofdata areanalyzedandsummarizedintousefulinformation.Thisinformationcanbeusedtocutcosts.Thegeneralfeature selectionandclassificationapplicationtoolsareusedforclassificationtasks.

Inthispaper,amostwidelyused geneselectionmethod,informationgain(IG),isadopted forextractinguseful knowledgefrommicroarraydata.IGmethodcanquicklyrejectalargenumberofnon-criticalnoiseandirrelevant genes,hence,refinesearchareaoftheoptimalsubsetofgenes(ChoandWon,2003;Wangetal.,2006).Alsoinour proposedsystem,decisiontreesaredevelopedtodiagnosetwosubclasstumor.

Thestructureofthispaperisorganizedasfollows:inSection2,thegriddingandfeatureselectionandclassification methodsarereviewed.Section3containsashortdescriptionofthebreastcancerdataset.Methodsforimageprocessing anddataanalysisarepresentedinSection4.Section5presentstheconclusionofthisresearch.

2. Relatedwork

Themaingoalofthissectionistoprovideabriefreviewondifferentbasicstepsoftheworksrelatedtogridding, geneselectionandtumorclassificationusingmicroarrayimageandmicroarraygeneexpressiondata.

ItisessentialtoefficientlyanalyzeDNAmicroarraydatabecausethenumberoffeaturesismuchlargerthanthe number of samples.Many imageprocessing,machinelearning anddata mining methodshave beenemployed to overcomethisdifficulty.

2.1. Gridding

Griddingisthemostimportanttaskintheinitialstage,ifdonecorrectly,substantiallyimprovestheefficiencyof subsequentstages.Duringtherecentyears,therehavebeenmethodsandsoftwarepackagesavailabledealingwithone

(3)

orafewproblems,themostimportantisthattheyallrequireseveralparameterstobesetbytheuser;forexamples softwaresuchasScanAlyze,GenePixPro6andImaGene(Bariamisetal.,2010;Fouadetal.,2014).

AnguloandSerra(2003)proposedanon-supervisedsetofalgorithmsforafastandaccuratespotdataextraction

fromDNAmicroarraysusingmorphologicaloperators.Inthismethod,therearedrawbacksthathavetoberesolved beforefullyautomaticgriddingcantakeplace.Asanexample,thismodelrequiresthatgridrowsandcolumnsare strictlyalignedwiththex-andy-axes.

AnapproachbeutilizedinKatzeretal.(2003)forautomaticgridsegmentationoftherawfluorescencemicroarray imagesbyMarkovrandomfield(MRF)techniques.Thismethodisnotsuitableformicroarraygriddingsinceacomplete searchisnotfeasiblefortheoptimalMRFconfiguration.

ZachariaandMaroulis(2008)utilizedasupportvectormachine(SVM) togridmicroarrayimages.Themethod

usesasetofsoft-marginSVMstoforecastthelinesofDNAmicroarraygridbymaximizingthemarginbetweenlines andspots.

Microarrayimagegriddingmethodbasedonimageprojectiontransformationandpowerspectralanalysisis dis-cussedinFengetal.(2015).Firstlyinthispaperistransformed2Dmicroarrayimageintoverticalandhorizontal1D projectionsequences,secondlyutilizedsignalprocessingmethodsoflowpassfiltering,zeromean,FFT,andpower spectralestimationbyperiodogrammethodtogetspotsarrayrowandcolumnspaninformation,andfinallyrealized themicroarrayimagegriddingaccordingtothelocalmaximaandspaninformationofspotsarray.

2.2. Geneselection

Themajorlimitationofthegeneexpressiondataisitshighdimensionwhichcontainsthousandsofgenesandavery fewsamples(GunavathiandPremalatha,2014).Anumberofresearchershaveturnedtodataminingtechnologiesand machinelearningapproachesforpredictingbreastcancer.

In1995(Wolbergetal.,1995)dataminingandmachinelearningapproacheswereembeddedintoacomputer-aided

systemfordiagnosingbreastcancer.Recent researches(WongandHsu,2008)haveshownthatasmallnumberof genesaresufficientforaccuratediagnosisofthemostdiseases,eventhoughthenumberofgenesvarygreatlybetween differentdiseases.Thusgeneselectionplaysamajorroleintheproposedsystem(Horngetal.,2009).

Someexistingmethodsproposedahybridsystemtoobtainasmallsetofhighlydiscriminativegenes;forexample theworkperformedbyMinXu(XuandSetiono,2003)havedesignedhybridmethodfromtheunivariatemaximum likelihoodmethod(LIK)andthemultivariaterecursivefeatureelimination(RFE)method.

Amostwidelyusedgeneselectionmethod,informationgain(ChoandWon,2003;Wangetal.,2006),isadopted forthepurposeofthiswork.Informationgainmeasuresthegoodnessofgeneusingthepresenceandabsencewithin thecorrespondingclass.

2.3. Classification

Classificationisthetaskoffindingthecommonpropertiesamongasetofobjectsinadatabaseandclassifyingthem intodifferentclasses(Chenetal.,1996).Intheliterature,statisticalapproacheslikeboosting,andself-organizingmap

(Golubetal.,1999),K-nearest-neighborclassification(Lietal.,2001),discriminationmethods(Dudoitetal.,2002),

decisiontree,multi-layerperceptron(Khanetal.,2001;Xuetal.,2002),leastsquareandlogisticregression(Fortand

Lambert-Lacroix,2005),NaiveBayesapproach(Fanetal.,2009)andactivelearningusingfuzzyK-nearest-neighbor

forcancerclassification(Halderetal.,2015)wereusedtogeneratetheclassifiermodelforgeneexpressiondata.Hybrid methodshavebeenrecentlytestedonmicroarraydata(LeeandLeu,2011)obtaininghighclassificationaccuracies. ForexampleMahmoudandBasma(2014)theproposedapproachisintegratedwithtwomachinelearningclassifiers; K-nearestneighbor(KNN)andsupportvectormachine(SVM);toclassifymicroarraygeneexpression.

Decisiontreeshavebeenconsideredastheclassifierinthispaper.Theperformanceoftheproposedapproachis evaluatedusinggeneexpressiondatasetviz.,Type2breastcancer(Kaoetal.,2009;Bergamaschietal.,2006).

3. Dataset

Thedatasetusedforevaluationoftheproposedmethodconsistsof71DNAmicroarrayimagesrelatedtobreast cancer,fromtheStanfordMicroarrayDatabase(Stanford,2015).TheimagesarestoredinTIFFfileswith16-bitgray

(4)

Table1

Accuracyoftheproposedmethodontherotatedimages. Angle Accuracyrotationadjustment

Image51769 Image51865 1◦ 99% 100% 0.75◦ 99% 98.67% 0.5◦ 99% 98% 0.25◦ 98% 98% −1◦ 100% 99% −0.75◦ 99.3% 98% −0.5◦ 99% 100% −0.25◦ 98.6% 98%

leveldepth.Fluorescentseparatedimageforeachsampleisdividedintotwochannels,ch1(cy5)andch2(cy3).The imagesinclude48blocksofabout870spots,atotalof41,760spotsintheimage.

Thereare22samplesrelatedtoprimarybreasttumorswhichcellsweregrownto70–80%confluence,thenharvested for totalRNA andgenomic (Kao et al., 2009) DNA and the remaining related to breasttumor specimens were derived from 49 patients withlocally advanced (T3/T4 and/or N2) breast cancer receiving either doxorubicin or fluorouracil-mitomycinbasedneoadjuvantchemotherapy(Bergamaschietal.,2006).

Thebreastcancerdatawasalreadydivided intotrainingandtestingsets.Thereare50training samplesand21 testingsamples.The21testingsetcontainsfiveclass-1andsixteenclass-2samples.Also,the50trainingsetcontains seventeenclass-1andthirtythreeclass-2samples.

4. Systemsandmethods

4.1. Rotationadjustment

Amicroarrayimageisoftenskewedduringthescanningprocess,mostlikelybecausethechipmicroarrayisnot correctlyplacedinsidethemicroarrayscanner.Rotationsoftheimagesareseenintwodifferentdirectionclockwise, withrespecttothex axis.Radontransformisappliedtofindanglesofrotationintheimage(RuedaandRezaeian, 2011).TheRadontransformg(ρ,τ),istheintegralofacontinuous2Dfunctionf(x,y)overacollectionofslantedlines

(Smith,1995):

g(ρ,τ)=  +∞

−∞ f(x,ρx+τ)dx (1)

TheRadontransformislinear–theRadontransformofaweightedsumoffunctionsisthesameweightedsum oftheindividuallyRadontransformedfunctions.Thisisanimportantpropertywhichcanbeusedtoapproximatethe transformbymeansofacomputerprogram.Also,theRadontransformofarotated,scaledortranslatedimagecanbe obtainedknowingtheRadontransformoftheoriginalimageandtheaffinetransformationparameters.Fig.2shows thedetectionofrotatedangles1.5◦and−1.5◦,andtheircorrections.

Althoughthedatasetonlyincludesmicroarrayimageswithrotationtoafewdegrees.Therefore,totesttheeffectof theRadontransform,thefirsttwoimagesareselectedwithskewangle0◦.Then,werotatedimageswith0.25,0.5,0.75 and1degreesinbothclockwiseandcounter-clockwisedirections.Table1showstheaccuracyoftheproposedmethod ontwoof therotated images.Thisshowstheeffectivenessandrobustnessoftheproposedmethodinsignificantly rotatedimages.ThealgorithmaccuracyiscalculatedasshowninEq.(2):

Accuracy= |θrealθest| θreal ×

100 (2)

4.2. Sub-griddetection

Image thresholdingisoneof the mostwidely-usedtechniquesthat hasmanyapplicationsinimageprocessing, includingsegmentation,classificationandobjectrecognition.Theproposedsub-gridmethodreferstotheprocessof

(5)

Fig.2.(a)Detectionoftherotatedangle=1.5◦.(b)Detectionoftherotatedangle=−1.5◦.(c)Correctingresult.

locatingeachsub-arraywithinamicroarrayimage(RuedaandRezaeian,2011).Theglobalparametersrequiredfor accuratelylocatingsub-arraysarewidthandheightofeachsub-arrayaswellasspacingbetweenthem.

Accordingtosub-grid,wecomputedtheroworcolumnmeansofpixelintensities,andaone-dimensionalfunction isobtained.Thegraphone-dimensionalisplotted.Itsdomainreflectsthepositionoftherows/columnsofpixels.Inthis work,thatfunctionisconsideredasahistograminwhicheachbinrepresentsonecolumn(orrow).Finally,thespace betweentheblocksiscalculatedbythelocalminimum.Thismethodcandealwithvarioustypesofnoisyimages.For example,imageswithblackregionsmeanthatsomeofthespotshavebeenlostduringthescanning.Fig.3showsthe stepsofgriddingamicroarrayimagecontaining12×4sub-grids,alongwiththecorrespondingroworcolumnsums. 4.3. Gridding

Griddingisanimportantstepwhichindicatescoordinatesfortheindividualspots.Theproposedgriddingmethod triesinitiallytoremovethelargeflarenoiseinamicroarrayimage(Fouadetal.,2013).

Microarrayimagegriddingiscomplex,asitcontainsnoiseandfeatureslowintensityandweakcontrast.Therefore, weproposedthepreprocessingstepbyapplyinghistogramequalizationfunctiontoproducethehighcontrastbetween theforeground(spots)andthebackground.Butthismethodmayincreasethecontrast ofbackgroundnoise,so we resortedtoWienerfilteringtoeliminateit(AcharyaandRay,2005).

Toremovethelargeflarenoiseatfirst,theerosionoperatorwithastructuringelementisappliedtoremoveforeground spots.Alessnoisyimageisobtainedbysubtractingresultedbackgroundfromtheoriginalimage.Thisismainlydue toremovaloflargeflarenoise.Afterthat,morphologicalopeningwithastructuringelementisappliedtoremovesmall spikesintheimage.

(6)
(7)

Fig.4.Spotdetectioninasub-gridwithremovingflarenoise. Table2

Accuracyoftheproposedmethodforgriddingimage.

Image Accuracy

Image51786 100% Image51771 98.83% Image54846 99.76%

Wesearchedaregulargridofspots,sowestartedbyconsideringthemeanintensityforeachcolumnoftheimage. Afterthat,wecomputedthemeanintensityforeach columnoftheimage(Labibetal.,2012).Toobtain themean horizontalintensityprofileMH(y)oftheimagef(x,y)(dimensionsxandy),theformulaisdefinedasfollows:

MH(y)= 1 x

x=X−1 x=0

f(x,y) (3)

Wecanusethespacingestimatetohelpdesigningafiltertoremovethebackgroundnoisefromtheintensityprofile. Hence,autocorrelationisused toenhancetheself-similarityof theprofile.Theresultpromotesestimation ofspot spacing.Thenwe extract the centroidsof thepeaks.Thesecorrespond tothe horizontalcenters ofthe spots. The midpointsbetweenadjacentpeaksprovidegridpointlocations.Theseparametersareusedtodeterminethelocations oftheverticalgridsanddrawthemontheimage.Wesimplytransposetheimageandrepeatallthestepsusedabove togetthehorizontalgridsontheimage.MicroarrayspotgriddingisshowninFig.4.

Theproposedgriddingmethodwasimplementedonnoisymicroarrayimagesofbreastcancer.Resultsobtainedin thisworkareshowninTable2.

Theaccuracyoftheappliedgriddingmethodonaspecifiedinputimagecanbecalculatedasfollows: Accuracy=  N(corretspot) N(totalspot)  ∗100 (4)

4.4. Measuringgeneexpression

Therearemajortypesof applicationof DNAmicroarraysinmedicine.Thefirstinvolvesfindingdifferencesin expressionlevelsbetweenpredefinedgroupsofsamples.Thisiscalleda“classcomparison”experiment.Oncethe microarrays have been hybridized, the resulting images are used to generate adataset. Thisdataset needs to be “preprocessed”priortotheanalysisandinterpretationoftheresults.Preprocessingisastepthatextractsorenhances

(8)

Fig.5.Displayofexpressionprofilingdataobtainedfromone cDNAarray(twochanneltechnology).(a)Spotrawintensity.(b)Spotslog transformationintwochannelsmicroarrayexperimentsusingCy3(green)andCy5(red).

meaningfuldatacharacteristicsandpreparesthedatasetfortheapplicationofdataanalysismethods.Atypicalexample ofpreprocessingistakenthelogarithmoftherawintensityvalues(Tarcaetal.,2006).

Thetotalamountof hybridizationfor aspotisproportionaltothetotalfluorescenceatthespotintwochannel microarrayexperimentsusingCy3(green)andCy5(red).Therefore,spotintensityisequaltosumofpixelintensities. Then,thedataisgenerallylog-transformed.Thelogtransformationimprovesthecharacteristicsofthedatadistribution andallowstheuseofclassicalparametricstatisticsforanalysisasshowninFig.5.Hence,reasonsforworkingwith log-transformedintensitiesandratiosinclude:(1)spreadingfeaturesmoreevenlyacrossintensityrange,(2)making variabilitymoreconstantacrossintensity range,and(3)resultinginclosetonormaldistributionofintensitiesand experimentalerrors.

Withtwo-channelarrays,theintensityvaluesofthetwocompetingsamplesareexpressedasratiosandthen log-transformed.Thelogratiobetweenthetwointensitiesofeachdyeisusedasthegeneexpressiondata(Labibetal.,

2012):

Geneexpression=log2 

int(cy5) int(cy3)



(5) whereint(cy5)andint(cy3)aretheintensitiesoftheredandgreencolors(channel1,2inimage).

Since at least hundreds of genes are put on the DNA microarray, it is so helpful that we caninvestigate the genome-wideinformationinshorttime.

5. Geneselection

Aspreviouslymentioned,thenumberoffeaturesismuchlargerthanthenumberofsamples.Mostgenesarenot relatedtotheperformanceoftheclassification.Infact,alargecollectionofgeneexpressionfeatureswillhavehigher computationalcostandslowerlearningprocess.Therearestudiessuggestingthatonlyafewgenesaresufficient.Thus, itiscrucialtorecognizewhetherasmallnumberofgenescansufficeforgoodclassification(LiandYang,2002).

InformationgaintechniqueusestheconceptofShannonentropy.GivenentropyEasameasureofimpurityinaset oftrainingsamples,itispossibletodefineameasureoftheeffectivenessofafeature/geneinclassifyingthetraining data.Thismeasureissimplytheexpectedreductioninentropycausedbypartitioningthedataaccordingtothisfeature, so-calledinformationgain(Wangetal.,2006).

AssumeagivensetofmicroarraygeneexpressiondataM,theinformationgainofageneiisdefinedas: IG(M, i)=E(M)− 

vV(i)

Mv

ME(Mv), (6)

whereV(i)isthesetofallpossiblevaluesoffeaturei,MvisthesubsetofMforwhichfeatureihasvaluev,E(M)is

theentropyoftheentireset,andE(Mv)istheentropyofthesubsetMv.TheentropyfunctionEisdefinedby

E= C  J=1 −|Cj| clog2 |Cj| c, (7)

(9)

Table3

GenesselectedthroughIGforBreastcancer.

Gen-no GeneID IGvalue Rank Genename

30216 ASCL1::PAH 0.753 1 achaete-scutecomplexhomolog1(Drosophila)::Phenylalanine hydroxylase

697 SETBP1::SETBP1 0.722 2 SETbindingprotein1::SETbindingprotein1 28275 PSCA::PSCA 0.707 3 prostatestemcellantigen::Prostatestemcellantigen

8737 PRKAR2A::PRKAR2A 0.707 4 Proteinkinase,cAMP-dependent,regulatory,typeII,alpha::protein kinase,cAMP-dependent,regulatory,typeII,alpha

1369 ELF4::ELF4 0.696 5 E74-likefactor4(etsdomaintranscriptionfactor)::E74-likefactor4 (etsdomaintranscriptionfactor)

8728 – 0.688 6 Transcribedlocus

21867 MBP::MBP 0.682 7 Myelinbasicprotein::myelinbasicprotein

20044 UROS::UROS 0.682 8 UroporphyrinogenIIIsynthase::uroporphyrinogenIIIsynthase 9679 – 0.682 9 FulllengthinsertcDNAcloneYF46C08

18945 RNF138::RNF138 0.679 10 Ringfingerprotein138::ringfingerprotein138

Theentropyissupposedtogivetheinformationrequiredinbits,andistraditionallyusedtodealwithboolean-valued features(hot/cold,true/false,etc.).Fortunately,thismethodcanbeextendedtohandlethedatawithcontinuousvalued features,forexample,microarraygeneexpressiondata.Thismethodshowsthatonlyafewgenesarehavingimportant informationaboutthediseaseandtheremaininghaveverylowamountofinformation.Regardingthestudysuggesting thatonlyfewgenesaresufficientforunderstandingtheirbiologicalrelationshipwiththetargetdiseases,hundredgenes withhigherIGvalueareselectedasinformativegenes.Table3givesthedetailofthegenesselectedusingIGforbreast cancerdataset(onlythemajortengenesweredisplayed).

5.1. Classification

Inthispaper,thedecisiontreealgorithmisusedtoclassifythedataset.Thedecisiontree(HanandKamber,2001) isoneoftheclassifyingmethods,whichclassifiesthelabeledtraineddataintoatreeorrules.

Oneofthemainadvantagesofdecisiontreesistheabilitytogenerateunderstandableknowledgestructures,i.e., hierarchicaltreesorsetsofrules,alowcomputationalcostwhenthemodelisbeingappliedtopredictorclassifynew cases,theabilitytohandlesymbolicandnumericinputvariables,provisionofaclearindicationofwhichattributes aremostimportantforpredictionorclassification(HeandHui,2009).

Inthispaper,j48treeisused.ThistreeisaslightlymodifiedversionofC4.5.ThisalgorithmisasuccessortoID3 developedbyQuinlan(1986).Thedefaultcriteria ofchoosingsplittingattributesinC4.5isinformationgainratio. Thiscriteriaisusedtoemploytheentropymeasureasanimpuritymeasure.IntreeC4.5theattributevaluesisdivided intotwo groupstohandlecontinuousattributesbased onthe selectedthresholdsuch thatall thevaluesabove the thresholdareconsideredasonechildandtheremainingasanotherchild.C4.5usesGainRatioasanattributeselection measurewhichremovesthebiasnessofinformationgainwhentherearemanyoutcomevaluesofafeature.Gainratio ofeachfeatureiscalculatedandthereaftertherootnodewillbetheattributewhosegainratioismaximum.Removing unnecessarybranchesusingpessimisticpruninginthedecisiontreeisdonetoimprovetheaccuracyofclassification. C4.5classificationtreecanbedescribedintotwoparts,learningusingtrainingdataandclassificationusingtestdata. Thestructureisintheformofstatementsincluding<if>...<then>....Oneadvantageofusinglogicalstatementsis thatitisconvenientforcodingintootherprogramsandthereforethecodecanbeeasilyreused.

Thethreeconceptsofinformationgain,entropyandgainratioaredescribedinthefollowing(Patidaretal.,2015;

AlSnousyetal.,2011).

Letcdenotesthe numberof classes,andp(S,j)the proportionof instancesinS that areassigned toj-thclass. Therefore,theentropyiscalculatedasfollows:

Entropy(S)=−

c



j=1

(10)

Fig.6.Thedecisiontreeforbreastcancerdiagnosis.

Accordingly,theinformationgainbyatrainingdatasetTisdefinedas: Gian(S,T)=Entropy−

vValue(Ts)

|Ts,v|

|Ts|

Entropy(Sv), (9)

whereValue(Ts)isthesetofvaluesofSinT,Ts isthesubsetsofTinducedbySandTs,visthesubsetofTinwhich

attributeShasavalueofv.

Therefore,theinformationgainratioofattributeSisdefinedas: GainRatio(S,T)= Gain(S,T)

SplitInfo(S,T) (10)

whereSplitInfo(S,T)iscalculatedas: SplitInfo(S,T)=−  (v=Value(Ts)) T s,v |Ts| ×logTs,v |Ts| (11) Inthedecisiontreemethod,agreatdealofstatisticalinformationissupplied,includingtruepositives(TP),false positives(FP),true negatives(TN),false negativestogether (FN).With thisamount,the classificationaccuracy is calculatedusingtheEq.(12):

Accuracy= TP+TN

TP+TN+FP+FN (12)

Fig.6showsthedecisiontreeforsubclassdiagnosisofbreastcancer.Thedecisiontreemethodwiththreegenes gives100%classificationaccuracyinthediagnosisphaseofbreastcancertype.

6. Conclusion

Inthispaper,wepresentedasystemforextractingdatafromimagemicroarrayandclassifyingrawdatarelated tobreastcancer,whichconsistsgenerallyofsixsteps.Thefirststepaimstorectifyanydetectedrotationswithinthe images.Next,imagethresholdingtechniquewasusedtodetectthecorrectnumberofindividualsub-gridscontaining 12×4sub-grids.Inthethirdstep,largeflarenoiseusingmorphologicaloperatorswereremovedinitially.Then,the meanroworcolumnofintensitiescomputedandautocorrelationwasimplementedtoenhancetheself-similarityof theprofile.Lastly,separatedlinesweredrawnforeachof thespots.Thefourthstepexplainshowgeneexpression valueswerecalculatedbyusinglogratiobetweenthetwointensitiesofeachdyeasthegeneexpressiondata.Within thefifthstep,informationgainmethodwasemployedtoreducethenumberofgeneswhensmallnumberofgeneswas sufficientforaccuratediagnosisofmostcancers.Asanextstep,decisiontreeapproachwasimplementedtoclassify breastcancermicroarraydata.

Overall,theadvantageoftheproposedsystemisthatgriddingperformancehasbeensuccessfulontheimageswhen followingconditionsaremet:lowintensity,poorqualityspots,noiseandartifacts,aswellasrotation.Theeffectsof noiseartifactsarediminishedbymorphologyoperator.Furthermore,thegeneralizationperformanceofthismethod

(11)

allowsdeterminationofthegridlinesinthepresenceofweaklyexpressedspots.Whileinmorethan98%ofthecases, thespotsresidecompletelyinsidetheirrespectivegridcells,merelyinafewimageswithnoticeablenoiseanddefects whichresultsaccuracylowerthan98%.

Intheproposedsystem,image’srawdatastoredinanExcelfile,andeachrowandcolumnrepresentedgenesand samples,respectively.Inthedataanalysisstep,theeffectivenessoftheproposedsystemhasbeendemonstratedusing breastcancerimagedatasets.Systemwiththreegenesand95.23%classificationaccuracywasevolved.

Theproposedmethodleadstoahighclassificationaccuracysystemforbreastcancerdataset.Thisapproachcan beusedasatooltoovercomemicroarraygeneexpressiondataclassificationlimitations.

Oneofthemostimportantadvantagesofthismethodistoextractdatafromimages,whilethepreviousmethods havebeenimplementedusingpublisheddatasettoclassifyvarioustypesofmicroarraydata.

References

Acharya,T.,Ray,A.K.,2005.ImageProcessing:PrinciplesandApplications.JohnWileyandSons(Chapter6).

AlSnousy,M.B.,El-Deeb,H.M.,Badran,K.,AlKhlil,I.A.,2011.Suiteofdecisiontree-basedclassificationalgorithmsoncancergeneexpression data.Egypt.Inform.J.12(2),73–82.

Angulo,J.,Serra,J.,2003.AutomaticanalysisofDNAmicroarrayimagesusingmathematicalmorphology.Bioinformatics19(5),553–562.

Bariamis,D.,Maroulis,D.,Iakovidis,D.K.,2010.UnsupervisedSVM-basedgriddingforDNAmicroarrayimages.Comput.Med.ImagingGraph. 34(6),418–425.

Bergamaschi,A.,Kim,Y.H.,Wang,P.,Sørlie,T.,Hernandez-Boussard,T.,Lonning,P.E.,Tibshirani,Børresen-Dale,A.L.,Pollack,J.R.,2006.

DistinctpatternsofDNAcopynumberalterationareassociatedwithdifferentclinicopathologicalfeaturesandgene-expressionsubtypesof breastcancerGenes.ChromosomesCancer45(11),1033–1040.

Cano,C.,Garcia,F.,Lopez,F.J.,Blanco,A.,2009.Intelligentsystemfortheanalysisofmicroarraydatausingprincipalcomponentsandestimation ofdistributionalgorithms.ExpertSyst.Appl.36(3),4654–4663.

Chen,M.S.,Han,J.,Yu,P.S.,1996.Datamining:anoverviewfromadatabaseperspective.IEEETrans.Knowl.DataEng.8(6),866–883.

Cho,S.B.,Won,H.H.,2003.MachinelearninginDNAmicroarrayanalysisforcancerclassification.In:ProceedingsoftheFirstAsia-Pacific bioinformaticsconferenceonBioinformatics,Adelaide,Australia,pp.189–198.

Dudoit,S.,Fridlyand,J.,Speed,T.P.,2002.Comparisonofdiscriminationmethodsfortheclassificationoftumorsusinggeneexpressiondata.J. Am.Stat.Assoc.97(457),77–87.

Fan,L.,Poh,K.L.,Zhou,P.,2009.AsequentialfeatureextractionapproachfornaiveBayesclassificationofmicroarraydata.ExpertSyst.Appl.36 (6),9919–9923.

Feng,Y.,Song,K.,Liu,J.,2015.Amicroarrayimagegriddingmethodbasedonprojectiontransformationandpowerspectralanalysis.In:2015 InternationalSymposiumonComputersandInformatics,Beijing,China,pp.44–51.

Fort,G.,Lambert-Lacroix,S.,2005.Classificationusingpartialleastsquareswithpenalizedlogisticregression.Bioinformatics21(7),1104–1111.

Fouad,I.A.,Mabrouk,M.S.,Sharawy,A.A.,2013.AnewmethodtogridnoisycDNAmicroarrayimagesutilizingdenoisingtechniques.Int.J. Comput.Appl.63(9),36–44.

Fouad,I.,Mabrouk,M.,Sharawy,A.,2014.AutomaticandaccuratesegmentationofgriddedcDNAmicroarrayimagesusingdifferentmethods. Adv.Comput.4(2),41–54.

Golub,T.R.,Slonim,D.K.,Tamayo,P.,Huard,C.,Gaasenbeek,M.,Mesirov,J.P.,Lander,E.S.,1999.Molecularclassificationofcancer:class discoveryandclasspredictionbygeneexpressionmonitoring.Science286(5439),531–537.

Gunavathi,C.,Premalatha,K.,2014.PerformanceanalysisofgeneticalgorithmwithkNNandSVMforfeatureselectionintumorclassification. Int.J.Comput.Electr.Autom.ControlInform.Eng.8(8),1490–1497.

Halder,A.,Dey,S.,Kumar,A.,2015.Activelearningusingfuzzyk-NNforcancerclassificationfrommicroarraygeneexpressiondata.In:Advances inCommunicationandComputing,Springer,India,pp.103–113.

Han,J.,Kamber,M.,2001.DataMining:ConceptsandTechniques.MorganKaufmannPress(Chapter9).

He,Y.,Hui,S.C.,2009.Exploringant-basedalgorithmsforgeneexpressiondataanalysis.Artif.Intell.Med.47(2),105–119.

Horng,J.T.,Wu,L.C.,Liu,B.J.,Kuo,J.L.,Kuo,W.H.,Zhang,J.J.,2009.Anexpertsystemtoclassifymicroarraygeneexpressiondatausinggene selectionbydecisiontree.ExpertSyst.Appl.36(5),9072–9081.

Kao,J.,Salari,K.,Bocanegra,M.,Choi,Y.L.,Girard,L.,Gandhi,J.,Kwei,K.L.,Hernandez-Boussard,T.,Wang,P.,Gazdar,A.F.,Minna,J.D., Pollack,J.R.,2009.Molecularprofilingofbreastcancercelllinesdefinesrelevanttumormodelsandprovidesaresourceforcancergene discovery.PLoSOne4(7).

Katzer,M.,Kummert,F.,Sagerer,G.,2003.AMarkovrandomfieldmodelofmicroarraygridding.In:18thACMSymposiumonAppliedComputing, pp.72–77.

Khan,J.,Wei,J.S.,Ringner,M.,Saal,L.H.,Ladanyi,M.,Westermann,F.,Meltzer,P.S.,2001.Classificationanddiagnosticpredictionofcancers usinggeneexpressionprofilingandartificialneuralnetworks.Nat.Med.7(6),673–679.

Labib,F.E.Z.,Fouad,I.,Mabrouk,M.,Sharawy,A.,2012.Anefficientfullyautomatedmethodforgriddingmicroarrayimages.Am.J.Biomed. Eng.2(3),115–119.

(12)

Li,W.,Yang,Y.,2002.Howmanygenesareneededforadiscriminantmicroarraydataanalysis.In:MethodsofMicroarrayDataAnalysis.Springer US,pp.137–149.

Li,L.,Weinberg,C.R.,Darden,T.A.,Pedersen,L.G.,2001.Geneselectionforsampleclassificationbasedongeneexpressiondata:studyof sensitivitytochoiceofparametersoftheGA/KNNmethod.Bioinformatics17(12),1131–1142.

Mahmoud,A.M.,Basma,A.M.,2014.Ahybridreductionapproachforenhancingcancerclassificationofmicroarraydata.Int.J.Adv.Res.Artif. Intell.3(10).

Patidar,P.,Dangra,J.,Rawar,M.K.,2015.DecisionTreeC4.5algorithmanditsenhancedapproachforEducationalDataMining.Int.J.Futuristic TrendsEng.Technol.2(2),14–24.

Quinlan,J.R.,1986.Introductionofdecisiontrees.J.Mach.Learn.1(1),81–106.

Rueda,L.,Rezaeian,I.,2011.AfullyautomaticgriddingmethodforcDNAmicroarrayimages.BMCBioinform.12(1),113.

Shulman,L.N.,Willett,W.,Sievers,A.,Knaul,F.M.,2010.Breastcancerindevelopingcountries:opportunitiesforimprovedsurvival.J.Oncol. 2010,1–6.

Smith,R.,1995.Asimpleandefficientskewdetectionalgorithmviatextrowaccumulation.In:ProceedingsoftheThirdInternationalConference onDocumentAnalysisandRecognition,IEEE,vol.2,Montreal,Canada,pp.1145–1148.

StanfordMicroarrayDatabase,http://smd.stanford.edu/(lastaccessedDecember2015). Stekel,D.,2003.MicroarrayBioinformatics.CambridgeUniversityPress.

Tarca,A.L.,Romero,R.,Draghici,S.,2006.Analysisofmicroarrayexperimentsofgeneexpressionprofiling.Am.J.Obstet.Gynecol.195(2), 373–388.

Wang,Z.,Palade,V.,Xu,Y.,2006.Neuro-fuzzyensembleapproachformicroarraycancergeneexpressiondataanalysis.In:InternationalSymposium onEvolvingFuzzySystems,IEEEComputationalIntelligenceSociety,Ambleside,UK,pp.241–246.

Wolberg,W.H.,Street,W.N.,Mangasarian,O.L.,1995.Imageanalysisandmachinelearningappliedtobreastcancerdiagnosisandprognosis.Anal. Quant.Cytol.Histol.17(2),77–87.

Wong,T.T.,Hsu,C.H.,2008.Two-stageclassificationmethodsformicroarraydata.ExpertSyst.Appl.34(1),375–383.

Xu,M.,Setiono,R.,2003.Geneselectionforcancerclassificationusingahybridofunivariateandmultivariatefeatureselectionmethods.Appl. GenomicsProteom.2,79–91.

Xu,Y.,Selaru,F.M.,Yin,J.,Zou,T.T.,Shustova,V.,Mori,Y.,Meltzer,S.J.,2002.Artificialneuralnetworksandgenefilteringdistinguishbetween globalgeneexpressionprofilesofBarrett’sesophagusandesophagealcancer.CancerRes.62(12),3493–3497.

Zacharia,E.,Maroulis,D.,2008.Anoriginalgeneticapproachtothefullyautomaticgriddingofmicroarrayimages.IEEETrans.Med.Imaging27 (6),805–813.

Figure

Updating...

Related subjects :