• No results found

On building ensembles of stacked denoising auto-encoding classifiers and their further improvement

N/A
N/A
Protected

Academic year: 2021

Share "On building ensembles of stacked denoising auto-encoding classifiers and their further improvement"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

This is a postprint version of the following published document:

Alvear Sandoval, R. F. y Figueiras Vidal, A. R. (2018).

On building ensembles of stacked denoising

auto-encoding classifiers and their further improvement.

Information Fusion

, 39, pp. 41-52.

DOI:

https://doi.org/10.1016/j.inffus.2017.03.008

© 2017 Elsevier B.V. All rights reserved.

This work is licensed under a

Creative Commons

(2)

On building ensembles of stacked denoising

auto-encoding classifiers and their further improvement

Ricardo F.

Alvear-Sandoval

, Aníbal

R. Figueiras-Vidal

GAMMA-L+/DTSC, Universidad Carlos III de Madrid, Spain

a

b

s

t

r

a

c

t

Toaggregate diverselearnersand totraindeeparchitecturesarethetwo principalavenuestowardsin-creasingthe expressivecapabilitiesof neuralnetworks. Therefore,theircombinations meritattention.I nthiscontribution,westudyhowtoapplysomeconventional diversity methods–baggingand labelswitching–t o ageneraldeepmachine,the stackeddenoising auto-encodingclassifier,inordertosolveanumber ofappropriately selectedimage recognitionproblems. Themainconclusion ofourworkisthatbinarizingmulti-classproblemsisthekeyto obtainbenefit fromthosediversitymethods.

Additionally,wecheckthataddingotherkindsofperformanceimprovementprocedures,suchaspre-emphasizingtrainingsamplesand elasticdistortionmechanisms,furtherincreasesthequalityofthere-sults.Inparticular,anappropriatecombinationofalltheabovemethods leadsustoreachanewabsoluterecordinclassifyingMNISThandwrittendigits.

Thesefactsrevealthat thereareclear opportunitiesfordesigningmorepowerfulclassifiers bymeansofcombiningdifferentimprovement techniques.

Keywords:Augmentation; Classification; Deep; Diversity; Learning; Pre-emphasis

1. Introduction

Thenumberofavailable trainingsampleslimitstheexpressive capability oftraditional one-hidden layer perceptrons,orshallow Multi-Layer Perceptrons(MLPs),forpracticalapplications,inspite of their theoretically unbounded approximation capacities [1,2]. Consequently, a lotofattentionis beingpaid to architecture and parameterization procedures that allow to improve their perfor-mancewhensolvingpracticalproblems.Themostrelevant proce-duresincreasethenumberofthetrainable weightsfollowingtwo main avenues: Building ensembles of learningmachines, or con-structingDeepNeuralNetworks(DNNs).

Most of the advances in DNN design correspond to the last decade.Infact,priorto2006theonlysuccessfullyusedDNNswere theConvolutionalNeuralNetwork(CNN)classifiers[3],whose sig-nificantly simplified structure makes possible their training by means ofconventional algorithms such asBack Propagation (BP). This architecture is appropriate for some kinds of applications, those in whichthe samples show translation-invariant character-istics, for example, image processing. But direct trainingof gen-eralformDNNsremainedwithoutsolutionbecausetheappearance of vanishing orexploding derivatives[4,5].In 2006 Hinton et al. [6] proposed a deep classifier indirect design that involved the

Corresponding author.

E-mail address: [email protected](R.F. Alvear-Sandoval).

stacking ofReduced BoltzmannMachines(RBMs) [7].A top layer task-oriented training and overall refining complete these classi-fiers,that havecome to be calledDeep Belief Machines (DBMs). Contrastive divergence algorithms allow for the training of the DBMswithoutahugecomputationaleffort[8].

Some years later, Vincent et al. [9] introduced a similar pro-cedure consisting of an expansive, denoisingauto-encoder layer-wise training plus the final top classification and refining. These are the Stacked Denoising Auto-Encoder (SDAE) classifiers. It is worth mentioning that both DBMs and SDAEs are representation machines[10],i.e.,theirhiddenlayersprovidemoreandmore so-phisticatedhigh-levelfeaturerepresentationsoftheinputvectors. Theserepresentationscanbeusefulforanalysispurposes[11],and, evenmore, therepresentation process induces adisentangling of the sub-spaces in which the samples appear [12]. On the other hand, another sequentially trainable deep architecture, the Deep StackingNetworks(DSNs),wasintroducedin[13,14],followingthe ideaoftrainingshallowMLPsandaddingtheiroutputs tothe in-putvector fortrainingfurtherunits.Finally, anumber of modifi-cationsthat reducethedifficultieswiththederivativeshavebeen proposedfortrainingdirectlyDNNs,suchasdataconscious initial-izations [15], Hessian-free search [16], mini-batch iterations [17], non-sigmoidalactivations[18],andaddingscaleandlocation train-ableparameters[19].

Therearealsoproofsofuniversalapproximationcapabilitiesfor DNNs[20,21],aswellasofsomeinterestingcharacteristicsofthem

(3)

[22].Theanalysesin[23,24]show thatbyaddinglayers toa net-work,it ispossible toreduce the effortto establishinput-output correspondences.Inpractice, DNNshaveoffered excellent perfor-manceresults in many applications, therefore, one can conclude theyareimportantinspiteofthelargenumberofparametersthat needtobelearned.Thereisnotroomheretogivemoredetails,so, theinterestedreaderis referred toexcellent tutorials [4,5,25] for moreextensivereviewsandbibliography,aswell asto[26] fora bibliographyofapplications.

Ensemblesarethesecond optiontoeffectivelyincreasethe ex-pressivecapabilityoflearningmachines,includingMLPs.Theyare builtby means of traininglearnersthat considerthe problemto besolvedfromdifferentperspectives, i.e.,underaprincipleof di-versity,andaggregatingtheiroutputstoobtainanimproved solu-tion.Wepresentaveryconciseoverviewofensembles, emphasiz-ingjustthedesignmethodsthat we willuseinourexperiments, inSection2,forthesakeofcontinuityinthisIntroduction.

Sincebothdiversityanddepthincreasetheexpressivepowerof MLPsbutthroughverydifferentmechanisms,itseemsreasonable toexpectthata combinationofthemwouldleadtoan even bet-terperformance.However,thereareamoderatenumberof contri-butionsalong thisresearch direction. We willbriefly revisethem inSection4,butweanticipatethatsomedifficultiesappearwhen tryingtoapplytheusualensemblebuildingmethodstomulti-class problems,and,consequently,mostoftheDNNensemblesare con-structedbymeansof“adhoc” procedures.

Inthispaper,weexploreanddiscussindetailhowandwhy di-versificationcanbe appliedtoDNNs,aswellasifincludingother improvementtechniquesgivesadditionaladvantage.The objective istoevaluateifitispossibletogetsignificantadvantagesby com-bining diversification and deep learning, as well as other tech-niques.

Ofcourse, wehaveto selectbothDNN architecturesand clas-sification problemsfor our experiments andsubsequent analysis. Althoughmostofthepreviousstudieswiththedatabaseswewill usehaveconsidered CNNs, we havedecided to work witha less specificarchitecturetoexcludethepossibilityofobtaining conclu-sions only validfor thisparticular formof DNN andthe kind of problemsthatareappropriateforit.So,weselectSDAEclassifiers, andinparticulartheSDAE-3designthatisintroducedin[9]. How-ever,atthe sametimeandjusttoshow thepotentialof combin-ing diversity and depth, we will address some traditional image classificationtasks–alsoincludedin[9],– thataremore appropri-atefor CNN architectures.The selected problemsfor our experi-ments will be the well-known 10-class handwritten digit MNIST database[3], its version witha smaller trainingset MNIST-BASIC [9],inordertoanalyzetherelevance oftheweak orstrong char-acteroftheSDAE-3classifiers,andalsothebinarydatabase RECT-ANGLES[9],withtheobjectiveofstudyingtheoriginofthe diffi-cultiesforcreatingensemblesofmulti-classDNNs.Weemphasize thattheseselectionsare not arbitrary:Thereare manypublished resultsfor MNIST,for examplein [27,28],and clearlyestablished recordsfor representationDNNs, a 0.86% error rate[10], andfor CNNensembles [29], a 0.21% errorrate. We anticipatefromnow that, with the help of a boosting-type training reinforcement or pre-emphasis,anda simpledataaugmentation besidesof the bi-narizationandtrainingdiversification we willapply,we arriveto anewabsoluteperformance record,a0.19% errorrate.Werepeat thatthisrecordwasnotourobjective,butwelooked forabetter understandingofhowtocombinediversityanddepth,andtoavoid conclusionsonlyvalidforparticularsituations–usingCNNsfor im-ageproblems,– weselect both SDAEs andthe databases. Thus,in ouropinion, thereis noreasonto think that ourconclusionsare problem-orarchitecture-dependent.

The restofthepaperisstructuredasfollows.InSection 2,we presentbriefoverviewsofmachineensembles,bothgeneralforms

and those that come from binarizing multi-class problems. We dedicateSection4tolistandcommentpreviouslypublishedworks in designing DNN ensembles. Section 5 describes the additional techniques –pre-emphasis and data augmentation– we will use in the second partof our experiments. The experimental frame-work isdetailed inSection 6: Databases, deeplearningunits, di-versification and binarization techniques, and pre-emphasis and data augmentation forms. The results of the experiments appear in Section 7following a sequentialorder, plus the corresponding discussions. Finally, the main conclusionsof thiswork andsome directionsforfurtherresearchclosethepaper.

2. Ensembles

Tobuild an ensembleofdiversemachinesandaggregatetheir outputs isa wayto increase expressivepower. Althoughthe first ideasonitwerepublishedhalfacenturyago[30],theyhavebeen mainly developed along the last two decades. In the following, we briefly review some ensembletechniques, includingthose we will useto diversifySDAEs. We dedicateseparate sub-sections to designs that introducediversification by meansofarchitecture or trainingdifferences,that wewillcallconventionalensembles,and toensembles that comefromtransforminga multi-classproblem inanumberofbinaryclassificationsfromwhichtheresultingclass canbe obtained.Sincea completereviewofensemblesisbeyond thescopeofthispaper,thereaderisreferredtomonographs[31– 34], aswell as to tutorialarticle [35], which includes interesting perspectivesonensembleapplications.

2.1. Conventionalensembles

Conventionaldiversificationmethodsmaybebroadlyclassified into twocategories. Thefirst are those approachesthat indepen-dently train a number of machines, usually with different train-ingsets.Thesemachines,orlearners,canalsohavedifferent struc-tures.Afterit,learners’outputsareaggregated–typicallywith sim-ple,non-trainable procedures– tocomeup withthefinal classifi-cation.Theseensemblesarecalledcommittees.

Among committees, Random Forests (RFs)[36] are very pop-ular because they offer a remarkable performance. They diver-sifya numberoftreeclassifiersbymeansofprobabilistic branch-ing,whichcanbecombinedwithsub-spaceprojections.Thereare other committees that can be applied to general types of learn-ers, requiringonlythat they are unstable: Bagging[37]andlabel switching [38,39]. We will include both of them in our experi-mentsbecausetheyaresimpletoimplementandprovidehigh ex-pressivepower,clearlyimprovingtheperformanceofasingle ma-chine.Yetweannouncethatthefirstexperimentalresultswilllead ustofocusonthesecond.

Bagging (“Bootstrap and aggregating”) produces diversity by trainingtheensemble learnerswithbootstrapingre-sampled ver-sionsoftheoriginaltrainingsetand,then,aggregatingthese learn-ers’ outputs, usually by averaging them or with a majority vote. Bootstrap is a random sampling mechanism which includes re-placementtopermit arbitrarysizesofthere-sampledpopulation. Although its primitive form used bootstrapped sets of the same size as the true training set, to explore the size of these boot-strappedsetsisimportanttofindagoodbalancebetween compu-tationaleffortandnumberoflearnersandensembleperformance, becauseinsomecasesthereductionofthetruesamplesthateach learner seescan provoke losses.Onthe other hand,label switch-ing changes thelabels ofa givenportionof thetrainingsamples accordingtosomestochasticmechanism.Wewillemploythe sim-plest version, forwhich these changes appear purely at random. The switchingratemustbe explored when designingthese com-mittees.

(4)

Thesecondtypeofconventionalensembles,whichwecall con-sortia,arealgorithmsthattrainboththelearnersandthe aggrega-tioninarelatedmanner.Thebestknownmethodinthiscategory is boosting, which has proven to provide excellent classification performancesby combiningtheoutputs ofweak learners[40,41]. Its principleistosequentiallydesignandaggregateweaklearners that,ateach stage,paymoreattentionto theexamplesthat have larger classification erroraccordingtothe previously builtpartial ensemble.Manyextensionsandgeneralizationsofthebasic boost-ingalgorithmshavebeenproposed[34].Amongthem,wemention [42,43]becausethey alsoconsidertheproximitytothe classifica-tionboundaryofthetrainingexamples,andthisideaisinthe ori-ginoftheformulasforthepre-emphasistechniqueswewillapply inthispaper,thatwillbeintroducedinSection5.But,since boost-ingmethodsrequireweaklearners,wedonotincludetheminour experimentstodiversifySDAEclassifiers.

Other consortia algorithms are those known as Negative Cor-relation Learning (NCL) ensembles [44,45] and the Mixtures of Experts (MoEs) [46]. Bothof them could be applied to diversify DNNs,buttheirperformanceimprovementsaremoderatefor clas-sification, and their many modifications to get more advantage [47,48]demand higher computational effort, which,in principle,is notidealwhenlookingformethodstodiversifyDNNs.

Accordingtotheabove,wewilladoptinourexperiments bag-ging and label switching as conventional diversification mecha-nisms, although we will also apply pre-emphasis forms that are conceptuallyequivalenttogeneralizedboostingweighting.

2.2. Ensemblesofbinaryclassifiersformulti-classproblems

Adifferenttypeofensembleisthat formedbydecomposinga multi-class problem into a number of binary problems [33]. Al-though ofdifferent originand constructed ina differentmanner that conventionalensembles, binarizationtechniquesaretrue en-semblessincetheyconsistofacollectionofbinarymachines,each onehavingadifferentfunctionortaskthatisrelatedtotheoverall problemofmulti-classclassification,andtheoutputsofthebinary machinesare aggregatedtoproduce aclassification that isbetter thananyonemachineusedalone.

There are two basic binarization techniques, One versus One (OvO) andOne versus Rest (OvR) [49,50].For a C-class problem, OvO isan ensemble ofC

(

C1

)

/2binary classifiersthat perform pair-wiseclassificationsofoneclassCjversusanotherclassCk for

each j =k.Theclass that hasthemostvotesis thendetermined tobe thecorrectclass.OvR isan ensembleofCbinaryclassifiers whereeachclassifiermakes adecisionbetweenoneclassandthe rest. The advantages of OvR are that the number of learners is smaller (obviously,C) –although itmaybe argued that thisleads to less diversity– and that each classifier sees all of the training samples. However, its disadvantage is that the learner data sets become imbalanced andthisfact tends to produce a decreasein theperformance ofindividualclassifiersand, consequently,ofthe wholeensemble.Althoughthiscanbealleviatedbyusingcarefully designedre-balancingmechanisms,wedecidetouseOvOtoavoid thecorrespondingrisks.

Anotherapproach tothebinarization ofamulti-class problem that ismoreeffectivethanOvOandOvR aretheError Correcting Output Codes (ECOCs) [51]. ECOCs are typically better than OvO orOvRbecausea wrongdecisionrequiresanumberofwrong bi-naryclassifications–seethedetails below,– andnotjusta wrong majority(OvO)orawronghighoutputlevel(OvR).However, effec-tiveECOCdesignsareverydifficultformanyclassproblems.More detailscanbefoundin[33].AnECOCbinarizationassociatesa bi-narycodewordtoeachclass,andtheresultingbinaryproblemsare those represented bythe columns, each dichotomy beingformed by groupingthetrueclassesaccordingtotheircorrespondenceto

0or 1bits. The overallclassification is performedby finding the classwhosecodewordistheclosesttotheonethatisproducesby theensembleofclassifiers.Thisimpliesthatcodewordswithhigh Hammingdistances mustbe selected, and, consequently, there is aclearcompromise withthelengthofthecodeword, i.e.,thesize oftheECOCensemble.Inourexperiments,wewillusethe15-bit, 10-classECOCwhichwasintroducedin[51]justforMNIST consid-eringdifferentcharacteristicsofhandwrittendigits.

We must say that, when the number of classes is high, OvO schemesbecome impractical,andECOC ensembles aredifficult to design.Monograph[33]providessome detailsabout howto doit. Toconsistently re-balance OvR dichotomiesis a good alternative, andthesame isusefulforbig ECOC ensembles. Sincewe donot dealherewithsuchakindofproblems,wedonotfurtherdiscuss thissubject.

3. AconciseoverviewofpreviousapproachestodiversifyDNNs

The first CNNensemble was presented in[52], withdifferent imagesizingasthediversificationtechnique.Simpledata augmen-tation mechanisms allowed a record performance for MNIST, an excellent 0.21% error rate, using again a CNNensemble [29]. Di-rectaveraging ofclassification outputs obtainedfromconsecutive hiddenlayersand/orofthetopoutputatdifferenttrainingepochs isproposedin[53],withmoderateadvantages.InGoogLeNet[54], baggingisappliedtogetherwithdifferentweightinitializations to getaverydeepandpowerfulCNNmachine.In[55],trainable sim-pleaggregationschemesareappliedtoCNNensembles tofurther improve their performance. Optical flow obtained from consecu-tive video frames is used in [56] to construct CNN ensembles. Similarly, multiple triphonestates are the source of diversity for speech recognition purposes in [57], using learners that include hiddenMarkovmodelunits.Theauthorsof[58]usespectral diver-sity[59]totrainsomeDNNswhoseoutputsareaggregated,while partialtrainingoflearnerswithdistorteddatasetsinthe diversifi-cationappliedin[60].

Fromourpointofview,thestudiescarriedoutin[61]are spe-cially relevant. Their authors found that there are difficulties to apply bagging to CNNs: In fact,average performances forseveral databasesbecomeworsethanapplyingonlyrandominitializations. Theyalsofound that todiversify layers that arenearto thefinal outputismoreeffective,afactthatwe alsoobservedatthesome time forSDAE-3learners[62].These were thestartingpointsfor theresearchwepresentinthispaper.

We conclude thisconcise review mentioning two very recent contributions: Simultaneously training aggregation weights and learners forseveral DNN architectures, including CNNs [63], and selectivelycombiningCNNsthat havebeentrainedwithpowerful augmentationprocedures[28].Finally,letusclarifythatwedonot discusshereimportantmethods suchasdrop-out [64]and drop-connect[65]because,althoughsomeresearchersconsiderthemas formsofdiversification,wefirmlysupportthattheyare probabilis-ticregularization methods (thatsubsequentlyproduce some kind ofelementarydiversification),sincetheir maineffectistoreduce thepossibilityofoverfittingby randomlylimitingthe activeunits orweightsateachlearningalgorithmstep.

An importantconclusion can be extracted fromthe above re-view:Toapply conventional diversificationtechniquesto DNNsis notaneasytask,andthisseemstobethereasonwhymostofthe DNN ensembles are builtusing other diversification mechanisms, such as“ad hoc” image augmentation pre-processing,speech se-quential features, or even partial training or manual selection. Since ourexperience indicates that the caseis similar to that of shallowMLPsappliedtomulti-classproblems,wherebinarization notonlyprovidesadirectadvantagebutitalsomakesconventional diversificationmoreeffective,weorientedthefirstphaseofour

(5)

re-searchtocheckifthisisalsotrueforDNNs,inparticular,forSDAE classifiers,thatweselectedbythereasonsweindicatedinthe In-troduction.Afterobtaininggoodresults,wedecidedtocontinueby addingother complementary procedure that also muchimproves theclassificationperformanceofseveralkindsofshallowmachines (includingshallowMLPs),pre-emphasis,underageneral formula-tionwe willintroduce inthenext section. And,finally,given the successthat the application ofdata augmentation pre-processing hasdemonstratedinmanyexperiments,we addedasimple form of it –also described in Section 5– to the combination of bina-rization,conventionaldiversification,andpre-emphasis.Theresults havebeenexcellent, including,asannounced above,a new abso-luterecord fortheMNISTdatabase, a0.19% averageerrorrate, in spiteofusinganon-convolutionalbasiclearner,theSDAE-3 classi-fier.Butwe thinkthatitisevenmoreimportanttohavechecked that binarization helps to effectively apply conventional diversi-fication mechanisms, as well as to have experimentally demon-strated that combiningother techniques withthem(namely pre-emphasisanddataaugmentation) further increasestheensemble performance.

4. Pre-emphasisanddataaugmentation

4.1.Theconceptofpre-emphasisandourselectedformulationsforit

Trainingsampleselectiontechniquesandtheirevolution, train-ing sampleweighting, called alsopre-emphasis, are efficient and effectivemethodstoimprovetheperformanceofclassification ma-chines.Conceptually,theyaremechanismsthatindicatetothe ma-chine the degree of attention which must be paid to each sam-ple,by meansof weighting thecorresponding trainingcost com-ponent.If weights are restrictedto {0, 1},we have a sample se-lectionscheme,which wasthe formappliedinthe seminalwork ofHartwithnearestneighborclassifiersfiftyyearsago[66].Many otherformshavebeensubsequentlyproposed,amongwhich[67– 70]includeinteresting alternatives and discussions. Pre-emphasis methodshaverecently beenused forselectingsamples askernel centroids[71] andalso forallowing a direct trainingofGaussian Processclassifiersby defining an appropriate formof softtargets [72].

The weight that each samplereceives hasbeen related to its classificationerrororto a measureofits proximity tothe classi-ficationboundary, usually accordingtothe outputofan auxiliary classifier.Theeffectivenessofbothmethodsisproblem-dependent [73]. Therefore, the pre-emphasis forms that are applied today include components of the two kinds. These combinations have demonstratedtheireffectivenessfordesigningboostingensembles [42,43].

In this paper, we will adopt the general forms of weighting functionsthat we have previously applied justto pre-emphasize SDAE-3classifiers withexcellent results [74,75].For binary prob-lems, the weight for the training cost of the sample {x(n), t(n)}, t(n)

{

1,1

}

,is

p

(

x(n)

)

=

α

+

(

1−

α

)

[

β

(

t(n) o(an)

)

2+

(

1−

β

)(

1−o( n)2

a

)

] (1)

whereo(an) istheoutput oftheauxiliary classifierwhenx(n) isits

input, and the parameters

α

,

β

, 0

α

,

β

1, serve to estab-lishtheappropriateproportionofnoemphasis(theterm

α

),error emphasis(theterm

(

1−

α

)

β

(

t(n) o(n)

a

)

2),andproximity

empha-sis (the term

(

1

α

)(

1

β

)(

1o(an)2

)

). In general, valuesfor

α

,

β

canbefound bymeansofcross-validationprocesses.Notethat form(1)includesmanyparticularcases:

α

=0,fullemphasiswith thetwo components;

β

=0, moderated(by

α

) emphasis accord-ingtotheproximitytotheboundary;

β

=1,moderatedemphasis accordingtothe error;

α

=0 and

β

=0,full emphasis according

to the proximity to the boundary;

α

=0 and

β

=1, full empha-sis accordingto the error; and

α

=1, no emphasis atall. Obvi-ously, thereare other possible functionalforms fortheerrorand theproximitytotheboundaryterms.Thedifferentformsaremore orlesseffective ina problem-dependentmanner, butthe perfor-mancedifferencesarealwaysmoderate.

Forthe(softmax)multi-classmachineswewilluse

p

(

x(n)

)

=

α

+

(

1

α

)

[

β

(

1o(n)

ac

)

2+

(

1−

β

)(

1−

|

o(acn) o(acn)

|

)

]

(2) where o(acn) isthe output of the auxiliary (softmax) machine cor-responding to the correct class c for x(n), ando(n)

ac the output of

thatmachinewhosevalueisthenearesttoo(acn) amongtherestof classes.

We remark that the application of pre-emphasis demands an additionalcomputationaleffortindesigning classifiers,duetothe need of validation to findappropriate valuesfor

α

,

β

. But there is no increase in the operation –i.e., the classification of unseen samples– computationaleffort,sincetheclassifier architecture re-mains thesame.This wasthe reasonwhichmoved ustoinclude pre-emphasisinasecondstepofexperiments,aftercheckingthat theresultsofferedbybinarizationandconventionaldiversitywere satisfactory.

4.2. Dataaugmentation

Data augmentation methods pre-process trainingexamples to createnewsamples havingsimilar characteristics to those ofthe originalones. Theyallowtoincrease thenumberoftraining sam-plesby addingtheaugmentedversionswithlabelscorresponding to the original samples from which they have been obtained. If carefullyselectedanddesigned,thesemethodsareeffectivein or-dertoincreasetheperformanceofclassificationmachines.

Dataaugmentationhasbeenusedalong twodecades[76],and its forms have significantly evolved. We repeat that the MNIST classification absolute record was obtained by means of a CNN ensemble with data augmentation as the diversification source [29].There are several dataaugmentation mechanisms that have showneffectivenessinimprovingtheperformanceofimage classi-fiers,suchasrandomtranslations,randomrotations,centering,and elasticdeformations.Forconcisereviews, werecommend [28,77]. Giventheadvantagethatdataaugmentationprovidesandthat,as forpre-emphasis, there is not an increase of theoperation com-putationaleffort, we alsoinclude aform ofit,the mostfrequent version ofelasticdeformation[78],inthelaststep ofour experi-ments,aftercombiningpre-emphasisanddiversity.

Thebasicaspectsoftheelasticdeformationwewillemployare asfollows.First,thereisapixeltranslation,whosehorizontaland vertical values are obtained by multiplying the elements of two matrices ofthe image size with smallrandom valuesby a scale factor,

, whichmust be appropriately selected. Translations are limitedto the image borders,and valuesthat arrive tothe same position are averaged. After it, a normalized Gaussian filtering is carried out with the objective of smoothing the results.The pa-rameter

σ

of the filtermust also be selected withcare. Our se-lectionof

,

σ

,willbe discussedinthe sectiondedicated tothe experiments.

5. Experimentalframework

Weconcentrateherethegeneralaspectsofourexperimentsto avoiddisorientingthereaderswithadispersepresentation.

(6)

5.1. Dataset

As previously announced, we will work witha much studied, relatively easy datasetof handwrittendigits, MNIST[3],both be-causetherearemanypublishedresultsfordifferentmachine clas-sifiers andbecausewe wantto seeifthe techniqueswe propose inthispaperproduceenoughimprovementsastogetcompetitive results withthose of CNNdesigns when a generalpropose basic deepnetworks,SDAE-3,isused.MNISTcontains50,000/10,000/ 10,000training/ validation/testsamples,respectively,of dimen-sion 784 (28 × 28) with 256-level quantized values normalized into[0,1].

Itseemsimportanttoanalyzetheeffectofthestrongnessofthe learners when building an ensemble.Fortunately, there is a ver-sionofMNIST,MNIST-BASIC[9](MNIST-Bintherestofthispaper), whoseonlydifferencewithMNISTisthatitstraining/validation/ testsamplesare10,000/2000/50,000,respectively.Lesstraining sampleswillimplyweakerlearners,and,sinceMNISTand MNIST-B havethesamenature,thiswill permitan excellent perspective fortheanalysiswemention.

Finally,weincludealsoabinarydatabasebecauseitisvery rel-evant to appreciate the possible differences between multi-class andbinaryproblemsinourstudy.Forreasonsofsimilarity,we se-lected RECTANGLES (RECT in the following), a database withthe outlinesofwhitehorizontalorverticalrectanglesonablack back-groundofjust28×28pixels.Thesizesofthetraining/validation /testsetsare10,000/2000/50,000,respectively.

5.2. Thebasicdeepmachine:theSDAE-3classifier

Onceagain,weemphasizethatweselectaDNNwithouta con-volutional structure because we want to check if the procedures we are introducing are effective enough to lead to high perfor-manceresultsusingageneralproposearchitecture,andCNNscan be considered“adhoc” architectureswhendealingwithimages– ourdatasets inthiswork– andother translationalinputs, such as speech.Underthatcondition,thereisnoreasonforexpectingthat thingswillbedifferent,ingeneral,forotherkindsofproblemsand DNNs.

We have selectedthe 3-layer Stacked Denoising Auto-Encoder (SDAE-3)classifier[9]bothbecauseitisarepresentationDNNand becauseithasbeenappliedtothedatabasesweareworkingwith. Consequently,wecanadoptthearchitectureandtraining parame-tersof[9]forabetterappreciationoftheeffectsofourproposed techniques.

Fig.1showsthestructure ofan SDAE-3classifier.It consistof three expansive auto-encoding layers with sigmoidal activations, plus a final classification unit, CL, witha sigmoidalor a softmax output forbinaryormulti-classproblems, respectively.The auto-encoding layers are sequentially trained using the input samples

x(n)astargetsfornoisyinputvectorsx(n) +r.Thedenoisingmakes

possible the expansive architecture, which increases the expres-sive capacity, andalso provides some degree ofrobustness. Each auto-encodinghidden layerisfrozen beforetrainingthe next. Fi-nally, the top classifier istrained andthe auto-encoding weights refined.

The design parameters used in [9] were 1000 units for the auto-encodinglayers, an 1,000-hidden layerMLPfinal classifier,a sample-by-sampleBPtrainingwitha0.01learningstepforthefirst auto-encodinglayerand0.02fortherestandthetopMLPand re-fining, and 40 trainingepochs, that are enough for convergence. We checkedthese values with positive results. However, our re-sultswerebetterwithanaddednoisevariance10%thevarianceof the samples,leadingustotheclassification results(for 10 differ-entinitializationruns),%averageerrorrates±standarddeviation, of1.58± 0.06, 3.42± 0.10,and2.40±0.10 forMNIST,MNIST-B,

Fig.1. SDAE-3 classifier (based on stacked denoising auto-encoders). x(n): training

sample; r: training noise. The weights of the layers are consecutively obtained by imposing the noisy-free input samples x(n)as targets, and then they are frozen until

inserting the final classifier, CL. o is the overall output.

andRECT,respectively.Theseresultsareslightlyworse thanthose of[9],butthestandarddeviationsarelower.

5.3.Conventionaldiversification

Weusebagging andswitching.The determinationofthesizes ofthecorresponding ensemblesisdone accordingtothebest re-sults for the validation set, using N=25,51,101, for the num-berofensemblelearners.The samevalidationsetserves toselect thebootstrappingsizeamongB=60,80,100and120%thatofthe originaltrainset,andtheswitchingrateamongS=10,20,30and 40% ofthe training samples.All committee’s unitsare initialized withuniformlydistributedrandomvalues.

5.4.Binarizationtechniques

Accordingto the discussionof Section 2.2, we willapply OvO binarization andthe 15 bit, 10-class ECOC which is proposed in [51]justforhandwrittendigits.

6. Experimentalresultsandtheirdiscussions

6.1. Firstlevelexperiments:binarizationandconventionaldiversity

We repeat that our preliminary experiments with multi-class problemsMNISTandMNIST-Busingonlybagging orswitchingof SDAE-3classifiersdidnotprovide anyperformance improvement, thesamenegativeresultthatotherstudiesobtained.Thus,we car-ried out additional experiments including both binarization and baggingorswitching.

Twoalternativeapproacheswere considered.The firstconsists ofdiversifyingthe SDAE-3auto-encodingparts,that wewill

(7)

indi-Fig.2. The G model for multiclass problems. DAE3 nare deep expansive denoising

auto-encoders, BE nthe binarizing ensembles, and FA the final aggregation (see text

for details). o is the overall output.

cateasDAE3, and, afterit, formulti-class problems,to construct binarizationensembleswithfinalclassifiersforeachDAE3output. Fig.2showsthecorrespondingmodel.Thefinalaggregation,FA,is avotecountingforeach ensembleplus a majorityvoteifOvOis applied,ora votecountingplus aHammingdistance class selec-tionforECOCbinarization.Obviously,switchingcannotbeapplied to build DAE3ensembles. We will call G (Global) model to this formofconstructingensemblesandbinarizing.

Fig. 3 represents the second alternative, which we call the T model because its aspect: There is a unique DAE3, bagging or switchingareappliedatitsoutput,andthen,OvOorECOC binariz-ingensemblesoffinalclassifiersaretrainedwiththediverse train-ingsets.Thismodelissuggestedbytheexperiencesindicatingthat diversificationathigherlayersismoreeffective,suchasthose pre-sentedin[61]. The final aggregationis identicalto that of theG model.

AsweannouncedinSection5.3,thevaluesofthenon-trainable parameters are selected according to the results for the valida-tion sets among N=25,51, and 101 for the ensemble size, B=

60,80,100,and120%forthebootstraptrainingsetsizes,andS=

10,20,30, and40% fortheswitching rates.No refiningis carried outwhentraining.

To reduce the huge computational load for the design of G models,weapply afrequentlyusedsimplification:We buildM> Nbagging unitsforeach valueofthebootstrap samplesizes,and then take N at random. Our experience with this trick indicates that the performance degradation (due to a loss in diversity) is verymoderate.

Fig.3. The TB model for multiclass problems (for bagging). DAE3 is a single expan- sive denoising deep auto-encoder, BE n the binarizing ensembles, and FA the final

aggregation (see text for details). o is the overall output.

Table1

Part a: performance results (% average error rate ±typical deviation) for OvO binarization of multi-class problems. DAE3: single DAE with OvO bi- narization; GB: G model with bagging; TB: T model with bagging; TS: T model with switching. Part b: performance results for the model T with switching when replacing OvO by ECOC binarization. Validation selected non-trainable parameter values are also indicated. Of course, RECT de- signs do not include binarization. Best results appear in boldface.

MNIST MNIST-B RECT

(no binarization) SDAE-3 1.58 ±0.06 3.42 ±0.10 2.40 ±0.13 a DAE3, OvO 1.40 ±0.06 2.60 ±0.08 − GB, OvO 0.86 ±0.01 1.76 ±0.04 1.20 ±0.04 ( N, B ) (101, 120) (101, 120) (101, 120) TB, OvO 0.77 ±0.00 1.68 ±0.04 1.19 ±0.01 ( N, B ) (101, 100) (101, 120) (101, 120) TS, OvO 0.75 ±0.00 1.67 ±0.04 1.10±0.02 ( N, S ) (101, 40) (101, 40) (101, 40) b TS, ECOC 0.36±0.02 0.75±0.01 – ( N, S ) (101, 30) (101, 30)

6.1.1. ResultsforOvObinarizationandtheirdiscussion

PartaofTable1showstheperformanceresultsfortheGandT architectureswithN andB/Svaluesselectedbyvalidationandfor aDAE3plusanOvObinarization(withoutconventional diversifica-tion).

Beforediscussingtheseperformanceresults,afewwordsabout the selected valuesfornon-trainable parameters.In all the cases butone(modelTwithbagging),theseselectedvaluesarethe

(8)

high-Fig.4. The average error rate for the validation and test sets versus N and S for the MNIST data set using the T model and OvO binarization.

estamongthoseexplored.Butanalysisoftheperformancefor dif-ferent pairs ofvalues makes evident that thereis a clear satura-tion effect just appearing for these extremevalues (N=101,B=

120%,S=40%).Thiscanbeseen fortheMNISTdatabaseinFig.4. Therefore,the bestisjusttoadoptthesevalues,because increas-ingthemwillonlyincreasethecomputationalloadtoclassify un-seen samples,butnot theperformance.Itmust alsoberemarked thattheparallelismbetweentheperformancesurfacesforthe val-idation and the test sets is nearly perfect. This means that the validation process will provide good valuesfor the non-trainable parameters.

FromtheperformanceresultsinPartaofTable1itcanbe con-cluded that bagging and switching diversification are effective if combined withbinarization when dealingwith multi-class prob-lems (MNIST andMNIST-B), andthat binarizationby itself is not the main reason for theseimprovements, because binarizing the classifiers of a simple DAE3 only provides modest performance increases. Thus, it appears that conventional diversification with powerfulclassifierswhendealingwithmulti-classproblemsisnot ableofimprovingthecorresponding classificationboundariesdue tothecomplexityoftheseboundaries.

Itisalsoevidentthatperformanceimprovementsaremore im-portant formodelT thanfor modelG approaches.Thisseems to indicate that the auto-encoding layers of the SDAE-3 classifiers carry outtheir disentanglingfunctioninan effectiveandefficient manner.Consequently,modelTdesignsmustbepreferred,because their computational designandoperation costsare lower.Onthe other hand,there arenot significant differencesbetweenbagging andswitchingmodelTresults.

Toclosethisdiscussion,letusremarkthattheperformance in-creases in an important amount: The average error rates forthe switchingmodelTcasesare47%,49%, and46% ofthoseofsingle SDAE-3classifiersforMNIST,MNIST-B,andRECT,respectively,and typical deviations become lower.Notethat in thiscasethere are not qualitative differences betweenthe multi-class problems and RECT,norbetweenMNISTandMNIST-B(‘strong’and‘weak’ learn-ers).

6.1.2. ResultsfortheECOCbinarizationandtheirdiscussion

According to the conclusions of the OvO binarization experi-ments,wewillrestrictourECOCdesigntomodelTformsandjust one ofthe conventional diversification techniques,switching.The explored non-trainable parameter valuesare N=21,51,101, and 121,andS=10,20,30,and40%.

Part bofTable1showsthe experimentalresults.Performance figures are much better thanthose withOvObinarization, which confirmstheadvantageofwell-designedECOCbinarization proce-dures.Anaverageerrorrateof0.36%forMNISTisaverygood re-sultusingnon-convolutionaldeepmachines.

Ofcourse,all theseimprovementsdonot comeby free: Com-putationalcosts are muchmore importantthan thoseofa single SDAE-3, both for training and forclassifying unseen samples, or operation.ConsideringtheECOC TcasefortheMNISTproblem,a directcount givesatotalofaround1.5×109 weightsandaround 1.5×106sigmoids,thatareapproximately4×102timesthe num-bers fora single SDAE-3 classifier. This is a reasonable estimate fortheoperation computationalloadincrease. Anda detailed ac-countingoftheoperationspertrainingstepforthediversifiedand binarizedfinalclassifiercomparedwiththoserequiredforasingle SDAE-3gives a similar orderofmagnitude forthe loadincrease: Evenconsidering thattheseclassifierswillbecome trainedinless steps,thisisalsoalargecomputationaleffort.But,aswhendealing withshallowmachines,thisis thepricetobe paidto get advan-tagefrombinarizationandconventional diversification.Obviously, tojustifytheirapplication, theproblemtobe solvedhastoshow ahighoverallmisclassificationcost.

6.2.Includingpre-emphasis

Asanintroduction,wewillresumetheresultsofapplyingonly pre-emphasistoSDAE-3classifiers[74,75]beforeaddingthis tech-niquetobinarizationandconventionaldiversity.Thiswillserveto appreciateifthecombinationisbetterthanallofitscomponents.

6.2.1. Pre-emphasizingSDAE-3classifiers

WeadoptthesameSDAE-3classifierarchitecture and parame-tersthat above andwe apply multi-class orbinarypre-emphasis formulas (1),(2).The parameters

α

,

β

, are explored in 0.1steps alongtheinterval[0,1].

With respect to the auxiliary classifiers, or guides, our expe-rience working with shallow classifiers is that pre-emphasis ef-fectsare better when using better guides. This can be expected, becausebetterauxiliaryclassifiersprovidebetterclassification re-sults,and, consequently, more appropriate pre-emphasis weights. Additionally,pre-emphasis effects are also better ifthe guide ar-chitecture issimilar to thatof thepre-emphasized machine. This is alsoreasonable, becauseboth classifiersare able of construct-ingsimilarboundaries,andthisfactimpliesthatmorebenefitcan beobtainedfromthe pre-emphasisprocess. However,we consid-eredappropriatetocheckifthesefindingscanbeextendedtodeep classifiers.So, we applied two kinds of auxiliary machines: One-hiddenlayer with1000unitsMLPs,andthesingleSDAE-3 classi-fierwithoutpre-emphasis.

Ontheotherhand,whenpre-emphasizingSDAE-3classifiers,it is unclear if it will be better to apply the pre-emphasis to both theauto-encodinglayersandthefinal classificationoronlytothe final classification step. We tried both alternatives, andhere will indicatethemas“initial” and“final” designs.

(9)

Table2

Test error rate in percent plus or minus the standard deviation for the three databases using pre-emphasized SDAE-3 classifiers. The values for α, β obtained by validation are given in parentheses. A star ( ∗) indicates suboptimal results (see the text for a discussion). Best results appear in boldface.

Pre-emphasis Aux. classifier MNIST MNIST-B RECT None (MLP) – 2.66 ±0.10 4.44 ±0.23 7.20 ±0.15 None (SDAE-3) – 1.58 ±0.06 3.42 ±0.10 2.40 ±0.13 Initial MLP 0.40 ±0.04 ∗ 0.82 ±0.01 0.92 ±0.10 (0.3, 0.6) (0.3, 0.5) (0.4, 0.4) Initial SDAE-3 0.37±0.010.72±0.01 0.87±0.04 (0.4, 0.5) (0.3, 0.5) (0.4, 0.3) Final MLP 0.57 ±0.00 0.91 ±0.03 1.26 ±0.04 (0.4, 0.6) (0.3, 0.5) (0.6, 0.3) Final SDAE-3 0.67 ±0.05 ∗ 0.83 ±0.02 1.31 ±0.02 ∗ (0.4, 0.4) (0.3, 0.5) (0.4, 0.3)

Table2showstheresultsofthecorrespondingexperiments. It can be seenthat any ofthefour pre-emphasismechanisms improvesthe performance of theconventional SDAE-3 classifiers, butalso that the initial pre-emphasis with the SDAE-3 classifier guideisclearlythebestoption.Thisconfirmsourexpectatives,and alsorevealsthatitisimportanttoapplythepre-emphasisevento theauto-encodingprocess. This resultcan be easily explained: A betterrepresentationofthetrainingsamplesthataremore impor-tantto definethe classificationboundariesproduces performance benefits.

As in the diversification process, there appears a clear paral-lelism between the performance surfaces vs.

α

,

β

, for the val-idation and the test sets. However, these surfaces are not so smooth, and this produces some sub-optimal results –other pairs of

α

,

β

values would offer better test performance,– that are indicated by asterisks. Differences are very minor, the most relevant case being the design with the SDAE-3 classifier guide and initial emphasis for MNIST:

α

=0.3,

β

=0.5 would give a 0.36% average error rate with a (practically) zero standard devi-ation.Clearly, a very minorchange withrespect to the validated result.

We cansaythatresultsarevery good.Inparticular. 0.37,0.72, and0.87% averageerror ratesfor MNIST,MNIST-B,and RECT, re-spectively, with a single SDAE-3 classifier are excellent: It must beconsideredthat theoperationcomputational loaddoesnot in-crease,andthat thedesignrequiresonly11×11=121timesthat ofaconventional singleSDAE-3classifier,duetothevalidationof

α

,

β

.Notethattheaboveerrorratesare0.23,0.21,and0.36%those ofthe single SDAE-3classifiers without pre-emphasis for MNIST, MNIST-B,andRECT,respectively.Thehigherimprovementsforthe multi-classproblemsareeasytounderstand:Helpingtofindbetter classificationboundariesismoreimportantwhenthereare multi-pleboundaries.

Letusremarkthecriticalimportanceofusinggeneraland flex-ible enough weighting formulas such as(1) and(2): All the val-idationselected pairs havevalues of

α

,

β

far from0 and1,that correspondtomorelimitedemphasisforms.Toshowthe degrada-tionthatlimitedweightingschemesproduce,itmustsufficetolist theirperformancefortheMNISTdatasetwithan SDAE-3classifier guideandinitialemphasis(%averageerrorrates):

α

=1 (no emphasis): 1.58 (that of the conventional SDAE-3 classifier)

α

=0,

β

=1(fullerroremphasis):0.85

α

=0,

β

=0(fullproximityemphasis):0.71

α

=0,

β

=0.6(fullcombinedemphasis):0.58

α

=0.2,

β

=1(moderatederroremphasis):0.77

α

=0.2,

β

=0(moderatedproximityemphasis):0.58

Table3

Performance results (% average error rate ±standard deviation) for the two ensembles that are trained with initial pre-emphasis. PrE T(ECOC) S corresponds to the best ensemble of Section6.1, and PrE ECOC S corre- sponds to an ensemble whose first diversity is the ECOC binarization. The validated values of α, βare also shown for the first design. Best results appear in boldface.

MNIST MNIST-B RECT

(no binarization) PrE T(ECOC)S 0.30 ±0.01 0.62 ±0.01 0.76±0.02

=

0 .2 = 0 .4) 0 (α.2 = = 0 .6) (0 α.3)= 0 .4 = PrE ECOC S 0.26±0.04 0.55±0.04

Alltheaboveperformancesare clearlyworse thantheaverage errorratewhich

α

=0.4,

β

=0.5offers,0.37%.

6.2.2. Pre-emphasizingbinarizedanddiversifiedSDAE-3classifiers

Forthesakeofbrevity,wewillpresentheretheresultsof ap-plyinginitialpre-emphasiswiththeconventionalSDAE-3classifier guideonlyforthebestensemblesofthosepresentedinSection6.1, TS with ECOC binarization (simply TS forRECT) –results for the other casesareworse,– and,inthecasesofmulti-classproblems, foranalternativewhichpermitstoapplyadifferentpre-emphasis foreachECOCbinaryproblem:First,theECOCisapplied,andthen theDAE3 auto-encodersplus thefinal switching ensembles com-pleteeach branch, andvaluesof

α

,

β

are separately selectedfor thosebranches.Thefinalaggregationisthesamevotingplus Ham-mingdistancebasedselection.Wewillalsokeepthenon-trainable parameters previouslyused, sincethereis nota significant sensi-tivitywithrespecttothem.Wewillindicatethesepre-emphasized ensemblesasPrET(ECOC)SandPrEECOC-S,respectively.

TheexperimentalresultsappearinTable3.

Comparingthefirstrowoftheseresultswiththose ofthe last rowofTable1andthefourthrowofTable2makes evident that combining pre-emphasis anddiversity (including binarizationfor multi-classproblems)producessignificantimprovementsin perfor-mance.AndtheresultsthatappearinthelastrowofTable3 indi-catethatseparate pre-emphasisisevenmoreeffectivetoincrease the classification performance. Obviously, separate pre-emphases requiremoredesign computational effort, becausetheir

α

,

β

pa-rameters mustbe independently selectedby means ofthe corre-sponding validation processes. However, the operation computa-tionalloadforboth pre-emphasizedensemblesisofthesame or-derofmagnitude,around1.5×109multiplications.

Oncemore,we insist:Such ahuge computationaleffortisthe price to be paid to get the exceptional performances of these ensembles of SDAE-3 classification machines in the image prob-lemsweconsiderinourexperiments,thatare,forthePrEECOC-S model,16%and16%theerrorratesofaconventionalsingleSDAE-3 classifierforMNISTandMNIST-B,respectively.

Althoughthe aboveresultsare excellent –note thatwe get an errorrateforMNISTwhichisonly1.24timestheabsoluterecord, which wasreached with a CNN ensemble [29],– the questionof whatarethelimitsofthesekindofapproacheswithgeneral pur-pose, computationally expensive DNNs emerges. Direct attempts ofimprovingthe performance,such asa second validationround with a finer grid ofvalues for

α

,

β

are not the solution: A two digit second round search forthe PrE ECOC S ensemble leadsto “only” 0.24 ± 0.08 and0.52 ± 0.06 average error rates ± stan-dard deviation for MNISTand MNIST-B, respectively. Yet the an-sweris to check ifother improvement mechanisms canbe com-binedwiththosewehavealreadyapplied,asthesuccessofadding pre-emphasistobinarizationandconventional diversification sug-gests.
(10)

Table4

% error average rate ±standard deviations for the PrE ECOC S ensemble when training samples are augmented by using the Elastic Distortion (ED) process described in the main text.

MNIST MNIST-B ED PrE ECOC S 0.19 ±0.01 0.50 ±0.03

In thenext section, we willexplore theadditional application ofadataaugmentationtechnique,elasticdeformations.

7. Addingelasticdeformations

Dataaugmentationhasbeenatraditionalcomplementof hand-written digit classification algorithms [76], leading to very good performances[28],inparticularwhenappliedforbuilding ensem-bles[29].Therefore,toaddittotheabovedesignsisaninteresting possibility.

Wepresentedanoverviewoftheprocedurewewillapplyhere, anelasticdistortion[78],inSection4.2.Weworkwithdiverse dis-tortions, selecting five combinations of parameters that produce visuallyacceptableresults:

{

,

σ

}

=

{

3,30

}

,

{

4,20

}

,

{

4,30

}

,

{

5,10

}

,

and {5,20}, where thescale parameter are applied to horizontal andverticaldisplacement matricesthatare formedby [−0.1,0.1] uniform random values. We explore the appropriate generation rate (of distorted samples with respect to original examples) among 100,200,300,and400% thenumberof trainingsamples, findingthat300%isgoodformostofthedesigns.Sinceweare in-cludingdistortedrandomsamples,wealsoexploretheaddednoise levelaroundthe10%valuewhichwaspreviouslyapplied,selecting 7%astheappropriatevariance.

Keeping the restof the non-trainable parameters attheir val-ues,except

α

,

β

,thatarevalidatedasbefore,we obtainthe% er-rorrates±standarddeviationsofTable4whenaddingtheabove elastic distortions fortrainingthe PrE ECOC S designs. Asbefore, 10runsareconsidered.

It can be seen that there are significant further performance improvements,andthissupportstheintuitionthat combiningthe elasticdistortionwiththepreviouslyappliedpre-emphasisplus bi-narizationandconventionaldiversification,iseffective.Weremark that the performance for MNISTis a new absoluterecord, supe-rior totheperformance ofthe CNNensemblesof[29],evenCNN being more adapted machinesfor the handwrittendigit recogni-tiontask.Thesefactspermittoconjecturethatcombiningdiversity withotherimprovementmechanismshavingdifferentnatureswill very likely provide performance advantages when working with DNNs.

To appreciate the effectiveness of our record design, Fig. 5 showsthewronglyclassifieddigits fora typicalrun. Itis easyto seethatthesearedifficultsamplesevenforahumanexpert.

Onceagain,ahighincreaseofcomputationalcosts istheprice to be paid in order to obtain these important performance im-provements. In particular, the record design has an architecture that differs from the ECOC T considered in Section 6.1.2 only in including15SDAE-3,oneforeachECOCbinaryproblem.Sincethe multiplicationsandsigmoidaltransformationsintheseSDAE-3are two orders ofmagnitude lower than those of the conventionally diversifiedfinalclassifiers,wehaveagainaround1.5× 109 multi-plicationsand1.5×106 sigmoidaltransformationsforthe opera-tion computational cost. We remark that both pre-emphasis and elastic deformations increase the design effort, but they do not modify thesize ofthe designedmachine and, consequently, they donotchangetheoperationcomputationalload.

Fig.5. Misclassified MNIST digits in a typical run of the ED PrE ECOC S design. k → l: true class → classification result.

8. Conclusionsandfurtherwork

Applying a general-purpose representation basic DNN, the SDAE-3classifier,toaselectednumbersofdatasets(MNIST, MNIST-BASIC, and RECTANGLES) that are relatively simple but that we chooseto allowappreciating theeffects oftheir different charac-teristics,wehavefoundthefollowingexperimentalfacts:

Building diverseensembles (bymeans ofbagging and switch-ing, in particular) is efficient to increase classification perfor-mance, butmulti-classproblems requirethe simultaneous ap-plication of binarizationtechniques, that, by themselves, only producemodestadvantages.

Conventional diversification ismore effectivewhen applied at thelastclassificationsteps.

Applying pre-emphasis sample weighting is also effective, in particular when the weighting formulasare generaland flex-ible. These generalandflexible forms producevery important performance improvements, without increasing the computa-tionalloadtoclassifyunseensamples.

Combiningpre-emphasiswithbinarizationandconventional di-versity furtherimprovestheperformance results,inavery re-markablemanner whenadifferentpre-emphasis isapplied to eachbinaryprobleminmulti-classsituations.

Adding to theabove combinationan elastic distortionprocess tocreateappropriateadditionaltrainingsamplesproducesonce morean increasedperformance,by nomeanstrivial:Thisway hasleadustoanewabsoluterecordinclassifyingMNISTdigits. ∗ Alltheabovetechniquesareusefulwhentheyarecombined,if

theappropriateformsareselected.

Of course, much more work is necessary to check what is the advantage that these methods and their combinations pro-vide when addressing other classification problems and/or using other DNN classifiers, andwe are actively workingin this direc-tion.WeadvisethatCNNisadelicatearchitecturewhichopposes seriousdifficultiestoconventionalmethodsofbuildingensembles, but, even if this approach is not successful, there are many ad-ditional possibilities that can be explored to improve its perfor-mance–andthat ofother DNN,– anditis plausiblethat combin-ingthemwillpermithigherperformanceadvantage,atleastifthe combined procedures have different character, i.e., different con-ceptualreasontoproduce performanceimprovements.And letus

(11)

clarifythat we are speaking non only of pre-emphasis orelastic distortion,but alsoof other data augmentation techniques, more elaboratedmethodsofnoisylearning(consideringtheproblemto beaddressed),andregularization mechanisms,includingdrop-out ordrop-connect.

Note

Inhttp://www.tsc.uc3m.es/ ∼ralvear/Software.htm, the inter-ested reader can access the software blocks we have developed forourexperiments,aswellasfindlinkstoother blockswehave used.

Acknowledgements

This workhasbeen partlysupported by researchgrants CASI-CAM-CM(S2013/ICE-2845, MadridCommunity) and Macro-ADOBE(TEC2015-67719, MINECO-FEDER EU), as well as by the researchnetworkDAMA(TIN2015-70308-REDT, MINECO).

We gratefully acknowledgethehelp ofProf. MonsonH.Hayes indiscussingandpreparingthiswork.

References

[1] G. Cybenko, Approximation by superpositions of a sigmoidal function, Math. Control Signals Syst. 2 (1989) 303–314.

[2] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (1989) 359–366.

[3] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,

L.D. Jackel, Backpropagation applied to handwritten zip code recognition, Neu- ral Comput. 1 (1989) 541–551.

[4] Y. Bengio, Learning deep architectures for AI, Found. Trends Mach. Learn. 2 (2009) 1–127.

[5] J. Schmidhuber, Deep Learning in Neural Networks: An Overview, Technical report, IDSIA-03-14, University of Lugano, 2014. arXiv: 1404.7828v4 [cs.NE].

[6] G.E. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief net- works, Neural Comput. 18 (2006) 1527–1554.

[7] P. Smolensky, Information processing in dynamical systems: foundations of harmony theory, in: D.E. Rumelhart, J.L. McClelland, The PDP Research Group (Eds.), Parallel Distributed Processing: Exploration in the Microstructure of Cognition. Vol. 1: Foundations, Cambridge, MA: MIT Press, 1986, pp. 194–281.

[8] M.A.C.-P. Carreira-Perpiñán, G.E. Hinton, On contrastive divergence learning, in:Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Barbados: Soc. Artificial Intelligence and Statistics, 2005, pp. 33–40.

[9] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoisingautoencoders: learning useful representations in a deep network with a localdenoising criterion, J. Mach. Learn. Res. 11 (2010) 3371–3408.

[10]Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach.Intell. 35 (2013) 1798–1828.

[11]M.M. Najafabadi, et al., Deep learning application and challenges in big data analytics, J. Big Data 2 (1) (2015) 9.

[12]P.P. Brahman, D. Wu, Y. She, Why deep learning works: a manifold disentangle- ment perspective, IEEE Trans. Neural Netw. Learn. Syst. 27 (2016) 1997–2008. [13] L. Deng, D. Yu, Deep convex net: a scalable architecture for speech pat-

tern classification, in: Proceedings of Interspeech 2011, Florence, Italy, 2011,pp. 2285–2288.

[14]B. Hutchinson, L. Deng, D. Yu, A deep architecture with bilinear modeling ofhidden representations: applications to phonetic recognition, in: Proceedingsof IEEE International Conference on Acoustics, Speech and Signal Processing,New York, NY: IEEE Press, 2012, pp. 4 805–4 808.

[15]X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforwardneural networks, in: Proceedings of the 13th International Conference on Ar- tificial Intelligence and Statistics, JMLR: W&CP 9, Chia Laguna Resort, Sardinia,Italy, 2010, pp. 249–256.

[16]J. Martens, Deep learning via Hessian-free optimization, in: Proceedings of the 23th International Conference on Machine Learning, Haifa (Israel), 2010, pp. 735–742.

[17]J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learningand stochastic optimization, J. Mach. Learn. Res. 12 (2011) 2121–2159.

[18]V. Nair, G.E. Hinton, Rectifier linear units improve resticted Boltzmann ma- chines, in: Proceedings of the 23th International Conference on MachineLearning, Haifa (Israel), 2010, pp. 807–814.

[19]S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by

reducing internal covariate shift, Proceedings of the 32th International Confer- ence on Machine Learning, JMLR: W&CP28. Lille (France), 2015.

[20]I. Sutskever, G.E. Hinton, Deep, narrow sigmoid belief networks are universalapproximators, Neural Comput. 20 (2008) 2629–2636.

[21]G. Montufar, N. Ay, Refinements of universal approximation results for deepbelief networks and restricted Boltzmann machines, Neural Comput. 23

[22]L. Szymanski, B. McCane, Deep networks are effective encoders of periodicity, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014) 1816–1827.

[23]J. Håstad, M. Goldmann, On the power of small-depth threshold circuits, Com- put. Complexity 1 (1991) 113–129.

[24]O. Delalleau, Y. Bengio, Shallow vs. deep sum-product networks, in: J. Shawe– Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, K.Q. Weinberger (Eds.), Advances inNeural Information Processing Systems 24, Cambridge, MA: MIT Press, 2011,pp. 666–674.

[25]L. Deng, D. Yu, Deep learning: methods and applications,Found. TrendsSignal Process. 7 (2014) 197–387.

[26]Deep Learning University, An annotated deep learning bibliography, http:// memkite.com/deep- learning- bibliography/.

[27]D. Miskin, J. Matas, All You Need is a Good Init, Technical report, IDSIA-03-14, University of Lugano, 2015. arXiv: 1511.06422v7 [cs.LG].

[28]S. Tabik, D. Peralta, A. Herrera-Poyatos, F. Herrera, A snapshot of image pre-processing for convolutional neural networks: case study of MNIST, Int. J. Comput. Intell. Syst. 10 (2017) 555–568.

[29]D. Ciresan, U. Meier, J. Schmidhuber, Multi-column deep neural networks for image classification, in: Proceedings of the Conference on Computer Vision and Pattern Recognition, New York, NY: IEEE Press, 2012, pp. 3642–3649.

[30]C.K. Chow, Statistical independence and threshold functions, IEEE Trans. Elec- tron. Comput. 14 (1965) 66–68.

[31]A.J. Sharkey, Combining Artificial Neural Nets: Ensemble and Modular Multi--Net Systems, London, UK: Springer, 1999.

[32]L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Hobo- ken, NJ: Wiley, 2004.

[33]L. Rokach, Pattern Classification Using Ensemble Methods, Singapore: World Scientific, 2010.

[34]R.E. Schapire, Y. Freund, Boosting: Foundations and Algorithms, Cambridge,MA: MIT Press, 2012.

[35]M. Wo´ zniak, M.G. na, E. Corchado, A survey of multiple classifier systems as hybrid systems, Inf. Fusion 16 (2014) 3–17.

[36]L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32.

[37]L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140.

[38]L. Breiman, Randomizing outputs to increase prediction accuracy, Mach. Learn.40 (20 0 0) 229–242.

[39]G. Martínez-Muñoz, A. Sánchez-Martínez, D. Hernández-Lobato, A. Suárez, Class-switching neural network ensembles, Neurocomputing 71 (2008) 2521–2528.

[40]Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learningand an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139.

[41]R.E. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated predictions, Mach. Learn. 37 (1999) 297–336.

[42]V. Gómez-Verdejo, M. Ortega-Moral, J. Arenas-García, A.R. Figueiras-Vidal, Boosting by weighting critical and erroneous samples, Neurocomputing 69 (2006) 679–685.

[43]V. Gómez-Verdejo, J. Arenas-García, A.R. Figueiras-Vidal, A dynamically ad- justed mixed emphasis method for building boosting ensembles, IEEE Trans. Neural Netw. 19 (2008) 3–17.

[44]Y. Liu, X. Yao , Ensemble learning via negative correlation, Neural Netw. 12 (1999) 1399–1404.

[45]Y. Liu, X. Yao, Simultaneous training of negatively correlated neural networks, IEEE Trans. Syst. Man Cybern. Pt. B-Cybern. 29 (1999) 716–725.

[46]R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts, Neural Comput. 3 (1991) 79–87.

[47]S.E. Yuksel, J.N. Wilson, P.D. Gadar, Twenty years of mixture of experts, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012) 1177–1193.

[48]A. Omari, A .R. Figueiras-Vidal, Feature combiners with gate-generated weightsfor classification, IEEE Trans. Neural Netw. Learn. Syst. 24 (2013) 158– 163.

[49]T. Hastie, R. Tibshirani, Classification by pairwise coupling, Ann. Stat. 26 (1998)451–471.

[50] K. Duan, S. Keerthi, W. Chu, S. Shevade, A. Poo, Multi-category classifica- tion by soft-max combination of binary classifiers, in: Proceedings of the 4th International Conference on Multiple Classifier Systems, Berlin, Heidelberg: Springer-Verlag, 2003, pp. 125–134.

[51]T.G. Dietterich, G. Bakiri, Solving multiclass learning problems via error-cor- recting output codes, J. Artif. Intell. Res. 2 (1995) 263–286.

[52]D.C. Ciresan, U. Meier, L.M. Gambardella, J. Schmidhuber, Convolutional neuralnetwork committees for handwritten character classification, in: Proceedingsof the 11th International Conference on Document Analysis and Recognition,New York, NY: IEEE Press, 2011, pp. 1135–1139.

[53]J. Xie, B. Xu, Z. Chung, Horizontal and vertical ensemble with deep represen- tation for classification (2013). arXiv: 1306.2759v1 [cs.LG].

[54]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions(2014). arXiv: 1409. 4842v1 [cs.CV].

[55]X. Frazão, L.A. Alexandre, Weighted convolutional neural network ensemble, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Ap- plications, Zug, Switzerland: Springer, 2014, pp. 674–681.

[56]O. Nina, C. Rubiano, M. Shah. Action Recognition using Ensemble of Deep Con- volutional Neural Networks. Crcv tech. report, Univ. of Central Florida, 2014.

[57]T. Zhao, Y. Zhao, X. Chen, Building an ensemble of CD-DNN-HMM acoustic model using random forests of phonetic decision trees, in: Proceedings of theIEEE 9th International Symposium on Chinese Spoken Language Processing,2014, pp. 98–102. Singapore

(12)

[58]L. Shao, D. Wu, X. Li, Learning deep and wide: a spectral method for learningdeep networks., IEEE Trans. Neural Netw. Learn. Syst. 25 (2014) 2303– 2308.

[59]T. Xia, D. Tao, T. Mei, Y. Zhang, Multiview spectral embedding., IEEE Trans. Syst.Man Cybern. Part B 40 (2010) 1438–1446.

[60]A.J. Simpson, Instant learning: parallel deep neural networks and convolutional bootstrapping (2015). arXiv: 1505.05972.

[61]S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, D. Batra, Why M heads are better than one: training a diverse ensemble of deep networks (2015). arXiv: 1511.06314v1 [cs.CV].

[62]R.F. Alvear-Sandoval, A.R. Figueiras-Vidal, Does diversity improve deep learn- ing? in: Proceedings of the 23rd European Signal Processing Confer- ence(EUSIPCO), Nice (France), 2015, pp. 2541–2545.

[63]S. Lee, S. Purushwalkan, M. Cogswell, V. Ranjan, D. Crandall, D. Batra, Stochas- tic multiple choice learning for training diverse deep ensembles, Proceedingsof the 29th Conference on Advances in Neural Information Processing Systems,Barcelona (Spain), 2016.

[64]N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach.Learn. Res. 15 (2014) 1929–1958.

[65]L. Wan, M. Zeiler, S. Zhang, Y. LeCun, R. Fergus, Regularization of neural net- works using dropconnect, in: Proceedings of the 30th International Conferenceon Machine Learning, JMLR: W&CP28. Atlanta, GA:, 2013, pp. 1058– 1066.

[66]P. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 14 (1968)515–516.

[67]P.W. Munro, Repeat until bored: a pattern selection strategy, in: Proceedings of the 4th Conference on Advances in Neural Information Processing Systems, Morgan Kaufmann. San Mateo, CA:, 1991, pp. 1001–1008.

[68]C. Cachin, Pedagogical pattern selection strategies, Neural Netw. 7 (1994)171–181.

[69]S.H. Choi, P. Rockett, The training of neural classifiers with condensed datasets,IEEE Trans. Syst. Man Cybern. Pt. B-Cybern. 32 (2002) 202–206.

[70]D. Gorse, A.J. Sheperd, J.G. Taylor, The new ERA in supervised learning, NeuralNetw. 10 (1997) 343–352.

[71]A. Lyhyaoui, et al., Sample selection via clustering to construct support vector–like classifiers, IEEE Trans. Neural Netw. 10 (1999) 1474–1481.

[72]S. El-Jelali, A. Lyhyaoui, A.R. Figueiras-Vidal, Designing model based classifiersby emphasizing soft targets, Fundamenta Informaticae 96 (2009) 419–433.

[73]L. Franco, S.A. Cannas, Generalization and selection of examples in feedforwardneural networks, Neural Comput. 12 (20 0 0) 2405–2426.

[74]R.F. Alvear-Sandoval, A.R. Figueiras-Vidal, An experiment in pre-emphasizing

diversified deep neural classifiers, in: Proceedings of the 24rd European Sym- posium on Artificial Neural Networks, Bruges (Belgium), 2016, pp. 527–532.

[75]R.F. Alvear-Sandoval, M.H. Hayes, A.R. Figueiras-Vidal, Improving denoising

stacked auto-encoding classifiers by pre-emphasizing training samples, Neu- rocomputing (2016). submitted to

[76]Y. LeCun, L. Bouttou, Y. Bengio, P. Haffner, Gradient-based learning applied todocument recognition, Proc. IEEE 86 (1998) 2278–2324.

[77]M.A. Nielsen, Neural Networks and Deep Learning, Determination Press (o. l.), 2015.

[78]M. O’Neill, Standard reference data program NIST, 2006, http://www. codeproject.com/kb/library/NeuralNetRecognition.aspx.

https://doi.org/10.1016/j.inffus.2017.03.008 r a Creative Commons stacked denoising http://www.tsc.uc3m.es/ Madrid , Control Signals Syst. 2 (1989) 303–314. [2] universal approximators, Neural Netw. 2 (1989) 359–366. [3] Hubbard, L. [4]Y. Bengio, Learning deep architectures for AI, Found. Trends Mach. Learn. 2 [6]G.E. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief net- works, Neural Comput. 18 (2006) 1527–1554. [7] Cognition. Vol. 1: Foundations, Cambridge, MA: MIT Press, 1986, pp. 194–281. [8] Statistics, Barbados: Soc. Artificial Intelligence and Statistics, 2005, pp. 33–40. [9] a local perspectives, IEEE Trans. Pattern Anal. Mach.Intell. 35 (2013) 1798–1828. [11] analytics, J. Big Data 2 (1) (2015) 9. [12] ment perspective, IEEE Trans. Neural Netw. Learn. Syst. 27 (2016) 1997–2008. [13] L. Deng, D. Yu, Deep convex net: a scalable architecture for speech pat- tern classification, in: Proceedings of Interspeech 2011, Florence, Italy,2011, B. Hutchinson, L. Deng, D. Yu, A deep architecture with bilinear modelingof X. [16]J. Martens, Deep learning via Hessian-free optimization, in: Proceedings of pp. 735–742. [17] [18]V. Nair, G.E. Hinton, Rectifier linear units improve resticted Boltzmann [19]S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training reducing internal covariate shift, Proceedings of the 32th International Confer- ence on Machine Learning, JMLR: W&CP28. Lille (France), 2015. L. Szymanski, B. McCane, Deep networks are effective encoders of periodicity,IEEE Trans. Neural Netw. Learn. Syst. 25 (2014) 1816–1827. J. Håstad, M. Goldmann, On the power of small-depth threshold circuits, Com- put. Complexity 1 (1991) 113–129. O. Delalleau, Y. Bengio, Shallow vs. deep sum-product networks, in: J. Shawe–Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, K.Q. Weinberger (Eds.), Advances in Process. 7 (2014) 197–387. memkite.com/deep- learning- bibliography/. pre-processing for convolutional neural networks: case study of MNIST, Int. J. image classification, tron. Comput. 14 (1965) 66–68. Multi--Net Systems, London, UK: Springer, 1999. ken, NJ: Wiley, 2004. Scientific, 2010. MIT Press, 2012. hybrid systems, Inf. Fusion 16 (2014) 3–17. L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140 L. Breiman, Randomizing outputs to increase prediction accuracy, (20 0 0) 229–242. 2521–2528. an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139. predictions, Mach. Learn. 37 (1999) 297–336. Boosting by weighting critical and erroneous samples, Neurocomputing 69 justed mixed emphasis method for building boosting ensembles, IEEE Trans. (1999) 1399–1404. IEEE Trans. Syst. Man Cybern. Pt. B-Cybern. 29 (1999) 716–725. experts, Neural Comput. 3 (1991) 79–87.

Figure

Fig. 1 shows the structure of an SDAE-3 classifier. It consist of three expansive auto-encoding layers with sigmoidal activations, plus a final classification unit, CL, with a sigmoidal or a softmax output for binary or multi-class problems, respectively
Fig. 2 shows the corresponding model. The final aggregation, FA, is a vote counting for each ensemble plus a majority vote if OvO is applied, or a vote counting plus a Hamming distance class  selec-tion for ECOC binarization
Fig. 4. The average error rate for the validation and test sets versus N and S for the MNIST data set using the T model and OvO binarization.
Table 2 shows the results of the corresponding experiments.
+2

References

Related documents