This is a postprint version of the following published document:
Alvear Sandoval, R. F. y Figueiras Vidal, A. R. (2018).
On building ensembles of stacked denoising
auto-encoding classifiers and their further improvement.
Information Fusion
, 39, pp. 41-52.
DOI:
https://doi.org/10.1016/j.inffus.2017.03.008
© 2017 Elsevier B.V. All rights reserved.
This work is licensed under a
Creative Commons
On building ensembles of stacked denoising
auto-encoding classiﬁers and their further improvement
Ricardo F.
Alvear-Sandoval
∗, Aníbal
R. Figueiras-Vidal
GAMMA-L+/DTSC, Universidad Carlos III de Madrid, Spaina
b
s
t
r
a
c
t
Toaggregate diverselearnersand totraindeeparchitecturesarethetwo principalavenuestowardsin-creasingthe expressivecapabilitiesof neuralnetworks. Therefore,theircombinations meritattention.I nthiscontribution,westudyhowtoapplysomeconventional diversity methods–baggingand labelswitching–t o ageneraldeepmachine,the stackeddenoising auto-encodingclassiﬁer,inordertosolveanumber ofappropriately selectedimage recognitionproblems. Themainconclusion ofourworkisthatbinarizingmulti-classproblemsisthekeyto obtainbeneﬁt fromthosediversitymethods.
Additionally,wecheckthataddingotherkindsofperformanceimprovementprocedures,suchaspre-emphasizingtrainingsamplesand elasticdistortionmechanisms,furtherincreasesthequalityofthere-sults.Inparticular,anappropriatecombinationofalltheabovemethods leadsustoreachanewabsoluterecordinclassifyingMNISThandwrittendigits.
Thesefactsrevealthat thereareclear opportunitiesfordesigningmorepowerfulclassiﬁers bymeansofcombiningdifferentimprovement techniques.
Keywords:Augmentation; Classiﬁcation; Deep; Diversity; Learning; Pre-emphasis
1. Introduction
Thenumberofavailable trainingsampleslimitstheexpressive capability oftraditional one-hidden layer perceptrons,orshallow Multi-Layer Perceptrons(MLPs),forpracticalapplications,inspite of their theoretically unbounded approximation capacities [1,2]. Consequently, a lotofattentionis beingpaid to architecture and parameterization procedures that allow to improve their perfor-mancewhensolvingpracticalproblems.Themostrelevant proce-duresincreasethenumberofthetrainable weightsfollowingtwo main avenues: Building ensembles of learningmachines, or con-structingDeepNeuralNetworks(DNNs).
Most of the advances in DNN design correspond to the last decade.Infact,priorto2006theonlysuccessfullyusedDNNswere theConvolutionalNeuralNetwork(CNN)classiﬁers[3],whose sig-niﬁcantly simpliﬁed structure makes possible their training by means ofconventional algorithms such asBack Propagation (BP). This architecture is appropriate for some kinds of applications, those in whichthe samples show translation-invariant character-istics, for example, image processing. But direct trainingof gen-eralformDNNsremainedwithoutsolutionbecausetheappearance of vanishing orexploding derivatives[4,5].In 2006 Hinton et al. [6] proposed a deep classiﬁer indirect design that involved the
∗ _{Corresponding author.}
E-mail address: [email protected](R.F. Alvear-Sandoval).
stacking ofReduced BoltzmannMachines(RBMs) [7].A top layer task-oriented training and overall reﬁning complete these classi-ﬁers,that havecome to be calledDeep Belief Machines (DBMs). Contrastive divergence algorithms allow for the training of the DBMswithoutahugecomputationaleffort[8].
Some years later, Vincent et al. [9] introduced a similar pro-cedure consisting of an expansive, denoisingauto-encoder layer-wise training plus the ﬁnal top classiﬁcation and reﬁning. These are the Stacked Denoising Auto-Encoder (SDAE) classiﬁers. It is worth mentioning that both DBMs and SDAEs are representation machines[10],i.e.,theirhiddenlayersprovidemoreandmore so-phisticatedhigh-levelfeaturerepresentationsoftheinputvectors. Theserepresentationscanbeusefulforanalysispurposes[11],and, evenmore, therepresentation process induces adisentangling of the sub-spaces in which the samples appear [12]. On the other hand, another sequentially trainable deep architecture, the Deep StackingNetworks(DSNs),wasintroducedin[13,14],followingthe ideaoftrainingshallowMLPsandaddingtheiroutputs tothe in-putvector fortrainingfurtherunits.Finally, anumber of modiﬁ-cationsthat reducethediﬃcultieswiththederivativeshavebeen proposedfortrainingdirectlyDNNs,suchasdataconscious initial-izations [15], Hessian-free search [16], mini-batch iterations [17], non-sigmoidalactivations[18],andaddingscaleandlocation train-ableparameters[19].
Therearealsoproofsofuniversalapproximationcapabilitiesfor DNNs[20,21],aswellasofsomeinterestingcharacteristicsofthem
[22].Theanalysesin[23,24]show thatbyaddinglayers toa net-work,it ispossible toreduce the effortto establishinput-output correspondences.Inpractice, DNNshaveoffered excellent perfor-manceresults in many applications, therefore, one can conclude theyareimportantinspiteofthelargenumberofparametersthat needtobelearned.Thereisnotroomheretogivemoredetails,so, theinterestedreaderis referred toexcellent tutorials [4,5,25] for moreextensivereviewsandbibliography,aswell asto[26] fora bibliographyofapplications.
Ensemblesarethesecond optiontoeffectivelyincreasethe ex-pressivecapabilityoflearningmachines,includingMLPs.Theyare builtby means of traininglearnersthat considerthe problemto besolvedfromdifferentperspectives, i.e.,underaprincipleof di-versity,andaggregatingtheiroutputstoobtainanimproved solu-tion.Wepresentaveryconciseoverviewofensembles, emphasiz-ingjustthedesignmethodsthat we willuseinourexperiments, inSection2,forthesakeofcontinuityinthisIntroduction.
Sincebothdiversityanddepthincreasetheexpressivepowerof MLPsbutthroughverydifferentmechanisms,itseemsreasonable toexpectthata combinationofthemwouldleadtoan even bet-terperformance.However,thereareamoderatenumberof contri-butionsalong thisresearch direction. We willbrieﬂy revisethem inSection4,butweanticipatethatsomediﬃcultiesappearwhen tryingtoapplytheusualensemblebuildingmethodstomulti-class problems,and,consequently,mostoftheDNNensemblesare con-structedbymeansof“adhoc” procedures.
Inthispaper,weexploreanddiscussindetailhowandwhy di-versiﬁcationcanbe appliedtoDNNs,aswellasifincludingother improvementtechniquesgivesadditionaladvantage.The objective istoevaluateifitispossibletogetsigniﬁcantadvantagesby com-bining diversiﬁcation and deep learning, as well as other tech-niques.
Ofcourse, wehaveto selectbothDNN architecturesand clas-siﬁcation problemsfor our experiments andsubsequent analysis. Althoughmostofthepreviousstudieswiththedatabaseswewill usehaveconsidered CNNs, we havedecided to work witha less speciﬁcarchitecturetoexcludethepossibilityofobtaining conclu-sions only validfor thisparticular formof DNN andthe kind of problemsthatareappropriateforit.So,weselectSDAEclassiﬁers, andinparticulartheSDAE-3designthatisintroducedin[9]. How-ever,atthe sametimeandjusttoshow thepotentialof combin-ing diversity and depth, we will address some traditional image classiﬁcationtasks–alsoincludedin[9],– thataremore appropri-atefor CNN architectures.The selected problemsfor our experi-ments will be the well-known 10-class handwritten digit MNIST database[3], its version witha smaller trainingset MNIST-BASIC [9],inordertoanalyzetherelevance oftheweak orstrong char-acteroftheSDAE-3classiﬁers,andalsothebinarydatabase RECT-ANGLES[9],withtheobjectiveofstudyingtheoriginofthe diﬃ-cultiesforcreatingensemblesofmulti-classDNNs.Weemphasize thattheseselectionsare not arbitrary:Thereare manypublished resultsfor MNIST,for examplein [27,28],and clearlyestablished recordsfor representationDNNs, a 0.86% error rate[10], andfor CNNensembles [29], a 0.21% errorrate. We anticipatefromnow that, with the help of a boosting-type training reinforcement or pre-emphasis,anda simpledataaugmentation besidesof the bi-narizationandtrainingdiversiﬁcation we willapply,we arriveto anewabsoluteperformance record,a0.19% errorrate.Werepeat thatthisrecordwasnotourobjective,butwelooked forabetter understandingofhowtocombinediversityanddepth,andtoavoid conclusionsonlyvalidforparticularsituations–usingCNNsfor im-ageproblems,– weselect both SDAEs andthe databases. Thus,in ouropinion, thereis noreasonto think that ourconclusionsare problem-orarchitecture-dependent.
The restofthepaperisstructuredasfollows.InSection 2,we presentbriefoverviewsofmachineensembles,bothgeneralforms
and those that come from binarizing multi-class problems. We dedicateSection4tolistandcommentpreviouslypublishedworks in designing DNN ensembles. Section 5 describes the additional techniques –pre-emphasis and data augmentation– we will use in the second partof our experiments. The experimental frame-work isdetailed inSection 6: Databases, deeplearningunits, di-versiﬁcation and binarization techniques, and pre-emphasis and data augmentation forms. The results of the experiments appear in Section 7following a sequentialorder, plus the corresponding discussions. Finally, the main conclusionsof thiswork andsome directionsforfurtherresearchclosethepaper.
2. Ensembles
Tobuild an ensembleofdiversemachinesandaggregatetheir outputs isa wayto increase expressivepower. Althoughthe ﬁrst ideasonitwerepublishedhalfacenturyago[30],theyhavebeen mainly developed along the last two decades. In the following, we brieﬂy review some ensembletechniques, includingthose we will useto diversifySDAEs. We dedicateseparate sub-sections to designs that introducediversiﬁcation by meansofarchitecture or trainingdifferences,that wewillcallconventionalensembles,and toensembles that comefromtransforminga multi-classproblem inanumberofbinaryclassiﬁcationsfromwhichtheresultingclass canbe obtained.Sincea completereviewofensemblesisbeyond thescopeofthispaper,thereaderisreferredtomonographs[31– 34], aswell as to tutorialarticle [35], which includes interesting perspectivesonensembleapplications.
2.1. Conventionalensembles
Conventionaldiversiﬁcationmethodsmaybebroadlyclassiﬁed into twocategories. Theﬁrst are those approachesthat indepen-dently train a number of machines, usually with different train-ingsets.Thesemachines,orlearners,canalsohavedifferent struc-tures.Afterit,learners’outputsareaggregated–typicallywith sim-ple,non-trainable procedures– tocomeup withtheﬁnal classiﬁ-cation.Theseensemblesarecalledcommittees.
Among committees, Random Forests (RFs)[36] are very pop-ular because they offer a remarkable performance. They diver-sifya numberoftreeclassiﬁersbymeansofprobabilistic branch-ing,whichcanbecombinedwithsub-spaceprojections.Thereare other committees that can be applied to general types of learn-ers, requiringonlythat they are unstable: Bagging[37]andlabel switching [38,39]. We will include both of them in our experi-mentsbecausetheyaresimpletoimplementandprovidehigh ex-pressivepower,clearlyimprovingtheperformanceofasingle ma-chine.Yetweannouncethattheﬁrstexperimentalresultswilllead ustofocusonthesecond.
Bagging (“Bootstrap and aggregating”) produces diversity by trainingtheensemble learnerswithbootstrapingre-sampled ver-sionsoftheoriginaltrainingsetand,then,aggregatingthese learn-ers’ outputs, usually by averaging them or with a majority vote. Bootstrap is a random sampling mechanism which includes re-placementtopermit arbitrarysizesofthere-sampledpopulation. Although its primitive form used bootstrapped sets of the same size as the true training set, to explore the size of these boot-strappedsetsisimportanttoﬁndagoodbalancebetween compu-tationaleffortandnumberoflearnersandensembleperformance, becauseinsomecasesthereductionofthetruesamplesthateach learner seescan provoke losses.Onthe other hand,label switch-ing changes thelabels ofa givenportionof thetrainingsamples accordingtosomestochasticmechanism.Wewillemploythe sim-plest version, forwhich these changes appear purely at random. The switchingratemustbe explored when designingthese com-mittees.
Thesecondtypeofconventionalensembles,whichwecall con-sortia,arealgorithmsthattrainboththelearnersandthe aggrega-tioninarelatedmanner.Thebestknownmethodinthiscategory is boosting, which has proven to provide excellent classiﬁcation performancesby combiningtheoutputs ofweak learners[40,41]. Its principleistosequentiallydesignandaggregateweaklearners that,ateach stage,paymoreattentionto theexamplesthat have larger classiﬁcation erroraccordingtothe previously builtpartial ensemble.Manyextensionsandgeneralizationsofthebasic boost-ingalgorithmshavebeenproposed[34].Amongthem,wemention [42,43]becausethey alsoconsidertheproximitytothe classiﬁca-tionboundaryofthetrainingexamples,andthisideaisinthe ori-ginoftheformulasforthepre-emphasistechniqueswewillapply inthispaper,thatwillbeintroducedinSection5.But,since boost-ingmethodsrequireweaklearners,wedonotincludetheminour experimentstodiversifySDAEclassiﬁers.
Other consortia algorithms are those known as Negative Cor-relation Learning (NCL) ensembles [44,45] and the Mixtures of Experts (MoEs) [46]. Bothof them could be applied to diversify DNNs,buttheirperformanceimprovementsaremoderatefor clas-siﬁcation, and their many modiﬁcations to get more advantage [47,48]demand higher computational effort, which,in principle,is notidealwhenlookingformethodstodiversifyDNNs.
Accordingtotheabove,wewilladoptinourexperiments bag-ging and label switching as conventional diversiﬁcation mecha-nisms, although we will also apply pre-emphasis forms that are conceptuallyequivalenttogeneralizedboostingweighting.
2.2. Ensemblesofbinaryclassiﬁersformulti-classproblems
Adifferenttypeofensembleisthat formedbydecomposinga multi-class problem into a number of binary problems [33]. Al-though ofdifferent originand constructed ina differentmanner that conventionalensembles, binarizationtechniquesaretrue en-semblessincetheyconsistofacollectionofbinarymachines,each onehavingadifferentfunctionortaskthatisrelatedtotheoverall problemofmulti-classclassiﬁcation,andtheoutputsofthebinary machinesare aggregatedtoproduce aclassiﬁcation that isbetter thananyonemachineusedalone.
There are two basic binarization techniques, One versus One (OvO) andOne versus Rest (OvR) [49,50].For a C-class problem, OvO isan ensemble ofC
(
C_{−}1)
_{/}2binary classiﬁersthat perform pair-wiseclassiﬁcationsofoneclassCjversusanotherclassCk foreach j =k.Theclass that hasthemostvotesis thendetermined tobe thecorrectclass.OvR isan ensembleofCbinaryclassiﬁers whereeachclassiﬁermakes adecisionbetweenoneclassandthe rest. The advantages of OvR are that the number of learners is smaller (obviously,C) –although itmaybe argued that thisleads to less diversity– and that each classiﬁer sees all of the training samples. However, its disadvantage is that the learner data sets become imbalanced andthisfact tends to produce a decreasein theperformance ofindividualclassiﬁersand, consequently,ofthe wholeensemble.Althoughthiscanbealleviatedbyusingcarefully designedre-balancingmechanisms,wedecidetouseOvOtoavoid thecorrespondingrisks.
Anotherapproach tothebinarization ofamulti-class problem that ismoreeffectivethanOvOandOvR aretheError Correcting Output Codes (ECOCs) [51]. ECOCs are typically better than OvO orOvRbecausea wrongdecisionrequiresanumberofwrong bi-naryclassiﬁcations–seethedetails below,– andnotjusta wrong majority(OvO)orawronghighoutputlevel(OvR).However, effec-tiveECOCdesignsareverydiﬃcultformanyclassproblems.More detailscanbefoundin[33].AnECOCbinarizationassociatesa bi-narycodewordtoeachclass,andtheresultingbinaryproblemsare those represented bythe columns, each dichotomy beingformed by groupingthetrueclassesaccordingtotheircorrespondenceto
0or 1bits. The overallclassiﬁcation is performedby ﬁnding the classwhosecodewordistheclosesttotheonethatisproducesby theensembleofclassiﬁers.Thisimpliesthatcodewordswithhigh Hammingdistances mustbe selected, and, consequently, there is aclearcompromise withthelengthofthecodeword, i.e.,thesize oftheECOCensemble.Inourexperiments,wewillusethe15-bit, 10-classECOCwhichwasintroducedin[51]justforMNIST consid-eringdifferentcharacteristicsofhandwrittendigits.
We must say that, when the number of classes is high, OvO schemesbecome impractical,andECOC ensembles arediﬃcult to design.Monograph[33]providessome detailsabout howto doit. Toconsistently re-balance OvR dichotomiesis a good alternative, andthesame isusefulforbig ECOC ensembles. Sincewe donot dealherewithsuchakindofproblems,wedonotfurtherdiscuss thissubject.
3. AconciseoverviewofpreviousapproachestodiversifyDNNs
The ﬁrst CNNensemble was presented in[52], withdifferent imagesizingasthediversiﬁcationtechnique.Simpledata augmen-tation mechanisms allowed a record performance for MNIST, an excellent 0.21% error rate, using again a CNNensemble [29]. Di-rectaveraging ofclassiﬁcation outputs obtainedfromconsecutive hiddenlayersand/orofthetopoutputatdifferenttrainingepochs isproposedin[53],withmoderateadvantages.InGoogLeNet[54], baggingisappliedtogetherwithdifferentweightinitializations to getaverydeepandpowerfulCNNmachine.In[55],trainable sim-pleaggregationschemesareappliedtoCNNensembles tofurther improve their performance. Optical ﬂow obtained from consecu-tive video frames is used in [56] to construct CNN ensembles. Similarly, multiple triphonestates are the source of diversity for speech recognition purposes in [57], using learners that include hiddenMarkovmodelunits.Theauthorsof[58]usespectral diver-sity[59]totrainsomeDNNswhoseoutputsareaggregated,while partialtrainingoflearnerswithdistorteddatasetsinthe diversiﬁ-cationappliedin[60].
Fromourpointofview,thestudiescarriedoutin[61]are spe-cially relevant. Their authors found that there are diﬃculties to apply bagging to CNNs: In fact,average performances forseveral databasesbecomeworsethanapplyingonlyrandominitializations. Theyalsofound that todiversify layers that arenearto theﬁnal outputismoreeffective,afactthatwe alsoobservedatthesome time forSDAE-3learners[62].These were thestartingpointsfor theresearchwepresentinthispaper.
We conclude thisconcise review mentioning two very recent contributions: Simultaneously training aggregation weights and learners forseveral DNN architectures, including CNNs [63], and selectivelycombiningCNNsthat havebeentrainedwithpowerful augmentationprocedures[28].Finally,letusclarifythatwedonot discusshereimportantmethods suchasdrop-out [64]and drop-connect[65]because,althoughsomeresearchersconsiderthemas formsofdiversiﬁcation,weﬁrmlysupportthattheyare probabilis-ticregularization methods (thatsubsequentlyproduce some kind ofelementarydiversiﬁcation),sincetheir maineffectistoreduce thepossibilityofoverﬁttingby randomlylimitingthe activeunits orweightsateachlearningalgorithmstep.
An importantconclusion can be extracted fromthe above re-view:Toapply conventional diversiﬁcationtechniquesto DNNsis notaneasytask,andthisseemstobethereasonwhymostofthe DNN ensembles are builtusing other diversiﬁcation mechanisms, such as“ad hoc” image augmentation pre-processing,speech se-quential features, or even partial training or manual selection. Since ourexperience indicates that the caseis similar to that of shallowMLPsappliedtomulti-classproblems,wherebinarization notonlyprovidesadirectadvantagebutitalsomakesconventional diversiﬁcationmoreeffective,weorientedtheﬁrstphaseofour
re-searchtocheckifthisisalsotrueforDNNs,inparticular,forSDAE classiﬁers,thatweselectedbythereasonsweindicatedinthe In-troduction.Afterobtaininggoodresults,wedecidedtocontinueby addingother complementary procedure that also muchimproves theclassiﬁcationperformanceofseveralkindsofshallowmachines (includingshallowMLPs),pre-emphasis,underageneral formula-tionwe willintroduce inthenext section. And,ﬁnally,given the successthat the application ofdata augmentation pre-processing hasdemonstratedinmanyexperiments,we addedasimple form of it –also described in Section 5– to the combination of bina-rization,conventionaldiversiﬁcation,andpre-emphasis.Theresults havebeenexcellent, including,asannounced above,a new abso-luterecord fortheMNISTdatabase, a0.19% averageerrorrate, in spiteofusinganon-convolutionalbasiclearner,theSDAE-3 classi-ﬁer.Butwe thinkthatitisevenmoreimportanttohavechecked that binarization helps to effectively apply conventional diversi-ﬁcation mechanisms, as well as to have experimentally demon-strated that combiningother techniques withthem(namely pre-emphasisanddataaugmentation) further increasestheensemble performance.
4. Pre-emphasisanddataaugmentation
4.1.Theconceptofpre-emphasisandourselectedformulationsforit
Trainingsampleselectiontechniquesandtheirevolution, train-ing sampleweighting, called alsopre-emphasis, are eﬃcient and effectivemethodstoimprovetheperformanceofclassiﬁcation ma-chines.Conceptually,theyaremechanismsthatindicatetothe ma-chine the degree of attention which must be paid to each sam-ple,by meansof weighting thecorresponding trainingcost com-ponent.If weights are restrictedto {0, 1},we have a sample se-lectionscheme,which wasthe formappliedinthe seminalwork ofHartwithnearestneighborclassiﬁersﬁftyyearsago[66].Many otherformshavebeensubsequentlyproposed,amongwhich[67– 70]includeinteresting alternatives and discussions. Pre-emphasis methodshaverecently beenused forselectingsamples askernel centroids[71] andalso forallowing a direct trainingofGaussian Processclassiﬁersby deﬁning an appropriate formof softtargets [72].
The weight that each samplereceives hasbeen related to its classiﬁcationerrororto a measureofits proximity tothe classi-ﬁcationboundary, usually accordingtothe outputofan auxiliary classiﬁer.Theeffectivenessofbothmethodsisproblem-dependent [73]. Therefore, the pre-emphasis forms that are applied today include components of the two kinds. These combinations have demonstratedtheireffectivenessfordesigningboostingensembles [42,43].
In this paper, we will adopt the general forms of weighting functionsthat we have previously applied justto pre-emphasize SDAE-3classiﬁers withexcellent results [74,75].For binary prob-lems, the weight for the training cost of the sample {x(n)_{,} _{t}(n)_{},} t(n) _{∈}
{
_{−}_{1}_{,}_{1}}
_{,}_{is}p
(
x(n))
=α
+(
1−α
)
[β
(
t(n) −o(an))
2+(
1−β
)(
1−o( n)2a
)
] (1)whereo(an) istheoutput oftheauxiliary classiﬁerwhenx(n) isits
input, and the parameters
α
,β
, 0 _{≤}α
,β
_{≤} 1, serve to estab-lishtheappropriateproportionofnoemphasis(thetermα
),error emphasis(theterm(
1−α
)
β
(
t(n) _{−}_{o}(n)a
)
2),andproximityempha-sis (the term
(
1_{−}α
)(
1_{−}β
)(
1_{−}o(_{a}n)2)
). In general, valuesforα
,β
canbefound bymeansofcross-validationprocesses.Notethat form(1)includesmanyparticularcases:α
=0,fullemphasiswith thetwo components;β
=0, moderated(byα
) emphasis accord-ingtotheproximitytotheboundary;β
=1,moderatedemphasis accordingtothe error;α
=0 andβ
=0,full emphasis accordingto the proximity to the boundary;
α
=0 andβ
=1, full empha-sis accordingto the error; andα
=1, no emphasis atall. Obvi-ously, thereare other possible functionalforms fortheerrorand theproximitytotheboundaryterms.Thedifferentformsaremore orlesseffective ina problem-dependentmanner, butthe perfor-mancedifferencesarealwaysmoderate.Forthe(softmax)multi-classmachineswewilluse
p
(
x(n))
_{=}α
_{+}(
_{1}_{−}α
)
_{[}β
(
_{1}_{−}_{o}(n)ac
)
2+(
1−β
)(
1−|
o(acn) −o(acn)|
)
](2) where o(_{ac}n) isthe output of the auxiliary (softmax) machine cor-responding to the correct class c for x(n)_{,} _{and}_{o}(n)
ac the output of
thatmachinewhosevalueisthenearesttoo(_{ac}n) amongtherestof classes.
We remark that the application of pre-emphasis demands an additionalcomputationaleffortindesigning classiﬁers,duetothe need of validation to ﬁndappropriate valuesfor
α
,β
. But there is no increase in the operation –i.e., the classiﬁcation of unseen samples– computationaleffort,sincetheclassiﬁer architecture re-mains thesame.This wasthe reasonwhichmoved ustoinclude pre-emphasisinasecondstepofexperiments,aftercheckingthat theresultsofferedbybinarizationandconventionaldiversitywere satisfactory.4.2. Dataaugmentation
Data augmentation methods pre-process trainingexamples to createnewsamples havingsimilar characteristics to those ofthe originalones. Theyallowtoincrease thenumberoftraining sam-plesby addingtheaugmentedversionswithlabelscorresponding to the original samples from which they have been obtained. If carefullyselectedanddesigned,thesemethodsareeffectivein or-dertoincreasetheperformanceofclassiﬁcationmachines.
Dataaugmentationhasbeenusedalong twodecades[76],and its forms have signiﬁcantly evolved. We repeat that the MNIST classiﬁcation absolute record was obtained by means of a CNN ensemble with data augmentation as the diversiﬁcation source [29].There are several dataaugmentation mechanisms that have showneffectivenessinimprovingtheperformanceofimage classi-ﬁers,suchasrandomtranslations,randomrotations,centering,and elasticdeformations.Forconcisereviews, werecommend [28,77]. Giventheadvantagethatdataaugmentationprovidesandthat,as forpre-emphasis, there is not an increase of theoperation com-putationaleffort, we alsoinclude aform ofit,the mostfrequent version ofelasticdeformation[78],inthelaststep ofour experi-ments,aftercombiningpre-emphasisanddiversity.
Thebasicaspectsoftheelasticdeformationwewillemployare asfollows.First,thereisapixeltranslation,whosehorizontaland vertical values are obtained by multiplying the elements of two matrices ofthe image size with smallrandom valuesby a scale factor,
, whichmust be appropriately selected. Translations are limitedto the image borders,and valuesthat arrive tothe same position are averaged. After it, a normalized Gaussian ﬁltering is carried out with the objective of smoothing the results.The pa-rameter
σ
of the ﬁltermust also be selected withcare. Our se-lectionof,
σ
,willbe discussedinthe sectiondedicated tothe experiments.5. Experimentalframework
Weconcentrateherethegeneralaspectsofourexperimentsto avoiddisorientingthereaderswithadispersepresentation.
5.1. Dataset
As previously announced, we will work witha much studied, relatively easy datasetof handwrittendigits, MNIST[3],both be-causetherearemanypublishedresultsfordifferentmachine clas-siﬁers andbecausewe wantto seeifthe techniqueswe propose inthispaperproduceenoughimprovementsastogetcompetitive results withthose of CNNdesigns when a generalpropose basic deepnetworks,SDAE-3,isused.MNISTcontains50,000/10,000/ 10,000training/ validation/testsamples,respectively,of dimen-sion 784 (28 × 28) with 256-level quantized values normalized into[0,1].
Itseemsimportanttoanalyzetheeffectofthestrongnessofthe learners when building an ensemble.Fortunately, there is a ver-sionofMNIST,MNIST-BASIC[9](MNIST-Bintherestofthispaper), whoseonlydifferencewithMNISTisthatitstraining/validation/ testsamplesare10,000/2000/50,000,respectively.Lesstraining sampleswillimplyweakerlearners,and,sinceMNISTand MNIST-B havethesamenature,thiswill permitan excellent perspective fortheanalysiswemention.
Finally,weincludealsoabinarydatabasebecauseitisvery rel-evant to appreciate the possible differences between multi-class andbinaryproblemsinourstudy.Forreasonsofsimilarity,we se-lected RECTANGLES (RECT in the following), a database withthe outlinesofwhitehorizontalorverticalrectanglesonablack back-groundofjust28×28pixels.Thesizesofthetraining/validation /testsetsare10,000/2000/50,000,respectively.
5.2. Thebasicdeepmachine:theSDAE-3classiﬁer
Onceagain,weemphasizethatweselectaDNNwithouta con-volutional structure because we want to check if the procedures we are introducing are effective enough to lead to high perfor-manceresultsusingageneralproposearchitecture,andCNNscan be considered“adhoc” architectureswhendealingwithimages– ourdatasets inthiswork– andother translationalinputs, such as speech.Underthatcondition,thereisnoreasonforexpectingthat thingswillbedifferent,ingeneral,forotherkindsofproblemsand DNNs.
We have selectedthe 3-layer Stacked Denoising Auto-Encoder (SDAE-3)classiﬁer[9]bothbecauseitisarepresentationDNNand becauseithasbeenappliedtothedatabasesweareworkingwith. Consequently,wecanadoptthearchitectureandtraining parame-tersof[9]forabetterappreciationoftheeffectsofourproposed techniques.
Fig.1showsthestructure ofan SDAE-3classiﬁer.It consistof three expansive auto-encoding layers with sigmoidal activations, plus a ﬁnal classiﬁcation unit, CL, witha sigmoidalor a softmax output forbinaryormulti-classproblems, respectively.The auto-encoding layers are sequentially trained using the input samples
x(n)_{as}_{targets}_{for}_{noisy}_{input}_{vectors}_{x}(n) _{+}_{r}_{.}_{The}_{denoising}_{makes}
possible the expansive architecture, which increases the expres-sive capacity, andalso provides some degree ofrobustness. Each auto-encodinghidden layerisfrozen beforetrainingthe next. Fi-nally, the top classiﬁer istrained andthe auto-encoding weights reﬁned.
The design parameters used in [9] were 1000 units for the auto-encodinglayers, an 1,000-hidden layerMLPﬁnal classiﬁer,a sample-by-sampleBPtrainingwitha0.01learningstepfortheﬁrst auto-encodinglayerand0.02fortherestandthetopMLPand re-ﬁning, and 40 trainingepochs, that are enough for convergence. We checkedthese values with positive results. However, our re-sultswerebetterwithanaddednoisevariance10%thevarianceof the samples,leadingustotheclassiﬁcation results(for 10 differ-entinitializationruns),%averageerrorrates±standarddeviation, of1.58± 0.06, 3.42± 0.10,and2.40±0.10 forMNIST,MNIST-B,
Fig.1. SDAE-3 classiﬁer (based on stacked denoising auto-encoders). x(n)_{: training}
sample; r: training noise. The weights of the layers are consecutively obtained by imposing the noisy-free input samples x(n)_{as }_{targets, and }_{then }_{they are frozen }_{until}
inserting the ﬁnal classiﬁer, CL. o is the overall output.
andRECT,respectively.Theseresultsareslightlyworse thanthose of[9],butthestandarddeviationsarelower.
5.3.Conventionaldiversiﬁcation
Weusebagging andswitching.The determinationofthesizes ofthecorresponding ensemblesisdone accordingtothebest re-sults for the validation set, using N=25,51,101, for the num-berofensemblelearners.The samevalidationsetserves toselect thebootstrappingsizeamongB=60,80,100and120%thatofthe originaltrainset,andtheswitchingrateamongS=10,20,30and 40% ofthe training samples.All committee’s unitsare initialized withuniformlydistributedrandomvalues.
5.4.Binarizationtechniques
Accordingto the discussionof Section 2.2, we willapply OvO binarization andthe 15 bit, 10-class ECOC which is proposed in [51]justforhandwrittendigits.
6. Experimentalresultsandtheirdiscussions
6.1. Firstlevelexperiments:binarizationandconventionaldiversity
We repeat that our preliminary experiments with multi-class problemsMNISTandMNIST-Busingonlybagging orswitchingof SDAE-3classiﬁersdidnotprovide anyperformance improvement, thesamenegativeresultthatotherstudiesobtained.Thus,we car-ried out additional experiments including both binarization and baggingorswitching.
Twoalternativeapproacheswere considered.The ﬁrstconsists ofdiversifyingthe SDAE-3auto-encodingparts,that wewill
indi-Fig.2. The G model for multiclass problems. DAE3 nare deep expansive denoising
auto-encoders, BE nthe binarizing ensembles, and FA the ﬁnal aggregation (see text
for details). o is the overall output.
cateasDAE3, and, afterit, formulti-class problems,to construct binarizationensembleswithﬁnalclassiﬁersforeachDAE3output. Fig.2showsthecorrespondingmodel.Theﬁnalaggregation,FA,is avotecountingforeach ensembleplus a majorityvoteifOvOis applied,ora votecountingplus aHammingdistance class selec-tionforECOCbinarization.Obviously,switchingcannotbeapplied to build DAE3ensembles. We will call G (Global) model to this formofconstructingensemblesandbinarizing.
Fig. 3 represents the second alternative, which we call the T model because its aspect: There is a unique DAE3, bagging or switchingareappliedatitsoutput,andthen,OvOorECOC binariz-ingensemblesofﬁnalclassiﬁersaretrainedwiththediverse train-ingsets.Thismodelissuggestedbytheexperiencesindicatingthat diversiﬁcationathigherlayersismoreeffective,suchasthose pre-sentedin[61]. The ﬁnal aggregationis identicalto that of theG model.
AsweannouncedinSection5.3,thevaluesofthenon-trainable parameters are selected according to the results for the valida-tion sets among N=25,51, and 101 for the ensemble size, B=
60_{,}80_{,}100_{,}and120%forthebootstraptrainingsetsizes,andS_{=}
10,20,30, and40% fortheswitching rates.No reﬁningis carried outwhentraining.
To reduce the huge computational load for the design of G models,weapply afrequentlyusedsimpliﬁcation:We buildM> Nbagging unitsforeach valueofthebootstrap samplesizes,and then take N at random. Our experience with this trick indicates that the performance degradation (due to a loss in diversity) is verymoderate.
Fig.3. The TB model for multiclass problems (for bagging). DAE3 is a single expan- sive denoising deep auto-encoder, BE n the binarizing ensembles, and FA the ﬁnal
aggregation (see text for details). o is the overall output.
Table1
Part a: performance results (% average error rate ±typical deviation) for OvO binarization of multi-class problems. DAE3: single DAE with OvO bi- narization; GB: G model with bagging; TB: T model with bagging; TS: T model with switching. Part b: performance results for the model T with switching when replacing OvO by ECOC binarization. Validation selected non-trainable parameter values are also indicated. Of course, RECT de- signs do not include binarization. Best results appear in boldface.
MNIST MNIST-B RECT
(no binarization) SDAE-3 1.58 ±0.06 3.42 ±0.10 2.40 ±0.13 a DAE3, OvO 1.40 ±0.06 2.60 ±0.08 − GB, OvO 0.86 ±0.01 1.76 ±0.04 1.20 ±0.04 ( N, B ) (101, 120) (101, 120) (101, 120) TB, OvO 0.77 ±0.00 1.68 ±0.04 1.19 ±0.01 ( N, B ) (101, 100) (101, 120) (101, 120) TS, OvO 0.75 ±0.00 1.67 ±0.04 1.10±0.02 ( N, S ) (101, 40) (101, 40) (101, 40) b TS, ECOC 0.36±0.02 0.75±0.01 – ( N, S ) (101, 30) (101, 30)
6.1.1. ResultsforOvObinarizationandtheirdiscussion
PartaofTable1showstheperformanceresultsfortheGandT architectureswithN andB/Svaluesselectedbyvalidationandfor aDAE3plusanOvObinarization(withoutconventional diversiﬁca-tion).
Beforediscussingtheseperformanceresults,afewwordsabout the selected valuesfornon-trainable parameters.In all the cases butone(modelTwithbagging),theseselectedvaluesarethe
high-Fig.4. The average error rate for the validation and test sets versus N and S for the MNIST data set using the T model and OvO binarization.
estamongthoseexplored.Butanalysisoftheperformancefor dif-ferent pairs ofvalues makes evident that thereis a clear satura-tion effect just appearing for these extremevalues (N_{=}101_{,}B_{=}
120%,S=40%).Thiscanbeseen fortheMNISTdatabaseinFig.4. Therefore,the bestisjusttoadoptthesevalues,because increas-ingthemwillonlyincreasethecomputationalloadtoclassify un-seen samples,butnot theperformance.Itmust alsoberemarked thattheparallelismbetweentheperformancesurfacesforthe val-idation and the test sets is nearly perfect. This means that the validation process will provide good valuesfor the non-trainable parameters.
FromtheperformanceresultsinPartaofTable1itcanbe con-cluded that bagging and switching diversiﬁcation are effective if combined withbinarization when dealingwith multi-class prob-lems (MNIST andMNIST-B), andthat binarizationby itself is not the main reason for theseimprovements, because binarizing the classiﬁers of a simple DAE3 only provides modest performance increases. Thus, it appears that conventional diversiﬁcation with powerfulclassiﬁerswhendealingwithmulti-classproblemsisnot ableofimprovingthecorresponding classiﬁcationboundariesdue tothecomplexityoftheseboundaries.
Itisalsoevidentthatperformanceimprovementsaremore im-portant formodelT thanfor modelG approaches.Thisseems to indicate that the auto-encoding layers of the SDAE-3 classiﬁers carry outtheir disentanglingfunctioninan effectiveandeﬃcient manner.Consequently,modelTdesignsmustbepreferred,because their computational designandoperation costsare lower.Onthe other hand,there arenot signiﬁcant differencesbetweenbagging andswitchingmodelTresults.
Toclosethisdiscussion,letusremarkthattheperformance in-creases in an important amount: The average error rates forthe switchingmodelTcasesare47%,49%, and46% ofthoseofsingle SDAE-3classiﬁersforMNIST,MNIST-B,andRECT,respectively,and typical deviations become lower.Notethat in thiscasethere are not qualitative differences betweenthe multi-class problems and RECT,norbetweenMNISTandMNIST-B(‘strong’and‘weak’ learn-ers).
6.1.2. ResultsfortheECOCbinarizationandtheirdiscussion
According to the conclusions of the OvO binarization experi-ments,wewillrestrictourECOCdesigntomodelTformsandjust one ofthe conventional diversiﬁcation techniques,switching.The explored non-trainable parameter valuesare N=21,51,101, and 121,andS_{=}10_{,}20_{,}30_{,}and40%.
Part bofTable1showsthe experimentalresults.Performance ﬁgures are much better thanthose withOvObinarization, which conﬁrmstheadvantageofwell-designedECOCbinarization proce-dures.Anaverageerrorrateof0.36%forMNISTisaverygood re-sultusingnon-convolutionaldeepmachines.
Ofcourse,all theseimprovementsdonot comeby free: Com-putationalcosts are muchmore importantthan thoseofa single SDAE-3, both for training and forclassifying unseen samples, or operation.ConsideringtheECOC TcasefortheMNISTproblem,a directcount givesatotalofaround1.5×109 _{weights}_{and}_{around} 1.5_{×}106_{sigmoids,}_{that}_{are}_{approximately}_{4}_{×}_{10}2_{times}_{the}_{} num-bers fora single SDAE-3 classiﬁer. This is a reasonable estimate fortheoperation computationalloadincrease. Anda detailed ac-countingoftheoperationspertrainingstepforthediversiﬁedand binarizedﬁnalclassiﬁercomparedwiththoserequiredforasingle SDAE-3gives a similar orderofmagnitude forthe loadincrease: Evenconsidering thattheseclassiﬁerswillbecome trainedinless steps,thisisalsoalargecomputationaleffort.But,aswhendealing withshallowmachines,thisis thepricetobe paidto get advan-tagefrombinarizationandconventional diversiﬁcation.Obviously, tojustifytheirapplication, theproblemtobe solvedhastoshow ahighoverallmisclassiﬁcationcost.
6.2.Includingpre-emphasis
Asanintroduction,wewillresumetheresultsofapplyingonly pre-emphasistoSDAE-3classiﬁers[74,75]beforeaddingthis tech-niquetobinarizationandconventionaldiversity.Thiswillserveto appreciateifthecombinationisbetterthanallofitscomponents.
6.2.1. Pre-emphasizingSDAE-3classiﬁers
WeadoptthesameSDAE-3classiﬁerarchitecture and parame-tersthat above andwe apply multi-class orbinarypre-emphasis formulas (1),(2).The parameters
α
,β
, are explored in 0.1steps alongtheinterval[0,1].With respect to the auxiliary classiﬁers, or guides, our expe-rience working with shallow classiﬁers is that pre-emphasis ef-fectsare better when using better guides. This can be expected, becausebetterauxiliaryclassiﬁersprovidebetterclassiﬁcation re-sults,and, consequently, more appropriate pre-emphasis weights. Additionally,pre-emphasis effects are also better ifthe guide ar-chitecture issimilar to thatof thepre-emphasized machine. This is alsoreasonable, becauseboth classiﬁersare able of construct-ingsimilarboundaries,andthisfactimpliesthatmorebeneﬁtcan beobtainedfromthe pre-emphasisprocess. However,we consid-eredappropriatetocheckiftheseﬁndingscanbeextendedtodeep classiﬁers.So, we applied two kinds of auxiliary machines: One-hiddenlayer with1000unitsMLPs,andthesingleSDAE-3 classi-ﬁerwithoutpre-emphasis.
Ontheotherhand,whenpre-emphasizingSDAE-3classiﬁers,it is unclear if it will be better to apply the pre-emphasis to both theauto-encodinglayersandtheﬁnal classiﬁcationoronlytothe ﬁnal classiﬁcation step. We tried both alternatives, andhere will indicatethemas“initial” and“ﬁnal” designs.
Table2
Test error rate in percent plus or minus the standard deviation for the three databases using pre-emphasized SDAE-3 classiﬁers. The values for α, β obtained by validation are given in parentheses. A star ( ∗_{) indicates suboptimal results (see } the text for a discussion). Best results appear in boldface.
Pre-emphasis Aux. classiﬁer MNIST MNIST-B RECT None (MLP) – 2.66 ±0.10 4.44 ±0.23 7.20 ±0.15 None (SDAE-3) – 1.58 ±0.06 3.42 ±0.10 2.40 ±0.13 Initial MLP 0.40 ±0.04 ∗ _{0.82 ±}_{0.01 } _{0.92 }_{±}_{0.10 } (0.3, 0.6) (0.3, 0.5) (0.4, 0.4) Initial SDAE-3 0.37±0.01∗ _{0.72}_{±}_{0.01} _{0.87}_{±}_{0.04} (0.4, 0.5) (0.3, 0.5) (0.4, 0.3) Final MLP 0.57 ±0.00 0.91 ±0.03 1.26 ±0.04 (0.4, 0.6) (0.3, 0.5) (0.6, 0.3) Final SDAE-3 0.67 ±0.05 ∗ _{0.83 ±}_{0.02 } _{1.31 ±}_{0.02 }∗ (0.4, 0.4) (0.3, 0.5) (0.4, 0.3)
Table2showstheresultsofthecorrespondingexperiments. It can be seenthat any ofthefour pre-emphasismechanisms improvesthe performance of theconventional SDAE-3 classiﬁers, butalso that the initial pre-emphasis with the SDAE-3 classiﬁer guideisclearlythebestoption.Thisconﬁrmsourexpectatives,and alsorevealsthatitisimportanttoapplythepre-emphasisevento theauto-encodingprocess. This resultcan be easily explained: A betterrepresentationofthetrainingsamplesthataremore impor-tantto deﬁnethe classiﬁcationboundariesproduces performance beneﬁts.
As in the diversiﬁcation process, there appears a clear paral-lelism between the performance surfaces vs.
α
,β
, for the val-idation and the test sets. However, these surfaces are not so smooth, and this produces some sub-optimal results –other pairs ofα
,β
values would offer better test performance,– that are indicated by asterisks. Differences are very minor, the most relevant case being the design with the SDAE-3 classiﬁer guide and initial emphasis for MNIST:α
=0.3,β
=0.5 would give a 0.36% average error rate with a (practically) zero standard devi-ation.Clearly, a very minorchange withrespect to the validated result.We cansaythatresultsarevery good.Inparticular. 0.37,0.72, and0.87% averageerror ratesfor MNIST,MNIST-B,and RECT, re-spectively, with a single SDAE-3 classiﬁer are excellent: It must beconsideredthat theoperationcomputational loaddoesnot in-crease,andthat thedesignrequiresonly11_{×}11_{=}121timesthat ofaconventional singleSDAE-3classiﬁer,duetothevalidationof
α
,β
.Notethattheaboveerrorratesare0.23,0.21,and0.36%those ofthe single SDAE-3classiﬁers without pre-emphasis for MNIST, MNIST-B,andRECT,respectively.Thehigherimprovementsforthe multi-classproblemsareeasytounderstand:Helpingtoﬁndbetter classiﬁcationboundariesismoreimportantwhenthereare multi-pleboundaries.Letusremarkthecriticalimportanceofusinggeneraland ﬂex-ible enough weighting formulas such as(1) and(2): All the val-idationselected pairs havevalues of
α
,β
far from0 and1,that correspondtomorelimitedemphasisforms.Toshowthe degrada-tionthatlimitedweightingschemesproduce,itmustsuﬃcetolist theirperformancefortheMNISTdatasetwithan SDAE-3classiﬁer guideandinitialemphasis(%averageerrorrates):•
α
_{=}1 (no emphasis): 1.58 (that of the conventional SDAE-3 classiﬁer)•
α
_{=}0,β
=1(fullerroremphasis):0.85•
α
_{=}0,β
=0(fullproximityemphasis):0.71•
α
_{=}0,β
=0.6(fullcombinedemphasis):0.58•
α
=0.2,β
=1(moderatederroremphasis):0.77•
α
_{=}0.2,β
=0(moderatedproximityemphasis):0.58Table3
Performance results (% average error rate ±standard deviation) for the two ensembles that are trained with initial pre-emphasis. PrE T(ECOC) S corresponds to the best ensemble of Section6.1, and PrE ECOC S corre- sponds to an ensemble whose ﬁrst diversity is the ECOC binarization. The validated values of α, βare also shown for the ﬁrst design. Best results appear in boldface.
MNIST MNIST-B RECT
(no binarization) PrE T(ECOC)S 0.30 ±0.01 0.62 ±0.01 0.76±0.02
(α=
0 .2 ,β= 0 .4) 0 (α.2 = ,β= 0 .6) (0 α.3)= 0 .4 ,β= PrE ECOC S 0.26±0.04 0.55±0.04 −
Alltheaboveperformancesare clearlyworse thantheaverage errorratewhich
α
=0.4,β
=0.5offers,0.37%.6.2.2. Pre-emphasizingbinarizedanddiversiﬁedSDAE-3classiﬁers
Forthesakeofbrevity,wewillpresentheretheresultsof ap-plyinginitialpre-emphasiswiththeconventionalSDAE-3classiﬁer guideonlyforthebestensemblesofthosepresentedinSection6.1, TS with ECOC binarization (simply TS forRECT) –results for the other casesareworse,– and,inthecasesofmulti-classproblems, foranalternativewhichpermitstoapplyadifferentpre-emphasis foreachECOCbinaryproblem:First,theECOCisapplied,andthen theDAE3 auto-encodersplus theﬁnal switching ensembles com-pleteeach branch, andvaluesof
α
,β
are separately selectedfor thosebranches.Theﬁnalaggregationisthesamevotingplus Ham-mingdistancebasedselection.Wewillalsokeepthenon-trainable parameters previouslyused, sincethereis nota signiﬁcant sensi-tivitywithrespecttothem.Wewillindicatethesepre-emphasized ensemblesasPrET(ECOC)SandPrEECOC-S,respectively.TheexperimentalresultsappearinTable3.
Comparingtheﬁrstrowoftheseresultswiththose ofthe last rowofTable1andthefourthrowofTable2makes evident that combining pre-emphasis anddiversity (including binarizationfor multi-classproblems)producessigniﬁcantimprovementsin perfor-mance.AndtheresultsthatappearinthelastrowofTable3 indi-catethatseparate pre-emphasisisevenmoreeffectivetoincrease the classiﬁcation performance. Obviously, separate pre-emphases requiremoredesign computational effort, becausetheir
α
,β
pa-rameters mustbe independently selectedby means ofthe corre-sponding validation processes. However, the operation computa-tionalloadforboth pre-emphasizedensemblesisofthesame or-derofmagnitude,around1.5×109_{multiplications.}Oncemore,we insist:Such ahuge computationaleffortisthe price to be paid to get the exceptional performances of these ensembles of SDAE-3 classiﬁcation machines in the image prob-lemsweconsiderinourexperiments,thatare,forthePrEECOC-S model,16%and16%theerrorratesofaconventionalsingleSDAE-3 classiﬁerforMNISTandMNIST-B,respectively.
Althoughthe aboveresultsare excellent –note thatwe get an errorrateforMNISTwhichisonly1.24timestheabsoluterecord, which wasreached with a CNN ensemble [29],– the questionof whatarethelimitsofthesekindofapproacheswithgeneral pur-pose, computationally expensive DNNs emerges. Direct attempts ofimprovingthe performance,such asa second validationround with a ﬁner grid ofvalues for
α
,β
are not the solution: A two digit second round search forthe PrE ECOC S ensemble leadsto “only” 0.24 _{±} 0.08 and0.52 _{±} 0.06 average error rates _{±} stan-dard deviation for MNISTand MNIST-B, respectively. Yet the an-sweris to check ifother improvement mechanisms canbe com-binedwiththosewehavealreadyapplied,asthesuccessofadding pre-emphasistobinarizationandconventional diversiﬁcation sug-gests.Table4
% error average rate ±standard deviations for the PrE ECOC S ensemble when training samples are augmented by using the Elastic Distortion (ED) process described in the main text.
MNIST MNIST-B ED PrE ECOC S 0.19 ±0.01 0.50 ±0.03
In thenext section, we willexplore theadditional application ofadataaugmentationtechnique,elasticdeformations.
7. Addingelasticdeformations
Dataaugmentationhasbeenatraditionalcomplementof hand-written digit classiﬁcation algorithms [76], leading to very good performances[28],inparticularwhenappliedforbuilding ensem-bles[29].Therefore,toaddittotheabovedesignsisaninteresting possibility.
Wepresentedanoverviewoftheprocedurewewillapplyhere, anelasticdistortion[78],inSection4.2.Weworkwithdiverse dis-tortions, selecting ﬁve combinations of parameters that produce visuallyacceptableresults:
{
,
σ
}
={
3,30}
,{
4,20}
,{
4,30}
,{
5,10}
,and {5,20}, where thescale parameter are applied to horizontal andverticaldisplacement matricesthatare formedby [−0.1,0.1] uniform random values. We explore the appropriate generation rate (of distorted samples with respect to original examples) among 100,200,300,and400% thenumberof trainingsamples, ﬁndingthat300%isgoodformostofthedesigns.Sinceweare in-cludingdistortedrandomsamples,wealsoexploretheaddednoise levelaroundthe10%valuewhichwaspreviouslyapplied,selecting 7%astheappropriatevariance.
Keeping the restof the non-trainable parameters attheir val-ues,except
α
,β
,thatarevalidatedasbefore,we obtainthe% er-rorrates±standarddeviationsofTable4whenaddingtheabove elastic distortions fortrainingthe PrE ECOC S designs. Asbefore, 10runsareconsidered.It can be seen that there are signiﬁcant further performance improvements,andthissupportstheintuitionthat combiningthe elasticdistortionwiththepreviouslyappliedpre-emphasisplus bi-narizationandconventionaldiversiﬁcation,iseffective.Weremark that the performance for MNISTis a new absoluterecord, supe-rior totheperformance ofthe CNNensemblesof[29],evenCNN being more adapted machinesfor the handwrittendigit recogni-tiontask.Thesefactspermittoconjecturethatcombiningdiversity withotherimprovementmechanismshavingdifferentnatureswill very likely provide performance advantages when working with DNNs.
To appreciate the effectiveness of our record design, Fig. 5 showsthewronglyclassiﬁeddigits fora typicalrun. Itis easyto seethatthesearediﬃcultsamplesevenforahumanexpert.
Onceagain,ahighincreaseofcomputationalcosts istheprice to be paid in order to obtain these important performance im-provements. In particular, the record design has an architecture that differs from the ECOC T considered in Section 6.1.2 only in including15SDAE-3,oneforeachECOCbinaryproblem.Sincethe multiplicationsandsigmoidaltransformationsintheseSDAE-3are two orders ofmagnitude lower than those of the conventionally diversiﬁedﬁnalclassiﬁers,wehaveagainaround1.5_{×} 109_{} multi-plicationsand1.5×106 _{sigmoidal}_{transformations}_{for}_{the}_{} opera-tion computational cost. We remark that both pre-emphasis and elastic deformations increase the design effort, but they do not modify thesize ofthe designedmachine and, consequently, they donotchangetheoperationcomputationalload.
Fig.5. Misclassiﬁed MNIST digits in a typical run of the ED PrE ECOC S design. k → l: true class → classiﬁcation result.
8. Conclusionsandfurtherwork
Applying a general-purpose representation basic DNN, the SDAE-3classiﬁer,toaselectednumbersofdatasets(MNIST, MNIST-BASIC, and RECTANGLES) that are relatively simple but that we chooseto allowappreciating theeffects oftheir different charac-teristics,wehavefoundthefollowingexperimentalfacts:
∗ _{Building} _{diverse}_{ensembles} _{(by}_{means} _{of}_{bagging} _{and}_{} switch-ing, in particular) is eﬃcient to increase classiﬁcation perfor-mance, butmulti-classproblems requirethe simultaneous ap-plication of binarizationtechniques, that, by themselves, only producemodestadvantages.
∗ _{Conventional} _{diversiﬁcation} _{is}_{more} _{effective}_{when} _{applied} _{at} thelastclassiﬁcationsteps.
∗ _{Applying} _{pre-emphasis} _{sample} _{weighting} _{is} _{also} _{effective,} _{in} particular when the weighting formulasare generaland ﬂex-ible. These generalandﬂexible forms producevery important performance improvements, without increasing the computa-tionalloadtoclassifyunseensamples.
∗ _{Combining}_{pre-emphasis}_{with}_{binarization}_{and}_{conventional}_{} di-versity furtherimprovestheperformance results,inavery re-markablemanner whenadifferentpre-emphasis isapplied to eachbinaryprobleminmulti-classsituations.
∗ _{Adding} _{to} _{the}_{above} _{combination}_{an} _{elastic} _{distortion}_{process} tocreateappropriateadditionaltrainingsamplesproducesonce morean increasedperformance,by nomeanstrivial:Thisway hasleadustoanewabsoluterecordinclassifyingMNISTdigits. ∗ _{All}_{the}_{above}_{techniques}_{are}_{useful}_{when}_{they}_{are}_{combined,}_{if}
theappropriateformsareselected.
Of course, much more work is necessary to check what is the advantage that these methods and their combinations pro-vide when addressing other classiﬁcation problems and/or using other DNN classiﬁers, andwe are actively workingin this direc-tion.WeadvisethatCNNisadelicatearchitecturewhichopposes seriousdiﬃcultiestoconventionalmethodsofbuildingensembles, but, even if this approach is not successful, there are many ad-ditional possibilities that can be explored to improve its perfor-mance–andthat ofother DNN,– anditis plausiblethat combin-ingthemwillpermithigherperformanceadvantage,atleastifthe combined procedures have different character, i.e., different con-ceptualreasontoproduce performanceimprovements.And letus
clarifythat we are speaking non only of pre-emphasis orelastic distortion,but alsoof other data augmentation techniques, more elaboratedmethodsofnoisylearning(consideringtheproblemto beaddressed),andregularization mechanisms,includingdrop-out ordrop-connect.
Note
Inhttp://www.tsc.uc3m.es/ ∼ralvear/Software.htm, the inter-ested reader can access the software blocks we have developed forourexperiments,aswellasﬁndlinkstoother blockswehave used.
Acknowledgements
This workhasbeen partlysupported by researchgrants CASI-CAM-CM(S2013/ICE-2845, MadridCommunity) and Macro-ADOBE(TEC2015-67719, MINECO-FEDER EU), as well as by the researchnetworkDAMA(TIN2015-70308-REDT, MINECO).
We gratefully acknowledgethehelp ofProf. MonsonH.Hayes indiscussingandpreparingthiswork.
References
[1] G. Cybenko, Approximation by superpositions of a sigmoidal function, Math. Control Signals Syst. 2 (1989) 303–314.
[2] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (1989) 359–366.
[3] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,
L.D. Jackel, Backpropagation applied to handwritten zip code recognition, Neu- ral Comput. 1 (1989) 541–551.
[4] Y. Bengio, Learning deep architectures for AI, Found. Trends Mach. Learn. 2 (2009) 1–127.
[5] J. Schmidhuber, Deep Learning in Neural Networks: An Overview, Technical report, IDSIA-03-14, University of Lugano, 2014. arXiv: 1404.7828v4 [cs.NE].
[6] G.E. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief net- works, Neural Comput. 18 (2006) 1527–1554.
[7] P. Smolensky, Information processing in dynamical systems: foundations of harmony theory, in: D.E. Rumelhart, J.L. McClelland, The PDP Research Group (Eds.), Parallel Distributed Processing: Exploration in the Microstructure of Cognition. Vol. 1: Foundations, Cambridge, MA: MIT Press, 1986, pp. 194–281.
[8] M.A.C.-P. Carreira-Perpiñán, G.E. Hinton, On contrastive divergence learning, in:Proceedings of the 10th International Workshop on Artiﬁcial Intelligence and Statistics, Barbados: Soc. Artiﬁcial Intelligence and Statistics, 2005, pp. 33–40.
[9] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoisingautoencoders: learning useful representations in a deep network with a localdenoising criterion, J. Mach. Learn. Res. 11 (2010) 3371–3408.
[10]Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach.Intell. 35 (2013) 1798–1828.
[11]M.M. Najafabadi, et al., Deep learning application and challenges in big data analytics, J. Big Data 2 (1) (2015) 9.
[12]P.P. Brahman, D. Wu, Y. She, Why deep learning works: a manifold disentangle- ment perspective, IEEE Trans. Neural Netw. Learn. Syst. 27 (2016) 1997–2008. [13] L. Deng, D. Yu, Deep convex net: a scalable architecture for speech pat-
tern classiﬁcation, in: Proceedings of Interspeech 2011, Florence, Italy, 2011,pp. 2285–2288.
[14]B. Hutchinson, L. Deng, D. Yu, A deep architecture with bilinear modeling ofhidden representations: applications to phonetic recognition, in: Proceedingsof IEEE International Conference on Acoustics, Speech and Signal Processing,New York, NY: IEEE Press, 2012, pp. 4 805–4 808.
[15]X. Glorot, Y. Bengio, Understanding the diﬃculty of training deep feedforwardneural networks, in: Proceedings of the 13th International Conference on Ar- tiﬁcial Intelligence and Statistics, JMLR: W&CP 9, Chia Laguna Resort, Sardinia,Italy, 2010, pp. 249–256.
[16]J. Martens, Deep learning via Hessian-free optimization, in: Proceedings of the 23th International Conference on Machine Learning, Haifa (Israel), 2010, pp. 735–742.
[17]J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learningand stochastic optimization, J. Mach. Learn. Res. 12 (2011) 2121–2159.
[18]V. Nair, G.E. Hinton, Rectiﬁer linear units improve resticted Boltzmann ma- chines, in: Proceedings of the 23th International Conference on MachineLearning, Haifa (Israel), 2010, pp. 807–814.
[19]S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by
reducing internal covariate shift, Proceedings of the 32th International Confer- ence on Machine Learning, JMLR: W&CP28. Lille (France), 2015.
[20]I. Sutskever, G.E. Hinton, Deep, narrow sigmoid belief networks are universalapproximators, Neural Comput. 20 (2008) 2629–2636.
[21]G. Montufar, N. Ay, Reﬁnements of universal approximation results for deepbelief networks and restricted Boltzmann machines, Neural Comput. 23
[22]L. Szymanski, B. McCane, Deep networks are effective encoders of periodicity, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014) 1816–1827.
[23]J. Håstad, M. Goldmann, On the power of small-depth threshold circuits, Com- put. Complexity 1 (1991) 113–129.
[24]O. Delalleau, Y. Bengio, Shallow vs. deep sum-product networks, in: J. Shawe– Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, K.Q. Weinberger (Eds.), Advances inNeural Information Processing Systems 24, Cambridge, MA: MIT Press, 2011,pp. 666–674.
[25]L. Deng, D. Yu, Deep learning: methods and applications,Found. TrendsSignal Process. 7 (2014) 197–387.
[26]Deep Learning University, An annotated deep learning bibliography, http:// memkite.com/deep- learning- bibliography/.
[27]D. Miskin, J. Matas, All You Need is a Good Init, Technical report, IDSIA-03-14, University of Lugano, 2015. arXiv: 1511.06422v7 [cs.LG].
[28]S. Tabik, D. Peralta, A. Herrera-Poyatos, F. Herrera, A snapshot of image pre-processing for convolutional neural networks: case study of MNIST, Int. J. Comput. Intell. Syst. 10 (2017) 555–568.
[29]D. Ciresan, U. Meier, J. Schmidhuber, Multi-column deep neural networks for image classiﬁcation, in: Proceedings of the Conference on Computer Vision and Pattern Recognition, New York, NY: IEEE Press, 2012, pp. 3642–3649.
[30]C.K. Chow, Statistical independence and threshold functions, IEEE Trans. Elec- tron. Comput. 14 (1965) 66–68.
[31]A.J. Sharkey, Combining Artiﬁcial Neural Nets: Ensemble and Modular Multi--Net Systems, London, UK: Springer, 1999.
[32]L.I. Kuncheva, Combining Pattern Classiﬁers: Methods and Algorithms, Hobo- ken, NJ: Wiley, 2004.
[33]L. Rokach, Pattern Classiﬁcation Using Ensemble Methods, Singapore: World Scientiﬁc, 2010.
[34]R.E. Schapire, Y. Freund, Boosting: Foundations and Algorithms, Cambridge,MA: MIT Press, 2012.
[35]M. Wo´ zniak, M.G. na, E. Corchado, A survey of multiple classiﬁer systems as hybrid systems, Inf. Fusion 16 (2014) 3–17.
[36]L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32.
[37]L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140.
[38]L. Breiman, Randomizing outputs to increase prediction accuracy, Mach. Learn.40 (20 0 0) 229–242.
[39]G. Martínez-Muñoz, A. Sánchez-Martínez, D. Hernández-Lobato, A. Suárez, Class-switching neural network ensembles, Neurocomputing 71 (2008) 2521–2528.
[40]Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learningand an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139.
[41]R.E. Schapire, Y. Singer, Improved boosting algorithms using conﬁdence-rated predictions, Mach. Learn. 37 (1999) 297–336.
[42]V. Gómez-Verdejo, M. Ortega-Moral, J. Arenas-García, A.R. Figueiras-Vidal, Boosting by weighting critical and erroneous samples, Neurocomputing 69 (2006) 679–685.
[43]V. Gómez-Verdejo, J. Arenas-García, A.R. Figueiras-Vidal, A dynamically ad- justed mixed emphasis method for building boosting ensembles, IEEE Trans. Neural Netw. 19 (2008) 3–17.
[44]Y. Liu, X. Yao , Ensemble learning via negative correlation, Neural Netw. 12 (1999) 1399–1404.
[45]Y. Liu, X. Yao, Simultaneous training of negatively correlated neural networks, IEEE Trans. Syst. Man Cybern. Pt. B-Cybern. 29 (1999) 716–725.
[46]R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts, Neural Comput. 3 (1991) 79–87.
[47]S.E. Yuksel, J.N. Wilson, P.D. Gadar, Twenty years of mixture of experts, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012) 1177–1193.
[48]A. Omari, A .R. Figueiras-Vidal, Feature combiners with gate-generated weightsfor classiﬁcation, IEEE Trans. Neural Netw. Learn. Syst. 24 (2013) 158– 163.
[49]T. Hastie, R. Tibshirani, Classiﬁcation by pairwise coupling, Ann. Stat. 26 (1998)451–471.
[50] K. Duan, S. Keerthi, W. Chu, S. Shevade, A. Poo, Multi-category classiﬁca- tion by soft-max combination of binary classiﬁers, in: Proceedings of the 4th International Conference on Multiple Classiﬁer Systems, Berlin, Heidelberg: Springer-Verlag, 2003, pp. 125–134.
[51]T.G. Dietterich, G. Bakiri, Solving multiclass learning problems via error-cor- recting output codes, J. Artif. Intell. Res. 2 (1995) 263–286.
[52]D.C. Ciresan, U. Meier, L.M. Gambardella, J. Schmidhuber, Convolutional neuralnetwork committees for handwritten character classiﬁcation, in: Proceedingsof the 11th International Conference on Document Analysis and Recognition,New York, NY: IEEE Press, 2011, pp. 1135–1139.
[53]J. Xie, B. Xu, Z. Chung, Horizontal and vertical ensemble with deep represen- tation for classiﬁcation (2013). arXiv: 1306.2759v1 [cs.LG].
[54]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions(2014). arXiv: 1409. 4842v1 [cs.CV].
[55]X. Frazão, L.A. Alexandre, Weighted convolutional neural network ensemble, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Ap- plications, Zug, Switzerland: Springer, 2014, pp. 674–681.
[56]O. Nina, C. Rubiano, M. Shah. Action Recognition using Ensemble of Deep Con- volutional Neural Networks. Crcv tech. report, Univ. of Central Florida, 2014.
[57]T. Zhao, Y. Zhao, X. Chen, Building an ensemble of CD-DNN-HMM acoustic model using random forests of phonetic decision trees, in: Proceedings of theIEEE 9th International Symposium on Chinese Spoken Language Processing,2014, pp. 98–102. Singapore
[58]L. Shao, D. Wu, X. Li, Learning deep and wide: a spectral method for learningdeep networks., IEEE Trans. Neural Netw. Learn. Syst. 25 (2014) 2303– 2308.
[59]T. Xia, D. Tao, T. Mei, Y. Zhang, Multiview spectral embedding., IEEE Trans. Syst.Man Cybern. Part B 40 (2010) 1438–1446.
[60]A.J. Simpson, Instant learning: parallel deep neural networks and convolutional bootstrapping (2015). arXiv: 1505.05972.
[61]S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, D. Batra, Why M heads are better than one: training a diverse ensemble of deep networks (2015). arXiv: 1511.06314v1 [cs.CV].
[62]R.F. Alvear-Sandoval, A.R. Figueiras-Vidal, Does diversity improve deep learn- ing? in: Proceedings of the 23rd European Signal Processing Confer- ence(EUSIPCO), Nice (France), 2015, pp. 2541–2545.
[63]S. Lee, S. Purushwalkan, M. Cogswell, V. Ranjan, D. Crandall, D. Batra, Stochas- tic multiple choice learning for training diverse deep ensembles, Proceedingsof the 29th Conference on Advances in Neural Information Processing Systems,Barcelona (Spain), 2016.
[64]N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overﬁtting, J. Mach.Learn. Res. 15 (2014) 1929–1958.
[65]L. Wan, M. Zeiler, S. Zhang, Y. LeCun, R. Fergus, Regularization of neural net- works using dropconnect, in: Proceedings of the 30th International Conferenceon Machine Learning, JMLR: W&CP28. Atlanta, GA:, 2013, pp. 1058– 1066.
[66]P. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 14 (1968)515–516.
[67]P.W. Munro, Repeat until bored: a pattern selection strategy, in: Proceedings of the 4th Conference on Advances in Neural Information Processing Systems, Morgan Kaufmann. San Mateo, CA:, 1991, pp. 1001–1008.
[68]C. Cachin, Pedagogical pattern selection strategies, Neural Netw. 7 (1994)171–181.
[69]S.H. Choi, P. Rockett, The training of neural classiﬁers with condensed datasets,IEEE Trans. Syst. Man Cybern. Pt. B-Cybern. 32 (2002) 202–206.
[70]D. Gorse, A.J. Sheperd, J.G. Taylor, The new ERA in supervised learning, NeuralNetw. 10 (1997) 343–352.
[71]A. Lyhyaoui, et al., Sample selection via clustering to construct support vector–like classiﬁers, IEEE Trans. Neural Netw. 10 (1999) 1474–1481.
[72]S. El-Jelali, A. Lyhyaoui, A.R. Figueiras-Vidal, Designing model based classiﬁersby emphasizing soft targets, Fundamenta Informaticae 96 (2009) 419–433.
[73]L. Franco, S.A. Cannas, Generalization and selection of examples in feedforwardneural networks, Neural Comput. 12 (20 0 0) 2405–2426.
[74]R.F. Alvear-Sandoval, A.R. Figueiras-Vidal, An experiment in pre-emphasizing
diversiﬁed deep neural classiﬁers, in: Proceedings of the 24rd European Sym- posium on Artiﬁcial Neural Networks, Bruges (Belgium), 2016, pp. 527–532.
[75]R.F. Alvear-Sandoval, M.H. Hayes, A.R. Figueiras-Vidal, Improving denoising
stacked auto-encoding classiﬁers by pre-emphasizing training samples, Neu- rocomputing (2016). submitted to
[76]Y. LeCun, L. Bouttou, Y. Bengio, P. Haffner, Gradient-based learning applied todocument recognition, Proc. IEEE 86 (1998) 2278–2324.
[77]M.A. Nielsen, Neural Networks and Deep Learning, Determination Press (o. l.), 2015.
[78]M. O’Neill, Standard reference data program NIST, 2006, http://www. codeproject.com/kb/library/NeuralNetRecognition.aspx.