A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data

(1)

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

A

simple

plug-in

bagging

ensemble

based

on

threshold-moving

for

classifying

binary

and

multiclass

imbalanced

data

Guillem

Collell

a,d

_,

_Drazen

_Prelec

a,b,c

_,

_Kaustubh

_R.

_Patil

a,e,∗

a_MIT_Sloan_{Neuroeconomics}_Lab,_{Massachusetts}_Institute_of_Technology,_Cambridge,_MA_02139,_USA b_Department_of_Economics,_{Massachusetts}_Institute_of_Technology,_Cambridge,_MA_02139,_USA c_Brain_&_Cognitive_Sciences,_{Massachusetts}_Institute_of_Technology,_Cambridge,_MA_02139,_USA d_Computer_Science_Department,_KU_Leuven,_Heverlee_3001,_Belgium

e_Institute_of_Neuroscience_and_Medicine,_Brain_&_Behaviour_(INM-7),_Research_Centre_Jülich,_Jülich_52425,_Germany

a

r

t

i

c

l

e

i

n

f

o

Articlehistory: Received 10 October 2016 Revised 26 July 2017 Accepted 29 August 2017 Available online xxx Communicated By Dr. V. Palade Keywords: Imbalanced data Binary classiﬁcation Multiclass classiﬁcation Bagging ensembles Resampling Posterior calibration

a

b

s

t

r

a

c

t

Classimbalancepresentsamajorhurdleintheapplicationofclassiﬁcationmethods.Acommonlytaken approachistolearnensemblesofclassiﬁersusingrebalanceddata.Examplesincludebootstrapaveraging (bagging)combined witheither undersamplingoroversamplingoftheminority classexamples. How-ever,rebalancingmethodsentailasymmetricchangestotheexamplesofdifferentclasses,whichinturn canintroduce theirownbiases.Furthermore, thesemethods oftenrequirespecifying the performance measureofinterestapriori,i.e.,beforelearning.Analternativeistoemploythethresholdmoving tech-nique,whichapplies athresholdtothecontinuousoutputofamodel,offeringthepossibilitytoadapt toaperformancemeasureaposteriori,i.e.,aplug-inmethod.Surprisingly,littleattentionhasbeenpaid tothiscombinationofabaggingensembleandthreshold-moving.Inthispaper, westudythis combi-nationanddemonstrateitscompetitiveness.Contrarytotheotherresamplingmethods,wepreservethe

naturalclassdistributionofthedataresultinginwell-calibratedposteriorprobabilities.Additionally,we extendtheproposedmethodtohandlemulticlass data.Wevalidatedourmethodonbinaryand mul-ticlassbenchmarkdatasets byusingboth, decisiontrees and neuralnetworks as baseclassiﬁers. We performanalysesthatprovideinsightsintotheproposedmethod.

1. Introduction

Dealing with a class imbalance in classiﬁcation is an impor-tantproblemthatposesmajorchallenges[1].Imbalanceddatasets frequently appear in real-world problems, such as in fault and anomaly detection [2,3], fraudulent phone call detection [4] and medicaldecision-making[5],tonameafew.Standardlearning al-gorithms are often guided by global error rates and hence may ignore instances of the minority class, leading to models biased towards predictingthe majorityclass.Severalmethodshavebeen proposedtoalleviatethisproblem(see,e.g.,[6,7]forreviews). Of-ten,aﬁrstchoiceconsistsofpreprocessingthedatabyresampling tobalancetheclassdistribution[8,9].Thisisoftenachievedby ei-ther randomly oversampling(ROS) the minority class [9] or ran-domly undersampling(RUS)the majorityclass [10].More sophis-∗ _{Corresponding author at: MIT Sloan Neuroeconomics Lab, Massachusetts Insti-}

tute of Technology, Cambridge, MA 02139, USA.

E-mailaddresses:[email protected](G. Collell), [email protected](D. Prelec), [email protected](K.R. Patil).

ticated methods that generate synthetic minority class instances arealsoapopularchoice,e.g.,thesyntheticminorityoversampling technique (SMOTE [9]). We will collectively call these data pre-processing methods as rebalancing mechanisms as they, in gen-eral,aimtomake thetrainingdatamore balanced.Thiswill also avoidconfusionwithother resamplingmechanisms,e.g., the sim-ple bootstrap. Rebalancingis often combined withensembles as they show superior performance to a single classiﬁer [11]. Many such combinations have been shown to be effective for imbal-anceddata classiﬁcation [6,12,13]. However, there areseveral po-tentialdrawbacksofrebalancingmethods:(1)potentiallossof in-formativedatawhenundersampling,(2)changesintheproperties ofthedata, suchasasymmetric changes inthe densityof exam-plesofdifferentclasses,whichinturncancausethemodelsto in-duceunwanted biases,e.g.,miscalibratedposteriorprobability es-timates[14,15],(3)itisoftennotevidentwhichclassdistributions tousefora givendatasetandaperformance measureof interest [16](wrapper methods [17] can be employed to tune the model fora givenmeasure,butthey are computationally expensiveand oftencater towardsonlyasinglemeasure,e.g.,eitheraccuracyor http://dx.doi.org/10.1016/j.neucom.2017.08.035

(2)

F1-score),and(4)itisnontrivialtoextendthesamplingheuristics normallydeﬁnedforbinarydatatomulticlassdataastherecanbe multipleminority/majorityclasses[18].

Moving decision thresholds isanother technique to deal with class imbalance. The main difference between rebalancing and threshold-based methods is that the former relies on data pre-processingbefore learning happens, whereas the latter relies on manipulating the continuous output of a learned model, e.g., classweights orposterior probabilities.Among other proponents, Provost[19]advocatedforthreshold-moving asa methodtodeal withclassimbalance.Nevertheless,surprisingly,littleattentionhas beenpaidto thistechnique, oftentoan extentthatitisnot even consideredforcomparisonwhennewmethodsareproposed.

While this technique has been utilized in combination with some popularlearning methods includinga smallensemble [19– 21]. However, to our knowledge, the combination of threshold-movingwitha bagging ensemblehas notbeen thoroughly inves-tigated.As isevident, threshold-moving dependsonreliable con-tinuousestimationoftheoutput;therefore,baggingensemblesare a good candidateto combine withthreshold-moving as they are knowntoprovidegoodprobabilityestimates[22,23].Inthiswork, westudythreshold-movingcombinedwithbaggingensemblesand showthatitisacompetitivemethodwithseveraladvantages.

In particular, we seek a method that provides well-calibrated posterior probability estimates. An important advantage of such a method is that it can be utilized as a plug-in method where the thresholdcan be set a posteriori, i.e., at the test phase. This providesanopportunitytoachievegoodperformanceondifferent measuresusingthesamemodel[24].Thisisamajorimprovement overother methods,e.g., cost-sensitive methods andrebalancing, which require the performance measure of interest to be speci-ﬁed at the learning phase. Here, we propose Probability Thresh-oldbagging (PT-bagging)that, aswe will show,passes asa plug-inmethod.Themain motivationbehindPT-baggingisto leverage theadvantagesofbaggingwhileavoidingtheproblemsthat rebal-ancingmethodsinevitablyentail,asdescribedabove.Theproposed methodPT-bagging addresses those problemsand possesses sev-eraldesirableproperties:

(1)It isaplug-in methodthat maximizesaperformance mea-sureofinterestwithoutretraining,butratherbyjust apply-ing anappropriatethresholdaposteriori.Bycontrast, rebal-ancing methods are not ﬂexible and need computationally expensiveparametertuning,e.g.,toﬁndwhichclass propor-tionstouseforlearningviaawrapperapproach[17]. (2)It consistently performs close to the best possible

macro-accuracy and macro F1 performances without the need to empirically ﬁnd the optimal threshold (e.g., by cross-validation).Obtainingavalidationsetfortuningcanbe com-putationally costly, mightnot always be possible,or might beﬁnanciallyprohibitive(e.g.,duetodatacollectioncosts). (3)Itcanbeextendedtohandlethemulticlasssettingwhen

ap-propriate thresholds fora performance measure ofinterest areavailable,e.g.,macro-accuracy.

We provide a theoretical analysis on when optimal macro-accuracyperformanceisguaranteed.However,forothermeasures, such as the macro F1-score, it is not always possible to obtain a closed-formexpression forthe optimal thresholds [25]. Never-theless, we show that ournew, simple andsensible threshold is closetotheoptimalthreshold,andthatPT-baggingachieveshigher macroF1-score performance compared to other methods. In this respect,wemaketwoadditionalcontributions:(1)theproposalof athresholdformaximizingthemacroF1-score,and(2)a compar-ison andanalysis of the full potential of the methods, which we deﬁne as their maximum attainable performance if the optimal thresholdwereknown.

Therestofthispaperisorganizedasfollows:inSection 2we provide the relevant background, describe some popular resam-plingmethods, anddiscusstheir potential flaws. InSection 3,we describeourproposed method,PT-bagging,andprovidea theoret-icaljustificationofits performance.InSection 4,we describeour experimentalsetup. InSection5,wepresentacomprehensiveset ofempiricaltestsanddiscusstheresults.Finally,wecommenton the implications of our findings and propose future lines of re-search.

2. Background

We considerthe standard classiﬁcationsetting wherea learn-ingalgorithmlearnsfromthetrainingdatatuples

{

xi,yi

}

Ni=1,where xi∈Xare features that can be either continuous, ordinal or cate-goricalandyi ∈C=

{

1,..., m

}

arediscreteclasslabels.Thegoalof learningistoestimateapredictor fˆ:X_→Cthatapproximatesthe true underlying function f: X→C. The model learned, fˆ, is then used tomake predictionson unseentest data

{

xj

}

M_j=1.For binary data,wehaveyi∈ {0,1}andwithoutlossofgeneralitywedenote theminorityclass(i.e.,theclasswithlowerfrequencyinthe train-ing data) asthe class 1.We refer to the class-speciﬁc thresholds as

λ

i, i=1,..., m.Theirapplicationtotheclassiﬁeroutputis de-scribedbelow(Algorithm1,step2.4).Wemaketwoassumptions: (1)theprobabilitydistributionofthetestdataissimilartothatof thetrainingdata,and(2)theclassdistributionofthetrainingdata provides anaccurate estimate oftheirrespectiveunderlying prior probabilities.

2.1. Performancemeasuresforimbalanceddata

Thecommonlyusedmeasureofaccuracy(correctclassiﬁcation rate)isagoodmetricwhendatasetsarebalanced.However,itcan bemisleadingforimbalanceddata.Forexample,thenaïvestrategy ofclassifyingalltheexamplesintothemajorityclasswouldobtain 99%accuracyinadatasetcomposedof99%examplesofthisclass. Therefore,othermeasures arenecesarywhendealingwith imbal-anceddata.

Several performance measures have been proposed in imbal-ancedlearning,allofwhicharecomputablefromtheelementsof theconfusionmatrix(Table1).Someofthemostextensivelyused measuresare: TNR = TN TN +FP ; Recall

(

= TPR

)

= TP TP +FN ; Precision = TP TP +FP ; FPR = FP TN +FP Macro −accuracy = TPR +TNR 2 ; G −mean =

TPR × TNR ;

F1 −score = 2 × Precision × Recall

Precision +Recall =

2TP

2TP +FP +FN

ThemacroF1-scoreisawidelyusedmeasureandiscalculated byconsideringeachclassseparatelyasthepositiveclassandthen averaging their corresponding F1-scores. Inaddition, the receiver operatingcharacteristic(ROC)curveisoftenemployed[6].TheROC curveisgeneratedbyplottingtheTPR(y-axis)andtheFPR(x-axis) whilemovingthroughthewholespectrumofdecisionthresholds.

Table1

Confusion matrix in binary classiﬁcation.

Predicted positive Predicted negative Actual positive TP (true positive) FN (false negative) Actual negative FP (false positive) TN (true negative)

(3)

Thearea undertheROCcurve(AUROC)– generallycomputed nu-merically– isoftenameasureofinterestthatprovidesasummary oftheROCcurve asasinglenumber.However, ROCcurvessuffer from a serious limitation forevaluating performance under class imbalance. When data are highlyimbalanced, ROC curves fail to capturelarge changes in the numberoffalse positives (FP)since the denominatorofFPR islargely dominatedby TN.Forthis rea-son,precision-recall(PR)curvesarepreferredoverROCcurvesfor imbalanced data[7]andarethereforeour choicehere. PRcurves are computed by moving the decision thresholdand plotting re-call(x-axis)andprecision(y-axis).AnalogoustotheROCcurve,the area underthePR curve(AUCPR)istypically employedasa sum-marymeasure.

2.2. Learningfromimbalanceddata

Many solutions have been proposed to deal with imbalanced data. These solutions mainly fall into one of the following three majorstrategies:(1)cost-sensitivelearning, (2)rebalancing mech-anisms,and(3)threshold-moving.Thesestrategiesarebrieﬂy dis-cussedbelow(foradetailedaccountsee[6,7,19]).

(1) Cost-sensitivelearningplacesdifferentmisclassification costs onthedifferentclasses.Highermisclassificationcostsforthe minorityclasscanbeimposedbyalossfunction.Infact,by altering the training class distribution, rebalancing mecha-nismseffectivelyimposedifferentmisclassificationcostsand canbedeemedasequivalenttocost-sensitivelearning[26]. For this reason, we only consider rebalancing methods in thisarticle.

(2) Rebalancing mechanisms resample the data to make the trainingdatamore balanced.Such datapreprocessing solu-tions do not require modifying the learning algorithm and thereforecanbeemployedwithexistinglearningalgorithms. (3) In threshold-moving, a model is learned from the data set witheithertheoriginalormodiﬁedclassproportionsandits continuousoutputisconvertedintoaclasslabelbyapplying anappropriatethreshold.

The techniquesrelevant tothis work arediscussed inthe fol-lowingsections.

2.3. Baggingensemble

Ensemble methodsuseasetofclassiﬁerstoimproveupon in-dividualclassiﬁers’performance[27].Twopopularensemble tech-niques are boosting andbagging [28]. In thiswork, we focus on baggedensemblessince,ingeneral,boostedensemblesdonot per-formbetterwithimbalanceddata[6].

In bagging (Algorithm 1), a set of base classiﬁersare learned from different samples of the given training set. At the predic-tion time, the outputs of the baseclassiﬁers are aggregated. The mainprinciplebehindbagging’sperformance,giventhateachbase

classifierperformsabovechance,isthattheaveragingreducesthe varianceofindividualclassifierswithoutincreasingtheirbias. Bag-ginggenerallyperformswellwithunstablebaselearnersforwhom small changes in the training data lead to large changes in the learned model [28]. For example, decisiontrees (DT) andneural networks(NN) are unstable classifiers andthus suitable for bag-ging.

Clearly,different sampling mechanisms(Algorithm1,step 1.2) anddifferent thresholds (step 2.4) can be used which will yield differentmodelsanddifferentoutputs.Allthemethodstestedhere useavariationofAlgorithm1,andwewilldiscusstheirsampling mechanismsandthresholdsinthenextsection.

Furthermore, different aggregation methods can be used to combine the outputs of the base classiﬁers, e.g., hard-voting to make crispclass assignmentsor soft-voting forprobabilistic pre-dictions. It is known that soft-voting generally provides better performance than hard-voting [21,29,30].We, therefore,use soft-votinginthiswork(Algorithm1,step2.3).

2.4.Resamplingmechanisms

In this section, we brieﬂy describe the different resampling mechanismsthatcanbeusedinstep1.2inAlgorithm1.Oneofthe simplestwaysto resampleistosampleeach instance withequal probability with replacement,i.e., a non-parametric bootstrap, as inthe original bagging algorithm [28].This samplingmechanism is not currently popular with imbalanced data as it preserves the imbalanced class distribution, which is thought to be detri-mentaltolearning. However, we argueherethat thismechanism works well when appropriate thresholds are available. Our pro-posed method, PT-bagging (discussed below), uses this sampling mechanism.

Commonly used resampling mechanisms for imbalanced data (here called rebalancing methods) try to balance the class pro-portions. Perhaps the simplest and most popular undersampling mechanism used for ensemble learning is referred to as exactly balancing (EB). EB resampling preserves the minority class in-stances while randomly undersampling majority class instances suchthat theclassproportionsareexactlybalanced.Roughly bal-ancing (RB) is a powerful variation of EB that improves perfor-manceby increasing diversity of the classiﬁers[12]. RB, like EB, preservestheminority class instances butundersamplesthe ma-jorityclassinstancesasdeterminedbyanegativebinomial distri-bution,ineffectroughlybalancingtheclassproportions. Oversam-pling mechanisms, on the other hand, over-sample the minority classexamples.Previousstudiesindicatethat undersampling gen-erally performs better than oversampling [6,31,32]. Furthermore, undersampling is computationally more eﬃcient than oversam-plingsinceitdiscardsalargepartofthetrainingdata.

More sophisticated hybrid methods that combine oversam-pling and undersampling have been proposed. One of the most popular such methods is the synthetic minority oversampling Algorithm1

Pseudo-code for the bagging ensemble. 1. Learning:

1.1. Input: A training set S={(xi , yi )}N i =1;yi ∈C={1 ,...,m}, where m is the number of class labels and nthe number of base classiﬁers. 1.2. Generate n training data sets by sampling 1_S_.

1.3. Learn n base classiﬁers from each sample. 2. Prediction:

2.1. Input: an instance x,nbase classiﬁers, a probability threshold λk for each class k.

2.2. Each base classiﬁer igives a probabilistic estimate Pi (y =k| x)for each label k∈{ 1 ,...,m} given a test instance x. 2.3. Compute averages of probabilistic predictions for each class k∈{1 ,..., m}: P(y =k|x)= 1

n n

i Pi (y =k|x). 2.4. Rank each class k∈{1 ,...,m}according to: P(y =k|x)/λk .

2.5. Assign the label for which the score in 2.4 is the highest.

(4)

technique(SMOTE)[9],whichgeneratesnewminorityclass exam-ples by interpolation whileundersampling the majority class ex-amples.SMOTEoftenperformswellincombinationwithabagging ensemble. More recent methods such as Random Balance (RNB) havecombinedinsightsfromboth,SMOTEandRB[33].Speciﬁcally, RNB randomlyselects a class proportionandoversamples one of the classesaccordingly with interpolated examples using SMOTE whiletheotherclassisundersampled.RNBaimsatincreasingthe diversityinthebaseclassiﬁers,whichinturnoftenimprovesthe ensembleperformance.Itisimportanttonoticethatthethreshold of0.5isnormallyusedwiththeserebalancingmethods.

Rebalancing methods,however, presenta numberof potential shortcomings. An important side effect of rebalancing is that it canleadtomiscalibratedposteriorprobabilityestimates,asrecent studieshavefound[14,15].Another– usuallyunnoticed– potential problemofresamplingtechniquesisthatofthepriorshift,i.e., dif-feringtraining(balanced)andtest(naturalclassproportion) distri-butions[15].Thismightcreateadditionalproblemssincea model learnedonabalanceddataisthenevaluatedinadifferentsetting. Lastly,theoriginaldensityofexamplesisasymmetricallymodiﬁed forthe classes(e.g.,by undersampling themajorityclassor over-samplingtheminorityclass),whichmightleadtoundesiredbiases inthemodel.Bycontrast,thesimplebootstrap samplingdoesnot modifythe data distribution,which ledus tohypothesize that it willbelesspronetotheseproblems.

3. Probabilitythresholdbagging(PT-bagging)

Severalstudieshaveconsideredthreshold-movingasamethod to deal with class imbalance [20,34]. For example, Maloof [20] compared threshold-moving to RUS using a single classiﬁer andconcludedthat theyachieve similar performance interms of ROC.However,tothebestofourknowledge,previousstudieshave notconsideredthreshold-movingincombinationwithbagging.

The basic idea behind PT-bagging is to leverage bagging (Algorithm 1) to first obtain well calibrated posterior esti-mates and appropriately threshold them afterward according to the performance measure to be maximized. PT-bagging learns base classifiers from simple bootstrap replicates of the original data set, which preserves the class distribution. Then, following Algorithm 1, probabilistic predictions for each class k are aver-agedacross the baseclassifiers toobtain a final posterior proba-bilityestimate P

(

y₌k

|

x

)

(step 2.3).Togetaclasslabel, ﬁrst,the probability estimateis transformed intoa scoreby dividing itby itsrespective class threshold

λ

k (step 2.4). The classk for which thisscore P

(

y₌k

|

x

)

_/

λ

_k is the largest is then assigned. Accord-ingtothecategorizationproposedbyHernandez-Oralloetal.[35], PT-baggingemploys a score-driven threshold,asopposed to,e.g., a ﬁxed threshold of 0.5 used by rebalancing methods. Crucially, inPT-baggingthethresholdscan beadapted tomaximize a mea-sureofinterest.Tomaximize macro-accuracy,theoptimal thresh-oldforclassk isequaltothe classpriorin thetrainingdata,i.e.,

λ

k =P

(

y=k

)

(seeTheorem1).Forinstance,iftheaveraged poste-riorprobabilityforclass0isP

(

y= 0

|

x

)

= 0.7,andthepriorofthis classis0.8(i.e.,P

(

y=0

)

= 0.8),thenits scoreis0.7/0.8 =0.875, andconsequentlythescoreforclass1is0.3/0.2₌1.5.Thus,class 1willbe assignedeven though ithasa lower posterior.The cal-culation of this score is identical in the multiclass setting. We henceforthspecifythemethodwiththethresholdthat maximizes a particularmeasure using a subscript: PTMA-bagging for macro-accuracyandPTF1-baggingformacroF1-score.Weemploythe no-tationPT-baggingwithouta subscriptwhenwe refertomeasures that are independent ofthe threshold (e.g.,AUCPR and posterior probabilitycalibration).

3.1. Thresholdformaximizingmacro-accuracy

Thefollowingtheoreticalresultaimstoprovideinsightintothe mechanism behindthe performance of threshold-moving for the macro-accuracy measure. The reader should note that the main message of Theorem 1 is not ﬁnding the optimal thresholds (or misclassiﬁcationcosts)forthemacro-accuracymeasure,whichare knownto beequalto theinverseofthepriors, butrathera con-structiveproofofanalgorithmthatmaximizesmacro-accuracyfor binaryand multiclassdata. Theorem 1’sproof showsthat a nec-essarycondition fora method to maximize macro-accuracy is to have goodestimates of the posteriorprobabilities P

(

y=k

|

x

)

for eachclassk.Tosimplifynotationandimprovereadability,we con-siderthebinaryclassproblem.Thesameprooftriviallygeneralizes toamulticlasssetting.

Theorem 1. Proposition: Let P

(

y= j

)

be the prior of class j and P

(

y= j

|

x

)

bethetrue(unknown)posteriorprobabilityofclassjgiven x. If proportions of each class are unchanged from training to test, thenpredictingtheclassksuchthat

k= argmax

j

P

(

y= j

|

x

)

P

(

y=j

)

(1)

maximizesthemacro-accuracy.

Proof. LetC₌

{

1_,0

}

be the class labels (positive ₌ 1 and nega-tive = 0). Letd be a random variablecorresponding to the class predictedbyaclassiﬁer.Thus,P

(

d=k

|

x

)

isthepredicted probabil-ityofamodelforclasskgivenx.

We shall ﬁrst derive the population expression of macro-accuracy.Recallthatmacro-accuracyisdeﬁnedas

(

TPR + TNR

)

/2 where TPR = TP/P andTNR=TN/N. Recall that P = TP + FN and N = TN+ FP. Let us ﬁrst derive a continuous expression for TPR. Notice ﬁrst that TP/

(

P+N

)

=∫

R

P

(

y=1

|

x

)

P

(

d=1

|

x

)

p

(

x

)

dx andthat P/

(

P + N

)

= P

(

y= 1

)

.Thus, theratio ofthe ﬁrst expres-sionoverthesecondisequaltoTPR=TP/P.Thatis,

TPR = ∫RP

(

y=1

|

x

)

P

(

d= 1

|

x

)

p

(

x

)

dx P

(

y=1

)

The derivationof TNRisanalogous. Thus, dividing thesumof TPRandTNRby2yieldstheexpressionofmacro-accuracy:

1 2 ∫RP

(

y = 1

|

x

)

P

(

d= 1

|

x

)

p

(

x

)

dx P

(

y=1

)

+1 2 ∫RP

(

y= 0

|

x

)

P

(

d= 0

|

x

)

p

(

x

)

dx P

(

y= 0

)

Byenteringbothtermsintothesameintegral:

1 2 R

P

(

y= 1

|

x

)

P

(

y=1

)

P

(

d= 1

|

x

)

+ P

(

y=0

|

x

)

P

(

y=0

)

P

(

d=0

|

x

)

p

(

x

)

dx (2)

Therefore,maximizingtheintegralin(2)isequivalenttoasking fortheoptimalchoiceofP

(

d₌1

|

x

)

andP

(

d₌0

|

x

)

foragivenx– in other words, how to assign class labels1 or 0 ina wise way (perhapsprobabilistically), given x. Notice that the bracketinside theintegralin(2)isnothingbutaconvexcombination:

P

(

y= 1

|

x

)

P

(

y= 1

)

β

x+

P

(

y= 0

|

x

)

P

(

y=0

)

(

1 −

β

x

)

(3)

where we deﬁned

β

x:=P

(

d=1

|

x

)

. Thus, by monotonicity, the convex combination (3) is maximized at x if and only if we placeprobability 1to thelargest term.Thatis tosay, anoptimal methodassignsthepositive class1withprobability 1iftheterm

(5)

P

(

y= 1

|

x

)

/P

(

y= 1

)

isthelargestorassignsthenegativeclasswith probability1ifP

(

y=0

|

x

)

/P

(

y=0

)

isthelargest.Thatis,

β

x:=P

(

d= 1

|

x

)

=

1 , i f P_P(y₍_y=₌1₁|x₎)> P(y=0|x) P(y=0) 0 , Otherwise (4)

Thisisindeedthemethodproposedabove,inEq.(1). Critically, we notethat theoptimalmethod ofEq.(4)willnot havethetrueP

(

y=1

|

x

)

athandbutanestimation P

(

y=1

|

x

)

in-stead. Notice alsothat all the other quantitiesneeded tomake a decision in (4) are known constants, i.e., P

(

y=1

)

and P

(

y=0

)

(i.e., the thresholds). Therefore, the good performance of the method totally relies on having good posterior probability esti-matesP

(

y= 1

|

x

)

.

3.2. ThresholdformaximizingmacroF1-score

Unlikemacro-accuracy,thereisnoclosed-formexpressionfora thresholdthat maximizesthemacroF1-score[25].Ingeneral, in-creasingthethresholdincreasestheprecisionoftheminorityclass attheexpense ofdecreasedrecall.It is,however,knownthat 0.5 istheupperboundontheoptimalthresholdfortheF1-score[25]. In the absence of any additionalinformation and tuning, we set thethresholdfortheminorityclassto

(

P

(

y=1

)

+0.5

)

/2.The ra-tionalebehindthisthresholdisthatitissetmidwaybetweenthe threshold formaximizing the averagerecall (i.e., the trainingset classprior)andtheupperboundonthethresholdformaximizing theF1-score(i.e.,0.5).

Notethatmethodshavebeenproposedtoestimatetheoptimal thresholdformaximizingtheF1-scorebutthey requireadditional dataforfine-tuningandaresusceptibletotheWinner’scurse[25]. Using tuning methods is difficult for most of the datasets used here, as they are relatively small. Importantly, our aimhere was totestatuning-freethreshold.However,findingbetterthresholds, whenpossible,canconceivablyresultinfurtherimprovements.

The readershould notethat theproposed thresholdis deﬁned forthebinaryclasssetting.Itsextension tothemulticlasssetting willbeconsideredinfuturework.

4. Experimentalsetup

We used the R statistical environment (http://www.r-project. org/) withcorresponding packagesanddefaultparameters unless otherwisespeciﬁed.

Selecting a proper base classiﬁer learning algorithm (Algorithm 1, step 1.3) is crucial as our method relies on good posterior probability estimates at the test time. Previous work has shown that bagged probabilistic decision trees (DT) provide goodposteriorprobability estimates[22].Theseare,in fact,more reliablethanotherclassiﬁerssuchaslogisticregression[22].Here, we employ unpruned J48 decision trees, an implementation of the C4.5 trees available in the “RWeka” package [30,36]. Even though it is common to apply Laplace smoothing to the leaf probabilities ofthe individual decisiontrees, it can be detrimen-tal for imbalanced data [37]. Therefore, we did not use Laplace smoothing.

In order to evaluate the generality of our method, we also employed neural networks (NN) as base classiﬁers. Bagged neu-ral networks are known to offer well calibrated posterior proba-bility estimates [23], and thus we hypothesized that this would be a suitable choice for our method. We used a single hidden layer with logistic units and softmax output, implemented with the“nnet” package.Followingaruleofthumb,wesetthenumber ofhiddenunitsto2/3oftheinputdimensionplusthenumberof classes[38].

Westudiedtheeffectofthenumberofbaseclassiﬁersby vary-ing them in {5, 10, 15, 25, 50, 100}. We ran 5_×2-fold cross-validationforeachmethodoneachdataset.TheFriedmantestwas usedtotestifthereweredifferencesacrossthemethodsandifthe test passedat95% signiﬁcance,a posthoc Nemenyi test was per-formedtoidentifyanypairwisedifferences[39].ApairedWilcoxon ranksumstestwasusedtocomparetwomethodsdirectly.

4.1. Datasets

We used 36 imbalanced binary data sets (Table 2): 14 from the HDDT repository (http://www3.nd.edu/∼dial/hddt/), 19 fromtheKEELrepository(http://sci2s.ugr.es/keel/imbalanced.php), and three from the UCI repository (https://archive.ics.uci.edu/ml/ datasets.html). Note that the evaluationwithneural network en-sembles includes 26 data sets that contain only numerical at-tributes.Only thecomplete instances ofdata were used andany constantattributeswereremoved.Table2showsasummaryofthe datasets usedin ourexperiments. The attributesofthe datasets arenumerical,categoricalornumerical-categoricalmixed. Forthe multiclasssetting,weused15 datasetsfromtheKEELrepository (Table3).

Table2

Overview of the binary data sets obtained from UCI, HDDT ∗_{and KEEL}_{† repositories (names were}_{shortened for}

convenience).

Dataset #Inst #Attr #Num %Min Dataset #Inst #Attr #Num %Min

pima 768 8 8 34.5 br-y † 277 9 0 29.2 ion 351 34 34 35.9 cl0vs4 † 173 13 13 7.5 sonar 208 60 60 46.6 ecoli4 † 336 7 7 5.9 spectf ∗ ₂₆₇ ₄₄ ₄₄ _20.6 _hab_† ₃₀₆ ₃ ₃ _26.5 phon ∗ ₅₄₀₄ ₅ ₅ _29.3 _{led7_xvs1}_† ₄₄₃ ₇ ₇ _8.3 page ∗ ₅₄₇₃ ₁₀ ₁₀ _10.2 _pb-1-3vs4_† ₄₇₂ ₁₀ ₁₀ _5.9 ism ∗ _11,180 ₆ ₆ _2.3 _shut-0vs4_† ₁₈₂₉ ₉ ₉ _6.7 letter ∗ _20,0₀₀ ₁₆ ₁₆ _3.9 _vow0_† ₉₈₈ ₁₃ ₁₃ _9.1 satim ∗ ₆₄₃₀ ₃₆ ₃₆ _9.7 _yst-2vs4_† ₅₁₄ ₈ ₈ _9.9 compu ∗ _13,657 ₂₀ ₂₀ _3.8 _yst4_† ₁₄₈₄ ₈ ₈ _3.4 segm ∗ ₂₃₁₀ ₁₉ ₁₉ _14.3 _glass6_† ₂₁₄ ₉ ₉ _13.5 oil ∗ ₉₃₇ ₄₉ ₄₉ _4.4 _new-th1_† ₂₁₅ ₅ ₅ _16.3 estate ∗ ₅₃₂₂ ₁₂ ₁₂ ₁₂ _wisc_† ₆₈₃ ₉ ₉ _34.5 hypo ∗ ₂₀₀₀ ₂₄ ₆ _6.1 _car-gd_† ₁₇₂₈ ₆ ₀ ₄ boun ∗ ₃₅₀₅ ₁₇₅ ₀ _3.5 _ﬂare-F_† ₁₀₆₆ ₁₁ ₀ ₄ cred ∗ ₁₀₀₀ ₂₀ ₇ ₃₀ _kdd_† ₁₆₄₂ ₄₁ ₂₆ _3.2 hrt-v ∗ ₁₃₃ ₉ ₄ _23.3 _veh0_† ₈₄₆ ₁₈ ₁₈ _23.5 ab9-18 † 731 8 7 5.7 w-red-4 † 1599 11 11 3.3

(6)

Table3

Overview of the multiclass data sets, all from the KEEL repository.

Dataset #Inst #Attr #Num %Min #Class

ontraceptive 1473 9 6 22.6 3 dermatology 366 34 34 5.5 6 balance 625 4 4 7.8 3 penbased 1100 16 16 9.5 10 shuttle 2175 9 9 0.09 5 wine 178 13 13 27 3 yeast 1484 8 8 0.3 10 pageblocks 548 10 10 0.6 5 thyroid 720 21 21 2.4 3 ecoli 336 7 7 0.6 8 autos 159 25 15 1.9 6 glass 214 9 9 4.2 6 new-thyroid 215 5 5 13.9 3 hayes-roth 132 4 4 22.7 3 lymphography 148 18 3 1.3 4

4.2.Methodsusedforcomparison

For the binary class setting, we included EB-bagging as the baselinemethodalongwithRB-bagging,SMOTE-baggingand RNB-bagging as state-of-the-art methods (see Section 2.4 for details). TheparametersofSMOTEweresettothecommonlyusedsetting of500%oversamplingoftheminorityclasswith5nearest neigh-borsand 100% samplingof the majorityclass. To investigatethe benefit of using an ensemble in PT-bagging, we additionally in-cludedasingleclassifier (denotedas“single”)thatusedthe com-plete training data to learn, and employed the same threshold-movingmechanism asPT-baggingandidenticalparameters asits baseclassifiers.

For completeness, we compared Platt scaling [40] and PT-baggingonbinarydatasets.Plattscalingisawell-knownposthoc calibrationmethodthattransformscontinuousmodeloutputsinto (calibrated) probability estimates. It firstly fits a logistic regres-sion model to the outputs with the class labels as the depen-dentvariable.Toavoidover-fitting,weuseddatafrom3-fold cross-validationwithtransformedclass labelsderivedfromthetraining set,asintheoriginal paper[40].Thislogisticmodelwasthen ap-pliedto the test set outputs to obtain calibrated probabilities. A thresholdof0.5wasfinally appliedtothecalibratedprobabilities to obtain the test class labels. If the posteriors are indeed cali-bratedafterPlattscalingthenthissettingshouldyieldgood perfor-mance.Wedeemeditsinclusionasrelevantsince,likeourmethod, Plattscalingisaneasy-to-applyaposterioricorrectionofthemodel outputs.

4.3.Performanceevaluation

Weevaluatethemethodsonthreeperformancemeasures:area underthePR curve(AUCPR), macro-accuracy andmacroF1-score (see Section 2.1 fordetails). Note that AUCPR is computed using theposteriorprobability estimateswhile theother two measures requireclasslabelassignments.

The methods consideredherediffer intwo importantaspects: (1)theresamplingmechanismand(2)thethresholdused.Each re-samplingmechanismmayrequireitsownthresholdtoachieve op-timalperformance.However,oftenastandardthresholdisusedin practice(e.g.,0.5forEB-andRB-bagging).Toevaluatethe perfor-manceindependentlyofthethreshold,wedevisedanovelscheme calledthefullpotential.Thefullpotentialofamethodisdeﬁnedas thebestperformance achievedoverall possiblethresholds onthe testset.Amethod’s performancecloseto itsfull potentialmeans thatthethresholdused ismoreattunedto theoptimalthreshold forthat method.Weapproximatedthefull potentialby searching theoptimalthresholdoveragridfrom0to1instepsof0.01.

Finally,weusedthestratiﬁedBrierscore(meansquarederror) toevaluatethecalibrationoftheposteriorprobabilities.TheBrier scoreofclasskiscalculatedastheaveragesquareddifferences be-tweentheestimatedprobabilityoftheexamplesfromclassk(i.e. P

(

y = k

|

xi

)

)andaperfectlyconﬁdentprobabilisticprediction(i.e., 1): BSk₌ 1 Nk_y∗ i=k

1 −P

(

y=k

|

xi

)

2 (5)

whereNkisthenumberofexamplesofclasskandy∗i =krefers to those examples for which k is the true class label. Averaging the Brier score of all the classes gives the stratiﬁed Brier score. The stratiﬁedBrierscoreis moreappropriate whenthereis class imbalance since it gives equal importance to all the classes and thusallowsanymiscalibrationoftheminorityclassestobespotted [41].

5. Resultsanddiscussion

In thissection, we compare the performance of theproposed method with other state-of-the-artmethods and provide further empiricalinsights.Forbrevity,whenensemblesofneuralnetworks performed similarly to those of decision trees, the results from neural networks ensembles are omitted. Unless otherwise indi-cated,wereportresultsonensembleswith100baseclassiﬁers. 5.1. Binarydatasets

Weﬁrstinvestigatedtheeffectsoftheensemblesize(Fig.1).It isworth mentioningthat, unsurprisingly, thearea undertheROC curve showed a much more clutteredpicture, which we omit in theinterestofspace.

Overall, the following general observations can be made: (1) all methods show improvementwithincreasing ensemblesize in allperformancemeasures. (2)PTMA-,RB-andEB-baggingperform better on macro-accuracy while PTF1-, SMOTE- and RNB-bagging performbetteronthemacroF1-score.Thisshowsthatdifferent re-samplingmechanismsaresuitable fordifferentperformance mea-sures; (3)PT-bagging – with appropriate thresholds – performed well in each ofthe evaluated measures, while the restof meth-odsperformedpoorlyinatleastoneofthem,e.g.,RB-bagging per-formed poorly in macroF1-score andAUCPR, while SMOTE- and RNB-bagging performed poorly in macro-accuracy. Finally, (4) as expected, EB-and RB-bagging performed similarly (although RB-bagging generallyfared betterwithdecision trees)andRNB-and SMOTE-bagging did not differ signiﬁcantly. We discuss these re-sultsinmoredetailbelow.

5.1.1. Theareaundertheprecision-recallcurve(AUCPR)

As can be readily observed, PT-, SMOTE- and RNB-bagging showed,onaverage,overall betterperformance thanEB- and RB-baggingonAUCPRforallensemblesizes(Fig.1,left).Thisindicates that PT-, SMOTE- and RNB-bagging are generallyable to achieve comparatively higher average precision and recall (F-measures). The Friedman test revealed a signiﬁcant difference between the methods(P<3e−12,forbothDTandNNensembles).

TheposthocpairwiseNemenyitestswithDTensemblesshowed anearlysigniﬁcantdifferencebetweenPT-baggingandEB-bagging (P ≈0.062),andbetweenRNB-andEB-bagging(P ≈0.083).Note that lower AUCPR generally implies a lower potential for the F-measures,irrespectiveofthethresholdused.Thus,whenusingDT ensembles, methods based onundersampling alone (i.e., EB-and RB-bagging)arelikelytohavealowerF1-score(asshownbelow). The single classiﬁer clearly showed a lower AUCPR than the PT-baggingandrestofthemethods(P<1.5e−6).

(7)

Fig.1. Average test performance across datasets for different numbers of classiﬁers in AUCPR (left), macro-accuracy (middle) and macro F1 (right). The ﬁrst row shows results for DT ensembles and second row for NN ensembles. The interpolated lines are shown for convenience.

With NN ensembles, RB-bagging was significantly (or nearly significantly)worsethanPT-,SMOTE-andRNB-bagging(P_≈0.058, 0.065and0.021,respectively).All themethods performed signifi-cantlybetterthanthesingleclassifier(P<0.001).Noother signif-icantdifferenceswerefound.

Thechoiceofbaseclassifiers(eitherDTorNN)didnotmakea significantdifferenceinanyofthemethod’sperformance(P>0.17 forallWilcoxontestson26datasets).Thisaligns withourclaim that PT-baggingcanbe usedwithdifferentchoicesofbase classi-fiers.

5.1.2. Macro-accuracy

Fig.1(middle)showsthatresamplingmethodsoffersimilaror higher macro-accuracy compared to PTMA-bagging when a small number of classifiers were employed. This aligns with Maloof’s [20] findings which showed that undersampling and threshold-moving perform similarly in terms of ROC and macro-accuracy when a single classifier is employed. However, the performance ofPTMA-baggingimprovedsubstantiallywithlargerensembles(25 base classifiersor more), indicating that once thevariance is re-duced through bagging, theerror that is left comes mostly from thebias (sincebagging reducesvariancebutnotbias). Thisresult suggests that the biasof PTMA-bagging islower than that ofthe othermethods.

PTMA-bagging showed thehighest number ofwins aswell as fewer losses compared to other methods (Table 4, left), espe-cially with DT, where it obtained almost twice as manywins as losses against the second best performing method (RB-bagging). The Friedman test revealed a signiﬁcant difference between the methods(P<1e−11forboth,DTandNNensembles).

ForDTensembles,theposthoc pairwiseNemenyitestsshowed thatPTMA-baggingperformedsignificantlybetterthanEB-, SMOTE-and RNB-bagging (all P < 0.03) and the single classifier (P<1e−10).Noneoftheremainingdifferencesweresignificant.

NNensembles showed a similar trend of pairwise differences astheDTensembles,althoughinthiscasePTMA-baggingwasonly significantlybetterthanSMOTE-bagging(P ≈0.021)andthesingle classifier (P < 1.7e−9). It should be noted, however,that it was moredifficultto obtainstatisticalsignificancewithNNensembles asfewerdatasetswereused.

Itisalso worthnotingthat thetype ofthebase classifierhad nosignificant effectonamethod’sperformance (Wilcoxon test,P >0.8forallcomparisonsofamethodusingDTagainst thesame methodusingNN)exceptforthesingleclassifierwhichperformed betterwithDTthanwithNN(P≈0.01).

Symmetryofclassrecalls: Inordertogain insightintothe bias ofthe methods towardpredictingeitherclass we testedthe null hypothesisthat,foreachmethod,thedifferencebetweentheclass recalls(Fig.2),i.e.,the twocomponentsofthemacro-accuracy,is equalto zerousing a one samplet-test. WithDT ensembles, the nullhypothesiswasrejectedforallmethods(P<0.003)exceptfor PTMA-bagging(P≈0.18).WithNNensembles,asimilarsymmetry wasobservedforPTMA-bagging(P ≈0.42).Inthiscase,EB-bagging wasthe onlyothermethod thatdidnot showsigniﬁcantly asym-metricclassrecalls(P≈0.46).

These resultssuggest that PTMA-bagging is not biased toward either class. Interestingly, only EB-bagging (using DT) showed higherrecallfortheminorityclass(Fig.2,left),suggestinga possi-bleovercompensationfortheminorityclassduetothe undersam-pling.

Fullpotential:Wethentestedthefullpotentialofthemethods formacro-accuracy,i.e.,whichwedefineastheirmaximum macro-accuracythatwouldbeattainableiftheoptimalthresholdforeach testsetwereknown.TheFriedman testrevealedasignificant dif-ferencebetweenmethods(P<0.001bothDTandNNensembles). Post-hoc pairwise comparisons with DT ensembles revealed that SMOTE-bagging(P ≈ 0.04) andPTMA-bagging (P ≈0.1) have (respectively) a significantly and nearly-significantly higher full

(8)

Table4

Win/Tie/Loss tables. Each element expresses how many times the method in the row wins/ties/loses against the method in the column. The top tables show results with DT methods, and the bottom tables show results with NN methods.

Macro-accuracy Macro F1-score

EB RB SMOTE RNB Single EB RB SMOTE RNB Single

PT MA 27/0/9 23/1/12 24/1/11 23/1/12 31/2/3 PT F1 33/0/3 31/1/4 19/2/15 19/1/16 30/0/6

EB – 6/4/26 20/2/14 19/1/16 31/0/5 EB – 2/2/32 1/1/34 1/2/33 10/0/26

RB – – 24/1/11 23/4/9 33/0/3 RB – – 5/2/29 6/2/28 15/0/21

SMOTE – – – 15/4/17 30/1/5 SMOTE – – – 18/2/16 28/0/8

RNB – – – – 34/1/1 RNB – – – – 30/0/6

EB RB SMOTE RNB single EB RB SMOTE RNB single

PT MA 13/1/12 16/0/10 20/0/6 15/2/9 26/0/0 PT F1 20/2/4 20/0/6 14/1/11 11/4/11 24/0/2

EB – 15/2/9 19/1/6 14/1/11 26/0/0 EB – 4/0/22 4/1/21 3/1/22 12/0/14

RB – – 19/1/6 13/1/12 26/0/0 RB – – 6/2/18 5/2/19 13/0/13

SMOTE – – – 4/2/20 22/0/4 SMOTE – – – 13/2/11 23/1/2

RNB – – – – 26/0/0 RNB – – – – 22/0/4

Fig.2. Average recall across data sets for different numbers of classiﬁers, separated for the minority class (solid line) and the majority class (dashed line). The left plot shows results for DT ensembles and right plot for NN ensembles.

Table5

The average difference between the actual performance and full potential. The ﬁrst row shows results for DT methods and second row for NN methods.

Macro-accuracy Macro F1-score

PT MA EB RB SMOTE RNB single PT F1 EB RB SMOTE RNB single

DT 1.9% 2.5% 2.3% 4.7% 3.4% 0.9% 2.6% 9.5% 6.9% 2.9% 3.2% 1.1%

NN 1.7% 2.0% 2.1% 5.9% 3.3% 0.6% 2.9% 6.5% 5.5% 2.9% 3.6% 0.9%

potential than EB-bagging. Again, PT-bagging and the rest ofthe ensemble methods clearly outperformed the single classiﬁer in termsof potential (P _< 1e₋10). The rest of the pairwise differ-enceswerenotsigniﬁcant.

With NN ensembles, the trend changed for RB-bagging, now exhibiting generallylower potential formacro-accuracy than the othermethods,withsignificantly(ornearlysignificantly)lower po-tentialthanRNB- andSMOTE-bagging (P ≈ 0.031andP ≈ 0.081, respectively). Thisdrop in the full potential ofRB-bagging when usingNNcan alsobe appreciatedfromits lower averagedAUCPR inFig.1 ascompared to,forexample, EB-bagging.This seemsto suggestthatRB-baggingmightnotgeneralizewellacrossbase clas-sifierchoices, conceivably becauseit might beleveraging proper-tiesofdecisiontrees.Toourknowledge,nootherstudieshave an-alyzedthe performance ofRB-bagging withbaseclassifiers other thanDTandthusmoreanalysesareneeded.

Finally, the single classiﬁer showed lower potential than PT-bagging(P<8e−7)andtherestoftheensembles(P<0.002).

To shedlight on howwell-tuned thethresholds employed for each method were, we compared their actual and full potential macro-accuracy.The averageabsolutedifference betweenthe full potentialmacro-accuracy and theactual macro-accuracy was cal-culated across cross-validation folds and datasets (Table 5, left). PTMA-baggingperformedclosertoitsfull potentialthantheother

ensemble methods. This suggests that the prior-based threshold employed by PTMA-bagging is a close-to-optimal choice. The sin-gleclassiﬁerperformedclosesttoitsfullpotentialmacro-accuracy, however its full potential was markedly lower than the other methods– asdiscussedabove– yieldingthusalowerperformance (Table4).

5.1.3. MacroF1-scoreandplug-inpotency

Animportantadvantageofourmethodisthatalearned ensem-blecanbeusedtomakepredictionsthatoptimizeanymeasureof interestby applyingan appropriatethresholda posteriori.We ap-pliedtheproposedthresholdtomaximizethemacroF1-score(see Section3.2)totheoutputsofthesamePT-baggingensemblesused above.TheresultingmethodistermedPTF1-bagging(Fig.1,right). The Friedman test revealed a significant difference between the methods for DT (P < 0.0008; P < 0.1 for NN). Post-hoc tests showed that PTF1-bagging had a significantly higher macro F1-score than PTMA-bagging, EB-bagging, and RB-bagging (all P < 0.0008) for both DT and NNensembles. However, witheither choiceofthebaseclassifier,PTF1-baggingwasnotsignificantly dif-ferentfromRNB-orSMOTE-bagging.Asinthepreviousmeasures, thesingleclassifierperformedmarkedlyworsethanPTF1-, SMOTE-andRNB-bagging(P<0.002)witheitherbaseclassifier,whilenot

(9)

Fig.3. Reliability plots for DT ensembles with 100 classiﬁers; spectf (UCI, left), pb-1-3vs4 (KEEL, middle) and satim (HDDT, right). We used 10 bins to discretize the posterior probability for the minority class P(y=1 | x) ( x-axis) for all ﬁve runs and two folds. The corresponding observed frequencies of the minority class ( y-axis) were calculated for each bin (i.e., the “true”P(y=1 | x)). A method lining up with the diagonal is well calibrated while values below the diagonal are overestimating the probability of the minority class.

worse than EB-bagging with NN (P ≈ 0.99) and not worse than RB-baggingwithbothNNandDT(P>0.91).

Again, none of the methods showed signiﬁcant differences when comparing their performanceswith DT andNN ensembles (P>0.17forallWilcoxontests).

Full potential: We found significant differences between the methods in terms of the full potential of the macro F1-score (Friedmantest, P < 0.055 forbothDT andNN).WithDT ensem-bles, pairwise posthoc tests revealed only one trend level differ-ence between PTF1- and EB-bagging (P ≈ 0.1) in favor of PTF1 -bagging.WithNNensembles,asnotedabovewithmacro-accuracy, RB-bagging’spotential generallydecreased showingasignificantly lowerfullpotentialthanPTF1-bagging(P≈0.04).Asexpected,the single classifier hadsignificantly lower full potential than all the ensembleswitheithertypeofbaseclassifier(P<0.0022).Noother differencesweresignificant.

Finally,theaverageddifferencesbetweentheactualandthefull potentialmacroF1-scoreshowthatPTF1-bagging– whichusesour novel threshold – performed closerto its full potential than the restofensemblemethods(Table5,right).Also, SMOTE-and RNB-baggingperformedclosertooptimalthanEB-andRB-bagging. Ad-ditionally,thesingleclassiﬁerperformedclosesttoitsoptimal per-formance,yetagain,itsfullpotentialandactualperformancewere markedlylowerthantheensemblemethods(Table4).

Taken together,these resultsimply that the methodsthat ex-hibit a performance clearly lower to that of their full potential could dobetter ifproperthresholds could befound, whichis of-ten not possible without usingcomputationally expensive tuning procedures. Overall, the resultsabove support our claimthat PT-bagging passesas a plug-in methodwhere the threshold can be setaposterioriaccordingtotheperformancemeasureofinterest. 5.1.4. Posteriorprobabilitycalibration

Sofar, wehaveshownthatPT-baggingperforms competitively onthreedifferentmeasures.Inparticular,goodperformanceinthe macroF1-score andmacro-accuracy (Theorem 1) is onlypossible if posteriorprobabilities are well calibrated. Inthe following, we arguethatPT-baggingestimateswellcalibratedposterior probabil-ities.

AnempiricalstudybyNiculescu-MizilandCaruana[23]showed thatbagged DTandNNensemblesestimate wellcalibrated poste-rior probabilities, making additional calibration – e.g., with Platt scaling – unnecessary. However, probability calibration is a rel-atively understudied problem for imbalanced data. In this di-rection, a recent study proposes to correct the calibration for undersampling [14] and another study proposes the use of an

undersampling-basedvariationofPlattscalingtoobtaincalibrated probabilities[42].Thesestudiesusethe(stratified)Brierscore(see Eq.(5)) toquantify calibration.Wallace andDahabreh [41]found thatundersamplingcombinedwithbaggingleadstoalowerBrier scorefortheminority class,i.e.,abettercalibrationforthisclass, without sacrificing the overall Brier score. We found similar re-sultsinourexperimentswithboth DTandNNensembles. Specif-ically,therebalancingmethods showedasignificantlylower Brier scorefor theminority class than PT-bagging andthe single clas-sifier,while PT-bagging fared better withthe majority class than the rebalancing methods and the single classifier (Friedman test P<2e−16;allpairwiseposthocNemenyitestsP<0.018forboth DTandNNbasedmethods).

ThisseeminglynegativeresultforPT-bagging,i.e.,ahigherBrier score for the minority class than the majority class, can be at-tributed to a potential shortcoming of the stratiﬁed Brier score. TheBrierscoreforclasskdecreaseswithcrispposteriors,e.g.,the Brierscorefortheminority classwouldbecomezeroifall poste-riorprobabilities forthisclass were 1(see Eq.(5)).Thus, the in-formationinthenon-crispposteriorsisignoredbytheBrierscore andoverestimatedprobabilitieswill,wrongly,leadtoalowerBrier score.ThissuggeststhatthestratiﬁedBrierscoremightnotbe ap-propriateforqualifyingposteriorcalibrationoverimbalanceddata, asitis notnecessarily indicativeofother performance measures. Developing new measures to quantify calibration is beyond the scopeofthispaper.

Reliability plots provide an alternative visual way to evaluate calibrationquality [23], overcomingthe aforementioned deﬁcien-cies.ThreeexamplesofreliabilityplotsareshowninFig.3.Visual inspection revealed that PT-bagging probabilities were well cali-brated(i.e.,closetothediagonalline)forthemajorityofthedata sets(21outof36DTensembles,and15outof26NNensembles; see Supplementary material). In contrast, all the other methods tendedtosystematicallyoverestimatetheposteriorsforthe minor-ityclass (hencetheirlower Brierscores).Moreover,whenever PT-bagging estimated miscalibratedposteriors (i.e.,not aligned with thediagonal), theother methodsfailedtoo.Theseresultssuggest thatPT-baggingestimatesrelativelywell-calibratedposteriors. 5.1.5. ComparisonwithPlattscaling

Toinvestigatethe effectof directcalibration,we applied Platt scaling to the posterior probabilities of the ensembles with 100 classiﬁers used above, followed by a threshold of 0.5. With the DTensembles,PTMA-baggingoutperformedPlattscalingin macro-accuracy(PairedWilcoxontest,P<1e−8;Win/Tie/Loss = 34/0/2). Platt scaling performed relatively better in the macro F1-score.

(10)

Nevertheless,PTF1-, SMOTE- and RNB-bagging still outperformed Plattscaling(PairedWilcoxontest,P<0.08inallcomparisons).

Results with NN ensembles showed a similar picture for themacro-accuracy measure,wherePTMA-baggingclearly outper-formed Platt scaling (P < 1e−6; Win/Tie/Loss = 21/1/4). How-ever,inthiscase, thedifferencesbetweenPlatt scalingandPTF1-, SMOTE-andRNB-baggingwerenon-signiﬁcant(allP_>0.4).

Inconclusion,probabilitycalibrationusingPlattscalingdidnot provideanimprovementoverthemethodsinvestigatedinthis ar-ticle.Thiscorroboratespreviousresultsshowingthatposterior cal-ibrationforbagged DTandNNensemblesisoftenunnecessaryas theirprobabilityestimatesaregenerallywellcalibrated[23]. 5.2.Multiclassdatasets

An important advantage of our method is that it can be di-rectlyextended to themulticlass setting.Formulticlass data,the threshold-movingtechniquecanbeappliedbydividingthe poste-riorsbyappropriateprobabilities(seeTheorem1).Forinstance,to maximizemacro-accuracy,thethresholdsofPTMA-baggingareset equaltothepriorprobabilitiesoftherespectiveclass(seeSection 3.1andAlgorithm1).Weevaluatedthisapproachon15multiclass datasets (Table3).Forcomparison,weusedtheUnderBaggingto OverBaggingmethod (UnderOver)[43]. This method uses under-orover-sampledinstancesofdifferentclassesinproportiontothe majorityclasssizecontrolledbyaparameterawhichcorresponds tothesamplingrateofthelargestclass.Theparameterachanges theensemblefromUnderBagging(a₌0)toOverBagging(a₌100). Noticethat UnderOver (a = 0) is the multiclassextension of EB-bagginginwhichthemajorityclassesare undersampledtomatch theleastfrequent class.We varied theparametera in {0,10,25, 50,100}.

Weusedsimilarexperimentalsettingstothoseusedforthe bi-narydatasets, i.e.,simple bootstrap samplingforPT-bagging and 5 × 2-fold cross-validation. Following their good performance in binarydatasets,hereweemployedonlyDTensemblesofsize100. As there are many methods (PT-bagging and five competing methods)relativetothenumberofdatasets,theFriedman testis unlikely toreveal differences. Therefore, herewe only performed pairwise tests. Firstly, we selected the single competitor method withthehighestaverage macro-accuracy.Thismethod – withan averagemacro-accuracy of 0.756 – was UnderOver-bagging(with a= 50).This methodandPTMA-bagging (average macro-accuracy 0.789) showed a trend-level difference (Paired Wilcoxon test, P ≈0.0946).Furthermore,of the total15 multiclass data sets,PTMA -bagginghadeightoverallwins againstthefivecompetitors,while thenextbestmethodwasUnderOver-bagging(witha ₌10) with fourwins.ThisresultsuggeststhatPT-baggingcanbesuccessfully employedformulticlassdatasetswhenappropriatethresholdsare available, although further tests are needed to confirm stringent statisticalsignificance.

6. Conclusionsandfuturework

We proposed a simple plug-inmethod, PT-bagging,for imbal-ancedclassiﬁcation. Ourmethod relieson simple bootstrap sam-pling– which preservesthe naturalclassdistribution – to create abagging ensemblefollowedby threshold-movingto assignclass labels.OurresultsandanalysesshowedthatPT-bagging, unsurpris-ingly,outperformsthesingleclassiﬁerbaselineandperforms com-petitivelytofourstate-of-the-artensemblemethods.Furthermore, itdoessointhreeperformancemeasures:AUCPR,macro-accuracy andmacroF1-score.Weshowedthat theclasspriorsprovide the optimal thresholds for maximizing the macro-accuracy measure, andwe introduced a newintuitive threshold formaximizing the macroF1-scoreanddemonstrateditseffectiveness.

We showed that PT-bagging (combined with an appropriate thresholdwhenneeded)performsatleastaswellasthebest com-petitor method in each of the three performance measures and does not underperform in any of them. Critically, it does so by reusingthesameensemblemodels acrossperformance measures. Bycontrast,all other methods provedto be weakinat leastone ofthe threemeasures. Weobserved thatthe undersampling-only methods(EB-andRB-bagging)weremoresuitableformaximizing themacro-accuracy,whilemethodscombiningsynthetic oversam-plingwithundersampling(SMOTE-andRNB-bagging)faredbetter withthemacroF1-score.Thusavoiding thenecessityofchoosing anarbitrarythresholdoridentifyingameasure-speciﬁcthreshold.

Our analysis provided several additional insights. Specifically, we found: (i) PT-bagging is lessbiased toward either class than other methods, (ii) it performs close to its full potential, (iii) it performs well with different choices of base classifier, (iv) PT-bagging canbe directly extendedto multiclassdata when appro-priate thresholds are available; andfinally, (v) a potential short-comingoftheBrierscoreinquantifyingprobabilitycalibrationand. Takentogether,ourworkprovidesacompetitiveandsimple al-ternative to other rebalancing- and syntheticoversampling-based ensemble methods, whichare often the first choice todeal with class imbalance. We hope that our results and analyses will in-crease interest in the threshold-moving technique and provide a basisfordevelopingnewthreshold-basedmethodsforimbalanced classification.

An important but understudied question is whether to use the naturalclassdistribution forlearning[19].Weiss andProvost [16]studiedthisquestionempiricallyandconcludedthatgenerally a differentclassdistribution leadstobetter performance. Our re-sultsstandincontrasttotheirconclusion.Animportantdifference between their procedures and ours, which can at least partially explain, the different conclusions, is the use of a singledecision treeversusan ensemble.Asourresultsshow,asingleclassiﬁeras wellassimplebootstrapbaggedsmallensemblesperformspoorly, buttheperformanceimproveswiththeensemblesize.Considering this, weconclude thata largeenough bagging ensemble success-fullymodelsthedatawithitsnaturalclassdistribution.

Wecantakeseveralpossibledirectionsinthefuture.Onesuch possible direction would be to use intrinsic data properties to improve the sampling mechanism (see, e.g., [7]). Such methods can leverage properties of the data and help with diﬃcult situ-ations, e.g., small disjunctsof the minority class. Additionally,as our methodavoids computationally expensive retraining, we aim toinvestigatethe suitabilityofourmethod inenvironmentswith dynamiccosts.

Acknowledgments

We thank Dr. Harikrishna Narasimhan, Prof. Jesse Davis and Prof.HendrikBlockeelforfruitfuldiscussions.KRPwaspartly sup-portedby theWellcomeTrust -MITfellowship103811AI.We are thankfultoStevenSheridanandDr.AbhijitKulkarnifor proofread-ingthemanuscript.

Supplementarymaterials

Supplementary material associated with this article can be found,intheonlineversion,atdoi:10.1016/j.neucom.2017.08.035. References

[1] Q.Yang,X.Wu,10challengingproblemsindataminingresearch,Int.J.Inf. Technol.Decis.Mak.5(4)(2006)597–604.

[2] Z.Yang,W.Tang,Associationrulemining-baseddissolvedgasanalysisforfault diagnosisofpowertransformers,IEEETrans.Syst.ManCybern.C:Appl.Rev.39 (6)(2009)597–610.

(11)

[3] R. Perdisci,G. Gu, W.Lee, Usinganensemble ofone-class SVMclassiﬁers tohardenpayload-basedanomalydetectionsystems,in:ProceedingsofIEEE SixthInternationalConferenceonDataMining,2006.

[4] T.Fawcett,F.Provost,Adaptivefrauddetection,DataMin.Knowl.Discov.1(3) (1997)291–316.

[5] M. Mazurowski, P.Habas, J. Zurada, Training neural networkclassiﬁers for medicaldecisionmaking:theeffectsofimbalanceddatasetsonclassiﬁcation performance,NeuralNetw.21(2)(2008)427–436.

[6] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid- based approaches, IEEE Trans. Syst. Man, Cybern. C: Appl. Rev 42 (2012) 463– 484, doi: 10.1109/TSMCC.2011.2161285.

[7] H.He,E.Garcia,Learningfromimbalanceddata,IEEETrans.Knowl.DataEng. 21(9)(2009)1263–1284.

[8] G.E.A .P.A . Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl. 6 (2004) 20, doi: 10.1145/1007730.1007735.

[9] N.Chawla,K.Bowyer,SMOTE:syntheticminorityover-samplingtechnique,J. Artif.Intell.Res.16(2002)321–357.

[10] A.Estabrooks,T.Jo,N.Japkowicz,Amultipleresamplingmethodforlearning fromimbalanceddatasets,Comput.Intell.20(2004)18–36.

[11] T.Dietterich,Ensemblemethodsinmachinelearning,in:ProceedingMCS‘00 ProceedingsoftheFirstInternationalWorkshoponMultipleClassiﬁerSystems, 2000.

[12] S. Hido, H. Kashima, Y. Takahashi, Roughly balanced bagging for imbalanced data, Stat. Anal. Data Min. 2 (2009) 412–426, doi: 10.1002/sam.10061. [13] C.Seiffert,T.M.Khoshgoftaar,J.VanHulse,A.Napolitano,RUSBoost:ahybrid

approachtoalleviatingclassimbalance,IEEETrans.Syst.ManCybern.A:Syst. Hum.40(2010)185–197.

[14] A.D.Pozzolo,O.Caelen,R.Johnson,G.Bontempi,Calibratingprobabilitywith undersamplingforunbalancedclassiﬁcation,IEEESymp.Comput.Intell.Data Min(2015)159–166.

[15] A.D.Pozzolo,O.Caelen,G.Bontempi,Whenisundersamplingeffectivein un-balancedclassiﬁcationtasks?in:ProceedingsofConferenceonMachine Learn-ingandKnowledgeDiscoveryinDatabases,2015.

[16] G.Weiss,F.Provost,Learningwhentrainingdataarecostly:theeffectofclass distributionontreeinduction,J.Artif.Intell.Res.19(2003)315–354. [17] N. Chawla, L. Hall,A. Joshi, Wrapper-basedcomputation and evaluation of

sampling methodsforimbalanceddatasets,in:Proceedingsofthe1st Inter-nationalWorkshoponUtility-BasedDataMining,ACM,2005.

[18] S.Wang,X.Yao,Multiclassimbalanceproblems:Analysisandpotential solu-tions,IEEETrans.Syst.ManCybern.B:Cybern.42(4)(2012)1119–1130. [19] F.Provost,Machinelearningfromimbalanceddatasets101,in:Proceedingsof

AAAI’2000WorkshoponImbalancedDataSets,2000.

[20] M.Maloof,Learningwhendatasetsareimbalancedandwhencostsare un-equaland unknown,in:ProceedingsofWorkshoponLearning from Imbal-ancedDataSetsII,2003.

[21] Z.Zhou,X.Liu,Trainingcost-sensitiveneuralnetworkswithmethods address-ingtheclassimbalanceproblem,IEEETrans.Knowl.DataEng.18(1)(2006) 63–77.

[22] F.Provost,P.Domingos,Treeinduction forprobability-basedranking,Mach. Learn.52(3)(2003)199–215.

[23] A.Niculescu-Mizil,R.Caruana,Predictinggoodprobabilitieswithsupervised learning,in: Proceedingsofthe 22ndInternational Conference onMachine Learning,2005.

[24] H.Narasimhan,R.Vaish,S.Agarwal,Onthestatisticalconsistencyofplug-in classiﬁers for non-decomposable performance measures, in: Proceedingsof AdvancesinNeuralInformationProcessingSystems,2014.

[25] Z.Lipton, C.Elkan,B. Naryanaswamy, Optimalthresholding ofclassiﬁersto maximizeF1measure,in:ProceedingsofConferenceonMachineLearningand KnowledgeDiscoveryinDatabases,2014.

[26] C.Elkan,Thefoundationsofcost-sensitivelearning,in:Proceedingsof Interna-tionalJointConferenceonArtiﬁcialIntelligence,2001.

[27] L.Kuncheva,CombiningPatternClassiﬁers:MethodsandAlgorithms,John Wi-ley&Sons,2004.

[28] L. Breiman, Bagging predictors, Mach. Learn 24 (1996) 123–140, doi: 10.1007/ BF00058655.

[29] E. Bauer, R. Kohavi, An empiricalcomparison ofvotingclassiﬁcation algo-rithms:bagging,boosting,andvariants,Mach.Learn.36(1)(1999)105–139. [30] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning,

SpringerSeriesinStatisticsSpringerNewYorkInc.,NewYork,NY,USA,2009. [31] J.Błaszczy´nski,J.Stefanowski,Ł.Idkowiak,Extendingbaggingforimbalanced data,in:Proceedingsofthe8thInternationalConferenceonComputer Recog-nitionSystem,CORES,2013.

[32] C.Drummond,R.Holte,C4.5,classimbalance,andcostsensitivity:why un-der-samplingbeatsover-sampling,in:ProceedingsofWorkshoponLearning fromImbalancedDataSetsII,2003.

[33] J.Díez-Pastor,J.Rodríguez,C.García-Osorio,RandomBalance:Ensemblesof variablepriorsclassiﬁersforimbalanceddata,Knowle.-BasedSyst.85(2015) 96–111.

[34] V.Sheng,C.Ling,Thresholdingfor makingclassifierscost-sensitive,in: Pro-ceedingsofNationalConferenceonArtificialIntelligence,AAAI,2006. [35] J.Hernández-Orallo,P.Flach,C.Ferri,Aunifiedviewofperformancemetrics:

Translatingthresholdchoiceintoexpectedclassiﬁcationloss,J.Mach.Learn. Res.13(2012)2813–2869.

[36] S. Pang, J. Gong, C5.0 classiﬁcation algorithm and application on individual credit evaluation of banks, Syst. Eng. – Theory Pract. 29 (2009) 94–104, doi: 10.1016/S1874-8651(10)60092-0.

[37] B.Zadrozny,C.Elkan,Learningandmakingdecisionswhencostsand proba-bilitiesarebothunknown,in:ProceedingsofSeventhACMSIGKDD Interena-tionalConferenceonKnowledgeDiscoveryinDataMining,ACM,2001. [38] G.Panchal,A.Ganatra,Y.Kosta,Behaviouranalysisofmultilayerperceptrons

withmultiplehiddenneuronsandhiddenlayers,InternationalJournalof Com-puterTheoryandEngineering3(2)(2011)332–337.

[39] J.Demšar,Statisticalcomparisonsofclassiﬁersovermultipledatasets,J.Mach. Learn.Res.7(Jan.)(2006)1–30.

[40] J.Platt,Probabilisticoutputsforsupportvectormachinesandcomparisonsto regularizedlikelihoodmethods,Adv.LargeMarginClassif.10(3)(1999)61–74. [41] B.Wallace,I.Dahabreh,Classprobabilityestimatesareunreliablefor imbal-anceddata(andhowtoﬁxthem),in:ProceedingsofIEEE12thInternational ConferenceonDataMining,2012.

[42] B.Wallace,I.Dahabreh,Improvingclassprobabilityestimatesforimbalanced data,Knowl.Inf.Syst.41(1)(2014)33–52.

[43] S.Wang,X.Yao,Diversityanalysisonimbalanceddatasetsbyusingensemble models,in:ProceedingsofIEEESymposiumonComputerIntelligenceandData Mining,CIDM’09,IEEE,2009.

GuillemCollell is a Ph.D. student at the Department of Computer Science at KU Leuven. He received both a B.S. in Psychology and a B.S. in Mathematics from the Au- tonomous University of Barcelona and a Research Mas- ter in Neuroeconomics from the Maastricht University. He was a Visiting Scholar at the MIT Neuroeconomics lab where he did research on imbalanced classiﬁcation. His current research interests include deep learning models to integrate vision and language as well as more theoretical and fundamental problems in machine learning.

DrazenPrelec is the DigitalEquipmentCorp.Leadersfor GlobalOperationsProfessorofManagement and a Professor of Management Science and Economics at the MIT Sloan School of Management. Prelec also holds appointments in the Department of Economics and in the Department of Brain and Cognitive Sciences. He was a Junior Fellow in the Harvard Society of Fellows, and has received a number of distinguished research awards, including the John Simon Guggenheim Fellowship. Prelec holds an AB in applied mathematics from Harvard College and a Ph.D. in experimental psychology from Harvard University.

Kaustubh R. Patil is currently a Research Fellow at the Brain and Behavior institute at the Research Center Juelich. Patil is also aﬃliated with the MIT Sloan Neuroe- conomics Lab where he was a Wellcome Trust-MIT fellow. His work is primarily concerned with developing machine learning and data mining methods to understand biologi- cal systems, especially in microbiology and neuroscience. He holds a Bachelor’s degree in electronics engineering from the Shivaji University, a Master’s degree in artiﬁ- cial intelligence from the University of Porto and a Ph.D. in bioinformatics from the Max-Planck Institute for Com- puter Science.