Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
A
simple
plug-in
bagging
ensemble
based
on
threshold-moving
for
classifying
binary
and
multiclass
imbalanced
data
Guillem
Collell
a,d,
Drazen
Prelec
a,b,c,
Kaustubh
R.
Patil
a,e,∗aMITSloanNeuroeconomicsLab,MassachusettsInstituteofTechnology,Cambridge,MA02139,USA bDepartmentofEconomics,MassachusettsInstituteofTechnology,Cambridge,MA02139,USA cBrain&CognitiveSciences,MassachusettsInstituteofTechnology,Cambridge,MA02139,USA dComputerScienceDepartment,KULeuven,Heverlee3001,Belgium
eInstituteofNeuroscienceandMedicine,Brain&Behaviour(INM-7),ResearchCentreJülich,Jülich52425,Germany
a
r
t
i
c
l
e
i
n
f
o
Articlehistory: Received 10 October 2016 Revised 26 July 2017 Accepted 29 August 2017 Available online xxx Communicated By Dr. V. Palade Keywords: Imbalanced data Binary classification Multiclass classification Bagging ensembles Resampling Posterior calibrationa
b
s
t
r
a
c
t
Classimbalancepresentsamajorhurdleintheapplicationofclassificationmethods.Acommonlytaken approachistolearnensemblesofclassifiersusingrebalanceddata.Examplesincludebootstrapaveraging (bagging)combined witheither undersamplingoroversamplingoftheminority classexamples. How-ever,rebalancingmethodsentailasymmetricchangestotheexamplesofdifferentclasses,whichinturn canintroduce theirownbiases.Furthermore, thesemethods oftenrequirespecifying the performance measureofinterestapriori,i.e.,beforelearning.Analternativeistoemploythethresholdmoving tech-nique,whichapplies athresholdtothecontinuousoutputofamodel,offeringthepossibilitytoadapt toaperformancemeasureaposteriori,i.e.,aplug-inmethod.Surprisingly,littleattentionhasbeenpaid tothiscombinationofabaggingensembleandthreshold-moving.Inthispaper, westudythis combi-nationanddemonstrateitscompetitiveness.Contrarytotheotherresamplingmethods,wepreservethe
naturalclassdistributionofthedataresultinginwell-calibratedposteriorprobabilities.Additionally,we extendtheproposedmethodtohandlemulticlass data.Wevalidatedourmethodonbinaryand mul-ticlassbenchmarkdatasets byusingboth, decisiontrees and neuralnetworks as baseclassifiers. We performanalysesthatprovideinsightsintotheproposedmethod.
© 2017TheAuthor(s).PublishedbyElsevierB.V. ThisisanopenaccessarticleundertheCCBYlicense.(http://creativecommons.org/licenses/by/4.0/)
1. Introduction
Dealing with a class imbalance in classification is an impor-tantproblemthatposesmajorchallenges[1].Imbalanceddatasets frequently appear in real-world problems, such as in fault and anomaly detection [2,3], fraudulent phone call detection [4] and medicaldecision-making[5],tonameafew.Standardlearning al-gorithms are often guided by global error rates and hence may ignore instances of the minority class, leading to models biased towards predictingthe majorityclass.Severalmethodshavebeen proposedtoalleviatethisproblem(see,e.g.,[6,7]forreviews). Of-ten,afirstchoiceconsistsofpreprocessingthedatabyresampling tobalancetheclassdistribution[8,9].Thisisoftenachievedby ei-ther randomly oversampling(ROS) the minority class [9] or ran-domly undersampling(RUS)the majorityclass [10].More sophis-∗ Corresponding author at: MIT Sloan Neuroeconomics Lab, Massachusetts Insti-
tute of Technology, Cambridge, MA 02139, USA.
E-mailaddresses:[email protected](G. Collell), [email protected](D. Prelec), [email protected](K.R. Patil).
ticated methods that generate synthetic minority class instances arealsoapopularchoice,e.g.,thesyntheticminorityoversampling technique (SMOTE [9]). We will collectively call these data pre-processing methods as rebalancing mechanisms as they, in gen-eral,aimtomake thetrainingdatamore balanced.Thiswill also avoidconfusionwithother resamplingmechanisms,e.g., the sim-ple bootstrap. Rebalancingis often combined withensembles as they show superior performance to a single classifier [11]. Many such combinations have been shown to be effective for imbal-anceddata classification [6,12,13]. However, there areseveral po-tentialdrawbacksofrebalancingmethods:(1)potentiallossof in-formativedatawhenundersampling,(2)changesintheproperties ofthedata, suchasasymmetric changes inthe densityof exam-plesofdifferentclasses,whichinturncancausethemodelsto in-duceunwanted biases,e.g.,miscalibratedposteriorprobability es-timates[14,15],(3)itisoftennotevidentwhichclassdistributions tousefora givendatasetandaperformance measureof interest [16](wrapper methods [17] can be employed to tune the model fora givenmeasure,butthey are computationally expensiveand oftencater towardsonlyasinglemeasure,e.g.,eitheraccuracyor http://dx.doi.org/10.1016/j.neucom.2017.08.035
0925-2312/© 2017 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license. ( http://creativecommons.org/licenses/by/4.0/)
F1-score),and(4)itisnontrivialtoextendthesamplingheuristics normallydefinedforbinarydatatomulticlassdataastherecanbe multipleminority/majorityclasses[18].
Moving decision thresholds isanother technique to deal with class imbalance. The main difference between rebalancing and threshold-based methods is that the former relies on data pre-processingbefore learning happens, whereas the latter relies on manipulating the continuous output of a learned model, e.g., classweights orposterior probabilities.Among other proponents, Provost[19]advocatedforthreshold-moving asa methodtodeal withclassimbalance.Nevertheless,surprisingly,littleattentionhas beenpaidto thistechnique, oftentoan extentthatitisnot even consideredforcomparisonwhennewmethodsareproposed.
While this technique has been utilized in combination with some popularlearning methods includinga smallensemble [19– 21]. However, to our knowledge, the combination of threshold-movingwitha bagging ensemblehas notbeen thoroughly inves-tigated.As isevident, threshold-moving dependsonreliable con-tinuousestimationoftheoutput;therefore,baggingensemblesare a good candidateto combine withthreshold-moving as they are knowntoprovidegoodprobabilityestimates[22,23].Inthiswork, westudythreshold-movingcombinedwithbaggingensemblesand showthatitisacompetitivemethodwithseveraladvantages.
In particular, we seek a method that provides well-calibrated posterior probability estimates. An important advantage of such a method is that it can be utilized as a plug-in method where the thresholdcan be set a posteriori, i.e., at the test phase. This providesanopportunitytoachievegoodperformanceondifferent measuresusingthesamemodel[24].Thisisamajorimprovement overother methods,e.g., cost-sensitive methods andrebalancing, which require the performance measure of interest to be speci-fied at the learning phase. Here, we propose Probability Thresh-oldbagging (PT-bagging)that, aswe will show,passes asa plug-inmethod.Themain motivationbehindPT-baggingisto leverage theadvantagesofbaggingwhileavoidingtheproblemsthat rebal-ancingmethodsinevitablyentail,asdescribedabove.Theproposed methodPT-bagging addresses those problemsand possesses sev-eraldesirableproperties:
(1)It isaplug-in methodthat maximizesaperformance mea-sureofinterestwithoutretraining,butratherbyjust apply-ing anappropriatethresholdaposteriori.Bycontrast, rebal-ancing methods are not flexible and need computationally expensiveparametertuning,e.g.,tofindwhichclass propor-tionstouseforlearningviaawrapperapproach[17]. (2)It consistently performs close to the best possible
macro-accuracy and macro F1 performances without the need to empirically find the optimal threshold (e.g., by cross-validation).Obtainingavalidationsetfortuningcanbe com-putationally costly, mightnot always be possible,or might befinanciallyprohibitive(e.g.,duetodatacollectioncosts). (3)Itcanbeextendedtohandlethemulticlasssettingwhen
ap-propriate thresholds fora performance measure ofinterest areavailable,e.g.,macro-accuracy.
We provide a theoretical analysis on when optimal macro-accuracyperformanceisguaranteed.However,forothermeasures, such as the macro F1-score, it is not always possible to obtain a closed-formexpression forthe optimal thresholds [25]. Never-theless, we show that ournew, simple andsensible threshold is closetotheoptimalthreshold,andthatPT-baggingachieveshigher macroF1-score performance compared to other methods. In this respect,wemaketwoadditionalcontributions:(1)theproposalof athresholdformaximizingthemacroF1-score,and(2)a compar-ison andanalysis of the full potential of the methods, which we define as their maximum attainable performance if the optimal thresholdwereknown.
Therestofthispaperisorganizedasfollows:inSection 2we provide the relevant background, describe some popular resam-plingmethods, anddiscusstheir potential flaws. InSection 3,we describeourproposed method,PT-bagging,andprovidea theoret-icaljustificationofits performance.InSection 4,we describeour experimentalsetup. InSection5,wepresentacomprehensiveset ofempiricaltestsanddiscusstheresults.Finally,wecommenton the implications of our findings and propose future lines of re-search.
2. Background
We considerthe standard classificationsetting wherea learn-ingalgorithmlearnsfromthetrainingdatatuples
{
xi,yi}
Ni=1,where xi∈Xare features that can be either continuous, ordinal or cate-goricalandyi ∈C={
1,..., m}
arediscreteclasslabels.Thegoalof learningistoestimateapredictor fˆ:X→Cthatapproximatesthe true underlying function f: X→C. The model learned, fˆ, is then used tomake predictionson unseentest data{
xj}
Mj=1.For binary data,wehaveyi∈ {0,1}andwithoutlossofgeneralitywedenote theminorityclass(i.e.,theclasswithlowerfrequencyinthe train-ing data) asthe class 1.We refer to the class-specific thresholds asλ
i, i=1,..., m.Theirapplicationtotheclassifieroutputis de-scribedbelow(Algorithm1,step2.4).Wemaketwoassumptions: (1)theprobabilitydistributionofthetestdataissimilartothatof thetrainingdata,and(2)theclassdistributionofthetrainingdata provides anaccurate estimate oftheirrespectiveunderlying prior probabilities.2.1. Performancemeasuresforimbalanceddata
Thecommonlyusedmeasureofaccuracy(correctclassification rate)isagoodmetricwhendatasetsarebalanced.However,itcan bemisleadingforimbalanceddata.Forexample,thenaïvestrategy ofclassifyingalltheexamplesintothemajorityclasswouldobtain 99%accuracyinadatasetcomposedof99%examplesofthisclass. Therefore,othermeasures arenecesarywhendealingwith imbal-anceddata.
Several performance measures have been proposed in imbal-ancedlearning,allofwhicharecomputablefromtheelementsof theconfusionmatrix(Table1).Someofthemostextensivelyused measuresare: TNR = TN TN +FP ; Recall
(
= TPR)
= TP TP +FN ; Precision = TP TP +FP ; FPR = FP TN +FP Macro −accuracy = TPR +TNR 2 ; G −mean = TPR × TNR ;F1 −score = 2 × Precision × Recall
Precision +Recall =
2TP
2TP +FP +FN
ThemacroF1-scoreisawidelyusedmeasureandiscalculated byconsideringeachclassseparatelyasthepositiveclassandthen averaging their corresponding F1-scores. Inaddition, the receiver operatingcharacteristic(ROC)curveisoftenemployed[6].TheROC curveisgeneratedbyplottingtheTPR(y-axis)andtheFPR(x-axis) whilemovingthroughthewholespectrumofdecisionthresholds.
Table1
Confusion matrix in binary classification.
Predicted positive Predicted negative Actual positive TP (true positive) FN (false negative) Actual negative FP (false positive) TN (true negative)
Thearea undertheROCcurve(AUROC)– generallycomputed nu-merically– isoftenameasureofinterestthatprovidesasummary oftheROCcurve asasinglenumber.However, ROCcurvessuffer from a serious limitation forevaluating performance under class imbalance. When data are highlyimbalanced, ROC curves fail to capturelarge changes in the numberoffalse positives (FP)since the denominatorofFPR islargely dominatedby TN.Forthis rea-son,precision-recall(PR)curvesarepreferredoverROCcurvesfor imbalanced data[7]andarethereforeour choicehere. PRcurves are computed by moving the decision thresholdand plotting re-call(x-axis)andprecision(y-axis).AnalogoustotheROCcurve,the area underthePR curve(AUCPR)istypically employedasa sum-marymeasure.
2.2. Learningfromimbalanceddata
Many solutions have been proposed to deal with imbalanced data. These solutions mainly fall into one of the following three majorstrategies:(1)cost-sensitivelearning, (2)rebalancing mech-anisms,and(3)threshold-moving.Thesestrategiesarebriefly dis-cussedbelow(foradetailedaccountsee[6,7,19]).
(1) Cost-sensitivelearningplacesdifferentmisclassification costs onthedifferentclasses.Highermisclassificationcostsforthe minorityclasscanbeimposedbyalossfunction.Infact,by altering the training class distribution, rebalancing mecha-nismseffectivelyimposedifferentmisclassificationcostsand canbedeemedasequivalenttocost-sensitivelearning[26]. For this reason, we only consider rebalancing methods in thisarticle.
(2) Rebalancing mechanisms resample the data to make the trainingdatamore balanced.Such datapreprocessing solu-tions do not require modifying the learning algorithm and thereforecanbeemployedwithexistinglearningalgorithms. (3) In threshold-moving, a model is learned from the data set witheithertheoriginalormodifiedclassproportionsandits continuousoutputisconvertedintoaclasslabelbyapplying anappropriatethreshold.
The techniquesrelevant tothis work arediscussed inthe fol-lowingsections.
2.3. Baggingensemble
Ensemble methodsuseasetofclassifierstoimproveupon in-dividualclassifiers’performance[27].Twopopularensemble tech-niques are boosting andbagging [28]. In thiswork, we focus on baggedensemblessince,ingeneral,boostedensemblesdonot per-formbetterwithimbalanceddata[6].
In bagging (Algorithm 1), a set of base classifiersare learned from different samples of the given training set. At the predic-tion time, the outputs of the baseclassifiers are aggregated. The mainprinciplebehindbagging’sperformance,giventhateachbase
classifierperformsabovechance,isthattheaveragingreducesthe varianceofindividualclassifierswithoutincreasingtheirbias. Bag-ginggenerallyperformswellwithunstablebaselearnersforwhom small changes in the training data lead to large changes in the learned model [28]. For example, decisiontrees (DT) andneural networks(NN) are unstable classifiers andthus suitable for bag-ging.
Clearly,different sampling mechanisms(Algorithm1,step 1.2) anddifferent thresholds (step 2.4) can be used which will yield differentmodelsanddifferentoutputs.Allthemethodstestedhere useavariationofAlgorithm1,andwewilldiscusstheirsampling mechanismsandthresholdsinthenextsection.
Furthermore, different aggregation methods can be used to combine the outputs of the base classifiers, e.g., hard-voting to make crispclass assignmentsor soft-voting forprobabilistic pre-dictions. It is known that soft-voting generally provides better performance than hard-voting [21,29,30].We, therefore,use soft-votinginthiswork(Algorithm1,step2.3).
2.4.Resamplingmechanisms
In this section, we briefly describe the different resampling mechanismsthatcanbeusedinstep1.2inAlgorithm1.Oneofthe simplestwaysto resampleistosampleeach instance withequal probability with replacement,i.e., a non-parametric bootstrap, as inthe original bagging algorithm [28].This samplingmechanism is not currently popular with imbalanced data as it preserves the imbalanced class distribution, which is thought to be detri-mentaltolearning. However, we argueherethat thismechanism works well when appropriate thresholds are available. Our pro-posed method, PT-bagging (discussed below), uses this sampling mechanism.
Commonly used resampling mechanisms for imbalanced data (here called rebalancing methods) try to balance the class pro-portions. Perhaps the simplest and most popular undersampling mechanism used for ensemble learning is referred to as exactly balancing (EB). EB resampling preserves the minority class in-stances while randomly undersampling majority class instances suchthat theclassproportionsareexactlybalanced.Roughly bal-ancing (RB) is a powerful variation of EB that improves perfor-manceby increasing diversity of the classifiers[12]. RB, like EB, preservestheminority class instances butundersamplesthe ma-jorityclassinstancesasdeterminedbyanegativebinomial distri-bution,ineffectroughlybalancingtheclassproportions. Oversam-pling mechanisms, on the other hand, over-sample the minority classexamples.Previousstudiesindicatethat undersampling gen-erally performs better than oversampling [6,31,32]. Furthermore, undersampling is computationally more efficient than oversam-plingsinceitdiscardsalargepartofthetrainingdata.
More sophisticated hybrid methods that combine oversam-pling and undersampling have been proposed. One of the most popular such methods is the synthetic minority oversampling Algorithm1
Pseudo-code for the bagging ensemble. 1. Learning:
1.1. Input: A training set S={(xi , yi )}N i =1;yi ∈C={1 ,...,m}, where m is the number of class labels and nthe number of base classifiers. 1.2. Generate n training data sets by sampling 1S.
1.3. Learn n base classifiers from each sample. 2. Prediction:
2.1. Input: an instance x,nbase classifiers, a probability threshold λk for each class k.
2.2. Each base classifier igives a probabilistic estimate Pi (y =k| x)for each label k∈{ 1 ,...,m} given a test instance x. 2.3. Compute averages of probabilistic predictions for each class k∈{1 ,..., m}: P(y =k|x)= 1
n n
i Pi (y =k|x). 2.4. Rank each class k∈{1 ,...,m}according to: P(y =k|x)/λk .
2.5. Assign the label for which the score in 2.4 is the highest.
technique(SMOTE)[9],whichgeneratesnewminorityclass exam-ples by interpolation whileundersampling the majority class ex-amples.SMOTEoftenperformswellincombinationwithabagging ensemble. More recent methods such as Random Balance (RNB) havecombinedinsightsfromboth,SMOTEandRB[33].Specifically, RNB randomlyselects a class proportionandoversamples one of the classesaccordingly with interpolated examples using SMOTE whiletheotherclassisundersampled.RNBaimsatincreasingthe diversityinthebaseclassifiers,whichinturnoftenimprovesthe ensembleperformance.Itisimportanttonoticethatthethreshold of0.5isnormallyusedwiththeserebalancingmethods.
Rebalancing methods,however, presenta numberof potential shortcomings. An important side effect of rebalancing is that it canleadtomiscalibratedposteriorprobabilityestimates,asrecent studieshavefound[14,15].Another– usuallyunnoticed– potential problemofresamplingtechniquesisthatofthepriorshift,i.e., dif-feringtraining(balanced)andtest(naturalclassproportion) distri-butions[15].Thismightcreateadditionalproblemssincea model learnedonabalanceddataisthenevaluatedinadifferentsetting. Lastly,theoriginaldensityofexamplesisasymmetricallymodified forthe classes(e.g.,by undersampling themajorityclassor over-samplingtheminorityclass),whichmightleadtoundesiredbiases inthemodel.Bycontrast,thesimplebootstrap samplingdoesnot modifythe data distribution,which ledus tohypothesize that it willbelesspronetotheseproblems.
3. Probabilitythresholdbagging(PT-bagging)
Severalstudieshaveconsideredthreshold-movingasamethod to deal with class imbalance [20,34]. For example, Maloof [20] compared threshold-moving to RUS using a single classifier andconcludedthat theyachieve similar performance interms of ROC.However,tothebestofourknowledge,previousstudieshave notconsideredthreshold-movingincombinationwithbagging.
The basic idea behind PT-bagging is to leverage bagging (Algorithm 1) to first obtain well calibrated posterior esti-mates and appropriately threshold them afterward according to the performance measure to be maximized. PT-bagging learns base classifiers from simple bootstrap replicates of the original data set, which preserves the class distribution. Then, following Algorithm 1, probabilistic predictions for each class k are aver-agedacross the baseclassifiers toobtain a final posterior proba-bilityestimate P
(
y=k|
x)
(step 2.3).Togetaclasslabel, first,the probability estimateis transformed intoa scoreby dividing itby itsrespective class thresholdλ
k (step 2.4). The classk for which thisscore P(
y=k|
x)
/λ
k is the largest is then assigned. Accord-ingtothecategorizationproposedbyHernandez-Oralloetal.[35], PT-baggingemploys a score-driven threshold,asopposed to,e.g., a fixed threshold of 0.5 used by rebalancing methods. Crucially, inPT-baggingthethresholdscan beadapted tomaximize a mea-sureofinterest.Tomaximize macro-accuracy,theoptimal thresh-oldforclassk isequaltothe classpriorin thetrainingdata,i.e.,λ
k =P(
y=k)
(seeTheorem1).Forinstance,iftheaveraged poste-riorprobabilityforclass0isP(
y= 0|
x)
= 0.7,andthepriorofthis classis0.8(i.e.,P(
y=0)
= 0.8),thenits scoreis0.7/0.8 =0.875, andconsequentlythescoreforclass1is0.3/0.2=1.5.Thus,class 1willbe assignedeven though ithasa lower posterior.The cal-culation of this score is identical in the multiclass setting. We henceforthspecifythemethodwiththethresholdthat maximizes a particularmeasure using a subscript: PTMA-bagging for macro-accuracyandPTF1-baggingformacroF1-score.Weemploythe no-tationPT-baggingwithouta subscriptwhenwe refertomeasures that are independent ofthe threshold (e.g.,AUCPR and posterior probabilitycalibration).3.1. Thresholdformaximizingmacro-accuracy
Thefollowingtheoreticalresultaimstoprovideinsightintothe mechanism behindthe performance of threshold-moving for the macro-accuracy measure. The reader should note that the main message of Theorem 1 is not finding the optimal thresholds (or misclassificationcosts)forthemacro-accuracymeasure,whichare knownto beequalto theinverseofthepriors, butrathera con-structiveproofofanalgorithmthatmaximizesmacro-accuracyfor binaryand multiclassdata. Theorem 1’sproof showsthat a nec-essarycondition fora method to maximize macro-accuracy is to have goodestimates of the posteriorprobabilities P
(
y=k|
x)
for eachclassk.Tosimplifynotationandimprovereadability,we con-siderthebinaryclassproblem.Thesameprooftriviallygeneralizes toamulticlasssetting.Theorem 1. Proposition: Let P
(
y= j)
be the prior of class j and P(
y= j|
x)
bethetrue(unknown)posteriorprobabilityofclassjgiven x. If proportions of each class are unchanged from training to test, thenpredictingtheclassksuchthatk= argmax
j
P
(
y= j|
x)
P
(
y=j)
(1)maximizesthemacro-accuracy.
Proof. LetC=
{
1,0}
be the class labels (positive = 1 and nega-tive = 0). Letd be a random variablecorresponding to the class predictedbyaclassifier.Thus,P(
d=k|
x)
isthepredicted probabil-ityofamodelforclasskgivenx.We shall first derive the population expression of macro-accuracy.Recallthatmacro-accuracyisdefinedas
(
TPR + TNR)
/2 where TPR = TP/P andTNR=TN/N. Recall that P = TP + FN and N = TN+ FP. Let us first derive a continuous expression for TPR. Notice first that TP/(
P+N)
=∫R
P
(
y=1|
x)
P(
d=1|
x)
p(
x)
dx andthat P/(
P + N)
= P(
y= 1)
.Thus, theratio ofthe first expres-sionoverthesecondisequaltoTPR=TP/P.Thatis,TPR = ∫RP
(
y=1|
x)
P(
d= 1|
x)
p(
x)
dx P(
y=1)
The derivationof TNRisanalogous. Thus, dividing thesumof TPRandTNRby2yieldstheexpressionofmacro-accuracy:
1 2 ∫RP
(
y = 1|
x)
P(
d= 1|
x)
p(
x)
dx P(
y=1)
+1 2 ∫RP(
y= 0|
x)
P(
d= 0|
x)
p(
x)
dx P(
y= 0)
Byenteringbothtermsintothesameintegral:
1 2 R
P(
y= 1|
x)
P(
y=1)
P(
d= 1|
x)
+ P(
y=0|
x)
P(
y=0)
P(
d=0|
x)
p(
x)
dx (2)Therefore,maximizingtheintegralin(2)isequivalenttoasking fortheoptimalchoiceofP
(
d=1|
x)
andP(
d=0|
x)
foragivenx– in other words, how to assign class labels1 or 0 ina wise way (perhapsprobabilistically), given x. Notice that the bracketinside theintegralin(2)isnothingbutaconvexcombination:P
(
y= 1|
x)
P(
y= 1)
β
x+P
(
y= 0|
x)
P
(
y=0)
(
1 −β
x)
(3)where we defined
β
x:=P(
d=1|
x)
. Thus, by monotonicity, the convex combination (3) is maximized at x if and only if we placeprobability 1to thelargest term.Thatis tosay, anoptimal methodassignsthepositive class1withprobability 1ifthetermP
(
y= 1|
x)
/P(
y= 1)
isthelargestorassignsthenegativeclasswith probability1ifP(
y=0|
x)
/P(
y=0)
isthelargest.Thatis,β
x:=P(
d= 1|
x)
= 1 , i f PP(y(y==11|x))> P(y=0|x) P(y=0) 0 , Otherwise (4)Thisisindeedthemethodproposedabove,inEq.(1). Critically, we notethat theoptimalmethod ofEq.(4)willnot havethetrueP
(
y=1|
x)
athandbutanestimation P(
y=1|
x)
in-stead. Notice alsothat all the other quantitiesneeded tomake a decision in (4) are known constants, i.e., P(
y=1)
and P(
y=0)
(i.e., the thresholds). Therefore, the good performance of the method totally relies on having good posterior probability esti-matesP(
y= 1|
x)
.3.2. ThresholdformaximizingmacroF1-score
Unlikemacro-accuracy,thereisnoclosed-formexpressionfora thresholdthat maximizesthemacroF1-score[25].Ingeneral, in-creasingthethresholdincreasestheprecisionoftheminorityclass attheexpense ofdecreasedrecall.It is,however,knownthat 0.5 istheupperboundontheoptimalthresholdfortheF1-score[25]. In the absence of any additionalinformation and tuning, we set thethresholdfortheminorityclassto
(
P(
y=1)
+0.5)
/2.The ra-tionalebehindthisthresholdisthatitissetmidwaybetweenthe threshold formaximizing the averagerecall (i.e., the trainingset classprior)andtheupperboundonthethresholdformaximizing theF1-score(i.e.,0.5).Notethatmethodshavebeenproposedtoestimatetheoptimal thresholdformaximizingtheF1-scorebutthey requireadditional dataforfine-tuningandaresusceptibletotheWinner’scurse[25]. Using tuning methods is difficult for most of the datasets used here, as they are relatively small. Importantly, our aimhere was totestatuning-freethreshold.However,findingbetterthresholds, whenpossible,canconceivablyresultinfurtherimprovements.
The readershould notethat theproposed thresholdis defined forthebinaryclasssetting.Itsextension tothemulticlasssetting willbeconsideredinfuturework.
4. Experimentalsetup
We used the R statistical environment (http://www.r-project. org/) withcorresponding packagesanddefaultparameters unless otherwisespecified.
Selecting a proper base classifier learning algorithm (Algorithm 1, step 1.3) is crucial as our method relies on good posterior probability estimates at the test time. Previous work has shown that bagged probabilistic decision trees (DT) provide goodposteriorprobability estimates[22].Theseare,in fact,more reliablethanotherclassifierssuchaslogisticregression[22].Here, we employ unpruned J48 decision trees, an implementation of the C4.5 trees available in the “RWeka” package [30,36]. Even though it is common to apply Laplace smoothing to the leaf probabilities ofthe individual decisiontrees, it can be detrimen-tal for imbalanced data [37]. Therefore, we did not use Laplace smoothing.
In order to evaluate the generality of our method, we also employed neural networks (NN) as base classifiers. Bagged neu-ral networks are known to offer well calibrated posterior proba-bility estimates [23], and thus we hypothesized that this would be a suitable choice for our method. We used a single hidden layer with logistic units and softmax output, implemented with the“nnet” package.Followingaruleofthumb,wesetthenumber ofhiddenunitsto2/3oftheinputdimensionplusthenumberof classes[38].
Westudiedtheeffectofthenumberofbaseclassifiersby vary-ing them in {5, 10, 15, 25, 50, 100}. We ran 5×2-fold cross-validationforeachmethodoneachdataset.TheFriedmantestwas usedtotestifthereweredifferencesacrossthemethodsandifthe test passedat95% significance,a posthoc Nemenyi test was per-formedtoidentifyanypairwisedifferences[39].ApairedWilcoxon ranksumstestwasusedtocomparetwomethodsdirectly.
4.1. Datasets
We used 36 imbalanced binary data sets (Table 2): 14 from the HDDT repository (http://www3.nd.edu/∼dial/hddt/), 19 fromtheKEELrepository(http://sci2s.ugr.es/keel/imbalanced.php), and three from the UCI repository (https://archive.ics.uci.edu/ml/ datasets.html). Note that the evaluationwithneural network en-sembles includes 26 data sets that contain only numerical at-tributes.Only thecomplete instances ofdata were used andany constantattributeswereremoved.Table2showsasummaryofthe datasets usedin ourexperiments. The attributesofthe datasets arenumerical,categoricalornumerical-categoricalmixed. Forthe multiclasssetting,weused15 datasetsfromtheKEELrepository (Table3).
Table2
Overview of the binary data sets obtained from UCI, HDDT ∗and KEEL † repositories (names were shortened for
convenience).
Dataset #Inst #Attr #Num %Min Dataset #Inst #Attr #Num %Min
pima 768 8 8 34.5 br-y † 277 9 0 29.2 ion 351 34 34 35.9 cl0vs4 † 173 13 13 7.5 sonar 208 60 60 46.6 ecoli4 † 336 7 7 5.9 spectf ∗ 267 44 44 20.6 hab † 306 3 3 26.5 phon ∗ 5404 5 5 29.3 led7_xvs1 † 443 7 7 8.3 page ∗ 5473 10 10 10.2 pb-1-3vs4 † 472 10 10 5.9 ism ∗ 11,180 6 6 2.3 shut-0vs4 † 1829 9 9 6.7 letter ∗ 20,0 0 0 16 16 3.9 vow0 † 988 13 13 9.1 satim ∗ 6430 36 36 9.7 yst-2vs4 † 514 8 8 9.9 compu ∗ 13,657 20 20 3.8 yst4 † 1484 8 8 3.4 segm ∗ 2310 19 19 14.3 glass6 † 214 9 9 13.5 oil ∗ 937 49 49 4.4 new-th1 † 215 5 5 16.3 estate ∗ 5322 12 12 12 wisc † 683 9 9 34.5 hypo ∗ 20 0 0 24 6 6.1 car-gd † 1728 6 0 4 boun ∗ 3505 175 0 3.5 flare-F † 1066 11 0 4 cred ∗ 10 0 0 20 7 30 kdd † 1642 41 26 3.2 hrt-v ∗ 133 9 4 23.3 veh0 † 846 18 18 23.5 ab9-18 † 731 8 7 5.7 w-red-4 † 1599 11 11 3.3
Table3
Overview of the multiclass data sets, all from the KEEL repository.
Dataset #Inst #Attr #Num %Min #Class
ontraceptive 1473 9 6 22.6 3 dermatology 366 34 34 5.5 6 balance 625 4 4 7.8 3 penbased 1100 16 16 9.5 10 shuttle 2175 9 9 0.09 5 wine 178 13 13 27 3 yeast 1484 8 8 0.3 10 pageblocks 548 10 10 0.6 5 thyroid 720 21 21 2.4 3 ecoli 336 7 7 0.6 8 autos 159 25 15 1.9 6 glass 214 9 9 4.2 6 new-thyroid 215 5 5 13.9 3 hayes-roth 132 4 4 22.7 3 lymphography 148 18 3 1.3 4
4.2.Methodsusedforcomparison
For the binary class setting, we included EB-bagging as the baselinemethodalongwithRB-bagging,SMOTE-baggingand RNB-bagging as state-of-the-art methods (see Section 2.4 for details). TheparametersofSMOTEweresettothecommonlyusedsetting of500%oversamplingoftheminorityclasswith5nearest neigh-borsand 100% samplingof the majorityclass. To investigatethe benefit of using an ensemble in PT-bagging, we additionally in-cludedasingleclassifier (denotedas“single”)thatusedthe com-plete training data to learn, and employed the same threshold-movingmechanism asPT-baggingandidenticalparameters asits baseclassifiers.
For completeness, we compared Platt scaling [40] and PT-baggingonbinarydatasets.Plattscalingisawell-knownposthoc calibrationmethodthattransformscontinuousmodeloutputsinto (calibrated) probability estimates. It firstly fits a logistic regres-sion model to the outputs with the class labels as the depen-dentvariable.Toavoidover-fitting,weuseddatafrom3-fold cross-validationwithtransformedclass labelsderivedfromthetraining set,asintheoriginal paper[40].Thislogisticmodelwasthen ap-pliedto the test set outputs to obtain calibrated probabilities. A thresholdof0.5wasfinally appliedtothecalibratedprobabilities to obtain the test class labels. If the posteriors are indeed cali-bratedafterPlattscalingthenthissettingshouldyieldgood perfor-mance.Wedeemeditsinclusionasrelevantsince,likeourmethod, Plattscalingisaneasy-to-applyaposterioricorrectionofthemodel outputs.
4.3.Performanceevaluation
Weevaluatethemethodsonthreeperformancemeasures:area underthePR curve(AUCPR), macro-accuracy andmacroF1-score (see Section 2.1 fordetails). Note that AUCPR is computed using theposteriorprobability estimateswhile theother two measures requireclasslabelassignments.
The methods consideredherediffer intwo importantaspects: (1)theresamplingmechanismand(2)thethresholdused.Each re-samplingmechanismmayrequireitsownthresholdtoachieve op-timalperformance.However,oftenastandardthresholdisusedin practice(e.g.,0.5forEB-andRB-bagging).Toevaluatethe perfor-manceindependentlyofthethreshold,wedevisedanovelscheme calledthefullpotential.Thefullpotentialofamethodisdefinedas thebestperformance achievedoverall possiblethresholds onthe testset.Amethod’s performancecloseto itsfull potentialmeans thatthethresholdused ismoreattunedto theoptimalthreshold forthat method.Weapproximatedthefull potentialby searching theoptimalthresholdoveragridfrom0to1instepsof0.01.
Finally,weusedthestratifiedBrierscore(meansquarederror) toevaluatethecalibrationoftheposteriorprobabilities.TheBrier scoreofclasskiscalculatedastheaveragesquareddifferences be-tweentheestimatedprobabilityoftheexamplesfromclassk(i.e. P
(
y = k|
xi)
)andaperfectlyconfidentprobabilisticprediction(i.e., 1): BSk= 1 Nky∗ i=k 1 −P(
y=k|
xi)
2 (5)whereNkisthenumberofexamplesofclasskandy∗i =krefers to those examples for which k is the true class label. Averaging the Brier score of all the classes gives the stratified Brier score. The stratifiedBrierscoreis moreappropriate whenthereis class imbalance since it gives equal importance to all the classes and thusallowsanymiscalibrationoftheminorityclassestobespotted [41].
5. Resultsanddiscussion
In thissection, we compare the performance of theproposed method with other state-of-the-artmethods and provide further empiricalinsights.Forbrevity,whenensemblesofneuralnetworks performed similarly to those of decision trees, the results from neural networks ensembles are omitted. Unless otherwise indi-cated,wereportresultsonensembleswith100baseclassifiers. 5.1. Binarydatasets
Wefirstinvestigatedtheeffectsoftheensemblesize(Fig.1).It isworth mentioningthat, unsurprisingly, thearea undertheROC curve showed a much more clutteredpicture, which we omit in theinterestofspace.
Overall, the following general observations can be made: (1) all methods show improvementwithincreasing ensemblesize in allperformancemeasures. (2)PTMA-,RB-andEB-baggingperform better on macro-accuracy while PTF1-, SMOTE- and RNB-bagging performbetteronthemacroF1-score.Thisshowsthatdifferent re-samplingmechanismsaresuitable fordifferentperformance mea-sures; (3)PT-bagging – with appropriate thresholds – performed well in each ofthe evaluated measures, while the restof meth-odsperformedpoorlyinatleastoneofthem,e.g.,RB-bagging per-formed poorly in macroF1-score andAUCPR, while SMOTE- and RNB-bagging performed poorly in macro-accuracy. Finally, (4) as expected, EB-and RB-bagging performed similarly (although RB-bagging generallyfared betterwithdecision trees)andRNB-and SMOTE-bagging did not differ significantly. We discuss these re-sultsinmoredetailbelow.
5.1.1. Theareaundertheprecision-recallcurve(AUCPR)
As can be readily observed, PT-, SMOTE- and RNB-bagging showed,onaverage,overall betterperformance thanEB- and RB-baggingonAUCPRforallensemblesizes(Fig.1,left).Thisindicates that PT-, SMOTE- and RNB-bagging are generallyable to achieve comparatively higher average precision and recall (F-measures). The Friedman test revealed a significant difference between the methods(P<3e−12,forbothDTandNNensembles).
TheposthocpairwiseNemenyitestswithDTensemblesshowed anearlysignificantdifferencebetweenPT-baggingandEB-bagging (P ≈0.062),andbetweenRNB-andEB-bagging(P ≈0.083).Note that lower AUCPR generally implies a lower potential for the F-measures,irrespectiveofthethresholdused.Thus,whenusingDT ensembles, methods based onundersampling alone (i.e., EB-and RB-bagging)arelikelytohavealowerF1-score(asshownbelow). The single classifier clearly showed a lower AUCPR than the PT-baggingandrestofthemethods(P<1.5e−6).
Fig.1. Average test performance across datasets for different numbers of classifiers in AUCPR (left), macro-accuracy (middle) and macro F1 (right). The first row shows results for DT ensembles and second row for NN ensembles. The interpolated lines are shown for convenience.
With NN ensembles, RB-bagging was significantly (or nearly significantly)worsethanPT-,SMOTE-andRNB-bagging(P≈0.058, 0.065and0.021,respectively).All themethods performed signifi-cantlybetterthanthesingleclassifier(P<0.001).Noother signif-icantdifferenceswerefound.
Thechoiceofbaseclassifiers(eitherDTorNN)didnotmakea significantdifferenceinanyofthemethod’sperformance(P>0.17 forallWilcoxontestson26datasets).Thisaligns withourclaim that PT-baggingcanbe usedwithdifferentchoicesofbase classi-fiers.
5.1.2. Macro-accuracy
Fig.1(middle)showsthatresamplingmethodsoffersimilaror higher macro-accuracy compared to PTMA-bagging when a small number of classifiers were employed. This aligns with Maloof’s [20] findings which showed that undersampling and threshold-moving perform similarly in terms of ROC and macro-accuracy when a single classifier is employed. However, the performance ofPTMA-baggingimprovedsubstantiallywithlargerensembles(25 base classifiersor more), indicating that once thevariance is re-duced through bagging, theerror that is left comes mostly from thebias (sincebagging reducesvariancebutnotbias). Thisresult suggests that the biasof PTMA-bagging islower than that ofthe othermethods.
PTMA-bagging showed thehighest number ofwins aswell as fewer losses compared to other methods (Table 4, left), espe-cially with DT, where it obtained almost twice as manywins as losses against the second best performing method (RB-bagging). The Friedman test revealed a significant difference between the methods(P<1e−11forboth,DTandNNensembles).
ForDTensembles,theposthoc pairwiseNemenyitestsshowed thatPTMA-baggingperformedsignificantlybetterthanEB-, SMOTE-and RNB-bagging (all P < 0.03) and the single classifier (P<1e−10).Noneoftheremainingdifferencesweresignificant.
NNensembles showed a similar trend of pairwise differences astheDTensembles,althoughinthiscasePTMA-baggingwasonly significantlybetterthanSMOTE-bagging(P ≈0.021)andthesingle classifier (P < 1.7e−9). It should be noted, however,that it was moredifficultto obtainstatisticalsignificancewithNNensembles asfewerdatasetswereused.
Itisalso worthnotingthat thetype ofthebase classifierhad nosignificant effectonamethod’sperformance (Wilcoxon test,P >0.8forallcomparisonsofamethodusingDTagainst thesame methodusingNN)exceptforthesingleclassifierwhichperformed betterwithDTthanwithNN(P≈0.01).
Symmetryofclassrecalls: Inordertogain insightintothe bias ofthe methods towardpredictingeitherclass we testedthe null hypothesisthat,foreachmethod,thedifferencebetweentheclass recalls(Fig.2),i.e.,the twocomponentsofthemacro-accuracy,is equalto zerousing a one samplet-test. WithDT ensembles, the nullhypothesiswasrejectedforallmethods(P<0.003)exceptfor PTMA-bagging(P≈0.18).WithNNensembles,asimilarsymmetry wasobservedforPTMA-bagging(P ≈0.42).Inthiscase,EB-bagging wasthe onlyothermethod thatdidnot showsignificantly asym-metricclassrecalls(P≈0.46).
These resultssuggest that PTMA-bagging is not biased toward either class. Interestingly, only EB-bagging (using DT) showed higherrecallfortheminorityclass(Fig.2,left),suggestinga possi-bleovercompensationfortheminorityclassduetothe undersam-pling.
Fullpotential:Wethentestedthefullpotentialofthemethods formacro-accuracy,i.e.,whichwedefineastheirmaximum macro-accuracythatwouldbeattainableiftheoptimalthresholdforeach testsetwereknown.TheFriedman testrevealedasignificant dif-ferencebetweenmethods(P<0.001bothDTandNNensembles). Post-hoc pairwise comparisons with DT ensembles revealed that SMOTE-bagging(P ≈ 0.04) andPTMA-bagging (P ≈0.1) have (respectively) a significantly and nearly-significantly higher full
Table4
Win/Tie/Loss tables. Each element expresses how many times the method in the row wins/ties/loses against the method in the column. The top tables show results with DT methods, and the bottom tables show results with NN methods.
Macro-accuracy Macro F1-score
EB RB SMOTE RNB Single EB RB SMOTE RNB Single
PT MA 27/0/9 23/1/12 24/1/11 23/1/12 31/2/3 PT F1 33/0/3 31/1/4 19/2/15 19/1/16 30/0/6
EB – 6/4/26 20/2/14 19/1/16 31/0/5 EB – 2/2/32 1/1/34 1/2/33 10/0/26
RB – – 24/1/11 23/4/9 33/0/3 RB – – 5/2/29 6/2/28 15/0/21
SMOTE – – – 15/4/17 30/1/5 SMOTE – – – 18/2/16 28/0/8
RNB – – – – 34/1/1 RNB – – – – 30/0/6
EB RB SMOTE RNB single EB RB SMOTE RNB single
PT MA 13/1/12 16/0/10 20/0/6 15/2/9 26/0/0 PT F1 20/2/4 20/0/6 14/1/11 11/4/11 24/0/2
EB – 15/2/9 19/1/6 14/1/11 26/0/0 EB – 4/0/22 4/1/21 3/1/22 12/0/14
RB – – 19/1/6 13/1/12 26/0/0 RB – – 6/2/18 5/2/19 13/0/13
SMOTE – – – 4/2/20 22/0/4 SMOTE – – – 13/2/11 23/1/2
RNB – – – – 26/0/0 RNB – – – – 22/0/4
Fig.2. Average recall across data sets for different numbers of classifiers, separated for the minority class (solid line) and the majority class (dashed line). The left plot shows results for DT ensembles and right plot for NN ensembles.
Table5
The average difference between the actual performance and full potential. The first row shows results for DT methods and second row for NN methods.
Macro-accuracy Macro F1-score
PT MA EB RB SMOTE RNB single PT F1 EB RB SMOTE RNB single
DT 1.9% 2.5% 2.3% 4.7% 3.4% 0.9% 2.6% 9.5% 6.9% 2.9% 3.2% 1.1%
NN 1.7% 2.0% 2.1% 5.9% 3.3% 0.6% 2.9% 6.5% 5.5% 2.9% 3.6% 0.9%
potential than EB-bagging. Again, PT-bagging and the rest ofthe ensemble methods clearly outperformed the single classifier in termsof potential (P < 1e−10). The rest of the pairwise differ-enceswerenotsignificant.
With NN ensembles, the trend changed for RB-bagging, now exhibiting generallylower potential formacro-accuracy than the othermethods,withsignificantly(ornearlysignificantly)lower po-tentialthanRNB- andSMOTE-bagging (P ≈ 0.031andP ≈ 0.081, respectively). Thisdrop in the full potential ofRB-bagging when usingNNcan alsobe appreciatedfromits lower averagedAUCPR inFig.1 ascompared to,forexample, EB-bagging.This seemsto suggestthatRB-baggingmightnotgeneralizewellacrossbase clas-sifierchoices, conceivably becauseit might beleveraging proper-tiesofdecisiontrees.Toourknowledge,nootherstudieshave an-alyzedthe performance ofRB-bagging withbaseclassifiers other thanDTandthusmoreanalysesareneeded.
Finally, the single classifier showed lower potential than PT-bagging(P<8e−7)andtherestoftheensembles(P<0.002).
To shedlight on howwell-tuned thethresholds employed for each method were, we compared their actual and full potential macro-accuracy.The averageabsolutedifference betweenthe full potentialmacro-accuracy and theactual macro-accuracy was cal-culated across cross-validation folds and datasets (Table 5, left). PTMA-baggingperformedclosertoitsfull potentialthantheother
ensemble methods. This suggests that the prior-based threshold employed by PTMA-bagging is a close-to-optimal choice. The sin-gleclassifierperformedclosesttoitsfullpotentialmacro-accuracy, however its full potential was markedly lower than the other methods– asdiscussedabove– yieldingthusalowerperformance (Table4).
5.1.3. MacroF1-scoreandplug-inpotency
Animportantadvantageofourmethodisthatalearned ensem-blecanbeusedtomakepredictionsthatoptimizeanymeasureof interestby applyingan appropriatethresholda posteriori.We ap-pliedtheproposedthresholdtomaximizethemacroF1-score(see Section3.2)totheoutputsofthesamePT-baggingensemblesused above.TheresultingmethodistermedPTF1-bagging(Fig.1,right). The Friedman test revealed a significant difference between the methods for DT (P < 0.0008; P < 0.1 for NN). Post-hoc tests showed that PTF1-bagging had a significantly higher macro F1-score than PTMA-bagging, EB-bagging, and RB-bagging (all P < 0.0008) for both DT and NNensembles. However, witheither choiceofthebaseclassifier,PTF1-baggingwasnotsignificantly dif-ferentfromRNB-orSMOTE-bagging.Asinthepreviousmeasures, thesingleclassifierperformedmarkedlyworsethanPTF1-, SMOTE-andRNB-bagging(P<0.002)witheitherbaseclassifier,whilenot
Fig.3. Reliability plots for DT ensembles with 100 classifiers; spectf (UCI, left), pb-1-3vs4 (KEEL, middle) and satim (HDDT, right). We used 10 bins to discretize the posterior probability for the minority class P(y=1 | x) ( x-axis) for all five runs and two folds. The corresponding observed frequencies of the minority class ( y-axis) were calculated for each bin (i.e., the “true”P(y=1 | x)). A method lining up with the diagonal is well calibrated while values below the diagonal are overestimating the probability of the minority class.
worse than EB-bagging with NN (P ≈ 0.99) and not worse than RB-baggingwithbothNNandDT(P>0.91).
Again, none of the methods showed significant differences when comparing their performanceswith DT andNN ensembles (P>0.17forallWilcoxontests).
Full potential: We found significant differences between the methods in terms of the full potential of the macro F1-score (Friedmantest, P < 0.055 forbothDT andNN).WithDT ensem-bles, pairwise posthoc tests revealed only one trend level differ-ence between PTF1- and EB-bagging (P ≈ 0.1) in favor of PTF1 -bagging.WithNNensembles,asnotedabovewithmacro-accuracy, RB-bagging’spotential generallydecreased showingasignificantly lowerfullpotentialthanPTF1-bagging(P≈0.04).Asexpected,the single classifier hadsignificantly lower full potential than all the ensembleswitheithertypeofbaseclassifier(P<0.0022).Noother differencesweresignificant.
Finally,theaverageddifferencesbetweentheactualandthefull potentialmacroF1-scoreshowthatPTF1-bagging– whichusesour novel threshold – performed closerto its full potential than the restofensemblemethods(Table5,right).Also, SMOTE-and RNB-baggingperformedclosertooptimalthanEB-andRB-bagging. Ad-ditionally,thesingleclassifierperformedclosesttoitsoptimal per-formance,yetagain,itsfullpotentialandactualperformancewere markedlylowerthantheensemblemethods(Table4).
Taken together,these resultsimply that the methodsthat ex-hibit a performance clearly lower to that of their full potential could dobetter ifproperthresholds could befound, whichis of-ten not possible without usingcomputationally expensive tuning procedures. Overall, the resultsabove support our claimthat PT-bagging passesas a plug-in methodwhere the threshold can be setaposterioriaccordingtotheperformancemeasureofinterest. 5.1.4. Posteriorprobabilitycalibration
Sofar, wehaveshownthatPT-baggingperforms competitively onthreedifferentmeasures.Inparticular,goodperformanceinthe macroF1-score andmacro-accuracy (Theorem 1) is onlypossible if posteriorprobabilities are well calibrated. Inthe following, we arguethatPT-baggingestimateswellcalibratedposterior probabil-ities.
AnempiricalstudybyNiculescu-MizilandCaruana[23]showed thatbagged DTandNNensemblesestimate wellcalibrated poste-rior probabilities, making additional calibration – e.g., with Platt scaling – unnecessary. However, probability calibration is a rel-atively understudied problem for imbalanced data. In this di-rection, a recent study proposes to correct the calibration for undersampling [14] and another study proposes the use of an
undersampling-basedvariationofPlattscalingtoobtaincalibrated probabilities[42].Thesestudiesusethe(stratified)Brierscore(see Eq.(5)) toquantify calibration.Wallace andDahabreh [41]found thatundersamplingcombinedwithbaggingleadstoalowerBrier scorefortheminority class,i.e.,abettercalibrationforthisclass, without sacrificing the overall Brier score. We found similar re-sultsinourexperimentswithboth DTandNNensembles. Specif-ically,therebalancingmethods showedasignificantlylower Brier scorefor theminority class than PT-bagging andthe single clas-sifier,while PT-bagging fared better withthe majority class than the rebalancing methods and the single classifier (Friedman test P<2e−16;allpairwiseposthocNemenyitestsP<0.018forboth DTandNNbasedmethods).
ThisseeminglynegativeresultforPT-bagging,i.e.,ahigherBrier score for the minority class than the majority class, can be at-tributed to a potential shortcoming of the stratified Brier score. TheBrierscoreforclasskdecreaseswithcrispposteriors,e.g.,the Brierscorefortheminority classwouldbecomezeroifall poste-riorprobabilities forthisclass were 1(see Eq.(5)).Thus, the in-formationinthenon-crispposteriorsisignoredbytheBrierscore andoverestimatedprobabilitieswill,wrongly,leadtoalowerBrier score.ThissuggeststhatthestratifiedBrierscoremightnotbe ap-propriateforqualifyingposteriorcalibrationoverimbalanceddata, asitis notnecessarily indicativeofother performance measures. Developing new measures to quantify calibration is beyond the scopeofthispaper.
Reliability plots provide an alternative visual way to evaluate calibrationquality [23], overcomingthe aforementioned deficien-cies.ThreeexamplesofreliabilityplotsareshowninFig.3.Visual inspection revealed that PT-bagging probabilities were well cali-brated(i.e.,closetothediagonalline)forthemajorityofthedata sets(21outof36DTensembles,and15outof26NNensembles; see Supplementary material). In contrast, all the other methods tendedtosystematicallyoverestimatetheposteriorsforthe minor-ityclass (hencetheirlower Brierscores).Moreover,whenever PT-bagging estimated miscalibratedposteriors (i.e.,not aligned with thediagonal), theother methodsfailedtoo.Theseresultssuggest thatPT-baggingestimatesrelativelywell-calibratedposteriors. 5.1.5. ComparisonwithPlattscaling
Toinvestigatethe effectof directcalibration,we applied Platt scaling to the posterior probabilities of the ensembles with 100 classifiers used above, followed by a threshold of 0.5. With the DTensembles,PTMA-baggingoutperformedPlattscalingin macro-accuracy(PairedWilcoxontest,P<1e−8;Win/Tie/Loss = 34/0/2). Platt scaling performed relatively better in the macro F1-score.
Nevertheless,PTF1-, SMOTE- and RNB-bagging still outperformed Plattscaling(PairedWilcoxontest,P<0.08inallcomparisons).
Results with NN ensembles showed a similar picture for themacro-accuracy measure,wherePTMA-baggingclearly outper-formed Platt scaling (P < 1e−6; Win/Tie/Loss = 21/1/4). How-ever,inthiscase, thedifferencesbetweenPlatt scalingandPTF1-, SMOTE-andRNB-baggingwerenon-significant(allP>0.4).
Inconclusion,probabilitycalibrationusingPlattscalingdidnot provideanimprovementoverthemethodsinvestigatedinthis ar-ticle.Thiscorroboratespreviousresultsshowingthatposterior cal-ibrationforbagged DTandNNensemblesisoftenunnecessaryas theirprobabilityestimatesaregenerallywellcalibrated[23]. 5.2.Multiclassdatasets
An important advantage of our method is that it can be di-rectlyextended to themulticlass setting.Formulticlass data,the threshold-movingtechniquecanbeappliedbydividingthe poste-riorsbyappropriateprobabilities(seeTheorem1).Forinstance,to maximizemacro-accuracy,thethresholdsofPTMA-baggingareset equaltothepriorprobabilitiesoftherespectiveclass(seeSection 3.1andAlgorithm1).Weevaluatedthisapproachon15multiclass datasets (Table3).Forcomparison,weusedtheUnderBaggingto OverBaggingmethod (UnderOver)[43]. This method uses under-orover-sampledinstancesofdifferentclassesinproportiontothe majorityclasssizecontrolledbyaparameterawhichcorresponds tothesamplingrateofthelargestclass.Theparameterachanges theensemblefromUnderBagging(a=0)toOverBagging(a=100). Noticethat UnderOver (a = 0) is the multiclassextension of EB-bagginginwhichthemajorityclassesare undersampledtomatch theleastfrequent class.We varied theparametera in {0,10,25, 50,100}.
Weusedsimilarexperimentalsettingstothoseusedforthe bi-narydatasets, i.e.,simple bootstrap samplingforPT-bagging and 5 × 2-fold cross-validation. Following their good performance in binarydatasets,hereweemployedonlyDTensemblesofsize100. As there are many methods (PT-bagging and five competing methods)relativetothenumberofdatasets,theFriedman testis unlikely toreveal differences. Therefore, herewe only performed pairwise tests. Firstly, we selected the single competitor method withthehighestaverage macro-accuracy.Thismethod – withan averagemacro-accuracy of 0.756 – was UnderOver-bagging(with a= 50).This methodandPTMA-bagging (average macro-accuracy 0.789) showed a trend-level difference (Paired Wilcoxon test, P ≈0.0946).Furthermore,of the total15 multiclass data sets,PTMA -bagginghadeightoverallwins againstthefivecompetitors,while thenextbestmethodwasUnderOver-bagging(witha =10) with fourwins.ThisresultsuggeststhatPT-baggingcanbesuccessfully employedformulticlassdatasetswhenappropriatethresholdsare available, although further tests are needed to confirm stringent statisticalsignificance.
6. Conclusionsandfuturework
We proposed a simple plug-inmethod, PT-bagging,for imbal-ancedclassification. Ourmethod relieson simple bootstrap sam-pling– which preservesthe naturalclassdistribution – to create abagging ensemblefollowedby threshold-movingto assignclass labels.OurresultsandanalysesshowedthatPT-bagging, unsurpris-ingly,outperformsthesingleclassifierbaselineandperforms com-petitivelytofourstate-of-the-artensemblemethods.Furthermore, itdoessointhreeperformancemeasures:AUCPR,macro-accuracy andmacroF1-score.Weshowedthat theclasspriorsprovide the optimal thresholds for maximizing the macro-accuracy measure, andwe introduced a newintuitive threshold formaximizing the macroF1-scoreanddemonstrateditseffectiveness.
We showed that PT-bagging (combined with an appropriate thresholdwhenneeded)performsatleastaswellasthebest com-petitor method in each of the three performance measures and does not underperform in any of them. Critically, it does so by reusingthesameensemblemodels acrossperformance measures. Bycontrast,all other methods provedto be weakinat leastone ofthe threemeasures. Weobserved thatthe undersampling-only methods(EB-andRB-bagging)weremoresuitableformaximizing themacro-accuracy,whilemethodscombiningsynthetic oversam-plingwithundersampling(SMOTE-andRNB-bagging)faredbetter withthemacroF1-score.Thusavoiding thenecessityofchoosing anarbitrarythresholdoridentifyingameasure-specificthreshold.
Our analysis provided several additional insights. Specifically, we found: (i) PT-bagging is lessbiased toward either class than other methods, (ii) it performs close to its full potential, (iii) it performs well with different choices of base classifier, (iv) PT-bagging canbe directly extendedto multiclassdata when appro-priate thresholds are available; andfinally, (v) a potential short-comingoftheBrierscoreinquantifyingprobabilitycalibrationand. Takentogether,ourworkprovidesacompetitiveandsimple al-ternative to other rebalancing- and syntheticoversampling-based ensemble methods, whichare often the first choice todeal with class imbalance. We hope that our results and analyses will in-crease interest in the threshold-moving technique and provide a basisfordevelopingnewthreshold-basedmethodsforimbalanced classification.
An important but understudied question is whether to use the naturalclassdistribution forlearning[19].Weiss andProvost [16]studiedthisquestionempiricallyandconcludedthatgenerally a differentclassdistribution leadstobetter performance. Our re-sultsstandincontrasttotheirconclusion.Animportantdifference between their procedures and ours, which can at least partially explain, the different conclusions, is the use of a singledecision treeversusan ensemble.Asourresultsshow,asingleclassifieras wellassimplebootstrapbaggedsmallensemblesperformspoorly, buttheperformanceimproveswiththeensemblesize.Considering this, weconclude thata largeenough bagging ensemble success-fullymodelsthedatawithitsnaturalclassdistribution.
Wecantakeseveralpossibledirectionsinthefuture.Onesuch possible direction would be to use intrinsic data properties to improve the sampling mechanism (see, e.g., [7]). Such methods can leverage properties of the data and help with difficult situ-ations, e.g., small disjunctsof the minority class. Additionally,as our methodavoids computationally expensive retraining, we aim toinvestigatethe suitabilityofourmethod inenvironmentswith dynamiccosts.
Acknowledgments
We thank Dr. Harikrishna Narasimhan, Prof. Jesse Davis and Prof.HendrikBlockeelforfruitfuldiscussions.KRPwaspartly sup-portedby theWellcomeTrust -MITfellowship103811AI.We are thankfultoStevenSheridanandDr.AbhijitKulkarnifor proofread-ingthemanuscript.
Supplementarymaterials
Supplementary material associated with this article can be found,intheonlineversion,atdoi:10.1016/j.neucom.2017.08.035. References
[1] Q.Yang,X.Wu,10challengingproblemsindataminingresearch,Int.J.Inf. Technol.Decis.Mak.5(4)(2006)597–604.
[2] Z.Yang,W.Tang,Associationrulemining-baseddissolvedgasanalysisforfault diagnosisofpowertransformers,IEEETrans.Syst.ManCybern.C:Appl.Rev.39 (6)(2009)597–610.
[3] R. Perdisci,G. Gu, W.Lee, Usinganensemble ofone-class SVMclassifiers tohardenpayload-basedanomalydetectionsystems,in:ProceedingsofIEEE SixthInternationalConferenceonDataMining,2006.
[4] T.Fawcett,F.Provost,Adaptivefrauddetection,DataMin.Knowl.Discov.1(3) (1997)291–316.
[5] M. Mazurowski, P.Habas, J. Zurada, Training neural networkclassifiers for medicaldecisionmaking:theeffectsofimbalanceddatasetsonclassification performance,NeuralNetw.21(2)(2008)427–436.
[6] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid- based approaches, IEEE Trans. Syst. Man, Cybern. C: Appl. Rev 42 (2012) 463– 484, doi: 10.1109/TSMCC.2011.2161285.
[7] H.He,E.Garcia,Learningfromimbalanceddata,IEEETrans.Knowl.DataEng. 21(9)(2009)1263–1284.
[8] G.E.A .P.A . Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl. 6 (2004) 20, doi: 10.1145/1007730.1007735.
[9] N.Chawla,K.Bowyer,SMOTE:syntheticminorityover-samplingtechnique,J. Artif.Intell.Res.16(2002)321–357.
[10] A.Estabrooks,T.Jo,N.Japkowicz,Amultipleresamplingmethodforlearning fromimbalanceddatasets,Comput.Intell.20(2004)18–36.
[11] T.Dietterich,Ensemblemethodsinmachinelearning,in:ProceedingMCS‘00 ProceedingsoftheFirstInternationalWorkshoponMultipleClassifierSystems, 2000.
[12] S. Hido, H. Kashima, Y. Takahashi, Roughly balanced bagging for imbalanced data, Stat. Anal. Data Min. 2 (2009) 412–426, doi: 10.1002/sam.10061. [13] C.Seiffert,T.M.Khoshgoftaar,J.VanHulse,A.Napolitano,RUSBoost:ahybrid
approachtoalleviatingclassimbalance,IEEETrans.Syst.ManCybern.A:Syst. Hum.40(2010)185–197.
[14] A.D.Pozzolo,O.Caelen,R.Johnson,G.Bontempi,Calibratingprobabilitywith undersamplingforunbalancedclassification,IEEESymp.Comput.Intell.Data Min(2015)159–166.
[15] A.D.Pozzolo,O.Caelen,G.Bontempi,Whenisundersamplingeffectivein un-balancedclassificationtasks?in:ProceedingsofConferenceonMachine Learn-ingandKnowledgeDiscoveryinDatabases,2015.
[16] G.Weiss,F.Provost,Learningwhentrainingdataarecostly:theeffectofclass distributionontreeinduction,J.Artif.Intell.Res.19(2003)315–354. [17] N. Chawla, L. Hall,A. Joshi, Wrapper-basedcomputation and evaluation of
sampling methodsforimbalanceddatasets,in:Proceedingsofthe1st Inter-nationalWorkshoponUtility-BasedDataMining,ACM,2005.
[18] S.Wang,X.Yao,Multiclassimbalanceproblems:Analysisandpotential solu-tions,IEEETrans.Syst.ManCybern.B:Cybern.42(4)(2012)1119–1130. [19] F.Provost,Machinelearningfromimbalanceddatasets101,in:Proceedingsof
AAAI’2000WorkshoponImbalancedDataSets,2000.
[20] M.Maloof,Learningwhendatasetsareimbalancedandwhencostsare un-equaland unknown,in:ProceedingsofWorkshoponLearning from Imbal-ancedDataSetsII,2003.
[21] Z.Zhou,X.Liu,Trainingcost-sensitiveneuralnetworkswithmethods address-ingtheclassimbalanceproblem,IEEETrans.Knowl.DataEng.18(1)(2006) 63–77.
[22] F.Provost,P.Domingos,Treeinduction forprobability-basedranking,Mach. Learn.52(3)(2003)199–215.
[23] A.Niculescu-Mizil,R.Caruana,Predictinggoodprobabilitieswithsupervised learning,in: Proceedingsofthe 22ndInternational Conference onMachine Learning,2005.
[24] H.Narasimhan,R.Vaish,S.Agarwal,Onthestatisticalconsistencyofplug-in classifiers for non-decomposable performance measures, in: Proceedingsof AdvancesinNeuralInformationProcessingSystems,2014.
[25] Z.Lipton, C.Elkan,B. Naryanaswamy, Optimalthresholding ofclassifiersto maximizeF1measure,in:ProceedingsofConferenceonMachineLearningand KnowledgeDiscoveryinDatabases,2014.
[26] C.Elkan,Thefoundationsofcost-sensitivelearning,in:Proceedingsof Interna-tionalJointConferenceonArtificialIntelligence,2001.
[27] L.Kuncheva,CombiningPatternClassifiers:MethodsandAlgorithms,John Wi-ley&Sons,2004.
[28] L. Breiman, Bagging predictors, Mach. Learn 24 (1996) 123–140, doi: 10.1007/ BF00058655.
[29] E. Bauer, R. Kohavi, An empiricalcomparison ofvotingclassification algo-rithms:bagging,boosting,andvariants,Mach.Learn.36(1)(1999)105–139. [30] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning,
SpringerSeriesinStatisticsSpringerNewYorkInc.,NewYork,NY,USA,2009. [31] J.Błaszczy´nski,J.Stefanowski,Ł.Idkowiak,Extendingbaggingforimbalanced data,in:Proceedingsofthe8thInternationalConferenceonComputer Recog-nitionSystem,CORES,2013.
[32] C.Drummond,R.Holte,C4.5,classimbalance,andcostsensitivity:why un-der-samplingbeatsover-sampling,in:ProceedingsofWorkshoponLearning fromImbalancedDataSetsII,2003.
[33] J.Díez-Pastor,J.Rodríguez,C.García-Osorio,RandomBalance:Ensemblesof variablepriorsclassifiersforimbalanceddata,Knowle.-BasedSyst.85(2015) 96–111.
[34] V.Sheng,C.Ling,Thresholdingfor makingclassifierscost-sensitive,in: Pro-ceedingsofNationalConferenceonArtificialIntelligence,AAAI,2006. [35] J.Hernández-Orallo,P.Flach,C.Ferri,Aunifiedviewofperformancemetrics:
Translatingthresholdchoiceintoexpectedclassificationloss,J.Mach.Learn. Res.13(2012)2813–2869.
[36] S. Pang, J. Gong, C5.0 classification algorithm and application on individ- ual credit evaluation of banks, Syst. Eng. – Theory Pract. 29 (2009) 94–104, doi: 10.1016/S1874-8651(10)60092-0.
[37] B.Zadrozny,C.Elkan,Learningandmakingdecisionswhencostsand proba-bilitiesarebothunknown,in:ProceedingsofSeventhACMSIGKDD Interena-tionalConferenceonKnowledgeDiscoveryinDataMining,ACM,2001. [38] G.Panchal,A.Ganatra,Y.Kosta,Behaviouranalysisofmultilayerperceptrons
withmultiplehiddenneuronsandhiddenlayers,InternationalJournalof Com-puterTheoryandEngineering3(2)(2011)332–337.
[39] J.Demšar,Statisticalcomparisonsofclassifiersovermultipledatasets,J.Mach. Learn.Res.7(Jan.)(2006)1–30.
[40] J.Platt,Probabilisticoutputsforsupportvectormachinesandcomparisonsto regularizedlikelihoodmethods,Adv.LargeMarginClassif.10(3)(1999)61–74. [41] B.Wallace,I.Dahabreh,Classprobabilityestimatesareunreliablefor imbal-anceddata(andhowtofixthem),in:ProceedingsofIEEE12thInternational ConferenceonDataMining,2012.
[42] B.Wallace,I.Dahabreh,Improvingclassprobabilityestimatesforimbalanced data,Knowl.Inf.Syst.41(1)(2014)33–52.
[43] S.Wang,X.Yao,Diversityanalysisonimbalanceddatasetsbyusingensemble models,in:ProceedingsofIEEESymposiumonComputerIntelligenceandData Mining,CIDM’09,IEEE,2009.
GuillemCollell is a Ph.D. student at the Department of Computer Science at KU Leuven. He received both a B.S. in Psychology and a B.S. in Mathematics from the Au- tonomous University of Barcelona and a Research Mas- ter in Neuroeconomics from the Maastricht University. He was a Visiting Scholar at the MIT Neuroeconomics lab where he did research on imbalanced classification. His current research interests include deep learning models to integrate vision and language as well as more theoret- ical and fundamental problems in machine learning.
DrazenPrelec is the DigitalEquipmentCorp.Leadersfor GlobalOperationsProfessorofManagement and a Professor of Management Science and Economics at the MIT Sloan School of Management. Prelec also holds appointments in the Department of Economics and in the Department of Brain and Cognitive Sciences. He was a Junior Fellow in the Harvard Society of Fellows, and has received a num- ber of distinguished research awards, including the John Simon Guggenheim Fellowship. Prelec holds an AB in ap- plied mathematics from Harvard College and a Ph.D. in experimental psychology from Harvard University.
Kaustubh R. Patil is currently a Research Fellow at the Brain and Behavior institute at the Research Center Juelich. Patil is also affiliated with the MIT Sloan Neuroe- conomics Lab where he was a Wellcome Trust-MIT fellow. His work is primarily concerned with developing machine learning and data mining methods to understand biologi- cal systems, especially in microbiology and neuroscience. He holds a Bachelor’s degree in electronics engineering from the Shivaji University, a Master’s degree in artifi- cial intelligence from the University of Porto and a Ph.D. in bioinformatics from the Max-Planck Institute for Com- puter Science.