The AdaBoost.NC Algorithm - A New NCL Algorithm

4.3 A New NCL Algorithm – AdaBoost.NC

4.3.1 The AdaBoost.NC Algorithm

The key point of AdaBoost.NC is to emphasize the training examples that cause the en-

semble to present low voting disagreement among the individual classifiers during learning,

in addition to the examples misclassified by the current classifier. This is accomplished by

applying a penalty term involvingambin the sequential training procedure of AdaBoost.

The classification difference measured byamb for each training example is combined into

its weight at each iteration. The weight-updating rule of AdaBoost is modified, such that

both classification errors and low diversity will be penalized by rising weights. Table 4.1

presents the pseudo-code of AdaBoost.NC following the AdaBoost.M1 procedure (Freund

and Schapire, 1996).

In step 3 of the algorithm, a penalty term pi is calculated for each training example

at the i-th iteration, in which the magnitude of ambi assesses the “pure” disagreement

degree within current ensemble ¯fi composed of the existing i classifiers. Uniform weights

(1_i) are simply assigned to the individuals for the calculation, i.e.

ambi = 1 i i X t=1 _¯ fi =y −[ft =y] . (4.12)

pi is introduced into the weight-updating step (step 5). The main effect of applying

pi is that, the misclassified examples by the current classifier that receive more same

votes from existing individual classifiers will get a larger weight increase; the correctly

classified examples that get more different votes will get a larger weight decrease. Thus,

both accuracy and diversity are considered during training.

The pre-defined parameter λ controls the strength of applying pi. The choice of αi

in step 4 is decided by using the same inferring method in (Schapire and Singer, 1999;

Table 4.1: The AdaBoost.NC algorithm (Wang et al., 2010).

Given training data set{(x1, y1), . . . ,(xj, yj), . . . ,(xN, yN)}

with labels yj ∈Y ={ω1, . . . , ωc} and penalty strength λ,

initialize data weightsD1(xj) = 1/N; penalty termp1(xj) = 1.

For each training epochi= 1,2, . . . , L:

Step 1. Train weak classifier fi using distribution Di.

Step 2. Get weak classifier fi: X →Y.

Step 3. Calculate the penalty value for every example xj:

pi(xj) = 1− |ambi(xj)|.

Step 4. Calculatefi’s weight αi according to error and penalty:

αi = 1₂log P j,yj=_fi(_xj)Di(xj)(pi(xj)) λ P j,yj6=_fi(_xj)Di(xj)(pi(xj)) λ !

Step 5. Update data weightsDi and obtain new weightsDi+1

according to error and penalty:

Di+1(xj) =

(pi(xj))λDi(xj)exp(−αi[fi(xj)=yj])

Zi ,

whereZi is a normalization factor.

Output the final ensemble: ¯ f(x) = arg max ˆ y∈Y PL i=1αi[fi(x) = ˆy].

(Define [π] to be 1 if π holds and 0 otherwise.)

that the classifier having fewer misclassified examples, which receive larger classification

ambiguity from the current ensemble members, will obtain a higher weight. It agrees

with the common understanding about diversity in classification that we would not want

ensemble members to make the same wrong decision. The details of how to chooseαi are

given in the Appendix. B section.

Here are some additional points to explain the choice of the penalty term. Why do

we use |amb| to encourage diversity instead of using amb? As we have explained above, the sign of amb indicates whether an example is correctly classified by the ensemble.

This accuracy information is already reflected in the original weight-updating rule of

AdaBoost. There is no point to consider it repeatedly in pi. For the same reason, the

individual classifier is uniformly weighted for the calculation of amb instead of using α

in step 5. In addition, we would not knowfi’s weight αi until step 4 when calculating pi

in step 3.

AdaBoost.NC can be viewed as the first NCL algorithm developed specifically for

classification problems. Its training strategy is much simpler than existing NCL algo-

rithms. It is free of choosing any base learning methods, whereas others are restricted

to neural networks. The error correlation information is introduced into the weights of

training examples, rather than the error function of the learners such as in the CELS al-

gorithm (Liu and Yao, 1999b) or the training examples themselves such as in the NCCD

algorithm (Chan and Kasabov, 2005a). In this way, diversity is considered from the ensem-

ble level without modifying the base learner that makes the algorithm learner-dependent

and training examples that can cause undesirable noise. All these features help to create

a flexible ensemble training framework with improved efficiency and accuracy.

Finally, it is worth explaining why AdaBoost.NC is possible to outperform the con-

ventional AdaBoost. Although AdaBoost attempts to enforce classifiers to make different

errors by focusing on misclassified examples sequentially, the overfitting problem has been

reported empirically (Quinlan, 1996; Opitz and Maclin, 1999; Dietterich, 2000a), especially

when the processing data is noisy. In some cases, the weight vectors can become very

skewed, which may lead to undesirable bias towards some limited groups of data. It is

also found that AdaBoost can produce a diverse ensemble at the first few training epochs,

but diversity drops as more classifiers are built (Shipp and Kuncheva, 2002b). So, it is

suggested to stop the training progress early for performance enhancement with proper

diversity maintained.

From a theoretical point of view, Schapire et al. derived an upper bound on the

generalization error of AdaBoost (Schapire et al., 1998). They proved that AdaBoost is

aggressive at increasing margins of the training examples, which contributes to the reduc-

tion of generalization error even after the training error reaches zero. However, this bound

is rather weak (Schapire, 2002). Breiman found that a better margin distribution does not

then overfitting set in (Breiman, 1999). Murua presented an improved error bound for the

linear classifier combination by introducing “mutual weak dependence” (Murua, 2002).

As an important progress, it is shown that both the low dependence between classifiers

and large margins play an important role in achieving low error rates, and there is a trade-

off between them. It provides the evidence that only considering margins is not sufficient.

It is worth looking for a training procedure that can keep the dependence between the

classifiers low with large margins. AdaBoost.NC works for this purpose. It is expected to

alleviate the overfitting problem of AdaBoost, because the penalty term boosts not only

the most difficult misclassified examples, but also the easiest examples that have been

correctly labelled. Easy examples are more likely to be chosen to help the classification

on difficult examples in AdaBoost.NC than in AdaBoost. It reduces the chance of focus-

ing on the same group of misclassified examples. Both theoretical and empirical studies

have provided justifications for AdaBoost.NC to achieve good performance.

In document Ensemble diversity for class imbalance learning (Page 117-120)