4.3 A New NCL Algorithm – AdaBoost.NC
4.3.1 The AdaBoost.NC Algorithm
The key point of AdaBoost.NC is to emphasize the training examples that cause the en-
semble to present low voting disagreement among the individual classifiers during learning,
in addition to the examples misclassified by the current classifier. This is accomplished by
applying a penalty term involvingambin the sequential training procedure of AdaBoost.
The classification difference measured byamb for each training example is combined into
its weight at each iteration. The weight-updating rule of AdaBoost is modified, such that
both classification errors and low diversity will be penalized by rising weights. Table 4.1
presents the pseudo-code of AdaBoost.NC following the AdaBoost.M1 procedure (Freund
and Schapire, 1996).
In step 3 of the algorithm, a penalty term pi is calculated for each training example
at the i-th iteration, in which the magnitude of ambi assesses the “pure” disagreement
degree within current ensemble ¯fi composed of the existing i classifiers. Uniform weights
(1i) are simply assigned to the individuals for the calculation, i.e.
ambi = 1 i i X t=1 ¯ fi =y −[ft =y] . (4.12)
pi is introduced into the weight-updating step (step 5). The main effect of applying
pi is that, the misclassified examples by the current classifier that receive more same
votes from existing individual classifiers will get a larger weight increase; the correctly
classified examples that get more different votes will get a larger weight decrease. Thus,
both accuracy and diversity are considered during training.
The pre-defined parameter λ controls the strength of applying pi. The choice of αi
in step 4 is decided by using the same inferring method in (Schapire and Singer, 1999;
Table 4.1: The AdaBoost.NC algorithm (Wang et al., 2010).
Given training data set{(x1, y1), . . . ,(xj, yj), . . . ,(xN, yN)}
with labels yj ∈Y ={ω1, . . . , ωc} and penalty strength λ,
initialize data weightsD1(xj) = 1/N; penalty termp1(xj) = 1.
For each training epochi= 1,2, . . . , L:
Step 1. Train weak classifier fi using distribution Di.
Step 2. Get weak classifier fi: X →Y.
Step 3. Calculate the penalty value for every example xj:
pi(xj) = 1− |ambi(xj)|.
Step 4. Calculatefi’s weight αi according to error and penalty:
αi = 12log P j,yj=fi(xj)Di(xj)(pi(xj)) λ P j,yj6=fi(xj)Di(xj)(pi(xj)) λ !
Step 5. Update data weightsDi and obtain new weightsDi+1
according to error and penalty:
Di+1(xj) =
(pi(xj))λDi(xj)exp(−αi[fi(xj)=yj])
Zi ,
whereZi is a normalization factor.
Output the final ensemble: ¯ f(x) = arg max ˆ y∈Y PL i=1αi[fi(x) = ˆy].
(Define [π] to be 1 if π holds and 0 otherwise.)
that the classifier having fewer misclassified examples, which receive larger classification
ambiguity from the current ensemble members, will obtain a higher weight. It agrees
with the common understanding about diversity in classification that we would not want
ensemble members to make the same wrong decision. The details of how to chooseαi are
given in the Appendix. B section.
Here are some additional points to explain the choice of the penalty term. Why do
we use |amb| to encourage diversity instead of using amb? As we have explained above, the sign of amb indicates whether an example is correctly classified by the ensemble.
This accuracy information is already reflected in the original weight-updating rule of
AdaBoost. There is no point to consider it repeatedly in pi. For the same reason, the
individual classifier is uniformly weighted for the calculation of amb instead of using α
in step 5. In addition, we would not knowfi’s weight αi until step 4 when calculating pi
in step 3.
AdaBoost.NC can be viewed as the first NCL algorithm developed specifically for
classification problems. Its training strategy is much simpler than existing NCL algo-
rithms. It is free of choosing any base learning methods, whereas others are restricted
to neural networks. The error correlation information is introduced into the weights of
training examples, rather than the error function of the learners such as in the CELS al-
gorithm (Liu and Yao, 1999b) or the training examples themselves such as in the NCCD
algorithm (Chan and Kasabov, 2005a). In this way, diversity is considered from the ensem-
ble level without modifying the base learner that makes the algorithm learner-dependent
and training examples that can cause undesirable noise. All these features help to create
a flexible ensemble training framework with improved efficiency and accuracy.
Finally, it is worth explaining why AdaBoost.NC is possible to outperform the con-
ventional AdaBoost. Although AdaBoost attempts to enforce classifiers to make different
errors by focusing on misclassified examples sequentially, the overfitting problem has been
reported empirically (Quinlan, 1996; Opitz and Maclin, 1999; Dietterich, 2000a), especially
when the processing data is noisy. In some cases, the weight vectors can become very
skewed, which may lead to undesirable bias towards some limited groups of data. It is
also found that AdaBoost can produce a diverse ensemble at the first few training epochs,
but diversity drops as more classifiers are built (Shipp and Kuncheva, 2002b). So, it is
suggested to stop the training progress early for performance enhancement with proper
diversity maintained.
From a theoretical point of view, Schapire et al. derived an upper bound on the
generalization error of AdaBoost (Schapire et al., 1998). They proved that AdaBoost is
aggressive at increasing margins of the training examples, which contributes to the reduc-
tion of generalization error even after the training error reaches zero. However, this bound
is rather weak (Schapire, 2002). Breiman found that a better margin distribution does not
then overfitting set in (Breiman, 1999). Murua presented an improved error bound for the
linear classifier combination by introducing “mutual weak dependence” (Murua, 2002).
As an important progress, it is shown that both the low dependence between classifiers
and large margins play an important role in achieving low error rates, and there is a trade-
off between them. It provides the evidence that only considering margins is not sufficient.
It is worth looking for a training procedure that can keep the dependence between the
classifiers low with large margins. AdaBoost.NC works for this purpose. It is expected to
alleviate the overfitting problem of AdaBoost, because the penalty term boosts not only
the most difficult misclassified examples, but also the easiest examples that have been
correctly labelled. Easy examples are more likely to be chosen to help the classification
on difficult examples in AdaBoost.NC than in AdaBoost. It reduces the chance of focus-
ing on the same group of misclassified examples. Both theoretical and empirical studies
have provided justifications for AdaBoost.NC to achieve good performance.