Methods for Improving the Accuracy - Efficient boosted ensemble based machine learning in the c

sets and consequently their patterns are overemphasized in the learning. If the test data does not contain large numbers of samples which contain sub-concept patterns, then the generalization is likely to deteriorate as witnessed by results here. Also, if the training dataset possesses a significant amount of noisy data, then the learning process risks placing undue learning emphasis on them.

Ultimately, the results demonstrate that the generalization ability of PSL-like classifiers is only as strong as the combined generalization strength of all its weakest nodes located in the latter sections within the layers. Consequently, modifications to the PSL- like approaches were required in order to mitigate the algorithm’s susceptibility to the disproportionate influence that sub-concepts and outliers have on the learning process in these nodes.

In summary, the findings show that:

1. PSL-like learning results in class-imbalanced learning.

2. Sub-concept patterns and noisy samples are isolated and concentrated in the final nodes of each layer.

3. PSL-like algorithms tend to overemphasize unrepresentative patterns in the presence of significant in-class variability and noisy samples by selecting poor features.

5.4 Methods for Improving the Accuracy

The problem of learning on imbalanced datasets and its associated challenges have been topics of extensive research in recent years. A brief review is offered here due to its relevance. Typically, random sampling methods that create balanced distributions without generating synthetic samples have been used. Of these, random over-sampling proceeds by replicating selected samples of a target class, while under-sampling removes data belonging to larger classes in order to restore balance. Both strategies introduce their own sets of problems. Over-sampling may lead to over-fitting, while under-sampling may lead to a loss of information regarding the majority class [62].

Some more complex and successful strategies include the synthetic minority over- sampling technique (SMOTE) [11], which artificially generates additional samples of a minority class. Adaptive synthetic sampling methods like Borderline-SMOTE [60] and ADA-SYN [63] address the shortcomings of SMOTE, which tends to overgeneralize in its generation of new samples leading to an overlapping between classes. Meanwhile, cluster- based sampling methods have been found to be powerful due to the flexibility they provide in their ability to target specific problems.

Boosting has also been applied directly to the problem of imbalanced learning. Ad- aBoost.M2 has been combined with synthetic sampling methods like SMOTE to yield SMOTEBoost [20], while Guo and Viktor [57] showed how to combine AdaBoost.M1 with a data generation approach that was termed DataBoost-IM. In addition, cost-sensitive learning has also been combined with boosting which enables it to focus learning on minority classes through predefined cost-matrices.

While the various data generation approaches have been shown to be effective, they are also complex and computationally expensive. What is more, determining the costs

of different classes in cost-sensitive learning is especially challenging [62]. As such, these approaches were not considered as suitable solutions to the problems of PSL. They are also not applicable since the nature of the in-class variability of the datasets changes from one node to the next, whereby the more representative samples are systematically removed and the minority class sub-concepts are thus isolated eventually becoming the majority. The challenge facing the current version of PSL-like algorithms is that though their goal is to learn the minority sub-concepts, a strategy is required that promotes this in a way that does not give an inordinate amount of emphasis to patterns from these samples.

The task is also complicated by the nature of the problem domain and the types of features being used. Haar-like features tend to produce up to 200,000 features per sample which is much larger than the number of training samples. As mentioned earlier, this poses a problem since it has been well established that high dimensional data tends to produce over-fitted classifiers in the presence of disproportionately smaller training sets [33, 34, 169]. With the reduction of training samples taking place per node, and the concentration of unrepresentative instances taking place in the trailing nodes, the challenge of high dimensionality of the feature sets becomes even greater.

In domains such as these, it is common to apply some form of dimensionality reduction in order to lessen the risk of producing over-fitted classifiers. This is generally achieved by either using a feature selection or a feature extraction approach [33]. More commonly feature selection is employed which seeks to identify a subset of features from the total feature space with the goal of minimizing the redundancy and irrelevance of the final feature subset. This is often achieved by applying predefined criteria based upon class separability or classification performance [22]. Meanwhile, feature extraction utilizes all the available features and projects them into fewer dimensions which has the effect of reducing the comprehensibility of the features themselves.

Dimensionality reduction certainly is a viable option for addressing the problems here. A simple criteria which only makes features available to the trailing nodes that have been used by the initial nodes found at each layer, as well as a selection of previously top ranking features holds considerable potential. However, though such a strategy holds much promise in improving the generalizability, a degree of comparative fairness regarding the training runtimes with algorithms that do not employ feature selection would be compromised. Since this is an important component of the study, the pursuit of this strategy is left for future work.

Numerous research has also found AdaBoost itself to be susceptible to outliers. Vari- ations of AdaBoost have been proposed with the goal of producing algorithms that are more robust to noise. Gentle AdaBoost [46] and Local boosting [196] are two proposed variants. However, in PSL-based training the boosting component in theory should not require modifications, since the boosting rounds are few and therefore the generated rules do not become too specific. This means that even in the presence of outliers, provided there is a sufficiently large number of quality samples, a significant negative effect on the generalizability of the nodes need not be expected.

Two intuitive and straightforward strategies for improving the accuracy of PSL-like classifiers were pursued. The first involved pruning the under-performing nodes (ensemble thinning) from the cascaded classifier. The second consisted of a strategic re-sampling approach to balance out the unrepresentative patterns.

5.4. Methods for Improving the Accuracy ₇₁

Ensemble Thinning

In Section 2.1.2, the concept ofthinningan ensemble-classifier in order to produce a subset which performs better than the original ensemble was reviewed. Ranking-based methods for thinning an ensemble were described as operating on an individual classifier level in order to rank each one against a validation set. Thereafter, the worst performing classifiers are deleted. Search-based thinning strategies tend to consider the collective accuracies of different combinations of subsets of classifiers, while cluster-based thinning methods group classifiers together through an additional phase prior to the thinning procedure.

The initial strategy of thinning was intuitive in that it only involved recognizing that the PSL itself performs a ranking-based procedure of classifiers within each layer and orders them in clusters of PSL nodes. The next step involved introducing a thinning parameter γ, to define how many nodes required pruning from each layer. The node- thinning strategy was naive in that it did not explicitly evaluate the nodes for accuracy but instead removed γ nodes from each layer regardless of their generalizability or the total number of existing nodes within a layer.

The thinning strategy was evaluated on the classifiers from the previous section with two different values of γ, whereγ was set to 1 and 2. Figure 5.10 shows the results of the thinning procedure on the 15000 positive sample dataset whereγ = 1. The results indicate that the thinning-strategy improved the accuracy of the original classifier in nearly every instance. In the majority of cases, the ROC curves of the pruned classifiers shifted towards the ideal top-left corner position of the graphs. In some instances the improvement in the accuracy was up to five percentage points.

The results of thinning forγ = 2 did not improve the accuracy of the classifiers. The thinning strategy proved to be too naive when γ was greater than 1, since the earlier layers of a cascade tended to consist of only a few nodes, for which this method was too aggressive. Overall, the results indicated that there is scope for further refinement of the thinning strategy. More sophisticated strategies can be devised which more intelligently prune layers by taking into consideration the total number of nodes within a layer, or by beginning the pruning process onwards from a selected layer in a cascade. Additionally, the thinning method can be combined with strategies which prune the nodes based on their performances on validation sets. The advantage of the current approach is that it requires virtually no post-processing overheads.

Moreover, the byproduct of the clustering process, in the form of nodes which takes place within PSL-like methods, is that it can be used as a procedure for identifying samples that are outliers or noisy. These samples can then be identified during the learning phase and removed from training subsequent layers. An additional parameter can easily be introduced to the cascade learning process which permanently removes and cleans the dataset of such samples. For instance, this parameter can represent the minimal proportion of unlearned samples within a base set. Once the percentage of unlearned samples falls below a predefined value, then these samples can be identified as noisy or representing irrelevant class sub-concepts and deleted from training.

0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=5

Classifiers: PSL Φ=5 PSL Φ=5 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=5

Classifiers: BPSL β=500 Φ=5 BPSL β=500 Φ=5 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=5

Classifiers: BPSL β=2000 Φ=5 BPSL β=2000 Φ=5 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=10

Classifiers: PSL Φ=10 PSL Φ=10 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=10

Classifiers: BPSL β=500 Φ=10 BPSL β=500 Φ=10 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=10

Classifiers: BPSL β=2000 Φ=10 BPSL β=2000 Φ=10 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=15

Classifiers: PSL Φ=15 PSL Φ=15 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=15

Classifiers: BPSL β=500 Φ=15 BPSL β=500 Φ=15 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=15

Classifiers: BPSL β=2000 Φ=15 BPSL β=2000 Φ=15 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=20

Classifiers: PSL Φ=20 PSL Φ=20 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=20

Classifiers: BPSL β=500 Φ=20 BPSL β=500 Φ=20 γ=1 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 0.0014 Hit Rate

False Positive Rate

ROC Curves for Classifiers Trained on a 15000 Sample Set and Φ=20

Classifiers:

BPSL β=2000 Φ=20

BPSL β=2000 Φ=20 γ=1

Figure 5.10: ROC curves for PSL and BPSL classifiers showing the results of pruning where the pruning parameter γ = 1.

5.4. Methods for Improving the Accuracy ₇₃

Informed Re-sampling

While the first strategy involved a simple post-processing step, the second strategy consisted of modifying the training sets on which the trailing nodes are exposed to, with the goal of balancing out the unrepresentative patterns of the target class with more general patterns.

The alternative strategy performed random re-sampling of positive samples without replacement from the set of positive samples which have already been correctly learned by preceding nodes. As such, the re-sampling is informed and is only carried out when the size of the base sets falls bellow the specified size β. The base sets are augmented by n number of samples which represent the difference between the specified set size β and the current number of unlearned positive samples. The re-sampled positives are referred to as secondary positives, while the remaining misclassified samples are termed primary

positives.

Once a node is trained, the secondary samples that are classified as false negatives remain within the training set for the next node. The secondary samples that have been predicted as true positives are replaced with new true positive samples as classified by the prior nodes. This strategy helps to steer the learning process away from creating overly specific rules based on the minority samples in the trailing nodes.

A further strategy was required in order to ensure that the learning focused on the primary samples rather than on re-learning patterns from the secondary positives. This was needed because the skewness between the primary and secondary positives becomes more acute for each successive node with the consequence that the primary samples are not learned, resulting in over-inflated layers with large numbers of nodes. Due to this, two additional mechanisms were used. The first increased the weights of the primary positives and the second relaxed the Φ size requirement for a fixed number of weak classifiers per node.

Given a 50-50 split of the total available weights between the positive and negative samples, the strategy for increasing the weights of the primary samples involved a function that allocated more weights to the primary samples as their relative proportion in respect to the secondaries increased. The modification of sample weights was designed to begin once the proportion of primaries to secondaries reached a predefined threshold t. With t= 0.5, the weight re-allocation function was defined as:

wj =t+

t−pj

2 , (5.1)

where wi is the proportion of total positive weights allocated to the primary samples for

nodej andpj is the proportion of primary samples that comprise the positive base set for

node j when pj < t. Once pj falls beyond the predefined threshold, the total proportion

of the re-allocated weights to the primary samples ranges from 0.5 to 0.75.

The Φ node size parameter was also relaxed in respect to the increasing skewness of the primary and secondary positives. The number of weak classifiers per node j was increased by 1−pj as soon as the number of primary positives in a given node fell below

β. Therefore, as the primaries became scarcer, the size of the node size Φ increased up to the maximum of twice its original size. This procedure was also necessary in order to limit the total number of nodes being generated by each layer. Algorithm 8 summarizes

the entire procedure.

Algorithm 8: BPSL with Re-sampling (BPSL.r) procedure for a given layer i.

Given : Pi,j denotes the positive dataset of size β, used to train nodej in

layer iwherePi,j ⊂P. Φdenotes the maximum number of weak

classifiers per node. Pi,1 = RandomSample(P, β) 1 of f set= 0 2 forj= 1 to ndo 3 fork= 1 toΦ+of f setj do 4

hi,j,k = AdaBoost(Pi,j, Ni)

Pi,j+1 = (xm, ym),∀x ∈Pi,j,wherehi,j(xm) =−1∧ym = 1

6 sP _{= SizeOf(P} i,j+1) 7 fork= 1 toβ−sP _do 8

Pi,j+1 = RandomSample(P,1),where (hi(xm) =−1)∧(ym = 1)

9 sP ₌_sP _{+ 1} 10 if Pi,j+1 =∅ then 11 break 12 if sP _{< β} _then 13 fork= 1 toβ−sP _do 14

Pi,j+1= RandomSample(P,1),where (hi(xm) = 1)∧(ym= 1)

15 of f setj =Φ 1−s P β 16 wj =t+1₂ sP β 17 foreach(xm, ym) inPi,j+1 do 18 if isPrimaryPositive(xm, ym) then 19 (xm, ym) = wj sP 20 else 21 (xm, ym) = 1−wj β−sP 22

The experiments consisted of re-training all the BPSL classifiers described in Section 5.3.1 under the same conditions. BPSL.r refers to the new BPSL classifiers that have been trained with the re-sampling component. In order to establish the degree of meaningfulness of the trailing nodes, the BPSL.r component was also combined with the previous strategy of pruning whereγ = 1.

The following sequences of graphs plot the ROC curves of PSL, BPSL, BPSL (with ensemble thinning), BPSL.r and BPSL.r (combined with thinning). The first sequence of ROC curves is shown in Figure 5.11 where the size of the base set parameter β = 2000 andΦ= 10. The first graph plots a BPSL and a BPSL.r classifier which have been trained 10 times in order to provide statistical data in respect to variations that can be expected in the subsequent ROC curves for both types of classifiers. Once again, the standard

5.4. Methods for Improving the Accuracy ₇₅

deviations for the hit and false positive rates are the smallest for the most critical regions of the ROC curves.

0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.001 0.002 0.003 0.004 0.005 Hit Rate

False Positive Rate Standard Deviation of ROC Curves for

BPSL and BPSL.r Classifiers Trained on a 5000 Sample Set and Φ=10

Classifiers: BPSL β=2000 Φ=10 BPSL.r β=2000 Φ=10 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.001 0.002 0.003 0.004 0.005 Hit Rate

False Positive Rate

ROC Curves for the 5000 Sample Training Set

Classifiers: PSL Φ=10 BPSL β=2000 Φ=10 BPSL γ=1 β=2000 Φ=10 BPSL.r β=2000 Φ=10 BPSL.r γ=1 β=2000 Φ=10 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.001 0.002 0.003 0.004 0.005 Hit Rate

False Positive Rate

ROC Curves for the 10000 Sample Training Set

Classifiers: PSL Φ=10 BPSL β=2000 Φ=10 BPSL γ=1 β=2000 Φ=10 BPSL.r β=2000 Φ=10

In document Efficient boosted ensemble based machine learning in the context of cascaded frameworks : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand (Page 88-98)