Generalisation Error - A Novel Ensemble Approach

6.2 A Novel Ensemble Approach

6.2.2 Generalisation Error

We now compare the generalisation ability of irsADE and rsADE. As expected, the ensemble techniques have a positive effect on the generalisation ability of some datasets, showing that there is no universal learning technique which is able to discriminate any classification problem.

Figures 6.2 and 6.3 respectively compare the generalisation error of irsADE and rsADE on Mushroom and Segment datasets as we increase the number of base classifiers. Figure 6.2 shows that the ensemble generalisation error decreases from 16% to 4% as we increase the number of classifiers in the ensemble. The ensemble methods irsADE and rsADE show similar classification performance for any size of the ensemble. Figure 6.3 shows a similar trend to Figure 6.2 for the Segment dataset. Overall the generalisation error decreases by more than 15% and both ensemble techniques show identical performance. Figures B.5 and B.4 in Appendix B illustrate how increasing the number of base classifiers up to 50% does not produce any change in the two ensemble method mutual behavior, as in both figures the classification error reaches a plateau from ensembles of 20 onwards although the classification mean error seems to be slightly improved. It is interesting to compare the base classifier generalisation ability with the ensemble generalisation ability as we increase the number of classifiers. Figures

2 4 6 8 10 12 14 16 18 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Ensemble Size Test Error rsADE irsADE

Figure 6.3: Segment dataset: irsADE vs rsADE – Test Error (mean and 95% confidence interval)

C.4 and C.5 in Appendix C show on the same graph the ensemble error for increasing numbers of base classifiers and each single base classifier error before being combined in the irsADE ensemble. For Mushroom and Segment datasets the ensembles succeed in reducing the generalisation error because they combine diverse base classifiers, since base classifiers show different levels of accuracy and must therefore be diverse. We conclude that when base classifiers are diverse, not only do both irsADE and rsADE ensemble techniques succeed in reducing the overall classification accuracy, but also that the sign of interaction information can be used as a base classifier model selection criterion, as it does not negatively affect the ensemble performance.

Figures 6.4, 6.5 and 6.6 are three cases where irsADE and rsADE do not succeed in reducing the generalisation error. Increasing the number of classifiers in the ensemble does not alter the ensemble performance, as shown in Figures B.2, B.3 and B.1. It is interesting to note that this occurs for Glass and Magic4, which are very different in sample size (the first one has only 107 training patterns, whereas the second one has 9509 training patterns) but both have a small of number of features (respectively 9 and 10). On the other hand Congress has a larger number of features (16) and a small number of training patterns (217). The reason why these ensemble methods do not succeed might be because the base

1 2 3 4 5 6 7 8 9 0.2 0.25 0.3 0.35 0.4 0.45 Ensemble Size Test Error rsADE irsADE

Figure 6.4: Glass dataset: irsADE vs rsADE – Test Error (mean and 95% confidence interval) 1 2 3 4 5 6 7 8 9 10 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 Ensemble Size Test Error rsADE irsADE

Figure 6.5: Magic4 dataset: irsADE vs rsADE – Test Error (mean and 95% confidence interval)

2 4 6 8 10 12 14 16 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Ensemble Size Test Error rsADE irsADE

Figure 6.6: Congress dataset: irsADE vs rsADE – Test Error (mean and 95% confidence interval)

classifiers are not diverse from each other, as shown in Figures C.2 and C.3 and C.1: from these graphs it is easy to observe that the base classifiers show similar classification errors which are comparable to the irsADE ensemble error as we increase the number of classifiers in the ensemble. This behavior (which is less pronounced for Magic4) indicates that the base classifiers are not diverse from each other. However, both irsADE and rsADE show comparable generalisation error, which confirms how the sign of interaction information can be used as a proxy to classification accuracy.

Figures 6.7 and 6.8 compare the generalisation ability of irsADE and rsADE on Sickeuthyroid and Hypothyroid as we increase the number of base classifiers. For both datasets these ensemble approaches negatively affect the classification performance, as the generalisation error increases as we increase the number of classifiers.

Sickeuthyroid and Hypothyroid, are respectively the 2 class and 4 class version of the same classification problem. These datasets are particularly unbalanced. In Sickeuthyroid 93.9% of the data is of class 1 and only 6.1% is of class 2. Sim- ilarly, in Hypothyroid the data belongs to one out of 4 classes according to these percentages: 5.1%, 92.3%, 2.5%, and 0.1%. Moreover, only 40% of the 3772 patterns are distinguishable from each other. It is worth observing that the ensemble

5 10 15 20 25 0.045 0.05 0.055 0.06 0.065 0.07 0.075 Ensemble Size Test Error rsADE irsADE

Figure 6.7: Sickeuthyroid dataset: irsADE vs rsADE – Test Error (mean and 95% confidence interval) 5 10 15 20 25 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.09 Ensemble Size Test Error rsADE irsADE

Figure 6.8: Hypothyroid dataset: irsADE vs rsADE – Test Error (mean and 95% confidence interval)

does not to improve over the single classifier as base classifiers make errors which are not statistically different from each other and the ensemble combination does not improve over the single accuracies, as shown in Appendix C in Figures C.6 and C.7. Moreover, it is interesting to point out that on average about 92% of test patterns are classified as class 1, which is in line with the data prior dis- tribution. By analyzing the sensitivity and the specificity of base classifiers and the sensitivity and the specificity of the ensemble as we increase the number of classifiers we find out that the averaged sensitivity of base classifiers is about 97% and their average specificity is 9%. More accurate classifiers show higher levels of specificity as well as high levels of sensitivity. Regarding the irsADE ensemble behavior, the first base classifier shows similar values of sensitivity and specificity, but the simple mean rule improves the sensitivity of the ensemble up to 100%, and reduces the ensemble specificity to 0%.

In document A probabilistic perspective on ensemble diversity (Page 128-133)