5.2 Data Mining Concepts
5.2.3 Addressing Class Imbalance
The approach proposed in this chapter is founded on the premise that the data generated during fault injection analysis captures aspects of the relationships between system states and system failures. Based on the states sampled and behaviours observed during fault injection analysis, a data mining algorithm can then generate error detection predicates through learning about these captured relationships. However, data sets derived from fault injection analysis are often imbalanced, in the sense that most of the logged states will not lead to a system failure, i.e., only a small proportion of runs lead to failure. Such an imbalance in the data sets to be processed must be addressed for the data mining process to be effective with respect to the generation of efficient error detection predicates. A key assumption made by concept learning algorithms that are based on error minimisation is that the training data used is well balanced [68]. That is to say, such algorithms assume that the distribution of class labels in training data sets is approximately uniform. However, there are a number of domains, such as network intrusion detection, fraud detection and software reliability, where the number of positive instances are often fewer than the number of negative instances. In addition to this skew in distribution, it is often the case that the minority class is the more interesting class to predict. Indeed, with respect
5. Generating Efficient Error Detection Mechanisms
the examples of generating efficient error detection mechanisms and detecting network intrusion, it is the minority classes, i.e., system failures and network intrusions, that are of most interest.
Two approaches have been used to address problem of class imbalance. The first of these is to act as if there is a higher cost associated with misclassifying instances of the minority class. Specifically, it is possible to define a cost matrix based on the class imbalance and then use the same error minimisation-based concept learning algorithms. However, this approach assumes that such a cost matrix can be incorporated by the learning process. This incorporation can, for example, be achieved by the altered priors technique proposed by Breiman
et al.[21]. The second approach to addressing the problem of class imbalance is to replace error minimisation metrics with cost minimisation metrics when searching the hypothesis space. However, Pazzaniet al. showed that using mis- classification costs as a greedy selection criteria in decision tree induction does not provide cost minimisation for the model generated [122]. Further, Ting et al. compared instance weighting to using minimum expected cost metrics for assigning labels to leaf nodes in a decision tree induced to minimise errors [157]. The results of these experiments suggested that instance weighting is more ef- fective than a cost minimisation-based approach.
The assignment of distinct weights to training examples, in effect, changes the data distribution within the training data [40] [46] [122] [157]. The associ- ated cost matrix must be converted to a cost vector, V, which can be difficult in the context of multi-class classification problems. Breiman et al. proposed using the sum of all misclassification costs for instances of the class, though al- ternatives, such asV(i) = arg max
j (C(i, j)), have also been proposed [21]. Ting et al. assign the same weight to all instances of a particular class, Lj, based
on V(j) using the formula shown in Equation 5.8, whereNj is the number of
instances in the data labelledLj andN =PiNi [157].
w(j) =V(j)P N
iV(i)Ni
5. Generating Efficient Error Detection Mechanisms
An alternative to implicitly changing the data distribution is to resample an original data set, either by oversampling the minority class or undersampling the majority class to make the class distribution more uniform [68] [89] [103]. A variety of resampling approaches have been investigated, with the most common approaches being those which resample with replacement and sample without replacement for undersampling the majority class. Japkowicz also experimented with focussed sampling approaches that oversampled from the boundary regions and undersampled from regions far from the decision boundary but experiments in these investigations suggested that there is little value over random sampling approaches [68]. Chawla et al. proposed the generation of synthetic data for minority classes along the line segment joining an example to k minority class nearest neighbours rather than simply sampling with replacement [26]. Empiri- cal tests showed their method, known as Synthetic Minority Oversampling Tech- nique (SMOTE), to outperform simple sampling with replacement. Zadrozny
et al. proposed the use of a cost-proportionate rejection sampling technique, while Kubat and Matwin suggest undersampling by removing redundant and borderline negative examples [89] [172]. A criticism of the oversampling and undersampling approaches is that it is not clear how much over oversampling and undersampling should be carried out. Chawla et al. proposed the use of cross validation for setting the level of oversampling and undersampling of the majority and minority classes automatically, ultimately demonstrating that this process can improve model accuracy [27].