Studies on the Effectiveness of Methods and Performance Metrics for Evaluating

2. REVIEW OF EXISTING LITERATURE

2.9 O PEN I SSUES AND E XISTING G APS IN C LASS I MBALANCE R ESEARCH

2.9.5 Studies on the Effectiveness of Methods and Performance Metrics for Evaluating

When it comes to performance, the main questions on assessments that fall under the imbalanced class domain are:

1. What are the data characteristics that degrade the performance of classifiers in

imbalanced tasks?

2. Is it possible to provide for approaches that, in general, are capable of providing the best improvements in performance?

3. Is the performance of the learning algorithms affected by different degrees of

imbalance?

4. How do varying degrees of imbalance to the distribution of classes affect the

performance of classifiers?

One of the first studies to ever address some of these questions can be found in Japkowicz and Stephen (2002), wherein the authors compare five different sampling strategies. The strategies involved using random undersampling and oversampling, focused random undersampling and focues oversampling (where the focus of the sampling was on parts of the input space that were either far or close to the decision boundary) and lastly modifying the misclassification costs associated with the classes. Although the study is an essential contribution to the class-imbalance research, a

significant limitation is that the comparisons on performance have been assessed using the rate of error of the classifiers.

This measure has been shown to be unsuitable for class-imbalance domains. The main conclusion from the research by Japkowicz and Stephen (2002) is that when using DT, the impact of 'harm' caused by the imbalance increases as the degree of data

separability decreases. Secondly, the increase in the training set size reduces the impact of 'harm' caused by the imbalance. Thirdly, the degree of imbalance is only a problem when disjuncts are present in the data. Fourthly, undersampling has been found to generally underperform in comparison to oversampling. Lastly, Japkowicz and Stephen (2002) also conclude that the modification of costs that are associated with the misclassification of different classes is a strategy that tends to outperform random or focused oversampling.

A different experimental approach was used in the research by (Batista et al. 2012; Prati et al. 2014) whereby the researchers used real datasets and for each dataset several training set distributions were generated using the same number of examples and varying degrees of imbalance. The effect of a change to the degrees of imbalance to the class distributions on a dataset was assessed by measuring the loss in

performance (using the metric AUC) of an imbalanced distribution in comparison to perfectly balanced class distribution.

𝐿𝑜𝑠𝑠 = 𝐿 =9:;₉ (2)

Where B represents the performance obtained on a perfectly balanced class

distribution, and I represent the performance obtained on an imperfect (imbalanced) distribution.

Random oversampling, SMOTE, borderline-SMOTE, and ADASYN were some of the strategies tested. One of the main contributions from this research is that for highly imbalanced distributions (10/90, 5/95, 1/99) there is a general failure to improve performance for all the strategies tested. Moreover, Metacost proved to be the least favorable strategy for improving performance. Lastly, two extensions of the SMOTE

algorithm (ADASYN and Borderline-SMOTE) did not prove to be significantly better than the standard algorithm itself.

Another vital contribution to the effectiveness of class-imbalance methods can be found in the research by López et al., (2013). In this study, the authors compare three different types of learning classifiers SVM, DT, and K-NN on sixty-six different datasets using the AUC metric. The study focused on using SMOTE and extensions of the SMOTE algorithm that fall into different categories. The first category involves extensions of the algorithm that include a pre-processing strategy (e.g., SMOTE+ENN, Borderline-SMOTE, safe-level-SMOTE, ADASYN…). The second category involves algorithm-level strategies that are either based on cost-sensitive learning or are

ensemble-based strategies (e.g., the EasyEnsemble, RUSBoost, and SmoteBagging). From the strategies that include a pre-processing method, SMOTE and SMOTE+ENN obtained the best results. Borderline-SMOTE and ADASYN also showed excellent performance on average. From the ensemble strategies, SmoteBagging showed the best results, followed by RUSBoost and EasyEnsemble. However, one notable limitation of these studies is that it assumes that a perfectly balanced distribution (between majority and minority class) is more favorable for performance. However, this has been shown not to be the case (Weiss and Provost 200); Khoshgoftaar et al., 2007). Albeit a lot more minor, another limitation of these studies is that they rely on using only one metric (AUC) to measure loss in performance.

In fact, there seems to be a lack of agreement on what is the best way to measure fraud detection performance. Many of the measures rely on costs to formulate measures of performance that are either transaction-dependent (Elkan, 2001; Bahsen et al., 2013; Bahsen et al., 2015) or class-dependent (Bolton & Hand, 2002; Hand et al., 2008). Alternatively, some literature avoids using cost-based measures by making an implicit assumption that predictive accuracy is more important for measuring performance (Bhattacharya et al., 2011; Dal Pozzolo et al., 2014). Moreover, it is often the case that cost matrices may not be producible either due to lack of information or

confidentiality. For this reason, it would be important to formulate a measure of performance that is more objective.

The metric for determining the degree of class imbalance is given by the imbalance ratio, whereby:

The Imbalance Ratio (IR): 𝐼𝑅 ==>

=? (3)

Where N+ denotes to the number of positive class cases in the dataset and N- denotes

the number of negative class cases in the dataset.

GAP IN LITERATURE: However, an existing gap in the literature is that there is no metric for determining when the IR ratio represents a severity that is deemed 'harmful.' The IR may signal that class distributions are imbalanced, but this imbalance may not be what is referred to as 'harmful' (Liu, Wu, and Zhou, 2009).

So, there is no metric for gauging whether an imbalance is 'harmful' a priori. However, there are ways to determine whether an imbalance in ‘harmful’ after the fact. For example, it can be checked using class-imbalance learning methods (such as the

EasyEnsemble). If the class-imbalance learning method has no effect (or has a decline) on the performance of the strategy, then the imbalance is not considered to be harmful. Unfortunately, there is no way to know this without testing, which ultimately comes at the cost of computational resources and time. As described in (Liu, Wu, and Zhou, 2009), some classification tasks suffer from a class-imbalance problem, but the severity of the imbalance is not significant enough to warrant the use of class-

imbalance methods that are specifically designed to deal with the imbalance problem. For tasks that do not suffer from the class-imbalance problem, boosting and bagging techniques on DT can often significantly improve performance; but for tasks that do in fact suffer from class-imbalance, then AdaBoost and Bagging will have either no effect or deteriorate the performance of DT (Liu, Wu, and Zhou, 2009). This is one existing way the authors test for the presence of a ‘harmful’ imbalance. The empirical results of the research by Liu, Wu, and Zhou (2009) suggest that for tasks in which ordinary learning methods are able to achieve a high AUC score (for example, above 0.95), then the class-imbalance learning methods are not helpful. However, when class-imbalance learning methods improve performance, then BalanceCascade and, in particular, EasyEnsemble are both able to achieve a higher AUC, F-measure, and G-mean than almost all other class-imbalance learning methods.

The appropriateness (or lack of) to performance metrics for assessing imbalanced classification problems is a widely studied area of the Class Imbalance Problem. Nevertheless, there are many issues in this aspect that remain inconclusive. For example, the appropriateness of statistical tests or error estimation procedures is an essential area for the problem that is still largely unresolved due to a lack of research. These are significant issues that still require much more research and present an essential challenge to the Class Imbalance Problem (Japkowicz, 2013).

It is a well-known fact that traditional performance metrics in imbalanced domains can lead to sub-optimal classification models (He and Garcia 2009; Weiss, 2004; Kubat and Matwin 1997). Traditional performance metrics produce misleading results due to the fact that these measures are insensitive to skewness and imbalanced distributions (Ranawana and Palade 2006; Daskalaki et al. 2006). Therefore, the use of appropriate evaluation metrics is a critical aspect of classification tasks in imbalanced domains. An appropriate measure or metric should be used to both assess the performance of

classifiers as well as help guide their learning processes during the learning phase. For binary classification tasks with a negative and positive class, the results obtained by a classifier can be explained by a confusion matrix (see Table 1 below). For both the negative and the positive class, the confusion matrix provides:

1. True Positives (TP): The value for the number of positive class instances that were correctly classified;

2. True Negative (TN): The value for the number of negative class instances that were correctly classified;

3. False Positive (FP): The value for the number of positive class instances that were incorrectly classified;

4. False Negative (FN): The value for the number of negative class instances that were incorrectly classified.

TABLE 1: CONFUSION MATRIX FOR A BINARY CLASS PROBLEM

Accuracy (see Equation 4 below in page 30) and its complement to the error rate are the most frequently used metrics for assessing the performance of classifiers in classification domains that do not suffer from the class imbalance problem.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =_{(FGHJ=HF=HJG)}(FGHF=) (4)

However, accuracy suffers from a preferential bias towards the majority class and is unsuitable for assessing imbalanced problems. For example, if only 1% of the total instances in the dataset belong to the minority class, high accuracy of 99% can be achieved by simply predicting all the majority class instances and none of the minority class instances. Consequently, when the objective is to predict rare class instances, this measure is not very useful.

We can derive other metrics from the confusion matrix that are more suitable for imbalanced problems. For example:

5. Recall or Sensitivity-True Positive Rate (TPR): 𝑇𝑃_MNO) =_FGHJ=FG Table 1: Confusion Matrix for a binary class problem Actual Prediction Prediction Total

N=sample Size Predicted Positive

(𝑌 = +) Predicted Negative (𝑌 = −) Actual Positive (𝑌 = +) 𝑇𝑃 = ∑ "RS = 𝐼(𝑦"= +)𝐼(𝑦 = +) 𝐹𝑁 = 𝑁_VWX"O"Y)− 𝑇𝑃 𝑁𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒= ∑ 𝑖=1 𝑁 𝐼(𝑦_𝑖 = +) Actual Negative (𝑌 = −) 𝐹𝑃 = 𝑁()_NO"Y)− 𝑇𝑁 𝑇𝑁 = ∑ "RS = 𝐼(𝑦"= −)𝐼(𝑦 = −) 𝑁𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒= ∑ 𝑖=1 𝑁 𝐼(𝑦_𝑖 = −) Total ∑ 𝑖=1 𝑁 𝐼(𝑦_𝑖= +) ∑ 𝑖=1 𝑁 𝐼(𝑦 = −) 𝑁

6. Specificity-True Negative Rate (TNR): 𝑇𝑁_MNO) = _F=HJGF=

7. False Positive Rate (FPR): 𝐹𝑃MNO) =_F=HJGJG

8. False Negative Rate (FPR): 𝐹𝑁MNO) =_FGHJ=J=

9. Precision-Positive Predictive Value (PPV): 𝑃𝑃YNbc) =_FGHJGFG

10. Negative Predictive Value (NPV): 𝑁𝑃YNbc) =_F=HJ=F=

Instead, for evaluating classifiers in imbalanced domains, other classification metrics have been introduced, such as F measure (Rijsbergen, 1979), the geometric mean (Kubat et al., 1998), and the Receiver Operating Characteristic (ROC) curve (Egan, 1975). F1-Score: This metric is defined as the harmonic mean of precision and recall.

𝐹S = 2 ∙_{GM)e"X"W(Hf)eNbb}GM)e"X"W(∙f)eNbb (5)

Where Precision and Recall are defined as follows:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =_FGHJGFG and 𝑅𝑒𝑐𝑎𝑙𝑙 =_FGHJ=FG (6)

The geometric mean (G-mean): This metricwas developed specifically for imbalanced domains. It calculates the accuracies of both classes by seeking to maximize their respective accuracies while maintaining a good balance between the two classes. However, equal weight importance is attributed to both classes under this formulation. There is another formulation of the G-mean that attributes higher importance to the positive class. In this alternative formulation, specificity is replaced by precision.

The area under the operating receiver curve (AUROC) or the AUC in short is a metric that has become quite predominant for class Imbalance problems (Fawcett, 2003). For example, Dal Pozzolo, suggests an AUC estimate based on the Mann-Whitney

(Wilcoxon) statistics (Dal Pozzolo, 2014).

𝐴𝑈𝐶 =SHFGopqr:JGopqr

s (8)

The AUC-score characterizes the area under the curve of sensitivities of the classifier that is plotted against the corresponding false-positive rate at various different

In document An Examination of the Smote and Other Smote-based Techniques That Use Synthetic Data to Oversample the Minority Class in the Context of Credit-Card Fraud Classification (Page 33-41)