• No results found

2.2 Major issues in data mining

2.2.3 Imbalanced datasets

Another major issue in data mining is dealing with imbalanced datasets. These are datasets in which the number of the majority class instances sig- nificantly outnumbers that of the minority class instances. This problem is calledclass imbalance and most data contains this problem. It often occurs in customer-related data, churn prediction, medical diagnosis, text categorisa- tion, and fraud detection, where the class of interest is the minority class. The imbalance in datasets is a major challenge in data mining because in many applications, the cost of misclassifying the minority class is high, such as in direct marketing, where businesses are interested in identifying potential buyers, and in charities when identifying potential donors.

The class imbalance issue has received attention in the literature (Ling and Li (1998), Chawla et al. (2002), Chawla (2005), Chawla (2009), and Batuwita and Palade (2013)). This is because data mining models usually tend to be

1The dataset can be found in the UCI machine learning repository https://archive.ics.

influenced by the majority class. Therefore, the minority class is usually mis- classified, leading to poor performance and low predictive accuracy. In an even worse scenario, minority examples are considered as outliers of the ma- jority class. Thus, they are ignored in the learning process. Moreover, the algorithm goal is to maximise accuracy, so it assumes that the class distribu- tion is equal and the misclassification cost for all classes is the same. How- ever, this is not always the case (Thai-Nghe, Gantner, and Schmidt-Thieme, 2010).

As previously mentioned, the imbalance between the classes is a major challenge when learning from imbalanced datasets. However, in some cases when learning from an imbalanced dataset, the classification algorithms are able to learn from the datasets and provide good accuracy. This indicates that the class imbalance ratio is not the only issue that affects the performance of the classifier when learning from imbalanced datasets. Therefore, more work is needed to analyse other factors when learning from imbalanced datasets (Japkowicz and Shah, 2011). One factor is the sparsity of the minority class, which is the distribution of the data within the minority class itself. In re- search by Jo and Japkowicz, 2004, the authors found that the main issue that affects classifier performance on imbalanced datasets is the small disjuncts within the dataset. The authors found that by focusing on the small disjuncts problem, performance was improved when compared to the performance of the classifier if the focus was only on the imbalance ratio. Another important factor is class overlapping. Stefanowski, 2013 studied the effect of overlap- ping, along with other factors such as the sparsity of the minority class. The experimental results indicate that both class decomposition and class over- lapping cause difficulty when learning from imbalanced datasets. In a study by Napierala and Stefanowski, 2016, the authors suggested that analysing the local characteristics of the minority examples and defining their types are important steps when learning from imbalanced datasets. The authors de- fined four types of minority class example: safe, borderline, rare examples and outliers, in which the last three are classified as unsafe examples.

Various solutions have been suggested to overcome the class imbalance problem (Chawla et al. (2002), Han, Wang, and Mao (2005), Kubat and Matwin (1997), Berson, Smith, and Thearling (2000), and Chawla (2005)). They fol- low three main approaches, being applied at the data, algorithmic, or hy- brid level. At the data level, the solutions work by applying various sam- pling techniques to balance the dataset. At the algorithmic level, the solu- tions work by modifying existing learning algorithms to overcome the bias

toward the majority class and adapting them so that they learn from imbal- anced datasets with a skewed distribution. Hybrid algorithms combine both approaches: data and algorithmic level. Although there is increased aware- ness of the importance of imbalanced data and available solutions, many of the key issues are still open and occur more often, especially when dealing with big data. More work is needed to analyse the dataset factors when learn- ing from imbalanced datasets and to investigate the possibility of combining various techniques to overcome the class imbalance problem.

This research will specifically address the problem of class imbalance in classification. The remainder of this chapter is organised as follows: Sec- tion 2.3 explains the problem and investigates the suggested solutions from the literature for this major issue. It also explores the possibility of applying variations of those solutions to imbalanced datasets and highlights different perspectives of the class imbalance problem and how it is likely to be solved based on various approaches further predicated on domain specificity. More- over, there exists a variety of combinatorial solutions that merge sampling techniques that are used together, or sampling techniques are joined with al- gorithmic solutions to increase the chances of accurate classification; these are presented in brief. Section 2.4 explains feature selection as a solution to the class imbalance problem and provides a description of the main meth- ods for feature selection. Section 2.5 defines the performance metrics used to evaluate the classification model on imbalanced datasets.