Predictive Data Mining for Highly Imbalanced Classification

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012)

139

Predictive Data Mining for Highly Imbalanced Classification

Madhuri Agrawal

1

, Gajendra Singh

2

, Ravindra Kumar Gupta

3

1_{M.Tech (Computer Science) Scholar Student, Sri Satya Sai Institute of Science & Technology, RGPV, BHOPAL,}

Abstract— The paper addresses some theoretical and practical aspects of data mining, focusing on predictive data mining, where two central types of prediction problems are

discussed: classification and regression. Further accent is

made on predictive data mining, where the time-stamped data greatly increase the dimensions and complexity of problem solving. The main goal is through processing of data (records from the past) to describe the underlying dynamics of the complex systems and predict its future.

Traditional classification algorithms can be limited in their performance on highly imbalanced datasets. A popular stream of work for countering the problem of class imbalance has been application of a sundry of sampling strategies. In this work, we focus on the problem of class imbalance. We incorporate different “rebalance” heuristics.

Keywords— Data Mining, Imbalanced data sets, Logistic

regression, Predictive data mining, Sampling.

I. INTRODUCTION

Solution of the problems concerning water resources and environment today depends on large number of data sources and knowledge corpuses. Many relevant sources of data, structured observations and scientific information related to water resources and environmental processes currently exist, varying in both size and scope. The large potentials in the existing data banks need to be explored in order to transform these data/observables into valuable engineering information and knowledge. The key to these potentials can be found in data mining as a new emergent filed in hydro informatics.

Mining highly unbalanced datasets, particularly in a cost sensitive environment, is among the leading challenges for knowledge discovery and data mining. The class imbalance problem arises when the class of interest is relatively rare as compared to the other class(es). Without loss of generality we will assume that the positive class (or class of interest) is the minority class, and the negative class is the majority class. Various applications demonstrate this

characteristic of high class imbalance, such as

bioinformatics, e-business, information security, to national security. For example, in the medical domain the disease may be rarer than normal cases; in business the defaults may be rarer than good customers, etc.

Sampling strategies, such as oversampling and undersampling, are extremely popular in tackling the problem of class imbalance.

That is, either the minority class is oversampled or majority class is under sampled or some combination of the two is deployed.

II. DATA MINING

Data mining can be defined as a process of discovering new, interesting knowledge, such as patterns, associations, rules, changes, anomalies and significant structures from large amounts of data stored in data banks and other information repositories. It is currently regarded as the key element of a much more elaborate process called Knowledge Discovery in Databases (KDD). In general, a knowledge discovery process consists of an iterative sequence of the following steps (see Figure 1):

1. Data selection, where data relevant to analysis task are retrieved from database;

2. Data cleaning, which handles noisy, erogenous, missing or irrelevant data;

3. Data integration (enrichment), where multiple

heterogeneous data may be integrated into one;

4. Data transformation (coding), where data are

transformed or consolidated into forms appropriate for different mining algorithms;

5. Data mining, which is an essential process where

intelligent methods are applied in order to extract hidden and valuable knowledge from data;

6. Knowledge representation, where visualization and

(2)

International Journal of Emerging Technology and Advanced Engineering

140 Data mining is based on the results achieved in database systems, statistics, machine learning, statistical learning theory, chaos theory, pattern recognition, neural networks, probabilistic graph theory, fuzzy logic and genetic algorithms. A large set of data analysis methods have been developed in statistics over many years of studies. Machine learning and statistical learning theory have contributed significantly to classification and induction problems. Neural networks have shown their effectiveness in classification, prediction, and clustering analysis tasks. One can say that there is no one specific technique that characterizes data mining. Any technique that helps to extract more out of the data sets in an autonomous and intelligent way may be classified as a data mining technique. Therefore data mining techniques form a quite heterogeneous group.

Data mining goals, operations and techniques

In general, data mining tasks can be classified into two categories:

Description: finding human-interpretable patterns,

associations or correlations describing the data.

Prediction: constructing one or more sets of data models

(rule set, decision tree, neural nets, support vectors), performing inference on the available set of data, and attempting to predict the behaviour of new data sets. The distinction between description and prediction is not very sharp. Predictive models can also be descriptive (to the degree that they are understandable), and descriptive models can be used for prediction. To achieve these goals, the categories of prediction as well as description are associated with the five basic operations, as presented in Figure 2.

While there are only a couple of basic data mining operations there is a wide variety of data mining techniques which make these operations possible. Data mining systems normally do not include each of these techniques, but they often combine two or more different techniques between which the user/engineer can choose – depending on the specific problem.

Therefore, potential users should survey the most common techniques, in order to decide which one will fit

their engineering needs best. Figure 3 presents some

common techniques assigned to the basic data mining operations, emphasizing the classification and regression problems.

The two central types of engineering prediction

problems are classification and regression.

Samples/observables of past experience with known attributes (features) are examined and generalized to future cases. Classification is closely coupled with clustering which is to identify clusters embedded in the multi-dimensional data space, where a cluster is a collection of data objects (groups of data) that are "similar" to one another. Similarity usually is expressed as different distance functions. Various approaches have been proposed in the literature for developing classifiers by means of clustering, which can be summarized as:

(i) Iterative clustering, (ii) Agglomerative hierarchical clustering and (iii) Divisive hierarchical clustering.

(3)

International Journal of Emerging Technology and Advanced Engineering

141 However, we wish to emphasize that unsupervised classification task should usually come together with the background knowledge provided from the domain experts.

The problem of regression is very similar to the problem of classification. It is usually described as a process of induction of the data model of the system (using some machine learning algorithm) that will be capable of predicting responses of the system that have yet to be observed. For regression the response of the system is usually a real value, while for classification is the class label(s). Time series prediction is a specialized type of regression (or occasionally classification) problem, where measurements/observables are taken over time for the same features.

From a predictive data-mining perspective, the time-stamped data greatly increase the dimensions of problem solving in a completely different direction. Instead of cases with one measured value for each feature, cases have the same feature measured at different time. To overcome this problem, raw time-dependent data are usually transformed for predictive data mining into lesser dimensional data space using transformations such as Vector Quantization and state-space methods (Tsonis, 1992) or simple averaging and re-sampling methods are applied.

The main goal of this work is to demonstrate the applicability of some predictive data mining techniques for classification and regression engineering problems.

III. METRICS FOR MEASURING PREDICTIVE MODEL

PERFORMANCE

The different types of errors and hits performed by a classifier can be summarized in a matrix. Table 1 illustrates a matrix for a two-class problem, with classes labeled positive and negative.

From such matrix it is possible to extract a number of metrics to measure the performance of learning systems, such as

Error rate (E) = (c+b) /(a+b+c+d)

Accuracy (Acc) = (a+d) /(a+b+c+d) = 1 - E.

The error rate (E) and the accuracy (Acc) are widely used metrics for measuring the performance of learning systems However, when the prior probabilities of the classes are very different, such metrics might be misleading.

For instance, it is straightforward to create a classifier having 99% accuracy (or 1% error rate) if the data set has a majority class with 99% of the total number of cases, by simply labeling every new case as belonging to the majority class.

Following is the proposed the metrics using Table 1. – False negative rate: FN = b / (a+b) is the percentage of positive cases misclassified as belonging to the negative class;

– False positive rate: FP = c / (c+d) is the percentage of negative cases misclassified as belonging to the positive class;

– True negative rate: TN = d / (c+d) = 1 - FP is the percentage of negative cases correctly classified as belonging to the negative class;

– True positive rate: TP = a / (a+b) = 1 - FN is the percentage of positive cases correctly classified as belonging to the positive class;

These four class performance measures have the advantage of being independent of class costs and prior probabilities. It is obvious that the main objective of a classifier is to minimize the false positive and negative rates or, similarly, to maximize the true negative and positive rates. Unfortunately, for most “real world” applications, there is a tradeoff between FN and FP and, similarly, between TN and TP.

IV. METRICS FOR IMBALANCED CLASSIFICATION

Many metrics have been used for effectiveness evaluation on imbalanced classification. All of them are based on the matrix as shown at Table I. With highly skewed data distribution, the overall accuracy metric at (1) is not sufficient any more. For example, a naive classifier that predicts all samples as negative has high accuracy. However, it is totally useless to detect rare positive samples. To deal with class imbalance, two kinds of metrics have been proposed.

(4)

International Journal of Emerging Technology and Advanced Engineering

142 On the other hand, sometimes we are interested in highly effective detection ability for only one class. For example, for credit card fraud detection problem, the target is detecting fraudulent transactions. For diagnosing a rare disease, what we are especially interested in is to find patients with this disease. For such problems, another pair of metrics, precision at (5) and recall at (6), is often adopted. Notice that recall is the same as sensitivity. F-Measure at (7) is used to integrate precision and recall into a single metric for convenience .

Previous Methods for Imbalanced Classification

Many methods have been proposed for imbalanced classification, and some good results have been reported. These methods can be categorized into three different categories: cost-sensitive learning, oversampling the minority class or undersampling the majority class. However, different measures have been used by different authors, which make comparisons difficult.

V. GSVM ALGORITHM

Granular computing represents information in the form of some aggregates (called information granules) such as subsets, subspaces, classes, or clusters of a universe. It then solves the targeted problem in each information granule . There are two principles in granular computing. The first principle is divide-and-conquer to split a huge problem into a sequence of granules (granule split); The second principle is data cleaning to define the suitable size for one granule to comprehend the problem at hand without getting buried in unnecessary details (granule shrink). As opposed to traditional data-oriented numeric computing, granular computing is knowledge-oriented. By embedding prior knowledge or prior assumptions into the granulation process for data modeling, better classification can be obtained. A granular computing-based learning framework called Granular Support Vector Machines.

GSVM combines the principles from statistical learning theory and granular computing theory in a systematic and formal way. GSVM extracts a sequence of information granules with granule split and/or granule shrink, and then builds SVMs on some of these granules if necessary. The main potential advantages of GSVM are:

• GSVM is more sensitive to the inherent data distribution. Hence, it may improve classification effectiveness.

• GSVM may speed up the classification process . As a result, it is more efficient and scalable on huge datasets.

SVM assumes that only SVs are informative to classification and other samples can be safely removed. However, for highly imbalanced classification, the majority class pushes the ideal decision boundary toward the minority class .As demonstrated in Fig. 1, negative SVs (the circled minus signs) that are close to the learned boundary may not be the most informative or even noisy. Some informative samples may hide behind them. To find these informative samples, we can conduct cost-sensitive learning or oversampling. However, these two “rebalance” strategies increase the number of SVs, and hence slow down the classification process.

To improve efficiency, it is natural to decrease the size of the training dataset. In this sense, undersampling is by nature more suitable to model an SVM for imbalanced classification than other approaches. However, elimination of some samples from the training dataset may have two effects:

• Information loss: due to the elimination of informative or useful samples, classification effectiveness is deteriorated; • Data cleaning: because of the elimination of irrelevant or

redundant or even noisy samples, classification

(5)

International Journal of Emerging Technology and Advanced Engineering

143 For a highly imbalanced dataset, there may be many

redundant or noisy negative samples. Random

undersampling is a common undersampling approach for rebalancing the dataset to achieve better data distribution. However, random undersampling suffers from information loss.

The well-known fact about SVM - only SVs are necessary and other samples can be safely removed without affecting classification. This fact motivates us to explore the possibility to utilize SVM for data cleaning/ undersampling. However, due to highly skewed data distribution, the SVM modeled on the original training dataset is prone to classify every sample to be negative. As a result, a single SVM cannot guarantee to extract all informative samples as SVs. An aggregation operation is then executed to selectively aggregate the samples in the negative information granules with all positive samples to complete the undersampling process. Finally, an SVM is modeled on the aggregated dataset for classification.

VI. CONCLUSION

In this work we discussed and demonstrated a useful approach of the predictive data mining techniques focusing on two types of engineering problem solving classification

and regression. When dealing with large amount of data

and when classes have to be discovered. Its simple nature and the probability theory background make this approach powerful data mining tool, especially when combined with the domain knowledge.

The Granular Support Vector Machines Algorithm implements a guided repetitive undersampling strategy to “rebalance” the dataset at hand. It is effective due to extraction of informative samples that are essential for classification and elimination of a large amount of redundant or even noisy samples.

REFERENCES

[1 ] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent Data Analysis, vol.6, no. 5, pp. 429– 449,2002.

[2 ] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Editorial: special issue on learning from imbalanced data sets,” SIGKDD Explorations, vol. 6,no. 1, pp. 1–6, 2004.

[3 ] Yuchun Tang, , Yan-Qing Zhang, Nitesh V. Chawla, and Sven Krasser,. “ SVMs Modeling for Highly Imbalanced Classification”, Journal of LATEX Class files, vol. 1 , no.11, Nov 2002.

[4 ] Slavco Velickov and Dimitri Solomatine. 2000 predictive Data Mining : Practical examples. Proceedings of the second Joint Workshop on AI, Cottbus, Germany.

[5 ] Bharatheesh, T.L.. and Iyengar, S.S.. 2003 Predictive data mining for delinquency modeling, Louisiana, USA.

[6 ] Abbot, M.B. (1996). “The sociotechnical dimension of hydroinformatics.” Proceedings of the second International Conference on Hydroinformatics. Zurich, Switzerland.

[7 ] R. Akbani, S. Kwek, and N. Japkowicz, “Applying support vector machines to imbalanced datasets,” in Proc. of the 15th European Conference on Machine Learning (ECML 2004), pp. 39–50, 2004. [8 ] G. Wu and E. Y. Chang, “KBA: Kernel boundary alignment

considering imbalanced data distribution,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 786–795, 2005. [9 ] B. Raskutti and A. Kowalczyk, “Extreme re-balancing for SVMs: a