A NORMALIZED MEASURE FOR ESTIMATING CLASSIFICATION RULES FOR MULTI-CLASS IMBALANCED DATASETS

(1)

A NORMALIZED MEASURE FOR

ESTIMATING CLASSIFICATION

RULES FOR MULTI-CLASS

IMBALANCED DATASETS

Sireesha Rodda

CSE Department, GITAM Institute of Technology, GITAM University Visakhapatnam, Andhra Pradesh, INDIA

Prof. Shashi Mogalla

CSSE Department, College of Engineering, Andhra University, Visakhapatnam, Andhra Pradesh, INDIA

Abstract

Most of the classification techniques proposed to handle multi-class datasets address data that is roughly balanced in nature. However, in many real-world datasets, the classes have imbalanced data distribution, where some classes have few training examples when compared to other classes. Most of the classification techniques addressing imbalanced data consider only binary-class datasets. In this paper, we present our research in learning from multi-class imbalanced dataset without breaking it down into a series of binary class datasets. This paper also discusses about normalized strength score, a measure used for estimating the quality of the classification rules obtained.

Keywords: Associative Classification, Imbalanced Data distribution, Skewed Data Classification.

I. INTRODUCTION

The area of classification has received lot of importance in the data mining field. Classification aims at predicting the class label of the unseen instance given its description in terms of a set of attributes. A classification model is first developed based on the examples in the training set. Each training example is described in terms of set of attributes and their corresponding values along with the class to which it belongs. The proposed framework uses associative classification approach to handle multi-class datasets. Association Mining(AM)[1] discovers patterns that occur frequently in the given training dataset based on a minimum support threshold. When the classifier is based on the Association Mining, it makes use of the frequent patterns extracted from the training set. Associative Classifiers use the Class Association Rules (CARs) generated as n end product of Association Mining to predict the class label of an unseen instance. A CAR is of the form A→ X, where A is a frequent pattern describing the set of predicates in terms of attribute-value pairs and X is a class label. Many Studies [2,3,4,5,6] show that Associative Classifiers are more accurate than traditional classification approaches like decision trees[7], genetic algorithms[8], and neural networks[9]. Classification Based on Association (CBA)[10] is a popular classifier based on Association Rule Mining.

(2)

the predictions from the binary classifiers. The measures used in such classifiers are not suitable for imbalanced datasets.

In this research work, a normalized strength score measure for multi-class imbalanced data is proposed to identify interesting patterns. The rest of this paper is organized as follows: Section 2 provides a review of previous works on imbalanced datasets. Section 3 discusses the terminology used in this paper. Section 4 introduces a new measure called Normalized Strength Score and discusses its suitability to multi-class imbalanced datasets. Section 5 presents and analyzes the results obtained while comparing them with the performance of other classifiers. Section 6 presents conclusions.

II. RELATED WORK

Many solutions to class imbalance problem have already been proposed in the literature. Most of the proposed solutions cater to the bi-class problems, i.e., the number of instances of one class is very few whereas the number of instances of the other class is quite large. Solutions reported are developed both at data level and at algorithm level [18]. In data level approaches, the class distribution of the training data is modified by re-sampling the data. In algorithm level approaches, the existing classifiers are modified to work well with the minority instances. Cost-sensitive learning solutions have also been proposed, which use either data level or algorithm level approaches and assign high misclassification costs for errors of minority class. This leads to minimizing of cost errors of minority classes. Some meta-techniques which use the concept of boosting have also been proposed.

In data level approaches, the class distributions of the imbalanced data are re-balanced by applying re-sampling on the data. These solutions include different forms of re-sampling such as random over-sampling, random under-sampling, directed over-sampling, directed under-sampling and combination of the above techniques. In random over-sampling, the minority class instances are replicated in order to achieve a balanced class distribution. In random under-sampling, some majority class instances are eliminated to obtain a balanced class distribution. However, both over-sampling and under-sampling have drawbacks. Over-sampling can cause the problem of over-fitting by increasing the importance of some instances. Under-sampling technique might throw away potentially useful data. Both random over-sampling and random under-sampling are non-heuristic methods.

Tomek links [19] is a heuristic under-sampling method that removes noisy and border-line majority class instances. Neighborhood Cleaning Rule (NCL) [20] uses Wilson’s Edited Nearest Neighbor Rule (ENN) [21] to remove majority class instances. Condensed Nearest Neighbor Rule (CNN) [22] is used to find a consistent subset of example. Both tomek links and condensed nearest neighbor rule approaches could be used to remove noisy and border-line majority examples. SMOTE[17] is an over-sampling approach to handle skewed datasets. In this approach, the minority class instances are over-sampled by generating synthetic examples of the minority class and adding them to the dataset. The new minority class instances are created by interpolating between several minority class instances lying together. Using this approach, the problem of over-fitting is avoided.SMOTE+ Tomek links [23] approach uses a combination of both SMOTE and tomek links techniques. Instead of removing only the majority class examples that form the tomek links, examples from both the classes are removed. Initially, the dataset is over-sampled using SMOTE, and then then tomek links are identified and removed, resulting in a balanced class distribution.

At algorithm level, the solutions try to adapt the existing classifier learning algorithms to strengthen the learning regarding the minority class. The methods at the algorithm level operate on the algorithms rather than the datasets. Commonly, an appropriate inductive bias is selected. For decision trees, one approach is to adjust the probabilistic estimate at the tree leaf [24]; another approach is to develop new pruning techniques [25]. In case of SVMs, adaptations such as different penalty constants for different classes [26], or adjusting class boundary based on kernel alignment boundary [27], have been reported. For imbalance datasets, class boundary learned by Support Vector Machines is apt to skew towards minority class, thus increase the misclassified rate of the minority class. A decision tree classifier [28] uses hellinger distance [29] as a decision tree splitting criterion instead of information gain measure, and proved hellinger distance to be skew-insensitive.

(3)

in the range [0,1]. It has been shown that support and confidence are not proper metrics for finding interesting classification rules. Classification rules generated using Strength Score interestingness measure give a balanced score even when the dataset distribution is skewed. In this thesis, strength score is used as an interestingness measure to find interesting patterns in an imbalanced dataset.

III. TERMINOLOGY

Let D be the set of transactions where each transaction T is a set of items (I). A transaction T is said to contain an itemset X if

X

⊆

T

.

a)Association Rules:

An Association rule is of the form X→ Y where

X

⊆

I

,

Y

⊆

I

and X∩Y=∅. The rule X→Y has support s in transaction database D if s% of the transactions in D contain

X

∪

Y

. The rule X→ Y has confidence c in the transaction dataset if c% of the transactions in D that contain X also contain Y. Association Rule Mining is the problem of discovering all the association rules whose support and confidence are greater than the user specified minimum support and minimum confidence respectively.

Complement class support (CCS) refers to the support of a given itemset in the classes other than the class in which it was generated.

b) Class Association Rules:

To make association mining suitable for the classification task, associative classification method focuses on a special type of association rules called Class Association Rules (CARs). CARs are association rules whose consequent is limited to class variables. Hence, only those rules of the form A→ Ci where

A

⊆

I

and Ci is a class, are generated. Before applying Association Mining, the training dataset needs to be preprocessed into the form of transaction database. Each item is represented as attribute-value pair, whose presence or absence depends on the value for the training example represented as a transaction.

In the next section, the suitability of strength score parameter with respect to imbalanced class dataset is discussed.

IV. SUITABILITY OF STRENGTH SCORE PARAMETER

Once Class Association Rules (CARs) are generated using association mining approach, the quality of each such classification rule is to be estimated. In this section, the suitability of the proposed normalized strength score measure is discussed.

The concept of strength score used in this work has been adapted from Arunasalem et.al. (Arunasalam & Chawla, 2006). Strength score of a rule is used as a measure for estimating the likelihood of a rule. Strength score for a rule Ai → Cj, SS Ai → Cj), can be evaluated using Eq. (1) given below.

SS Ai → Cj)=

→ ∗ ( → )

→ Eq.(1)

where Conf(Ai → Cj) represents confidence of the rule, ClSup(Ai → Cj) represents support of the rule and

CCS(Ai → Cj) represents complement class score of the rule.

Class support of a rule, ClSup(Ai → Cj), denotes the percentage of instances in class Cj that contain itemset Ai

and is evaluated using Eq.(2). ClSup(Ai → Cj)=

σ( ∪ )

σ( ) Eq.(2)

Confidence of a rule, Conf(Ai → Cj), denotes the percentage of instances containing itemset Ai that actually

belong to class Cj, and is evaluated using Eq.(3).

Conf Ai → Cj)= σ( ∪ )

σ( ) Eq.(3)

Complement class support measure is used to measure the class support of an itemset in the other class. This measure can be evaluated using Eq.(4).

CCS Ai → Cj)=

σ( ┐ )

σ(┐ ) Eq.(4)

(4)

For instance, let the dataset contain two classes viz., C1 and C2. Consider an itemset Ai associated with both the

classes.

Case I: If Ai is not equally associated with both the classes

Strength score of the rule Ai → C1 is given by

SS(Ai → C1)=

( → )∗ ( → )

( → )

= ( → )∗₍ ( → ) → )

=Conf(A → C ) ∗ ( ( → )

( → )) Eq.(5)

Strength score of a rule in a given class can be defined as confidence of a rule in the same class multiplied by the factor (₍ → )

→ ).If Ai occurs more frequently in class C1 than class C2, then the confidence of Ai with

respect to class C1 is increased by the factor (₍ → )

→ ). Otherwise, the confidence should be decreased by the same factor.

Case II: If Ai is equally associated with both the classes

Assume that the dataset is balanced in nature. i.e.,C1 and C2 are of equal size( |C1|=|C2|).

Then, |A ∪ C | = |A ∪ C |

As the classes are of the same size, Clsup(A → C ) = Clsup(A → C ) Also, Conf(A → C ) = Conf(A → C ) = = 0.5

Hence it can be implied that SS(A → C ) = SS(A → C ) = 1.0

On the other hand, assume that the dataset is imbalanced in nature. Let C1 be the majority class label and C2 be

the minority class label. i.e., |C1|>>>|C2|

If Ai is equally associated with both the classes, then | ∪C_1| > |A_i∪C_2|.

Conf Ai → C1)=

| ∪ |

| | Eq.(6)

Conf Ai → C2)=

| ∪ |

| | Eq.(7)

By observing Eq.(6) and Eq.(7), it can be deduced that Conf(A → C ) > (A → C )

From the above inequation it can be inferred that for a dataset having imbalanced class distribution, even though an itemset is equally associated with both the classes, the confidence measure is biased towards the majority class. The Strength score measure, on the other hand, provides a balanced measure by multiplying the confidence of the rule Ai → C1with the factor

( → )

( → ) .

Hence, strength score measure provides an accurate estimate of strength of a rule regardless of the class distributions in the training dataset. In case of multi-class datasets, Complement class support of an itemset in a given class is the combined classs support of the itemset in all the classes other than the given class. It is straightforward to show that Strength score is a proper measure for rules in multi-class datasets also. In case of multi-class datasets, Strength score of an itemset Ai for a given class Cj represents confidence that is

proportionately changed with support of itemset Ai with respect to Cj against , where Cj represents any class

other than Cj.

Eventhough strength score provides accurate estimate of the quality of classification rules, the strength score are not normalized in nature. Furthermore, the strength score measure faces divide by zero error whenever CCS value becomes zero. To avoid the above problems, strength score measure is modified as Normalized Strength Score measure and is shown in Eq.(8).

SS Ai → Cj)= tanh(

( → )∗ ( → )

( → ) ) Eq. (8)

To scale the strength score value of a rule, tanh transformation (Godfrey, 2009) has been applied. The function

(5)

approximately equal to x. In the proposed research work, Tanh function has been used to apply linear transformation on the input values. The resulting normalized strength score values always lie in the range [0, 1] even when CCS value becomes zero. The densities of the resulting values are also preserved. This makes tanh

function ideal to be used in classification task.

Once classification rules are generated and their normalized strength score values are evaluated, then the best classification rules can be selected to form the classification model. The classification model identifies the classification rules that cover the unseen instance. Voting scheme could be applied to obtain the class label of the unseen instance.

V. CONCLUSIONS AND FUTURE WORK

This article presents a novel approach for classifying multi-class datasets. A new measure called “Normalized Strength Score” for Associative Classifier has been proposed. The suitability of the proposed measure has been discussed. The proposed approach differs from the existing approaches in that multi-class dataset is automatically handled without breaking down the training data into a collection of two-class datasets. The practical working of the approach needs to be worked out and tested on benchmark datasets.

VI. REFERENCES

[1] R. Agrawal and R. Srikant (1994).Fast Algorithm for mining association rules.In Proc. Of VLDB’94, Santiago, Chile, Sept. 1994. [2] Janssens D, Wets G, Brijs T and Vanhoof K (2003). “Integrating classification and association rules by proposing adaptations to

CBA”. In: Proc. of the 10th International Conference on Recent Advances in Retailing and Services Science, Portland, Oregon. [3] Li W, Han J, Pei J (2001). “CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules”. In: Proc. of the

1st IEEE International Conference on Data Mining, San Jose, California, pp.369-376.

[4] B. Lent, A. Swami, and J. Widom.Clustering association rules. In ICDE’97, England, April 1997.

[5] G. Dong, X. Zhang, L. Wong, and J. Li. Caep: Classification by aggregating emerging patterns. In DS’99 (LNCS 1721), Japan, Dec. 1999.

[6] K. Wang, S. Zhou, and Y. He.Growing decision tree on support-less association rules. In KDD’00, Boston, MA, Aug. 2000. [7] Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo

[8] Sharpe PK, Glover RP (1999) Efficient GA based techniques for classification. ApplIntell 11:277–284 [9] Kulkarni AD, Cavanaugh CD (2000) Fuzzy neural network models for classification. ApplIntell 12:207–215

[10] Liu B, Hsu W, Ma Y (1998). “Integrating Classification andAssociation Rule Mining”. In: Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, New York,pp.80-86.

[11] N. Japkowicz. Learning from imbalanced data sets: A comparison of various strategies, Learning from imbalanced data sets: The AAAI Workshop 10-15. Menlo Park, CA: AAAI Press. Technical Report WS-00-05, 2000.

[12] N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357, 2002.

[13] M.A. Maloof . Learning when data sets are Imbalanced and when costs are unequal and unknown, ICML-2003 Workshop on Learning from Imbalanced Data Sets II, 2003.

[14] M. Kubat, R. Holte and S. Matwin. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30, 195–215, 1998.

[15] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the Fourteenth International Conference on Machine Learning San Francisco, CA, Morgan Kaufmann, 179- 186,1997

[16] M. Joshi, V. Kumar and R. Agarwal. Evaluating boosting algorithms to classify rare classes: comparison and improvements. Technical Report RC-22147, IBM Research Division, 2001.

[17] N. Chawla, A. Lazarevic, L. Hall and K. Bowyer. SMOTEBoost: improving prediction of the minority class in boosting. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat- Dubrovnik, Croatia , 107-119, 2003.

[18] Gu, Q., Cai, Z., Zhu, L., & Huang, B. Data Mining on Imbalanced Datasets. International Conference on Advanced Computer Theory and Engineering (icacte 2008), (pp. 1020-1024),2008.

[19] Tomek, I. Two Modifications of CNN. IEEE Transactions on Systems Man and Communications SMC-6, 769-772, 1976. [20] Laurikkala, J.Improving identification of difficult small classes by balancing class distribution. University of Tampere,2001.

[21] Wilson, D. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Communications, 2 (3), 408-421.

[22] Hart, P. E. (1968). The Condensed Nearest neihbor Rule. IEEE Transactions on Information Theory, 14 (3), 515-516.

[23] Gustavo, E., Batista, P., Prati, R., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning data. 6 (1), 20-29.

[24] Chawla, N., Japcowicz, N., & Kolcz, A. (2004). Editorial: Special Issue on learning from imbalanced datasets. SIGKDD Explorations , 6 (1), 1-6.

[25] Zadrozny, B., & Charles, E. Learning and making decisions when costs and probabilities are both unknown. Seventh International Conference on Knowledge Discovery and Data Mining, (pp. 204-213).

[26] Lin, Y., Yoonkyun, L., & Grace, W. (2002). Support Vector Machines for classification in nonstandard situations. Machine Learning, 46 (3), 191-202.

[27] Gang, w., & Chang, E. Y. (2003). Class-Boundary Alignment for Imbalanced Dataset Learning. ICML'03 Workshop on Learning from Imbalanced Datasets.

[28] Cieslak, D. A., & Chawla, N. V. (2008). Learning Decision Trees for Unbalanced Data. European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), (pp. 241-256). Antwerp, Belgium.

[29] Kailath, T. (1967). The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Transactions on Communications , 15 (1), 52-60.

[30] Arunasalam, B., & Chawla, S. (2006). Paremeter-free classification for imbalanced data scoring using complement class support.