A novel Method for Data Mining and Classification based on

(1)

A novel Method for Data Mining and Classification based on

Ensemble Learning

1

_{Li Han}

1, First Author

Neijiang Normal University;Sichuan Neijiang 641112,China, E-mail:

[email protected]

Abstract

Data mining has been attached great importance in information industry. The main reason is that data mining stores lots of data which are broadly applicable. Besides, these data are urgently required to be transformed into useful information and knowledge. This paper mainly concerns a sorted and branching problem of data mining and designed an ensembled KNN classifier based on distance learning. This classifier firstly performed filtering uncorrelated attributes in data sets based on information gain and it filters redundant attributes with lower correlation degree. Then, through bagging to integration, the generated classifier cannot only make use of its self operation to carry out randomly selection towards training samples. But it also implemented random filtering on classifier attribute to enhance differentiation between sub-classifiers during guaranteeing the accuracy of sub-classifiers. During distance learning, with near component method, the optimized calculation can be implemented with leave-one-out cross validation. The experiment data shows that classification effects of new classifiers is significantly improved than traditional KNN and single classification method of distance learning.

Keyword:

Ensemble Learning, Data Mining, Bagging Method, k-Nearest Neighbor Algorithm

1. Introduction

Due to relative technological development and practical working requirement, database technology is applied to store as well as manage data and machine learning technology is used to analyze data. Therefore, a great deal of knowledge hidden in data has been discovered and the analysis as well as the reorganization of these data finally forms a concerning research field:Knowledge Discovery in Database, KDD [1]. It generally refers to discovering mode or communication methods from source data. KDD covers the whole data mining process containing from the very first target of making business to the final result analysis while data mining only makes a description on the sub-process of using algorithm on mining to carry out data mining. However, recently, people find that there is a lot of work in data mining which can be accomplished by statistics and they consider that the best strategy is to organically combine statistics with data mining [2] and the technology of data mining is the most critical step in KDD.

Classification technology is an important branch in data mining. There are many methods applying to classification data mining, like decision tree method, genetic algorithm, Bayesian networks, rough set, KNN, related rules, [3-8]. On the design of comprehensively ensemble classification in this paper, the method of calculating average information gain is performed during data pretreatment to filter uncorrelated attributes and the ensemble method is to use basic bagging ensemble method to generate sub-classifier to synthesize results while each sub-classifier is an improved KNN classifier.

As one of the most traditional classification algorithms in data mining, algorithm KNN has still been broadly applied in many areas due to its simplicity, efficiency and nonparametric quality[9]. Towards disadvantages of KNN, people put forward some efficient improving methods. On one hand, many learning algorithms of KNN distance measure models have been brought forward aiming at the advantages of singleness on KNN distance measure models. Representative algorithms contain MLCC

（Metric Learning by Collapsing Class）， LMNN（Distance Metric Learning for Large Margin Nearest Neighbor Classification）and NCA（Neighborhood Components Analysis）.These three distance learning methods [10-12] applied different mathematical models and training measure acquired can narrow down the distance of similar samples in these training data but enlarge the distance between different samples. On the other hand, algorithm KNN integrates with other algorithms to improve

(2)

classification effect including integrating SVM with algorithm KNN, integrating genetic algorithm with fuzzy KNN algorithm and integration Bayesian classifier with KNN classifier. Towards the above two methods’ improvement, this paper integrates algorithm Bagging with KNN classifier of distance learning and puts forward an ensemble KNN classifier based on distance learning.

2. Related Study

2.1. KNN

Classification Algorithm

KNN, which is based on statistics, aims to classify the category of the majority of the samples in each nearest neighbor according to the k of the test sample in the feature space. The basic method is：

all of the examples are put into an N-dimensional space.Usually, each example x is expressed as a feature vector{a1(x),a2(x),…,an(x)}. ai (x) denotes the attribute value r. The similarity measurement

between the two examples xi and xj commonly is calculate by euclidean distance.

2 1

( , )

_i _j n

( ( )

_r _i _r

( ))

_j r

d x x

a x







（1）

KNN is a kind of weak classifier for it is very sensitive to the dimension trap. To avoid the measurement errors caused by the similarity measure,[13] add a feature weights for each attribute for an improvement.That is, different attributes in classification have different influence. Then the improved distance is calculated as





2 1

( , )

k i m

(

j kj ij

)

j

d x c

w x c

 





_（₂_）

A meaningful method to improve the KNN algorithm is to reduce weight as much as possible through adjusting weights wj. The similarity between vector ep and eq in weighted similarity matrix can

be expressed as:    

1

w pq w pq

d











（3）  w pq



we get by optimization with certain criteria is used as the new similarity. This method contributes to improve the classification results and it reduce the similarity of the similarity within the increasing big class.

2.2. Ensemble Learning Algorithms

Ensemble learning algorithm is a technology which is used to improve the accuracy of the classification algorithm.It developed gradually from the field of machine learning technology Ensemble learning algorithm mainly include two types of methods: Bagging and Boosting. This paper takes advantage of the bagging method for KNN classification methods.Beacuse this paper need distance learning for each sub-classifier. If there are a sub-categories with larger number of attributes and samples，distance learning process will be extremely time-consuming.The formation process of each sub-classifiers in bagging method is independent of each other, which make it possible for multiple threads or online parallel processing method to balance the metric learning total operation of the sub-classifiers.So Bagging algorithm can simplify part of the calculation.

Bagging generated classifiers by the measure of return random sampling technique，or we called it bootstrapping sampling. In this method, the difference between the ensemble members are obtained by resampling for the Bootstrapping, or by providing differences through training the randomness and independence of the samples. Bagging method is mainly used for unstable learning algorithm, neural networks and decision trees.For example, Bagging reduces the variance produced by based classifier through predicting vote for these classifiers, then reduce the generalization error. For stable learning algorithm, such as Naive Bayes method, the bagging integration can not decrease the generalization error.

(3)

The algorithm can be described as follows:

We set original training sample as D = {(x1, y1), (x2, y2), ..., (xN, yN)).N is the training sample

number.In the training phase:

For t = 1,2,... T Do / / T is the number of individuals in the bagging integration (1) extracte m inputs from the training sample randomly;

(2) obtaine model HT in accordance with a given learning algorithm (3) put the training sample back

Return to collection (h1, h2, ..., hr). Forecast period:

 

1

T i t i t

H x

h x

T





(4)

 

1

arc

max

T i y Y t t

H x

_

h x



_i

   





(5)

3. Ensemble KNN Classifier based on Distance Learning

3.1. Algorithm Process

A marked training data set {(xi,yi)}ni=1 was supposed to contain n samples and each sample

possessed d attributes Xi∈R as well as data set has c classifications y-={1,2,…,c}. This paper

introduced a comprehensive integration KNN classification algorithm based on distance learning. At first, attribute filtering was used to eliminate the attributes whose correlation is smaller than closing values. Then, using bagging integration algorithm generated sub-classifier and each sub-classifier was acquired by randomly attributing selection on the basis of original data sets. In the end, each sub-classifier was carried out distance learning to calculate distance measurement model aiming at each different sub-classifier and models would be classified on testing samples after learning. Algorithm process mainly includes initial attribute filtering, bagging integration algorithm and KNN algorithm of distance learning. Its process is shown as figure 1.

(4)

Figure 1. Ensemble KNNe lassifier based on distance learning method

3.2. Attibutes Filtration of Training Dataset

In initial data set, since many attributes are not correlated to learning target, correlated attributes learning is interrupted. Input attribute is randomly selected to generate sub-attribute set to train sub-classifiers so that the influence of uncorrelated attributes is amplified so as to result in correction reduction of classifiers. Because of this, uncorrelated attributes need to be filtered before integration.This paper adopts a method based on information gain to filter attributes. Specifically, this method is applied to calculate all information gain of original attributes in data set. Some information gain which is smaller than the attribute with specific threshold f should be removed as uncorrelated attribute. Here, one third of the average value on information gain of f selected attributes is taken as the threshold value on attribute filtration.

Dataset Training

Attributes filtration Threshold f control the Filtration

Sub-Classifier generated by bootstrap

Sub-Classifier 1 Sub-Classifier 2 …… Sub-Classifier t

Attributes of Sub-Classifier 1 are random eliminated Attributes of Sub-Classifier 2 are random eliminated Attributes of Sub-Classifier t are random eliminated ……

Control preserve for attribute disturbance parameter

Sub-Classifier 1 Distance measurement A1 Sub-Classifier 2 Distance measurement A2 …… Sub-Classifier t Distance measurement At Sub-Classifier 1 Classification results S1 Sub-Classifier 2 Classification results S2 Sub-Classifier t Classification results St

Comprehensive classification results with majority voting system

(5)

3.3. Bagging Methods

To establish an effective ensembling,each of the sub-classifier need to own higher accuracy and the difference among the classifiers should be great.KNN method is relatively hard to be ensembled by Bagging algorithm.For example:Give a data set which has N samples.The probability selection that sample i is selected obey the distribution of Poisson distribution approximately withλ= 1.The the probability that sample i emerges at least once is 1−(1/e)≈0.632.We assume the two-classifier problems need to generate t sub-classifiers.When and only when the times of that some testing sample emerges in the neighborhood of training set less than t/2,the classification results pf testing sample will change.It is obvious that this probability continues to decrease with the increasing of t.So the sub-classifier with accuracy and differences can not be acquired only by bootstrap methods which randomly selects training sample.

The Bagging method for the calculation of generating the sub-classifiers in this paper, absorbs the advantage of FABSIR in [14] to add disturbance to input attributes.The process of generating sub-classifiers is:On the one hand,we extract samples of origin dataset to compose sub-classifiers by bootstrap method;on the other hand,the proportion of sub-classifiers’ attributes are controlled by parameters.The parts of the attributes are eliminated randomly.Assume the number of generated sub-classifiers is t.During the process,the attributes of these t sub-classifiers will be eliminated randomly again.Give parameter factor s is used to control the amount of eliminated attributes.If d attributes left after the origin dataset is filtered,The sub-classifier has attributes of d×s.This kind of method can improve the effect of classification,which is determined by the difference among the sub-classifiers generated by the method.

3.4. Distance Measurement Learning

We introduced a distance learning KNN classifier based on near component analysis (NCA) to be used to improve classification accuracy of each sub-classifier and the following formula presents specific realization of distance learning algorithm. We suppose that a marked sub-classifier contains n real inputting vectors xl,…,xn，x∈R and the corresponding types are marked cl,…,cn. Choosing a

distance vector makes classification effect of KNN to reach the best value. Distance learning method adopted Mahalanobis distance model to seek solution on semi positive definite matrix seeking Q=AT_A.

It is shown as formula 1

( , ) (

)

T

(

) (

T

)

x y x y

x y



x



y Q x



y



A



A



A

(6) Error rate of leave-one-out validation is a discontinuous function of a variable matrix A and the very tiny change of matrix A will greatly impact testing results of KNN’s leave-one-out validation. Thus, this algorithm introduced a differential non-linear function, that is, each point i takes another point j as its adjacency and inherits the probability of typing tag ci as Pij. On the basis of Mahalanobis distance,

non-linear function softmax is used to define Pij as:

2 2

exp( ||

|| )

exp( ||

|| )

i j ij i j k i

Ax

P

Ax











(7) Due to randomly selection, the probability Pi of correct classification on point i is

i i ij j C

P







₍₈₎

Therefore, the best classification effect is the number on correctly classified point which can reach the highest. We set the target function as:

( )

i ij i i j C i

f A

P





 





(9) This is a matrix function and the method NCA is through conjugate gradient method or quadratic programming to optimize this matrix function and gets the optimal solution A so as to get the optimistic distance learning method. Besides, the target function can be transformed into target function towards

(6)

gradient A xij=xi-xj.

2 (

)

i T T ij ij ij ik ik ik i j C k

f

A

P x x

A





_{ }

_



 



(10) After solution.

2 (

)

i T T i ik ik ik ij ij ij i i j C

f

A

P

P x x

A





_

_



 



(11)

This paper made use of conjugate gradient method referred in [15] to calculate the maximum value of formula 3 to acquire the optimum value A matrix. The acquired new distance measure d(x,y)=(x-y)T_Q(x-y)T_Q(x-y)=(A

x-Ay)T(Ax-Ay) can be regarded as distance measure of algorithm KNN to carry out classification.

4. Experiment Result Analysis

Five groups of commonly used UC datasets are applied to detect this algorithm’s classification effect and these data of data sets are all numeric. Since KNN cannot deal with missing data, some missing data are ignored before introducing data sets. The information of data sets is shown as the following table

Table 1. Five data sets used in experiment

Dataset Samples Features Categories

Wine 178 13 3

Segement 200 19 7

Iris 150 4 3 Bal 625 4 3 Ion 351 35 2

Leave-one-out staggered validation is used to calculate the classification correction rate of each data set. That is, towards a data set with n samples, each sample is selected to take as a testing sample at one time and the rest n-1 samples can make up new data set as training set. Towards testing samples classification, the same steps will repeat n times in order that each sample can act as a testing sample. In the end, samples will be tested during these n cycles to be classified based on correct numbers to get the correction rate.

At first, different value K is set to compare the classification effect between traditional KNN as well as algorithm KNN of distance learning with these five groups’ data sets. Sub-classifier number of integration learning is set as 50, the value K is 1 and the coefficient of attributing filtering is 0.33. Because of different data sets, attribute extracting coefficient s gets 0.6 in Ion data set, gets 0.8 in Segment data set and gets 1 in all Wine, Iriss, Bal data sets. The results of using method leave-one-out to perform cross validation are shown as the following tables.

Table 2. Comparision results of Wine

Wine K=1 K=3 K=5 K=10

Traditional KNN 0.9494 0.9663 0.9494 0.9551 Single-Distance study 0.9888 0.9831 0.9831 1.0000 Single-Ensemble 0.9831 0.9888 0.9775 0.9775 Ensembled-Distance study 1.0000 1.0000 1.0000 1.0000

(7)

Table 3. Comparision results of Segment Segment K=1 K=3 K=5 K=10 Traditional KNN 0.9690 0.9557 0.9514 0.9448 Single-Distance study 0.9748 0.9700 0.9690 0.9548 Single-Ensemble 0.9633 0.9513 0.9528 0.9681 Ensembled-Distance study 0.9787 0.9654 0.9631 0.9741

Table 4. Comparision results of Iris

Iris K=1 K=3 K=5 K=10

Table 5. Comparision results of Bal

Bal K=1 K=3 K=5 K=10

Table 6. Comparision results of Ion

Ion K=1 K=3 K=5 K=10

Next, each data set will be carried out calculation by means of five different algorithms to compare results in the end. They are traditional KNN classifier, improved KNN classifier of Mahalanobis distance, integration algorithm FABSIR of multimode disturbance, KNN classifier after NCA distance learning and integration learning KNN classifier introduced by this paper. With the same coefficient values acquired in each item, classification result is shown as following table.

Table 7. Comparision of 5 methods

KNN Improved KNN FABSIR NCA NEW

Wine 0.9494 0.9438 0.9785 0.9888 1.0000 Segment 0.9690 0.9697 0.9715 0.9748 0.9928 Iris 0.9133 0.9267 0.9254 0.9067 0.9867 Bal 0.7248 0.7344 0.8145 0.9200 0.9648 Ion 0.8476 0.8775 0.8753 0.8860 0.9401 Average 0.8802 0.8904 0.9022 0.9353 0.9769

From experimental datas, we can see that the new algorithm, whose classification effect has been obviously improved comparing to single integration algorithm and single distance learning algorithm, combines integration algorithm and NCA distance learning algorithm together. Towards some data sets

(8)

like Bal, distance learning algorithm is more obviously improving classification effect. However, the improvement of new algorithm’s classification effect brought forward consumption of calculation quantity especially using leave-one-out method to perform cross validation. The times of a data set needing to perform distance learning equalize sub classifiers of x samples. However, large scale of data sets with many samples and attributes, NCA seems to be slower in distance learning. Besides, there are also many mixing data sets of numerical data and character data and distance model with this algorithm’s KNN classifier cannot be classified. Thus, on one hand, this algorithm has to carry out further improvement of calculation speed. Data sets with mixed type need to reconstruct distance measurement of KNN classifier so as to be adapted to the data sets with character information.

5. Conclusion

This paper realized an ensemble KNN classifier based on distance learning.This classifier firstly performed filtering uncorrelated attributes in data sets based on information gain and it filters redundant attribute with lower correlation degree. Then, through integration of bagging, the generated classifier cannot only make use of its self operation to carry out randomly selection towards training samples but it also implemented random filtering on classifier attribute to enhance differentiation between sub-classifiers during guaranteeing the accuracy of sub-classifiers. All sub-classifiers on bagging integration algorithm have made use of classifier KNN on the basis of near component analysis on distance learning and distance learning measure is applied to calculate classification results. In the end, majority voting system is applied to synthesize classification results to acquire final judgment. From experiment results, since integration algorithm is applied and each sub-classifier carried out distance learning, the classifier which was put forward in this paper has improved more obviously on the effect during comparing to single integration learning algorithm or single distance learning algorithm.

6. References

[1] Holmström, Hampus,"Estimation of single-tree characteristics using the kNN method and plotwise aerial photograph interpretations",Forest Ecology and Management, vol.167, no.13, pp.303-314, 2002.

[2] Zhu Jianping,"Data Compression of Transactional Database Attribute Item in Data Mining",Statistics & Information Forum,vol.26,no.5,pp.136-141,2006.

[3] Reza Entezari-Maleki , Arash Rezaei, and Behrouz Minaei-Bidgoli , "Comparison of Classification Methods Based on the Type of Attributes and Sample Size ", JCIT, Vol. 4, No.3, pp.94-102, 2009.

[4] Tan Junshan,He Wei,Qing Yan,"Application of genetic algorithm in data mining",The Proceedings of the 1st International Workshop on Education Technology and Computer Science, ETCS,pp.353-356, 2009.

[5] Li Yanmei,Zhang Zhuokui,"Data Mining Based on Bayesian Networks",Computer Simulation,vol.18,no.2,pp.560-564,2008.

[6] Yitian Xu, Haozhi Zhang, Laisheng Wang, "Rough Margin-Based Linear υ Support Vector Machine", JCIT, Vol. 5, No. 8, pp. 226 ~ 232, 2010.

[7] Baek SeongJoon,Sung KoengMo,"Fast K-nearest-neighbour search algorithm for nonparametric classification",Electronics Letters, vol 36, no.21, pp.1821-1822, 2000.

[8] Lu JianJiang,"Research on Algorithms of Miming Association Rules with Weighted Items ",Journal of Computer Research and Development,vol.9,no.10,pp.178-182,2002.

[9] Hosein Alizadeh , Behrouz Minaei-Bidgoli and Saeed K. Amirgholipour , "A New Method for Improving the Performance of K Nearest Neighbor using Clustering Technique", JCIT, Vol. 4, No. 2, pp. 84 ~ 92, 2009.

[10]Wang Jun,Woznica Adam, Kalousis Alexandros,"Learning neighborhoods for metric learning",Lecture Notes in Computer Science, vol.7523,no.1,pp.223-236, 2012.

[11]Yoo, SungGoo,Chong, KilTo,"Obstacle avoidance system using a single camera and LMNN fuzzy controller",Journal of Institute of Control, Robotics and Systems, vol.15, no.2, pp. 192-197,2009.

(9)

[12]Manit, Jirapong,Youngkong, Prakarnkiat,"Neighborhood components analysis in sEMG signal dimensionality reduction for gait phase pattern recognition",The proceeding of 6th International Conference on Broadband Communications and Biomedical Applications, Program,pp 86-90, 2011.

[13]Shen Chuanhe, Wang Xiangrong,Yu Di,"Feature weighting of support vector machines based on derivative saliency analysis and its application to financial data mining",International Journal of Advancements in Computing Technology, vol.4, no.1, pp.199-206,2012.

[14]Zhou, ZhiHua,Yu Yang,"Ensembling local learners through multimodal perturbation",IEEE Transactions on Systems, Man, and Cybernetics,vol.35, no.4,pp.725-735, 2005.

[15]Li huaxin,"Conjugate Gradient Applied to Image Reconstruction",CT Theory and Applications,vol.16,no.2,2007.