Review of Pruning and Performance Enhancement Techniques over Classification Algorithms

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)

500

Review of Pruning and Performance Enhancement

Techniques over Classification Algorithms

Anshu Katare

1

_{, Dr. Vijay Anant Athavale}

2

1_{Research Scholor, Department of Computer Science & Engineering, PAHER University Udaipur, Rajasthan} 2_{Director, Dev Raj Group’s Technical Campus, Ferozepur, Punjab}

1_{[email protected]} 2

[email protected]

Abstract— Machine learning algorithm is frequently used in making decisions of various complex problems. Additionally these algorithms may use in the domain of classification, clustering of data and analyze them. Thus we can say that large complex problems are solved by use of machine learning and their algorithms. To make correct decision we required to improve their performance in terms of accuracy and error rates of learning. Even more we required to reduce the size of tree in terms of depth of search to find optimal solution of any problem. in this paper we provide the brief literature survey about the performance enhancement algorithms and methods and provide the problem formulation. Additionally here we provide the solution domain and basic search algorithm.

Keywords— Decision Tree, Data Mining, Enhancement,

Problem Formulation, Solution Steps.

I. INTRODUCTION

Machine learning is a branch of artificial intelligence where machine is trained to work like human intelligence and capable to make decisions like human brain. Machines are not performing the intelligence work the data and algorithms are making it intelligent. Algorithms are trained and using the previous events and their related facts as parameters and prepare data models for future use. These data models are help any machine to take a better decision according to the nature and facts by which these are trained. The three main factors are used define the capability of any machine. Firstly, there interface by which a user is interacted with the machine, secondly database where different facts and data models are builds and prepared and lastly, the search engines that helps to find the optimum solution for any specific problem.

To perform search operation the most frequent methodology which is highly recommended is tree data structure. In this all the facts are related to each other by using links and forming a decision tree. The leaf nodes are mounted as a decision over tree and intermediate nodes represents the

relation of different attributes or the path by which they found the most optimal decision.

One of such kind of data structure is decision tree; decision tree is frequently used in different kind of problem solving techniques. Such as classification problems, cluster analysis and data analysis and decision making problems. Useful information extraction from the data, data source required data mining and there different approaches. There are large amount of methods and algorithms are available by which we accurately classify the given training dataset. Some of them are complex and some of them are very easy to implement. According to their complexity of implementation and classification methodology we divide these data models in two main parts.

1. Opaque model 2. Transparent Model.

(2)

International Journal of Emerging Technology and Advanced Engineering

501

model for machine learning we required to make effort in the direction of decision tree and rule building technique.

Decision tree is a data model that returns results in most transparent way by which user can estimate the output values using simple traversing with the parameters which is supplied for evaluation. Decision trees are mathematical model based on weighted graph. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. To demonstrate the working of any data model we can assume this as a formula which is derived to satisfy the similar kind of data patterns. the use of decision trees leads some advantages and disadvantages of data structure In this paper we are going to work with data mining and analyzing its performance enhancement strategies that are frequently used. These methods are promises to improve the performance in classification algorithms. Additionally we find some ways by which the performance in terms of accuracy of any classification algorithm is improved.

II. BACKGROUND

In data mining various methods, tools and algorithms are developed and implemented to achieve high performance problem solving. But most of the techniques are opaque model of decision mining. In transparent data pattern analysis and decision mining we use decision trees. Transparent data structures are provide the faculty to find the solution of the complex problem by using simple paper and panicle.

Decision tree can be used to visually and unambiguously represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. Decision trees used in data mining are of two main types: Classification tree analysis is when the predicted outcome is the class to which the data belongs. and Regression tree analysis is when the predicted outcome can be considered a real number.

Training sets: Major component of any data model is a training data set of substances whose outcomes are well known classes mean to say the input parameters and their relative output is given. The training process is introduces to develop a classification rule or a data model that can determine the class of any object from its values of the attributes. The direct question is arises here whether or not the attributes provide appropriate information to perform this. In particular, if the training data set contains two objects that have undistinguishable values for each attribute and yet belong to different classes, it is clearly impossible to differentiate between these objects with reference only to the give n attributes. In such a case attributes will be termed insufficient for the training set and hence for the induction task. [1]

Ensemble Learning: According to the [2] Ensemble learning is a machine learning assumption where more than one learners are trained to solve the same problem. In ordinary machine learning methodologies which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them to use in future problems.

Constructing Ensembles: An ensemble is created in two steps First, a number of base learners are created, which can be generated in a parallel fashion or in a sequential fashion where the generation of a base learner has inﬂuence on the generation of subsequent learners Then, the base learners are combined to use, where between the most common combination arrangements are common voting for classiﬁcation and weighted averaging for regression.

(3)

International Journal of Emerging Technology and Advanced Engineering

502

Suppose X and Y denote the instance space and classes respectively, assuming Y = {−1, +1}. A training data set D ={(x1,y1),(x2,y2),··· ,(xm, ym)} is given, where xi ∈X and yi ∈Y ( i=1,··· ,m).

Boosting: Boosting is an algorithm which may found in various other variants and all the processes are different with the different data scenario. In this part we include the most frequently used algorithm AdaBoost. It is a meta-algorithm, and can be used in combination with many other learning algorithms to improve their performance. AdaBoost is adaptive in the sense that succeeding classifiers built are squeezed in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. The algorithm is taken place as in First step it assigns equal weights to all the training examples. Denote the scattering of the weights at the t-th learning round as Dt. From the training data set and Dt the algorithm generates a base learner ht: X → Y by calling the base learning algorithm. After that, it uses the training examples to test ht, and the weights of the incorrectly classiﬁed examples will be increased. Thus, an updated weight distribution Dt+1 is obtained. From the training data set and D t+1 AdaBoost generates another base learner by calling the base learning algorithm again. Such a process is repeated for T times, each of which is called a round, and the ﬁnal learner is derived by weighted majority voting of the T base learners, where the weights of the learners are determined during the training process. In practice, the base learning algorithm may be a learning algorithm which can use weighted training examples directly; otherwise the weights can be exploited by sampling the training examples according to the weight distribution Dt.

Ada boost

Input: Data set D = {(x1,y1),(x2,y2),··· ,(xm,ym)}; Base learning algorithm L;

Number of learning rounds T. Process:

1. Initialize the weight distribution D1(i)=1/m.

For t =1,··· ,T:

2. Train a base learner ht from D using distribution Dt ht = L(D,Dt);

3. Measure the error of ht

ἐt = Pri∼Di[ht(xi = yi)]; 4. Determine the weight of ht αt = .5 ln 1−ἐt /ἐt ; case 1 if ht(xi)= yi

Dt+1(i)= (Dt(i)/Zt) X exp(−αt) case 2 if ht(xi) != yi

Dt+1(i)= (Dt(i)/Zt) X exp(αt) =Dt(i) exp (−αt yi ht(xi))/Zt 5. Update the distribution, where Zt is a normalization 6. factor which enables Dt+1 to be a distribution 7. Output: H(x)= sign(f(x))= sign ∑ αt ht(x)

On the other hand Bagging trains a number of base learner search from a different boot strap sample by calling a base algorithm. Bootstrapping or booting refers to a group of descriptions which refer to a self-sustaining process that proceeds without external help. A bootstrap sample is collected by sub sampling the training dataset with replacement, where the size of a sample is as same as that of the training data set. Thus, for a bootstrap sample Bagging is shown in below

Input: Data set D= {(x1,y1),(x2,y2),··· ,(xm,ym)}; Base learning algorithm L;

Number of learning rounds T. Process:

1. initializing the rounds for T sets as For t=1,··· ,T:

2. % Generate a bootstrap sample from D Dt= Bootstrap(D);

3. Train a base learner ht from the bootstrap sample ht= L(Dt)

end.

4. the value of 1(a) is 1 if a is true and 0 otherwise Output: H(x)= argmax y∈Y ∑ 1 (y= ht(x))

(4)

International Journal of Emerging Technology and Advanced Engineering

503

In implementation of Stacking, a number of ﬁrst-level of learners is created from the data set by applying different learning algorithms. The first level learners are then combined with a second-level learner which is called as meta-learner. It is evident that Stacking has close relation with information fusion methods the basic algorithm is given below.

Input: Data set D= {(x1,y1),(x2,y2),··· ,(xm,ym)}; First-level learning algorithms L1,··· ,LT;

Second-level learning algorithm L. Process:

1. Train a ﬁrst-level individual learner ht by applying the ﬁrst-level end;

For t=1,··· ,T:

2. learning algorithm Lt to the original data set D ht= Lt(D)

3. Generate a new data set ∅;

For i=1,··· ,m: and For t=1,··· ,T:

4. % Use ht to classify the training example xi end; zit= ht(xi)

∪{((zi1,zi2,··· ,ziT),yi)} end; 5. Train the

second-second-level

Pruning: Over fitting is a significant practical difficulty for decision tree models and many other predictive models. Over fitting happens when the learning algorithm continues to develop hypotheses that reduce training set error at the cost of an increased test set error. There are several approaches to avoiding over fitting in building decision trees.

 Pre-pruning that stop growing the tree earlier, before it perfectly classifies the training set.

 Post-pruning that allows the tree to perfectly classify the training set, and then post prune the tree.

Tree height: The height of the tree represents the number of decisions (comparisons) made to sort the particular data represented by the leaf. The longest path from root to leaf (the height) gives the number of comparisons in the worst case the average of the lengths of the paths from the root to all the leaves gives the average number of comparisons.

III. PROPOSEDALGORITHM

In the above section we introduce the basic methods and algorithms that help any machine learning algorithm to perform well. For that purpose we describe the most frequently used methods. In this section we provide the adoptability of any decision tree and our proposed algorithm basics.

Three adaptively properties of decision trees that lead to faster rates of convergence for abroad range of pattern classiﬁcation problems. These properties are:

Noise Adaptivity: Decision trees can automatically adapt to the (unknown) regularity of the excess risk function in the neighbourhood of the Bayes decision boundary. The regularity is quantiﬁed by a condition similar to T sybakov’s noise condition. [5]

Manifold Focus: When the distribution of features happens to have support on a lower dimensional manifold, decision trees can automatically detect and adapt their structure to the manifold. Thus decision trees learn the ―effective‖ data dimension.

Feature Rejection: If certain features are irrelevant (i.e., independent of the class labels), then decision trees can automatically ignore these features. Thus decision trees learn the ―relevant‖ data dimension.

Due to the discussion we can say that any decision tree for high performance utility required having some basic properties that are provided under the table given below.

Property Description

Data set That is a basic ingredient of the algorithm or data model the selected dataset is in clean cut format with low missing and ambiguous values Classification To train algorithm use any other

performance improvement strategy like bagging, boosting or stacking Tree height Tree height should be less thus fast

(5)

International Journal of Emerging Technology and Advanced Engineering

504

Tree pruning Use the data pruning methods like pre-pruning and post pruning method for reduce the ambiguity

Adoptive rule There are make some difference by which no conflict or not any confusion occurred during decisions Testing The test set must be belongs to the

base training dataset and use of machine test such as n cross validation

Design principle: the principal design of the proposed system is based on set theory and rain forest decision tree algorithm.

To understand the complete system design we start with the basic train set and to increase the algorithm consistency suppose the complete training dataset is defined by the three classes A, B and C. and data set contains n values that distributed between three classes.

If we distribute all the values in a space we found that all values are scattered in this space and distributed between these three classes as the above given diagram, Some of the values are distributed and defined uniquely by exact single classes they are represented by the circle A, B, C is uniquely identified values. Some of the values are lies between two classes that are represented using AB, AC, and BC in the problem space. And the last section of the data is lies in area of ABC. The trees defined by these subsets are generated using tree mounting algorithm.

To mount the data over tree we use the basic concept of rain forest, where a set of trees are used to provide the solution of any given problem. Here we use data clusters to mount them like first class A than B and so on. Finally a set of different trees are generated these set are combined using the weighted probability basis.

After mounting such trees in to one we prune the tree using post pruning method to remove additional branches and leaf in tree to find unique decisions.

IV. MODEL TESTING

Machine Learning algorithms have provided important functionality to support solutions in many scientiﬁc applications - such as computational biology, computational linguistics, and others. For Quality assurance of such applications leads some challenge because traditional software testing processes do not always applied to test such kind of applications.

Supervised ML applications consist of two phases of data processing by nature. The ﬁrst phase also called the training phase that analyses the training data; the result of this analysis is a model that attempts to make generalizations about how the attributes relate to the label. In the second phase also called the testing phase, the model is applied to another, previously-unseen data set (the testing data) where the labels are unknown. In a classiﬁcation algorithm, the system attempts to predict the label of each individual example.[6]

Confusion Matrix: A Confusion Matrix is a visual performance assessment of a classification algorithm in the form of a table layout or matrix. Each column of the matrix represents predicted classifications and each row represents actual defined classifications. This representation is a useful way to help evaluate a classifier model. A well behaved model should produce a balanced matrix and have consisted present correctness numbers for accuracy, recall, precision and an F measure. If anyone wants to build own classification models, this is a helpful way to evaluate them. For example, a confusion matrix summarizing the results of a classification algorithm might look like the following.

Predicted

Actual Pos Neg Neutral Pos 15 10 100 Neg 10 15 10 Neutral 10 100 1000

(6)

International Journal of Emerging Technology and Advanced Engineering

505

and 100 were classified as Neutral. 110 Positive statements were missed and considered false negatives. Values in the diagonal are correctly classified and are underlined. All other classifications are incorrect.

Precision: Precision is the correct classifications penalized by the number of incorrect classifications. true positives / ( true positives + false positives).

Recall: Recall is the number of correct classifications penalized by the number of missed items. true positives / (true positives + false negatives).

F Measure: F1 measure is a derived effectiveness measurement. The resultant value is interpreted as a weighted average of the precision and recall. The best value is 1 and the worst is 0.

2((precision*recall) / (precision + recall))

Accuracy: The simplest and most intuitive assessment is Accuracy. It is the correct classifications divided by all classifications.

Cross-Validation: Cross -Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model. In typical validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of validation is k-fold validation. Other forms of cross-validation are special cases of k-fold cross-cross-validation or involve repeated rounds of k-fold cross-validation. There are two possible goals in cross-validation:

 To estimate performance of the learned model from available data using one algorithm. In other words, to gauge the generalizability of an algorithm.

 To compare the performance of two or more different algorithms and ﬁnd out the best algorithm for the available data, or alternatively to compare the performance of two or more variants of a parameterized model.

Cross-validation can be applied in three contexts: performance estimation, model selection, and tuning learning model parameters.

V. CONCLUSION AND FUTUR WORK

In this paper we make a simple over view of our proposed method of decision tree mining and the basic way by which we increase the performance of defined tree using various

algorithms and properties. The concept of proposed decision tree is implemented using java and its reach classes and libraries. And actual algorithm is required to simulate how it works in different kinds of training data formats in future. References

[2] Ludmila I. Kuncheva, Christopher J. Whitaker, ―Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy‖ Journal Machine Learning, Volume 51, Issue 2, May 2003, pages 181-207.

[3] Zhi-Hua Zhou, Ensemble Learning, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093,China [email protected].

[4] Krogh Anders, Jesper Vedelsby , ― Neural network ensembles, cross validation, and active learning‖ Advances in Neural Information Processing Systems 7, MIT Press, Cambridge, MA (1995), pages 231–238.

[5] Clayton Scott, Robert Nowak, ―On the Adaptive Properties of Decision Trees‖ Electrical and Computer Engineering University of Wisconsin.

[6] A. Nobel, ―Analysis of a complexity based pruning scheme for classiﬁcation trees,‖ IEEE Transaction on Information Theory, volume 48, issue 8, August 2002, pp. 2362–2368.

[7] G. Blanchard, G. Lugosi, and N. Vayatis, ―On the rate of convergence of regularized boosting classiﬁers,‖ J. Machine Learning Research, vol. 4 year 2003, pp. 861–894.

[8] Xiaoyuan Xie, Joshua W.K.Ho, Christian Murpphy, Gail Kaiser, Baowen Xu, Tsong Yueh Chen ―Testing and Validating Machine Learning Classiﬁers by Metamorphic Testing‖, Preprint submitted to Elsevier, January 11, 2010.

[9] Cross-Validation PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University, and Date: 6/11/08 Time: 19:52:53 Stage: First Proof.