Vol 5, No 1 (2013)

(1)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 1

Parallel Ensemble Techniques for

Data Mining Application

M. Govindarajan Assistant Professor

Department of Computer Science and Engineering, Annamalai University Annamalai Nagar – 608002, Tamil Nadu, India.

ABSTRACT

Data mining is a powerful technique to extract hidden predictive information from large datasets. Classification is a very popular application of data mining. The goal of classification is to build a classifier with high accuracy and low cost. Since most data mining datasets are very large, using small subsets of the dataset can speedup data mining tasks. It is possible to generate a family of predictors using different subsets of training dataset and combine those predictors to achieve higher accuracy with low cost. This is the basic idea of ensemble techniques. Variants of ensemble techniques include bagging, AdaBoost, arcing etc. In this research work, parallel bagging and parallel AdaBoost with support vector machine (SVM) as a base classifier are implemented with the support of the parallel hardware. The NSL-KDD datasets are used to examine the essential parameters for parallel ensemble techniques such as sample size, number of iterations, number of processors, and threshold. Experiments are conducted to demonstrate that ensemble techniques can be effectively parallelized. The parallel ensemble techniques will provide more accurate results.

Keywords

Classification, Data mining, parallelism, ensemble techniques, bagging, AdaBoost.

1. INTRODUCTION

Data mining is a powerful technique to extract hidden predictive information from large datasets. Using a combination of machine learning, statistical analysis. Modelling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results.

Classification is a very popular application of data mining. Given a training dataset, the target of classification is to train a model to predict the class labels. A training dataset usually consists of a large number of examples, each of which contains values for a series of attributes and a known class label. After the model is trained, it is deployed to analyze new data and make predictions.

(2)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 2 dataset with known class labels. the classification output of the model is compared with the known class labels and get the test set error, usually measured as the percentage of misclassified cases. Accuracy is an important factor in assessing the performance of any classification model. Another important factor is the cost of building the classifier. The goal of classification is to build a classifier with high accuracy and low cost.

Decision trees, neural networks, and SVM are widely-used classification methods. SVM is a new machine-learning paradigm that works by finding an optimal hyperplane as to solve the learning problems.

SVM usually achieves higher generalization performance than traditional neural networks that implement the empirical risk minimization (ERM) principle in solving many machine learning problems [5]. Another key characteristic of SVM is that training SVM is equivalent to solving a linearly constrained quadratic programming problem so that the solution of SVM is always unique and globally optimal. Since SVM can be applied for classification and regression problems.

Since most data mining datasets are very large, using small subsets of the dataset can speedup data mining tasks. It is possible to generate a family of predictors using different subsets of training dataset and combine those predictors to achieve higher accuracy. This is the basic idea of ensemble techniques.

By using ensemble techniques, higher accuracy with lower cost is achieved. There are many variants of ensemble techniques. Some of them generate subsets independently for example bagging [2], and combine the predictors with uniform weight. Bagging can naturally be parallelized. Since the generation of subsets and predictors is independent.

Other ensemble techniques are more complicated. For example boosting [7] They generate a subset for training based on the performance of previously constructed predictors. They use weighted voting to combine the Family of predictors. The ensemble technique proceeds in sequential steps. So parallelizing the ensemble technique is not as straightforward as bagging.

(3)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 3 1

.xb w

The organization of this research paper is as follows: section 2 provides a base classifier. Section 3 presents algorithms for sequential bagging and AdaBoost. Section 4 presents algorithms for parallel bagging and AdaBoost. Section 5 explains the performance evaluation measures. Section 6 focuses on the experimental results and discussion. Finally, results are summarized and concluded in section 7.

2. BASE CLASSIFIER

2.1 Support Vector Machine (SVM)

SVM [4] [6] are powerful tools for data classification. Classification is achieved by a linear or nonlinear separating surface in the input space of the dataset. The separating surface depends only on a subset of the original data. This subset of data, which is all that is needed to generate the separating surface, constitutes the set of support vectors. In this study, a method is given for selecting as small a set of support vectors as possible which completely determines a separating plane classifier. In nonlinear classification problems, SVM tries to place a linear boundary between two different classes and adjust it in such a way that the margin is maximized [12]. Moreover, in the case of linearly separable data, the method is to find the most suitable one among the hyperplanes that minimize the training error. After that, the boundary is adjusted such that the distance between the boundary and the nearest data points in each class is maximal.

In a binary classification problem, its data points are given as:

}, 1 , 1 { , )},....

, ),....(

{( 1, 1   

 x y x y x y

D l l n (2.1)

where

y = a binary value representing the two classes and,

x = the input vector.

As mentioned above, there are numbers of hyperplanes that can separate these two sets of data and the problem is to find the hyperplane with the largest margin. Suppose that all training data satisfy the following constraints:

for yi 1 (2.2)

1 .xb

w for yi 1 (2.3)

where

w = the boundary x = the input vector

b = the scalar threshold (bias).

(4)

IJCSBI.ORG

) ) . sgn(( )

(y wx b

f   (2.4)

Thus, the separating hyperplane must satisfy the following constraints:

1

  ] ) . [(wx b

yi i (2.5)

where l = the number of training sets

The optimal hyperplane is the unique one that not only separates the data without error but also maximizes the margin. It means that it should maximize the distance between closest vectors in both classes to the hyperplane. Therefore the hyperplane that optimally separate the data into two classes can be shown to be the one that minimize the functional:

2

w w 

( ) (2.6)

Therefore, the optimization problem can be formulated into an equivalent non-constraint optimization problem by introducing the Lagrange multipliers (I 0) and a Lagrangian :

) 1 ) ) . (( ( 2 1 ) , , ( .. 1

2 _ _





  b x w y w b w

L t t t

l t



 (2.7)

The Lagrangian has to be minimized with respect to w and b by the given expressions:





 y x

w0 (2.8)

This expressions for w0 is then substitute into equation (2.7) which will result in dual form of the function which has to be maximized with respect to the constraints I 0.

Maximize



       l j I j i j i j i

I y y x x

W .. , ) ( ) ( 1 2 1 (2.9)

Subject to I 0,i1..land



Iyi

The hyperplane decision function can therefore be written as:



     ( ) ( ( . ) ) ) ( 0 0 0

0x b sign y x x b

w sign x

(5)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 5 However, the equation (2.10) is meant for linearly separable data in SVM. In a non-linearly separable data, SVM is used to learn the decision functions by first mapping the data to some higher dimensional feature space and constructing a separating hyperplane in this space.

3. SEQUENTIAL BAGGING AND ADABOOST

3.1 Bagging

Bagging, which stands for Bootstrap Aggregating, is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.

Figure 1. The Bagging Algorithm

The aggregation averages over the versions when predicting a numerical outcome and uses a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets.

Bagging is not as helpful with stable base model learning algorithms because they tend to return similar base models in spite of the differences among the bootstrap training sets. Because of this, the base models almost always vote the same way, so the ensemble returns the same prediction as almost all of its base models, leading to almost no improvement over the base models.

Bagging ({(x₁,y₁),(x₂,y₂),....,(x_N,y_N)},M)

For each m=1,2,….,M

Tm = Sample_With_Replacement ({(x1, y1),(x2,y2),...,(xN,yN)},N)

hm = Lb (Tm )

Return h_fin(x)argmax_y_Y





M

m

m x y h

I

1

) ) ( (

Sample_With_Replacement (T,N)

{}



S

For i = 1,2,….,N

R= random_integer (1,N) Add T[r] to S.

(6)

IJCSBI.ORG

3.2 AdaBoost

Figure 2. The AdaBoost Algorithm

Boosting is a general method for improving the accuracy of any given learning algorithm. Boosting's roots are in a theoretical framework for studying machine learning called the PAC learning model [10]. In 1999, a Algorithm: Adaboost. A boosting algorithm—create an ensemble of classifiers. Each one gives a weighted vote.

Input:

 D, a set of d class-labeled training tuples;

 k, the number of rounds (one classifier is generated per round);

 a classification learning scheme. Output: A composite model.

Method:

(1) initialize the weight of each tuple in D to 1=d;

(2) for i = 1 to k do // for each round:

(3) sample D with replacement according to the tuple weights to obtain Di;

(4) use training set Di to derive a model, Mi;

(5) compute error(Mi), the error rate of Mi

(6) if error(Mi) > 0:5 then

(7) reinitialize the weights to 1/d

(8) go back to step 3 and try again;

(9) endif

(10) for each tuple in Di that was correctly classified do

(11) multiply the weight of the tuple by error(Mi)/(1-error(Mi)); // update weights

(12) normalize the weight of each tuple;

(13) endfor

To use the composite model to classify tuple, X:

(1) initialize weight of each class to 0;

(2) for i = 1 to k do // for each classifier:

(3)

 

 M_ii

i _error_M

error log w



 1 ; // weight of the classifier’s vote

(4) c = Mi(X); // get class prediction for Xfrom Mi

(5) add wi to weight for class c

(6) endfor

(7)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 7 boosting algorithm called AdaBoost, which stands for adaptive boost, was introduced in [8].

AdaBoost also uses perturbation and combination. But it adopts the way of adaptively resample and combining so that the weights in the resampling are increased for those cases most often misclassified. Combining is done by weighted voting.

For AdaBoost, a set of weights is maintained over the training set. One way to implement AdaBoost in practice is to resample the dataset based on the weights of the instances. This weight for each instance is adjusted in each round according to whether the instance is correctly classified or not. After a family of classifiers is built based on the adaptively-resampled replicates, they are combined using weighted voting.

4. PARALLEL BAGGING AND ADABOOST

4.1 Bagging

Based on the sequential algorithm mentioned in section 3, a parallel bagging algorithm (shown in Figure 3) is designed based on the replicated approach discussed above.

The original training dataset is divided into P subsets (assuming there are P processors) and allocate each subset to one processor. This subset will be the local training dataset on each processor. Each processor creates a bootstrap replicate (random sampling with replacement) subset, called a sample, from the local training dataset. Then each processor executes the same (or similar) data mining algorithm on its local sample and generates a predictor from it. This procedure repeats R times until R x P predictors are obtained. R is the user-defined number of iterations. The value of R depends on properties of the dataset, the size of samples and number of processors using concurrently. Then R x P predictors are obtained to deploy on new data. These predictors vote (classification) or average (regression) to get the final answer to new data. All the predictors are given the same weight for voting.

(8)

IJCSBI.ORG

Figure 3. The parallel bagging algorithm

4.2 AdaBoost

AdaBoost as a popular boosting algorithm and the base for arc-fs, has received a lot of attention in recent research. This research is focused on AdaBoost which will represent the typical ensemble technique with adaptive resampling characteristics. Based on the sequential algorithm of AdaBoost discussed in Section 3, a parallel AdaBoost algorithm is designed using the replicated approach.

Suppose there are P processors. The full training dataset is divided into P subsets and each subset is allocated to one processor as a local training set on that processor. A distribution of this local training dataset is maintained on each processor. This distribution represents the selection probability of each instance from the local training dataset. The probability of each example is initialized to 1/n, where n is the number of examples of local training dataset. Then a subset is selected based on the probability of each instance on each processor to form a sample. The same data mining algorithm is applied on each of the local samples and a predictor is generated for each sample. Then a total exchange among all the processors carries on, that each processor sends its local predictor to all other processors, so on each processor there are now P predictors.

In the nest round, on each processor, a new sample is formed based on the updated distribution and a predictor is generated out of this new sample, followed by a total exchange of predictors generated in this stage and distribution updating. The same procedure will repeat R times until finally R x P predictors are obtained. R is the user defined number of iterations. The training part of the parallel AdaBoost algorithm ends with R x P predictors. Then all these predictors are deployed to test new data. To estimate the

divide dataset into P subsets and load one partition to one processor as the local training dataset

for r = 1: .... R

for all processors from 1 to P

create subset of bootstrap replicate (sample) of the local training dataset.

(9)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 9 accuracy, all the predictors are voted on the test set with the weight log(l/β) to get the final output of the combined predictor. It is observed that different sampling methods are used for AdaBoost and bagging. In bagging, random sampling with replacement is used to form samples, while in AdaBoost, adaptive resampling is used as each sample is formed based on the distribution which is adaptively updated based on the performance of previous predictors. This gives AdaBoost a further improvement compared to bagging [3].

Figure 4. The parallel AdaBoost algorithm

divide dataset into P subsets and load one subset to each processor

initialize the distribution of each local partition to be equal D

 

₁_,_p

 

i 1/nfor

r=1….,R

for processor pfrom 1 to P

1. form sample from the local training dataset using distribution 2. fit a mode1 to this sample and get the predictor

3.totally exchange the predictors generated locally in this round

4. test the accuracy of local training dataset by plurality voting the predictors (if more than threshold predictors vote for the correct class label. the example is set to be correctly classified)

5. for the combined predictor 

 

_r_,_p on each processor, set

 

_r_,_p  /



D

 

_r_,_p

 

i

 1 2 if



x_i_,y_i



is misclassified

if 

 

_r_,_p 1/2, then abort.

6. set

   

 





.

p , r / p , r p ,

r  __



1

7. update distribution D(r,p) of local training dataset



_

    



_{ }



p , r

i p , r i p ,

r _Z

D D ₁

Where Z(r,p) is a normalization factor (chosen so that D



_r₁_,_p



will be a

distribution).

β(r,p) if correctly classified

(10)

IJCSBI.ORG

Figure 5. Deploying the parallel AdaBoost algorithm

5. PERFORMANCE EVALUATION MEASURES

5.1 Classification Accuracy

The primary metric for evaluating classifier performance is classification Accuracy: the percentage of test samples that the ability of a given classifier to correctly predict the label of new or previously unseen data (i.e. tuples without class label information). Similarly, the accuracy of a predictor refers to how well a given predictor can guess the value of the predicted attribute for new or previously unseen data.

5.2Run Time

The serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential and parallel computer.

6. EXPERIMENTAL RESULTS AND DISCUSSION

6.1 Experimental Environment

WEKA-Parallel can support both shared-memory and distributed-memory computing environments. In these experiments are built and run parallel bagging and parallel AdaBoost on a 2-processor Flosolver machine. This application is developed and tested on distributed-memory computing environment with the support of WEKA-Parallel.

6.2 Dataset Description

The data used in classification is NSL-KDD, which is a new dataset for the evaluation of researches in network intrusion detection system. NSL-KDD

Given: R X P classifiers: 

 

_r_,_p : X → Y, (r=1,….,R, p=1,…,P) Vote these classifiers with weight and output the combined classifier:

 

  



_r_,_p x y



_{ }

_r,_p

Y y max arg

x log



(11)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 11 consists of selected records of the complete KDD'99 dataset [9]. NSL-KDD dataset solve the issues of KDD'99 benchmark [11]. Each NSL-KDD connection record contains 41 features (e.g., protocol type, service, and ag) and is labelled as either normal or an attack, with one specific attack type. The datasets are summarized in Table 6.1.

Table 1. Data Set Summary

Data Set #Training #Test #Attributes

NSL- KDD dataset

25192 11850 42

6.3Experiments and Analysis

Parallel bagging is applied based on SVM for the NSL-KDD dataset on Flosolver. The test set is randomly selected from the whole dataset. The same test set is used for all the NSL-KDD experiments for both bagging and AdaBoost.

6.3.1 Sequential bagging versus Parallel bagging

In the previous sections, the important factors are examined for parallel bagging. In this section, parallel and sequential bagging results are compared and also present some observations.

Table 2. Classification Accuracy of Sequential and Parallel Bagging

Trial Data size Sequential Bagging Parallel Bagging

1 80% 94.72% 94.97%

2 85% 95.27% 95.32%

3 90% 94.34% 95.61%

4 96% 95.14% 95.56%

(12)

IJCSBI.ORG

Figure 6. Classification Accuracy of Sequential and Parallel Bagging

Table 3. Time (in seconds) for Sequential and Parallel Bagging

Trial Data size Sequential Bagging Parallel Bagging

1 80% 26.35 19.43

2 85% 17.28 11.03

3 90% 6.20 4.73

4 96% 1.15 0.93

5 98% 0.45 0.38

(13)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 13 Parallel bagging experiments show the test set accuracy increases continuously with sample size. The running time increases faster than linearly with the sample size. For small sample sizes, an effective way of improving the test set accuracy is to have more rounds. After training the same number of instances, larger bag sizes have greater accuracy than smaller bag sizes but cost much more to compute. Using as many processors concurrently as possible is a good choice for parallel bagging. Parallel bagging shows very similar test set accuracy over sequential bagging. Parallel bagging is effective and efficient. The experiments show the achievable accuracy is limited by choosing sample size. Parallel bagging shows the same performance (same accuracy and same total resource consumed) no matter how the computation is arranged.

6.3.2 Sequential AdaBoost versus Parallel AdaBoost

After examining the important factors for parallel AdaBoost, the parallel and sequential AdaBoost results are compared and present some observations.

Table 4. Classification Accuracy of Sequential and Parallel AdaBoost

Trial Data size Sequential AdaBoost Parallel AdaBoost

1 80% 95.61% 96.28%

2 85% 95.77% 96.22%

3 90% 96.45% 96.54%

4 96% 96.62% 98.73%

(14)

IJCSBI.ORG

Figure 8. Classification Accuracy of Sequential and Parallel AdaBoost

Table 5. Time (in seconds) for Sequential and Parallel AdaBoost

Trial Data size Sequential AdaBoost Parallel AdaBoost

1 80% 40.72 34.09

2 85% 20.63 16.84

3 90% 8.35 7.50

4 96% 1.51 0.96

5 98% 0.49 0.26

(15)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 15 Parallel AdaBoost is both effective and efficient. Large sample size has an advantage in achievable accuracy over small sample size after seeing the same number of instances, but takes a longer time.

6.3.3 Parallel AdaBoost versus Parallel bagging

From the previous work [1] [3] [7], boosting can achieve higher accuracy than bagging for datasets in lots of domains. Boosting is more powerful because, by adopting the adaptive resampling, more weight is employed on the hard examples which lie on the boundary of different classes. Parallel AdaBoost can go even further as it exchanges the predictors at the end of each stage; thus each processor gets much more information than in the sequential algorithm.

Parallel AdaBoost experiments for NSL-KDD datasetshow that focusing on the hard instances at a reasonable speed by setting threshold to a moderate value will provide satisfactory results. The best value for threshold depends on property of datasets; however, it is independent of sample size. Increasing the sample size, or the number of iterations, or the number of processors running concurrently will provide higher accuracy at the cost of consuming more resources.

Parallel AdaBoost consistently provides higher accuracy over sequential AdaBoost in much shorter times after seeing the same number of examples. Parallel AdaBoost is more powerful than parallel bagging. It is also cheaper than parallel bagging for SVM. Parallel AdaBoost is demonstrated to be both effective and efficient.

7. CONCLUSION

Parallel algorithms are developed for bagging and AdaBoost based on the replicated approach. It has two significant advantages: first, the dataset is partitioned so the access cost is spread across processors: second, the information that must be exchanged between phases is often much smaller than the data, so communication is cheap.

(16)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 5, No. 1. SEPTEMBER 2013 16 applications. Several essential parameters are examined for parallel ensemble techniques such as sample size, number of iterations, number of processors and threshold.

Experiments that are conducted demonstrate that ensemble techniques, including bagging and AdaBoost, can be effectively parallelized. It shows that those ensemble techniques with an implied sequential dependency, such as AdaBoost, can be parallelized by running a sequential algorithm on each processor. Parallel algorithms benefit from the total exchange of predictors at the end of each round and achieve higher accuracy than sequential algorithms.

The following conclusions from this paper:

 Both bagging and AdaBoost can be effectively parallelized.

 For both parallel bagging and parallel AdaBoost, the achievable accuracy is limited by selecting sample sizes. If accuracy is a major concern, large size sample is preferred.

 For both parallel bagging and parallel AdaBoost, increasing sample size or number of iterations can increase accuracy but costs more resources.

 Setting the threshold to a moderate value to focus on hard examples with reasonable speed offers the most satisfactory results for parallel AdaBoost. The best value for threshold depends on the dataset; however, it is independent of sample size.

 Parallel AdaBoost consistently exceeds parallel bagging in accuracy which powers the power of adaptively resampling. It is worthwhile to increase the attention paid to hard examples.

 Parallel AdaBoost is cheaper than parallel bagging, at least for SVM. The surprising result comes from the frequent repetitions in samples generated in later rounds: for which SVM needs less time.

8. ACKNOWLEDGEMENTS

(17)

IJCSBI.ORG

REFERENCES

[1] E. Bauer and R. Kohavi, “An empirical comparison of voting classification algorithms:

Bagging, boosting, and variants”, Machine Learning: 36: 103-142, 1999.

[2] L. Breiman, “Bagging predictors”, Machine Learning. 24(2): 1-3-140, 1996.

[3] L. Breirnan, “Arcing classifiers”, The Annals of Statistics, 26 (3):801-849, 1998.

[4] Burges, C. J. C. “A tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery, 2(2):121-167, 1998.

[5] Cao, L. J. and Francis, E. H. T, “Support Vector Machine With Adaptive Parameters in

Financial Time Series Forecasting”, The National University of Singapore, Singapore 119260, IEEE Transactions on Neural Networks, 14(6), 2003.

[6] Cherkassky, V. and Mulier, F, “Learning from Data - Concepts, Theory and Methods”,

John Wiley & Sons, New York, 1998.

[7] Y. Freund and R. Schapire, “Experiments with a new boosting algorithm”, In

Proceedings of the 13th international Conference on Machine Learning, pages 148-136. Morgan Kaufmann, 1996.

[8] Y. Freund and R. Schapire, “A short introduction to boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, 1999.

[9] Ira Cohen, Qi Tian, Xiang Sean Zhou and Thoms S.Huang, "Feature Selection Using Principal Feature Analysis", In Proceedings of the 15th international conference on Multimedia, Augsburg, Germany, September, pp. 25-29, 2007.

[10]M. Keams and L.G. Valiant, “Cryptographic limitations on learning boolean formulae

and finite automata”, Journal of the Association for Computing Machinery: 55(1):67-95, 1994.

[11]KDD'99 dataset, http://kdd.ics.uci.edu/databases, Irvine, CA, USA, 2010.

[12]Vanajakshi, L. and Rilett, L.R. (2004), “A Comparison of the Performance of Artificial

Neural Network and Support Vector Machines for the Prediction of Traffic Speed”,