New Ensemble Combination Scheme

(1)

New Ensemble Combination Scheme

Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE

Abstract— Recently many statistical learning techniques are successfully developed and used in several areas. However, these algorithms sometimes are not robust and does not show good performances. The ensemble method can solve these problems.

It is known that the ensemble learning sometimes improves the generalized performance of machine learning tasks as well as makes it robust. However, the combining weights of the ensemble model are usually pre-determined or determined with the concept that the ensemble model is a superposition of individual ones. Thus we proposed a new ensemble combination scheme which consider the ensemble model is a factor affects the individual predictors. Through experiments, the proposed method shows better performance than other existing methods in the regression problems and shows competitive performance in the classification problems.

I. INTRODUCTION

Several learning models such as clustering, classification, regression, etc., are successfully developed in these days and they are used in many areas [1], [2], [3], [4]. However, sometimes these algorithms show ill performances or are not robust. Ensemble can be a solution to overcome these problems. Ensemble combines multiple models together in specific ways. The ensemble learning is known to improve the generalized performance of a single predictor and it has won some success. For example, the famous one million dol- lars Netflix Prize competition, the team named The Ensemble won the first place. They were known to ensemble several algorithms. In other data mining competitions, most winning teams employed an ensemble of learning models. Also, the ensemble method has been widely used in several practical area [5], [6], [7]

Along with these practical successes, many different ensemble algorithms have been proposed and developed [3], [8], [9]. However, some of them use the pre-determined weights. Most of the rest find weights with the concept that the ensemble model is a superposition of individual predictors. Unlike existing ensemble models, we propose a new ensemble combination scheme which consider individual predictors as realizations of the ensemble model. In other words, in the view of factor analysis, the ensemble model becomes a factor which affects individual models.

The organization of this paper is as follows. In the next section we summarize related literature and algorithms. In Section 3 we describe the proposed method. Next we present the results of an empirical analysis and comparison with other ensemble methods in Section 4. Then we conclude in Section 5.

Namhyoung Kim is with the Department of Statistics, Seoul National University, Seoul, Korea (email: [email protected]).

Youngdoo Son and Jaewook Lee is with the Department of Industrial Engineering, Seoul National University, Seoul, Korea (email: hand02, jae- [email protected]).

II. LITERATURE REVIEW

Most of the contemporary ensemble methods develop ensemble model based on the following equation.

f =

∑m i=1

wiCi (1)

where f is a real function predicted, Ci is an individual base model and wi is weight. It is illustrated in Figure 1, that is, the predictions of the individual models Ci, i = 1, 2, ..., m, are combined using the weight wi. The weights can vary depending on combination scheme. According to combination scheme, ensembles are divided into two main classes, combine by learning and combine by consensus.

Boosting combines multiple base predictors by learning. It utilizes the information of the performance of previous predictors so this kind of ensemble is also known as supervised scheme. In bagging, the ensemble is formed by majority voting. Bagging doesn’t consider the performance of each predictor. It averages the predictions of individual models in unsupervised scheme.

A. Bagging

Bagging is the simplest algorithm to construct an ensemble[10]. It creates individual predictors by training randomly selected training set. Each training set is generated by randomly with replacemeth, n exampels. In bagging, the weights are uniform, i.e., the ensemble prediction is given by

f = 1 m

∑m i=1

Ci. (2)

In the case of classification, majority voting is used to predict the class of data.

B. Boosting

Boosting algorithms learn iteratively weak classifiers using a training set selected according to the previous classified results then combine them with different weights to create an ensemble[12]. The weights are determined by the weak learners’ performance.

Our method uses both supervised and unsupervised scheme. In the following section, we describe our proposed method.

III. PROPOSED METHOD

In this paper, we propose a new ensemble combination scheme. We assume that the prediction of each individual predictor Ci reflects the real function value f by λi and a specific error term δi. In equation form, we would write the following.

(2)

Fig. 1. Ensemble of Classifiers

Ci= λif + δi, i = 1, 2, ..., m (3)

We can rewrite the above equation in matrix form as following.

C = f Λ^′+ ∆ (4)

where C =







C₁⁽¹⁾ C₂⁽¹⁾ · · · C1⁽¹⁾

C₁⁽²⁾ C₂⁽²⁾ · · · C^m⁽²⁾ ...

C₁⁽ⁿ⁾ C₂⁽ⁿ⁾ · · · Cm⁽ⁿ⁾





, f =





 f⁽¹⁾ f⁽²⁾ ... f⁽ⁿ⁾





, Λ = [λ₁λ₂· · · λm]^′, and ∆ =







δ₁⁽¹⁾ δ⁽¹⁾₂ · · · δ1⁽¹⁾

δ₁⁽²⁾ δ⁽²⁾₂ · · · δm⁽²⁾

...

δ₁⁽ⁿ⁾ δ⁽ⁿ⁾₂ · · · δm⁽ⁿ⁾





. n is the number of cases of data set. In addition, we have the following three assumptions about the proposed method.

Assumption 1: 1) Variance of f is 1.

(n− 1)f^′f = 1 (5)

2) Specific factors δ are mutually uncorrelated with diag- onal covariance matrix:

Θ = 1

n− 1∆^′∆ = diag(θ²₁, θ²₂, . . . , θ_m²). (6) 3) Common factor f and specific factors δ are uncorre-

lated.

approximation of the correlation matrix S is S = 1/(n− 1)C^′C

= 1/(n− 1)(fΛ^′+ ∆)^′(f Λ^′+ ∆)

= 1/(n− 1)(Λf^′f Λ^′+ ∆^′f Λ^′+ ∆f^′∆ + ∆^′∆) By assumptions, the correlation matrix becomes simplified as following

S = ΛΛ^′+ Θ (7)

ML solution : S(Λ^′Λ + Θ)⁻¹Λ = Λ

f_(i) = Λ^′(ΛΛ^′ + Θ)⁻¹C_(i) or f_(i) = (1 + Λ^′Θ⁻¹Λ)⁻¹Λ^′Θ⁻¹C(i).

Let f_(i)^∗ = Λ^′(ΛΛ^′+ Θ)⁻¹C(i)+ ϵ(i)and C(i)= Λf_(i)^∗ Then Λf_(i)^∗ = C_(i)= ΛΛ^′(ΛΛ^′+ Θ)⁻¹C_(i)+ Λϵ_(i) Λ^′Λϵ(i)= Λ^′[Ip− ΛΛ^′(ΛΛ^′+ Θ)⁻¹]C(i)

∴ ϵ^∗_(i)= Λ^′[ ¹

Λ^′ΛIp− (ΛΛ^′+ Θ)⁻¹]C_(i)) Bayesian Latent Ensemble

f_(i)^∗ = _1+Λ^Λ^′′^ΘΘ⁻¹⁻¹ΛC_(i)+ ϵ^∗_(i) f^∗=

∑m

k=1λ_kθ⁻²_k C_k 1+∑_m

l=1λ²_lθ⁻²_l + ϵ f^∗=∑m

k=1

λ_kθ⁻²_k (1+∑m

l=1λ²_lθ⁻²_l )C_k+ ϵ

In supervised scheme, we assume the given output of training sets as real function value. Then, the process becomes much simpler.

IV. EXPERIMENTALRESULTS

A. Data Sets

To evaluate the performance of the proposed method, we use both regression and classification data sets from UCI Machine Learning Repository[11]. Table I shows the summary of the data sets used. The data shown in Table I is widely used real-world problems in the machine learning

(3)

Fig. 2. Proposed method

TABLE I

SUMMARY OF THE DATA SETS

Data set cases class Features Neural Network

cont disc inputs ouputs hiddens Epochs

breast-cancer-w 699 2 9 - 9 1 5 20

Credit-a 690 2 6 9 47 1 10 35

ionosphere 351 2 34 9 34 1 10 40

Statlog german 1000 2 7 13 45 1 10 20

Statlog aust 690 2 6 8 40 1 10 35

ailerons 13750 regression 40 - 40 1 10 20

Housing boston 506 regression 13 - 13 1 5 35

Pole telecom 5000 regression 48 - 48 1 10 20

abalone 4177 regression 7 1 9 1 5 20

community. The data sets vary in terms of both the number of cases and features.

B. Experiments settings

In this study, neural networks are used as a base model.

Neural networks consist of a series of input, hidden and output layers. It is needed to set the size of hidden layer.

Some parameters are also required for the neural networks.

We trained the neural networks with a backpropagation function, a learning rate of 0.15, a momentum constant of 0.9 and randomly selected weights between -0.5 and 0.5.

The size of hidden layer was determined according to the number of input and output units having five hidden units as a minimum. The number of epochs was based on the number of cases. We used a larger value of epochs as the size of data increases. The training parameters used in the experiments are shown in Table I.

For comparison, other popular ensemble algorithms, bag-

ging and boosting, and single model are applied to the same data sets. We averaged test error of each method over five standard 10-fold cross validation experiments. The data set is divided into 10 equal-sized sets, then each set is used as the test set in turns then the other nine sets are used for training.

In all ensemble methods we used an ensemble size of 25 for each fold.

C. Results

Table II and Table III show the classification error for classification problems and root mean squared error for regression problems, respectively. The columns indicate the different methods, a single model(Single), Bagging(Bagging), Arc- ing(Arc), Ada-boosting(Ada) and the proposed method(FA).

Each row shows the results for each data set. From the tables we can notice that the proposed method usually shows good performances for several data sets.

In classification problems(Table II), the proposed method

(4)

does not shows the worst performance and its results are of- ten the best or close to the best. In regression problems(Table III), the proposed method shows the best performances except for the housing boston data set. In that data set, bagging algorithm performs the best but the proposed method showed competitive result with the bagging. Therefore, we can say that our proposed method performs better than other existing ensemble methods in general.

V. CONCLUSION

In this study, we presented novel methods for ensemble combination scheme with a new perspective on a relation- ship between real function and each classifier. We assumed predicted value of each classifier is composed of the real function value and a specific error term. According to this assumption, we solved weights for ensemble combination in both supervised and unsupervised scheme. The experimental results show the proposed methods are superior to the existing methods. They also show self-correction skills, so an ensemble can adjust weights noticing corrupt classifiers in learning process.

The proposed ensemble method seems to be robust with some corrupting predictors unlike other ensemble schemes whose performances may decrease with corrupting predictors. In other words, the proposed method has self-correction reliability. Thus, for future work, the analogous proof and experimental results for this intuition are necessary. We also need to prove and show the statistical stability of the proposed model using statistical tests like Friedman test, t- test, or sign-test.

REFERENCES

[1] D. Lee and J. Lee, “Domain descried support vector classifier for multi- classification problems,” Pattern Recognition, vol. 40, pp. 41–51, 2007.

[2] Y. Son, D-j. Noh, and J. Lee, “Forecasting trends of high-frequency KOSPI200 index data using learning classifiers,” Expert Systems with Applications, vol. 39, no. 14, pp. 11607–11615, 2012.

[3] N. Kim, K-H. Jung, Y. S. Kim, and J. Lee, “Uniformly subsampled ensemble (USE) for churn management: Theory and implementation,”

Expert Systems with Applications, vol. 39, no. 15, pp. 11839–11845, 2012.

[4] H. Park and J. Lee, “Forecasting nonnegative option price distributions using Bayesian kernel methods,” Expert Systems with Applications, vol.

39, no. 18, pp. 13243–13252, 2012.

[5] T. Gneiting and A. E. Raftery, “Weather forecasting with ensemble methods,” Science, vol. 310, no. 5746, pp. 248–249, 2005.

[6] D. Morrison, R. Wang, and L. C. De Silva, “Ensemble methods for spoken emotion recognition in call-centres,” Speech communication, vol. 49, no. 2, pp. 98–112, 2007.

[7] L. K. Hansen, C. Liisberg, and P. Salamon, “Ensemble methods for handwritten digit recognition,” Neural Networks for Signal Processing [1992] II., Proceedings of the 1992 IEEE-SP Workshop, IEEE, 1992.

[8] T. G. Dietterich, “Ensemble methods in machine learning,” Multiple classifier systems, pp 1–15, Springer Berlin Heidelberg, 2000.

[9] H. Drucker, et al., “Boosting and other ensemble methods,” Neural Computation, vol. 6, no. 6, pp. 1289–1301, 1994.

[10] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.

[11] A. Frank and A. Asuncion, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml], Irvine, CA: University of California, School of Information and Computer Science, 2010.

[12] Y. Freund and R. Schapire, “Experiments with a new boosting algo- rithm,” In Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148–156, 1996.

(5)

TABLE II

CLASSIFICATION RESULTS(CLASSIFICATION ERROR)

Data set

Neural Network

* Single Bagging Boosting FA

Arc Ada Unsupervised Supervised

Breast-cancer-w 3.59% 3.15% 3.35% 2.91% 3.50% 3.56%

credit-a 14.17% 13.94% 15.04% 14.52% 14.03% 13.88%

ionosphere 12.91% 11.71% 10.17% 10.86% 8.46% 8.51%

statlog aust 15.48% 13.83% 15.22% 14.75% 13.57% 14.26%

statlog german 26.66% 24.68% 27.48% 24.96% 23.78% 23.96%

hepatitis 36.80% 34.13% 36.40% 36.00% 33.87% 34.00%

sonar 17.40% 16.60% 14.40% 15.70% 16.60% 16.70%

diabetes 24.16% 23.21% 24.82% 23.16% 24.13% 24.16%

TABLE III

REGRESSION RESULTS(RMSE)

Data set

Neural Network

Single Bagging Boosting FA

Unsupervised Supervised

Abalone 2.0989 2.0733 7.6227 2.0702 2.0711

housing boston 3.8296 3.2316 17.3399 3.2743 3.2705

ailerons 0.0005 0.0003 0.0021 0.0002 0.0002

pole telecom 13.6437 10.5535 31.5003 10.3371 10.0061