• No results found

A Hybrid Forecasting Methodology using Feature Selection and Support Vector Regression

N/A
N/A
Protected

Academic year: 2022

Share "A Hybrid Forecasting Methodology using Feature Selection and Support Vector Regression"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

A Hybrid Forecasting Methodology using Feature Selection and Support Vector Regression

Jos´e Guajardo, Jaime Miranda, and Richard Weber, Department of Industrial Engineering, University of Chile

Abstract

Various techniques have been proposed to forecast a given time series. Models from the ARIMA family have been successfully used as well as regression approaches based on e.g. linear, non-linear regression, neural net- works, and Support Vector Machines. What makes the dif- ference in many real-world applications, however, is not the technique but an appropriated forecasting methodology.

Here we present such a methodology for the regression- based forecasting approach. A hybrid system is presented that iteratively selects the most relevant features and con- structs the best regression model given certain criteria. We present a particular technique for feature selection as well as for model construction. The methodology, however, is a generic one providing the opportunity to employ alternative approaches within our framework.

Support Vector Regression; Time Series Prediction; Fea- ture Selection.

1 Introduction

Forecasting problems can be solved basically using two different approaches: time series models, such as the ARIMA family of models or exponential smoothing on one hand, or on the other hand regression approaches, such as linear or non-linear regression, neural networks, regression trees, or Support Vector Regression. Hybrid models com- bining both approaches have been applied e.g. in [1].

We present a hybrid forecasting methodology combining feature selection with regression models where we propose Support Vector Regression for model construction. Our methodology, however, is independent of the particular re- gression model, i.e. any other regression approach can be used within the proposed framework.

Section II. provides a literature review related to feature selection and support vector regression. In section III. we develop the proposed hybrid forecasting methodology. Sec- tion IV. presents the results derived by our methodology as well as using alternative approaches. Section V. concludes

this work and points at future developments.

2 Review of the related literature

The aim of this section is to provide a review of the ma- jor approaches for feature selection, as well as to describe the well known support vector regression algorithm. Both issues will be important to understand the methodology we are proposing.

2.1 Feature Selection

Given the complexity of certain data mining applica- tions, such as e.g. regression it is desirable to select the most important features to construct the respective model.

This problem motivates efforts in developing feature se- lection methods. Some of the benefits that can be ob- tained by performing feature selection are [13][21][29]: im- provements in the accuracy of the forecasts delivered by the model, reducing computational times involved in model construction, facilitating data visualization and model un- derstanding, and reducing the probability of overfitting.

Several approaches have been developed to carry out fea- ture selection, being possible to group the existing methods in three major categories [13]: filters, wrappers and embed- ded methods.

F ilter methods perform feature selection as a prepro- cessing step, independently of the learning algorithm used for model construction. An example of such a mechanism is variable ranking, using e.g. correlation coefficients be- tween each feature and the dependent variable. Another fil- ter approach selects features based on a linear model (this corresponds to the filter preprocessing step), and then con- structs a non-linear model using the selected features (see e.g. [2]). As can be seen, filter mechanisms do not depend on the regression algorithm applied, and consequently the selected features are not related to it. This is one of their main weaknesses, but they are commonly applied for sim- plicity reasons. Also, filter methods in general do not take into account the relationship between variables.

(2)

On the other hand, wrapper methods are character- ized as being a subset selection approach. The main idea of wrapper methods is to assess subsets of variables according to their usefulness to a given learning algo- rithm [13]. Wrapper methods became popular by works such as [14][15]. Here, the learning algorithm is treated as a black box, and the best subset of features is deter- mined according to the performance of the particular al- gorithm applied to build a regression model (e.g. linear or non-linear regression, neural networks, support vector machines, among others). Wrapper methods need a crite- rion to compare the performance of different feature sub- sets (e.g., minimization of the training error), as well as a search strategy to guide the process. F orward selection and backward elimination are two of the commonly used search strategies. A f orward selection strategy starts with an empty set of features and incorporates in each iteration features considering their conjoint predictive power. On the other hand, a backward elimination procedure starts with all available features belonging to the set of selected features. In each iteration the feature which produces the minimum decrease in predictive performance is eliminated.

Both strategies need a stopping criterion in order to deter- mine the ”best” feature subset.

Wrappers are often criticized because they seem to be a

”brut force” method requiring exhaustive computation [13], but at the same time have the advantage of taking into ac- count the performance of the predictive model to be used, as well as evaluating how a given subset of features performs as a whole.

Finally, embedded methods incorporate feature selec- tion as part of the training process, i.e., feature selection is done when building the predictive model. Such mecha- nisms usually involve changes in the objective function of the applied learning algorithm and therefore are commonly associated to a specific predictor. Examples of embedded methods are decision trees [3] [20], and the mechanisms developed in [17][21][30].

The methodology proposed in this paper uses wrapper methods and applies forward selection to guide the search strategy.

2.2 Support Vector Regression

Here we describe the standard Support Vector Re- gression (SVR) algorithm, which uses the so-called ²- insensitive loss function, proposed by Vapnik [26]. This function allows a tolerance degree to errors not greater than

². The description is based on the terminology used in [23] [19].

Let ( ~x1, y1), .., ( ~x`, y`), where ~xi ∈ <n and yi ∈ <

∀i, be the training data points available to build a re- gression model. The SVR algorithm applies a transfor-

mation function Φ to the original data points1 from the initial Input Space, to a generally higher-dimensional F eature Space (F ). In this new space, we construct a lin- ear model, which corresponds to a non-linear model in the original space:

Φ : <n −→ F, w ∈ F

f (x) = h ~w, Φ(x)i + b (1)

The goal when using the ²-insensitive loss function is to find a function that fits current training data with a deviation less or equal to ², and at the same time is as flat as possi- ble. This means that one seeks for a small weight vector

~

w, e.g. by minimizing the norm k ~wk2[23]. The following optimization problem is stated for such purpose:

Minw,b~

1

2k ~wk2 (2)

s.t. yi− h ~w, Φ(xi)i − b ≤ ε (3) h ~w, Φ(xi)i + b − yi≤ ε (4)

This problem could be infeasible. Therefore, slack vari- ables ξi, ξi are introduced to allow error levels greater than

², arriving to the formulation proposed in [26]:

Minw,b~ 1

2k ~wk2+ C ·Pl

i=1i+ ξi) (5) s.t. yi− h ~w, Φ(xi)i − b ≤ ε − ξi (6) h ~w, Φ(xi)i + b − yi≤ ε − ξi (7) ξi, ξi ≥ 0 i = 1, . . . , l (8)

This is known as the primal problem of the SVR algo- rithm. The objective function takes into account generaliza- tion ability and accuracy in the training set, and embodies the structural risk minimization principle [27]. Parameter C measures the trade-off between generalization ability and accuracy in the training data, and parameter ² defines the degree of tolerance to errors.

To solve the problem stated in equation 5, it is more con- venient to represent the problem in its dual form. For this purpose, a Lagrange function is constructed, and once ap- plying saddle point conditions, the following dual problem is obtained:

1When the identity function is used, i.e. Φ(x) −→ X, no transforma- tion is carried out and linear SVR models are obtained.

(3)

Max−1 2

Xl i,j=1

i− αi)(αj− αj)hΦ(xi), Φ(xj)i

−ε Xl i=1

i+ αi) + Xl i=1

yii− αi)

s.t. αi, αi ∈ [0, C]

Xl i=1

i− αi) = 0 (9)

The solution of this quadratic optimization problem will be function of the dual variables αiand αi. To return to the original problem, it can be shown that the following rela- tionships hold [27]:

w = Xl i=1

i− αi)Φ(xi) (10)

f (x) = Xl i=1

i− αi)K(xi, x) + b (11)

Here, the expression K(xi, x) is equal to hΦ(xi), Φ(x)i, which is known as the kernel f unction [27]. The exis- tence of such a function allows us to obtain a solution for the original regression problem, without considering explic- itly the transformation Φ(x) applied to the data.

3 Development of the Proposed Forecasting Methodology

To build forecasting models, one has to deal with differ- ent issues such as feature selection, model construction, and model updating. In this section we present a framework to build SVR-based forecasting models, which takes into ac- count all these tasks. First, a general description of this framework is provided, then we present details about SVR model construction, feature selection, and model updating.

3.1 General Framework of the Proposed Method- ology

The following figure provides a general view of our framework for SVR-based forecasting.

The first step consists of dividing the data into training, validation and test subsets. Training data will be used to construct the model, validation data is used for model and feature selection, and test data is a completely independent subset, which provides an estimation of the error level that the model would have in reality. As we perform model updating, these subsets will be dynamically redefined over time; see e.g. [12].

Figure 1. SVM-UP Forecasting Methodology

The proposed methodology can be summarized as fol- lows: first, calculate base parameters (², C, and kernel parameter) that work well under some general conditions.

Next, and using base parameters ²0, C0, K0, perform fea- ture selection using a wrapper method with forward selec- tion strategy to obtain the best set of predictor variables X. Finally, using predictors X, return to the problem of model construction performing Grid Search around base parame- ters to get the optimal parameters ², C, K. Thus, at the end, the predictive model is determined by parameters ², C, kernel function K, and predictors X.

The final element in this framework is model updating.

3.2 Model Construction using Support Vector Machines within the Proposed Methodology

A critical issue in SVR is the selection of the model pa- rameters ²0 and C0, as well as the kernel function used to encode data to a higher dimensional space.

As discussed in section (3.1), our approach to deal with model construction calculates initial values for SVR param- eters, and then performs grid search around them. This structure is shown in figure (2):

To calculate initial values for parameters epsilon0 and C0, we use the empirical rules proposed by Cherkassky et.

al [6] (figure (2)). Experiments using various time series performed by the authors of this paper show that these rules are more effective than others like e.g. those proposed in [16].

Regarding kernel function, we use the RBF kernel trans- formation to the original data, since this function has per- formed well under some general conditions [23], and is the most commonly used in practice [5][6][9][16][18][28]. To set an appropriate value for the parameter of this kernel function (σ), we carry out some exploratory experiments

(4)

Figure 2. SVM model selection

trying different settings, and choosing the best one in terms of the results obtained in this phase.

Once defined the initial values for the three parameters, we select the best set of features (see next subsection). Us- ing the selected features the well-known Grid Search mech- anism [4][9][18][24] around base parameters epsilon0, C0, K0 is performed. The best point of the grid will define the final parameters of the SVR model.

3.3 Feature Selection within the Proposed Methodology

As discussed in section (2.1), there are several feature selection approaches in the context of time series pre- diction. We propose a wrapper method inspired by ideas presented in [15], which selects feature guided by a forward selection strategy. Figure (3) displays such a strategy:

Figure 3. Feature selection heuristic

Given the set of initial predictors (x1, ..., xn), we define a maximum number of predictors (m) to be included in the predictive model, taking into consideration the nature of the problem and the number of training data points. Also, the number of tuples2to be achieved in each iteration (k) has to be defined. The greater the value of k, the greater the size of the search space induced by the strategy. Once defined both values, the strategy starts building models with each individual variable as the single predictor. In the second iteration, we mix the k best individual predictors (1−tuple) with the remaining variables, and keep the k best 2 − tuple of predictors for the next iteration, and so on until we have the k best m − tuples of predictors. Finally, we choose the best tuple of variables among the m · k best i − tuples computed in the iterative process (i : 1, .., m), to be the predictors Xthat will be used in the final SVR model.

3.4 Model Updating within the Proposed Method- ology

The task of developing forecast models for a time series problem could be affected by changes in the phenomena we are studying over time. Therefore a static model that works well in some period could provide poor predictions in some future period. To deal with this issue, we designed a way to address model updating [11] when building a predictive model, in which the basic idea is to define a two part train- ing set: the first part contains historical data, and the second part contains the most recent data. When a predefined num- ber of new samples arrives, these observations are incorpo- rated into the training set. This way patterns in new data are taken into account in model construction. Finally, the proportion of data belonging to the training and validation sets is kept stable over time by shifting data points from the historical part of the training set to the validation set, when we perform model updating.

4 EXPERIMENTS AND RESULTS

We applied the proposed methodology to a real-world sales prediction problem, in which a company wants to pre- dict the number of units that will be sold the following week, for 5 different products.

Consequently, we analyzed 5 time series with weekly sales data during the period from January 2001 to Septem- ber 2004 each. Figure (4) shows one of these 5 series, which presents a strong monthly seasonality, increasing sales numbers during each month, reaching its peak in the last week of each month.

2In this context, an i-tuple means a combination of i different predictors used together in a predictive model.

(5)

Figure 4. Time series with weekly sales data

Next, the proposed methodology will be applied and the obtained results are provided. Finally, we compare our re- sults with the well-known ARMAX approach and neural networks.

We applied the proposed methodology to the 5 time se- ries using a set of 23 initial features (independent variables).

As a result we obtained for each series a different set of parameters describing the respective model. Following the above described wrapper strategy we also determined for each series a set of selected features. This way we defined the overall relevance of each one of the original features by simply counting how often each feature has been selected by the applied methodology.

The following listing shows the features ordered accord- ing to their average relevance from the most important one to the least important one for all 5 series.

Normalized number of week within a month (taking values between 0 and 1)

Sales one week before

Sales two weeks before

Binary variable indicating if the month under consid- eration has 4 or 5 weeks

Sales eight weeks before

Categorical variable indicating if the week under con- sideration contains holidays of certain categories

Sales 13 weeks before

Sales 14 weeks before

Ordinal variable indicating the year under considera- tion (2001, ..., 2004)

Sales seven weeks before

Sales twelve weeks before

Number of the week under consideration (taking val- ues between 1 and 195 which is the number of weeks within the analyzed period)

Number of month within a quarter (taking values 1,2,3)

Sales three weeks before

Sales six weeks before

Sales ten weeks before

Sales four weeks before

Sales five weeks before

Number of week within a year (taking values 1,... , 52)

Number of month within a year (taking values 1,... , 12)

Sales nine weeks before

Number of quarter within a year (taking values 1,... , 4)

Sales eleven weeks before

We applied the final model provided by the proposed methodology SVM-UP to the test set from April 2004 to September 2004.

Alternatively we developed forecasting models for each series using the well-known ARMAX approach as well as a multilayer perceptron (MLP) within the proposed method- ology, i.e. a MLP-type neural network as regression tech- nique instead of SVR. Since a wrapper strategy is employed the set of selected features depends on the regression tech- niques used. In the special application described here dif- ferent sets have been identified in the cases of MLP and SVR.

Applying the three models to the same evaluation set of all 5 time series we obtained the results provided in the fol- lowing table, which shows the average test set error levels using as error function MAPE (mean average percentage er- ror).

Compared with ARMAX the proposed methodology us- ing SVR (SVM-UP) gives slightly better results in 3 out of 5 cases and also in average over all 5 series.

Using a neural network within the proposed methodol- ogy (NN-UP) outperforms ARMAX also in 3 out of 5 cases but predicts worse on average.

(6)

Test Set Error

Product ARMAX NN-UP SVM-UP

Models (%) Models (%) Models (%)

P1 10.98 14.31 11.86

P2 13.64 14.66 11.44

P3 6.87 6.66 6.51

P4 10.53 10.56 11.01

P5 10.90 9.85 9.98

Average (5 Prod.) 10.58 11.21 10.16

Table 1. Average Test Set Error Levels.

The advantage of the proposed methodology, however, is not only limited to the forecasting performance, it provides also the capability of selecting the most relevant features and updating the respective model.

5 CONCLUSIONS AND FUTURE WORKS

We presented the hybrid methodology SVM-UP for time series forecasting, which combines a wrapper method with forward selection strategy to perform feature selection and regression model construction using Support Vector Ma- chines. Model selection is based on calculating initial val- ues for the SVR parameters and then performing grid search around them. The final component of our methodology is model updating: we define a training set formed by past and most recent information. When a predefined number of new observations arrives, this most recent information is incorporated into the training set, and the proportion of data belonging to the training and validation sets is kept stable over time by shifting data points from the historical part of the training set to the validation set.

We have applied the proposed methodology to a sales forecasting problem and compared its performance to a standard ARMAX approach and neural networks. Com- paring the respective results shows that our methodology performs slightly better and additionally provides a selec- tion of the most important features. This last point increases the comprehension of the phenomenon we are studying, and could be useful to understand certain variables involved in the final regression model and their influence on the depen- dent variable, e.g. the number of sold units.

Major advantages of the proposed methodology are ex- pected when dealing with dynamic phenomena, where the performance of a forecasting model could be significantly improved by performing model updating. First results show that our methodology provides very good results for sea- sonal time series (see [11][12]).

Future work has to be done for predicting non-seasonal time series and for selecting the most appropriate parame- ters of the kernel function based on theoretical approaches.

Acknowledgment

This work has been supported partially by the Millennium Nucleus ”Complex Engineering Systems”

(www.sistemasdeingenieria.cl) and the Fondecyt project 1040926.

The first author would like to thank Penta Analytics S.A., a Chilean consulting company where he carried out part of this work during 2004.

References

[1] L. Aburto, R. Weber Improved Supply Chain Manage- ment based on Hybrid Demand Forecasts. Applied Soft Computing, in press, 2006.

[2] J. Bi, K.P. Bennett, M. Embrechts, C. Breneman and M. Song. Dimensionality Reduction via Sparse Sup- port Vector Machines. Journal of Machine Learning Re- search, 3:1229-1243, (special issue on feature/variable selection), 2003.

[3] L. Breiman, J.H. Friedman, R.A. Olshen, C.J.

Stone. Classification and Regression Trees. Wadsworth Brooks, Monterey, 1984.

[4] C. Chang and C. Lin. LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm, 2001.

[5] B.J. Chen, M.W. Chang and C.J. Lin. Load Forecast- ing using support vector machines: a study on EU- NITE competition.. Report for EUNITE Competition for Smart Adaptive System, 2001.

[6] V. Cherkassky and Y. Ma. Practical selection of SVM parameters and noise estimation for SVM regression.

Neural Networks 17(1):113-126, 2004.

[7] W. Chun-Hsin, W. Chia-Chena, S. Da-Chun, C. Ming- Hua and H. Jan-Ming. Travel Time Prediction with Sup- port Vector Regression. In Proc. of IEEE Intelligent Transportation Systems Conference, 2003.

[8] N. Cristianini and J. Shawe Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.

[9] A. Demiriz, K. Bennett, C. Breneman and M. Em- brechts. Support vector machine regression in chemo- metrics. Computing Science and Statistics, 2001.

[10] A. Ding, X. Zhao and L. Jiao. Traffic flow time series prediction based on statistic learning theory. In Proc. of IEEE 5th International Conference on Intelligent Trans- portation Systems, pp. 727-730, 2002.

(7)

[11] J. Guajardo and R. Weber. How to update forecast- ing models in time series prediction problems?. In Proc.

of Intenational Symposium on Forecasting (ISF 2005), San Antonio, Texas, June 2005.

[12] J. Guajardo, J. Miranda and R. Weber. A model up- dating strategy for predicting time series with seasonal patterns. Submitted to the Journal of Forecasting, 2005.

[13] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Re- search, 3:1157-1182, 2003.

[14] R. Kohavi and D. Sommerfield. Feature subset selec- tion using the wrapper method: Overfitting and dy- namic search space topology. In Usama M Fayyad and Ramasamy Uthurusamy, editors, First International Conference on Knowledge Discovery and Data Mining (KDD-95), 1995.

[15] R. Kohavi and G.H. John. Wrappers for feature subset selection. Artificial Intelligence 97(1-2):273-324, 1997.

[16] D. Mattera and S. Haykin. Support Vector Machines for Dynamic Reconstruction of a Chaotic System. In:

B. Sch¨olkopf, J. Burges, A. Smola, ed., Advances in Kernel Methods: Support Vector Machine, MIT Press, 1999.

[17] J. Miranda, R. Montoya and R. Weber. Linear Penal- ization Support Vector Machines for Feature Selection.

Lecture Notes in Computer Science, in press, 2005.

[18] M. Momma and K.P. Bennett. A pattern search method for model selection of support vector regres- sion. In Proc. of SIAM Conference on Data Mining, 2002.

[19] K. M¨uller, A. Smola, G. R¨atsch, B. Sch¨olkopf, J. Kohlmorgen and V. Vapnik. Using Support Vector Machines for Time Series Prediction. In: B. Scholkopf, J. Burges, A. Smola, ed., Advances in Kernel Methods:

Support Vector Machine, MIT Press, 1999.

[20] R. Quinlan. C4.5: Programs for Machine Learning.

Morgan Kaufmann Publishers, San Mateo, 1993.

[21] A. Rakotomamonjy. Variable selection using SVM- based criteria. Journal of Machine Learning Research 2, 1357-1370, 2002.

[22] D.C. Sansom, T. Downs and T.K. Saha. Evaluation of support vector machine based forecasting tool in elec- tricity price forecasting for Australian National Elec- tricity Market Participants. In Proc. of Australasian Universities Power Engineering Conference, 2002.

[23] A.J. Smola and B. Sch¨olkopf. A Tutorial on Sup- port Vector Regression. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998.

[24] C. Staelin. Parameter selection for support vector ma- chines. Technical report, HP Labs, Israel, 2002.

[25] T. Trafalis and H. Ince. Support Vector Machine for Regression and Applications to Financial Forecasting.

IJCNN (6): 348-353, 2000.

[26] V. Vapnik. The nature of statistical learning theory.

Springer-Verlag, New York, 1995.

[27] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, NY, 1998.

[28] H. Yang, I. King and L. Chan. Non-fixed and asymmet- rical margin approach to stock market prediction using support vector regression. In Proc. of ICONIP 2002, Singapore, 2002.

[29] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio and V. Vapnik. Feature selection for SVMs.

In NIPS 13, 2000.

[30] J. Weston, A. Elisseff, B. Sch¨olkopf and M. Tipping.

Use of the zero norm with linear models and kernel methods. JMLR, 3:1439-1461, 2003.

References

Related documents

The consultant are mainly concerned with technical factors such as inadequate contractors experience, delay causes by subcontractors and improper planning while

reducing post-ischemic stroke frailty and brain damage in reproductively senescent female

This study investigated the major causes of high cost of materials and land acquisition problems in the construction industry in Ghana from the view of professionals such

2 - Based on Dell internal analysis, June 2019, measuring manual configuration of Wyse ThinOS via OOBE or automated deployment from a file server.. 3 - Based on Dell internal

If the Auto Pay payment method is available, an insured may choose to have monthly installments electronically withdrawn from their personal checking account, savings account,

Little is known so far of applying the separation principle or state estimation to the controlled Markov Chain approximation to solve continuous nonlinear stochastic control

Among these species, 40 were widely used by the local people in traditional medicine to treat 48 human ailments.. Tree species were also used for fuel wood, construction materials,

logform dvi Annotating Logical Forms for EHR Questions Kirk Roberts?, Dina Demner Fushman? ? School of Biomedical Informatics University of Texas Health Science Center at Houston