Available at http://www.joics.com
Training Robust Support Vector Regression via
D. C. Program
Kuaini Wang, Ping Zhong∗,
Yaohong Zhao
College of Science, China Agricultural University, Beijing 100083, China
Abstract
The classical support vector machines are sensitive to noise and outliers. In this paper, we propose a truncated quadratic insensitive loss function and develop a robust support vector regression which has strong ability of suppressing the impact of noise and outliers while at the same time keeps the sparseness. Since the truncated quadratic insensitive loss function is non-convex and non-differentiable, we construct a smooth loss function which is the combination of two Huber loss functions as its approximation. The resultant optimization problem can be formulated as a difference of convex functions program. We establish a Newton-type algorithm to solve it. Numerical experiments on the benchmark datasets show that the proposed algorithm has promising performance.
Keywords: Support Vector Machine; Regression; Loss Function; Robustness; D. C. Program
1
Introduction
Support vector machine (SVM) is a useful tool for machine learning, and it earns success in various aspects ranging from pattern recognition, classification, function estimation, time series prediction and so on [1, 2, 3]. In practice, sampling errors, modeling errors and instrument errors may corrupt the training samples with noise and outliers. The classical SVM yields poor generalization performance in the presence of noise and outliers. There are several kinds of methods to construct the robust SVMs. The commonly used approach constructs the robust models by introducing weighted values to errors caused by training samples [4, 5, 6]. Another approach constructs the robust models based on the ramp loss functions [7, 8, 9]. In addition, we can construct the robust models by the second order cone programming [10, 11, 12].
As we know, loss functions play the essential role in supervised learning. One of important and popular loss functions is the quadratic loss function, and many SVMs are constructed by using this loss function, such as L2-SVM [1] and least squares SVM (LS-SVM) [13]. In this paper, we introduce a non-convex and non-differentiable loss function based on the quadratic insensitive
⋆Project supported by the National Nature Science Foundation of China (No. 70601033) and Innovation Fund
for Graduate Student of China Agricultural University (No. KYCX2010105).
∗Corresponding author.
Email address: [email protected] (Ping Zhong).
1548–7741/ Copyright © 2010 Binary Information Press December 2010
loss function and propose a robust support vector regression (SVR). We smooth the proposed loss function by the combination of two Huber loss functions and formulate the associated non-convex optimization as a difference of non-convex functions (d. c.) program. The d. c. algorithm (DCA) was successfully applied to a lot of different and various non-differentiable non-convex optimization problems to which it quite often gave global solutions and proved to be more robust and efficient than related standard methods, especially in the large-scale setting [14, 15]. We employ the concave-convex procedure [16] and develop a Newton-type algorithm to solve the robust SVR, which can explicitly incorporate noise and outlier suppression and sparseness in the training process. Experimental results on benchmark datasets confirm the effectiveness of the proposed algorithm.
The rest of this paper is organized as follows. Section 2 presents SVR in the primal. In section 3, we propose the non-convex loss function and the robust model. In section 4, a Newton-type algorithm is developed for solving the robust SVR. Section 5 presents the experimental results on benchmark datasets. Finally, section 6 gives the conclusions.
2
Support Vector Regression in the Primal
In this section, we briefly describe L2-SVR in the primal. Considering a regression problem with training samples {(xi, yi)}ni=1, where xi∈ Rd is the input sample and yi is the corresponding
target, we can obtain a predictor by solving the following optimization problem:
min w, b, ξ, ˜ξ 1 2∥w∥ 2+ C n ∑ i=1 (ξi2+ ˜ξi2) (1) s.t. w⊤ϕ(xi) + b− yi ≤ ε + ξi, i = 1,· · · , n (2) yi− (w⊤ϕ(xi) + b)≤ ε + ˜ξi, i = 1,· · · , n (3)
where ϕ(·) is a nonlinear map from the input space to the feature space, C is the regularization factor which balances the tradeoff between the fitting errors and model complexity. Program (1)– (3) can be written as an unconstrained optimization in an associated reproducing kernel Hilbert space H: min f 1 2∥f∥ 2 H+ C n ∑ i=1 l(f (xi)− yi) (4)
where l(z) = (max (0,|z| − ε))2 with ε > 0 is the quadratic insensitive loss function. For the sake of simplicity, we can drop the bias b without loss of generalization performance of SVR [17]. According to [17] the optimal function for (4) can be expressed as a linear combination of the training samples in the feature space f (x) = ∑ni=1βik(x, xi), where k(·, ·) is a kernel function.
Then we have min β L(β) = 1 2β ⊤Kβ + C n ∑ i=1 l(zi) (5)
where K is the kernel matrix with Kij = k(xi, xj), Ki is the ith row of K, and zi = Kiβ− yi.
3
Robust Model
Noise and outliers existing in the training samples tend to cause large residuals. Hence, they keep more influence on the optimal solution of (5), which may result in the decisive hyperplane of SVR deviating from the original position and thus deteriorate the generalization performance of SVR. We introduce a non-convex loss function to limit the impact of them. By setting the upper bound, we get the following loss function:
lθ(z) = min{θ2, (max(0,|z| − ε))2} (6)
where θ > 0 is a constant. It is easy seen that lθ(z) can control the residuals caused by noise
and outliers. However, lθ(z) is neither convex nor differentiable, and the resultant optimization
problem is difficult to be solved. To overcome this dilemma, we first propose a smooth loss function as the approximation of lθ(z). To do so, we construct two Huber loss functions lhu1 (z) and lhu2 (z): l1hu(z) = 0 if |z| ≤ ε (|z| − ε)2 if ε <|z| ≤ ε + θ θ[2|z| − (2ε + θ)] if |z| > ε + θ (7) lhu2 (z) = 0 if |z| ≤ ε + θ −θ(|z| − ε − θ)2/h if ε + θ <|z| ≤ ε + θ + h −θ[2|z| − (2ε + 2θ + h)] if |z| > ε + θ + h (8)
where h > 0 is the Huber parameter. Combining lhu
1 (z) and lhu2 (z), we obtain lθ,hhu(z) = l1hu(z) + lhu2 (z) = 0 if |z| ≤ ε (|z| − ε)2 if ε <|z| ≤ ε + θ θ[2|z| − (2ε + θ)] − θ(|z| − ε − θ)2/h if ε + θ <|z| ≤ ε + θ + h θ2+ θh if |z| > ε + θ + h (9)
It is easy to verify that lhu
θ,h(z) is continuous and differentiable. Its shape is shown in Fig. 1.
When h → 0, lθ,hhu(z) approaches lθ(z) defined by (6). So lhuθ,h(z) is a smooth approximation of
lθ(z). Substituting (9) into (5), we propose the robust model as follows:
min β Lθ,h(β) = 1 2β ⊤Kβ + C n ∑ i=1 lθ,hhu(zi) (10)
Note that the objective function of (10) is non-convex. Denote u(β) = 12β⊤Kβ + C
n ∑ i=1 lhu1 (zi) and v(β) =−C n ∑ i=1
l2hu(zi). Then optimization problem (10) can be expressed as
min
3.0 2.5 2.0 1.5 1.0 0.5 0 θ 2+ θh _ε_θ _ε_θ_h ε+θε+θ+h _ε ε h _4 _2 ESV SV2 SV1 NSV SV1 SV2 ESV 0 2 4 h
Fig.1: Smooth non-convex loss function lθ,hhu(z)
(11) is a d.c. program since u and v are convex functions.
In the d.c. programming literature, the DCA [14, 15] was proposed for solving a general d.c. program of form min{u(x) − v(x) : x ∈ Rn} with u and v being proper lower semi-continuous
convex functions, which form a large class of functions than the class of differentiable functions. DCA solves two sets of convex programs called the primal and dual programs iteratively in succession such that the solution of the primal is the initialization to the dual and vice-verse. It is pointed out that since there are as many as DCA as there are DC decompositions, the suitable choices of the DC decomposition of the objective function and the initial point are important for the computational efficiency. It can be shown that if v is differentiable, then DCA exactly reduces to concave-convex procedure (CCCP) [16]. The CCCP algorithm is an iterative procedure that solves a sequence of convex programs: xt+1 ∈ arg minx{u(x) − x⊤∇v(xt)}. The resulting
algorithm is proved to own global convergence behavior, i.e., for any random initialization, the sequence generated by CCCP converges to a stationary point of the d.c. program.
In our program (11), since v is differentiable, we can solve it by CCCP. The optimal solution β∗ of (11) can be obtained by iteratively solving the following optimization problem:
βt+1 = arg min
β {u(β) − β
⊤∇v(βt
)} (12)
where ∇v(βt) is the derivative of ∇v(βt) with respect to β at the tth iteration:
∇v(βt) = ∂v(βt) ∂β =−C n ∑ i=1 ∂lhu 2 (zit) ∂zi ·∂zi ∂β =−C n ∑ i=1 ηtiK⊤i (13) where ηti = 0 if |zti| ≤ ε + θ 2θ h[(ε + θ)s t i− zit] if ε + θ <|z t i| ≤ ε + θ + h −2θst i if |z t i| > ε + θ + h (14)
with st i = sign(zit) = { 1 if zt i ≥ 0 −1 if zt i < 0
. In each iteration, we only need to solve the following convex optimization problem:
min β Lθ,h(β) = u(β) + C n ∑ i=1 ηtiKiβ (15)
4
Newton Algorithm for Robust SVR
Since (15) is a convex optimization, we can establish Newton-type algorithm to solve it. First, we divide the training samples into four groups according to|zit| = |Kiβt− yi| at the tth iteration:
(1) The samples with|zit| ≤ ε are regarded as non-support vectors lying in NSV region illustrated in Fig. 1, and the number of training samples in this region is denoted by|NSV |.
(2) The samples with ε <|zit| ≤ ε + θ + h are regarded as support vectors. We further divide them into two subgroups, i.e. the samples with ε < |zt
i| ≤ ε + θ lying in SV1, and the samples with ε + θ <|zt
i| ≤ ε + θ + h lying in SV2 region. We denote the number of samples in these two subgroups by|SV1| and |SV2|, respectively.
(3) The samples with |zt
i| > ε + θ + h are regarded as error support vectors who lie in ESV
region shown in Fig. 1, and the number of samples in this region is denoted by|ESV |.
For convenience of expression, we arrange the four regions of samples in the order of SV1, SV2, ESV and NSV. Let I1 and I2 be n×n diagonal matrices, where I1 has the first|SV1| entries being 1 and the others 0, and I2 has the first |SV1| entries being 0, followed by the |SV2| entries being 1 and 0 for the rest. In order to develop a Newton-type algorithm for (15), we need to calculate the gradient and Hessian of the objective function of (15). The gradient is
∇Lθ,h(β) = Kβ + 2CK [ I1(Kβ− y − εs) + θI2s− θI2(zt− (ε + θ)st) h ] (16)
where y = [y1,· · · , yn]⊤, s = [sign(z1),· · · , sign(zn)]⊤, zt = [z1t,· · · , ztn]⊤, and st = [st1,· · · , stn]⊤,
and the Hessian is
G = K + 2CKI1K (17)
Then the solution βt+1 of (15) at the tth CCCP iteration can be updated by
βt+1= βt− G−1∇Lθ,h(βt) = 2C(In+ 2CI1K)−1 [ I1(y + εst)− θI2st+ θI2(zt− (ε + θ)st) h ] (18) where Indenotes n×n identity matrix. In Eq. (18), we need to calculate the inverse of In+2CI1K. Notice that it is a sparse matrix:
In+ 2CI1K = I|SV1|+ 2CKSV1,SV1 2CKSV1,SV2 2CKSV1,ESV 2CKSV1,N SV 0 I|SV2| 0 0 0 0 I|ESV | 0 0 0 0 I|NSV |
Its inverse can be derived as follows: (In+ 2CI1K)−1=
A −2CAKSV1,SV2 −2CAKSV1,ESV −2CAKSV1,N SV
0 I|SV2| 0 0 0 0 I|ESV | 0 0 0 0 I|NSV | (19)
where A = (I|SV1|+ 2CKSV1,SV1)−1. Substituting (19) into (18), we get the optimal solution at
the (t + 1)th iteration βt+1= 2C A{ySV1 + εs t SV1 + 2CθKSV1,SV2 [ st SV2 + ((ε + θ)s t SV2 − z t SV2)/h ]} −θ[stSV2+ ((ε + θ)stSV2 − ztSV2)/h] 0 0 = βSVt+1 1 βSVt+12 0 0 (20)
It is shown by Eq. (20) that the samples in the ESV region have no influence on the optimal solution because the corresponding elements in βt+1 are fixed at 0. Considering that the noise
and outliers are always lying in the ESV region, the robust SVR is much less insensitive to them and thus gains better generalization performance. In addition, the robust SVR also keeps the sparseness since the elements of βt+1 in NSV region are fixed at 0.
Algorithm NRSVR (Newton-type algorithm for robust SVR)
Given the training samples S ={(xi, yi)}ni=1, kernel matrix K and a small positive constant ρ,
the predefined constants ε, θ, h.
1. Initialization: β0 is solved using a classical SVM toolbox on a small subset of S. Let t = 0 and divide the training samples into four regions according to |Kiβ0− yi|;
2. Rearrange the regions in the order of SV1, SV2, ESV and NSV, and adjust K and y correspondingly. Calculate the gradient Lθ,h(βt) and check whether∥∇Lθ,h(βt)∥ ≤ ρ. If so, stop;
else go to the next step;
3. Compute βt+1 according to Eq. (20);
4. Spilt training samples into four regions according to |Kiβt+1− yi|. Set t = t + 1 and go to
step 2.
Notice that in the above procedure, we need not reorder K and y during the computation in step 2. In fact, we only need to remember the indices of the samples in the different groups. When they are required, we may abstract the corresponding rows or columns from the original matrices or vectors. In practice, we choose the start point β0 such that not all z0
i = Kiβ0− yi
satisfy |z0
i| ≤ ε or |zi0| > ε + θ + h. Since the case that |zi0| ≤ ε or |z0i| > ε + θ + h for all i implies
βt= 0, ∀ t.
The objective function Lθ,h(β) of (10) monotonously decreases with respect to the sequence
{βt} generated by NRSVR. In fact, if βt+1 is the optimal solution at tth iteration for (15), then
u(βt+1) + C n ∑ i=1 ηitKiβt+1≤ u(βt) + C n ∑ i=1 ηtiKiβt (21)
Since v(β) is convex function, we have v(βt+1)− v(βt)≥ ∇v(βt)⊤(βt+1− βt) = C n ∑ i=1 ηitKiβt− C n ∑ i=1 ηtiKiβt+1 (22)
From (21) and (22), we obtain Lθ,h(βt+1)≤ Lθ,h(βt). In addition, obviously, Lθ,h(β)≥ 0. Hence,
according to the analysis in [16], NRSVR converges.
Next, we discuss the computational complexity of NRSVR. Since the most time-consuming stage is to calculate the iterations, we merely consider one iteration complexity. In step 2, the complexity of computing ∇Lθ,h(β) is O(n(|SV1| + |SV2|)). In step 3, the cost of updating βt is max{O(|SV
1|3), O(|SV1| (|SV1| + |SV2|))}. Hence, the total computational complexity is O(n(|SV1| + |SV2|) + |SV1|3), which is comparable with those of algorithms with convex loss functions [1, 13].
5
Number Experiments and Analysis
In order to verify the robustness of the proposed algorithm, we compared NRSVR with LS-SVR and L2-SVR on several benchmark datasets. Gaussian kernel k(xi, xj) = exp(−∥xi−xj∥2/σ2) was
used in the experiments. There exist five parameters: C, σ, ε, θ, and h. LS-SVR needs to choose the prior two parameters, L2-SVR needs to choose the prior three parameters, and the last two parameters are introduced by NRSVR. We searched the optimal parameters (C, σ, ε, θ, h) from the sets{2−10,· · · , 210}×{2−10,· · · , 210}×{10−3, 2×10−3, 5×10−3, 10−2, 2×10−2,· · · , 9×10−2, 10−1}× {0.001, 0.005, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45}×{0.001, 0.005, 0.01, 0.05, 0.1} by five-fold cross validation. We adopted three popular criteria, root mean square error (RMSE), mean absolute error (MAE), and mean relative error (MRE), to evaluate the generalization performance of these three algorithms. All the experiments were carried on Intel Pentium IV 3.00GHz PC with 2GB of RAM using Matlab 7.0 under Microsoft Windows XP.
We test the three algorithms on a collection of seven benchmark datasets from the UCI1 and StatLib2. Pyrim, Triazines, AutoMPG, and Boston housing are taken from UCI. Pollution, Bodyfat, and Concrete are taken from StatLib. In order to test the robustness of the three algorithms, 20% large noise was added in each dataset. For each dataset, some samples were randomly chosen for training, and the rest samples were employed for test. The specific numbers are listed in TrNum and TeNum items in Table 1, respectively. We used the same training and test sets to test the three algorithms on each dataset. The experimental results are summarized in Table 1. It can be seen that NRSVR gains the best performance among the three algorithms for all datasets.
Next, we discuss the influence of parameters θ and h introduced in our proposed NRSVR. h is a Huber parameter used to smooth the non-convex loss function, and its value is usually small. For our experience, h = 10−3 is appropriate. Parameter θ is introduced to limit the upper bound of loss function. In general, it should not be too large or too small. If the value of θ is too large, noise and outliers can be easily treated as support vectors, which will not only reduce the prediction accuracy of NRSVR, but also aggravate the testing burden because of more support vectors appearing in the optimal solution. If θ is too small, some normal samples are taken as outliers in the training phase and do not take part in determining the decision hyperplane. This results in
1Available from URL: http://archive.ics.uci.edu/ml/. 2Available from URL:http://lib.stat.cmu.edu/datasets/.
Table 1: Experimental results on benchmark datasets
Dataset Algorithm RMSE MAE MRE TrNum TeNum
LS-SVR 44.2928 34.0039 0.0357 40 20 Pollution L2-SVR 45.2703 35.0242 0.0368 40 20 NRSVR 39.4165 30.2575 0.0320 40 20 LS-SVR 0.0772 0.0535 0.1191 50 24 Pyrim L2-SVR 0.0805 0.0539 0.1199 50 24 NRSVR 0.0757 0.0508 0.1108 50 24 LS-SVR 0.1301 0.0997 0.2285 150 36 Triazines L2-SVR 0.1287 0.0993 0.2275 150 36 NRSVR 0.1274 0.0981 0.2261 150 36 LS-SVR 0.0094 0.0075 0.0071 200 52 Bodyfat L2-SVR 0.0088 0.0071 0.0067 200 52 NRSVR 0.0034 0.0023 0.0022 200 52 LS-SVR 3.0728 2.2499 0.0992 300 92 AutoMPG L2-SVR 2.9373 2.2137 0.0996 300 92 NRSVR 2.7882 2.0508 0.0895 300 92 LS-SVR 4.2619 2.9092 0.1429 300 206 Boston housing L2-SVR 4.3830 3.1184 0.1548 300 206 NRSVR 3.9520 2.6975 0.1358 300 206 LS-SVR 8.3216 6.4423 0.2399 500 530 Concrete L2-SVR 6.9837 5.2011 0.1843 500 530 NRSVR 6.9114 5.1502 0.1836 500 530
poor generalization performance. Therefore, we need to find a suitable value who can suppress the impact of outliers while at the same time keep the good generalization performance. We took Pollution and Pyrim datasets as examples to illustrate the influence of these two parameters. When one parameter is analyzed, the rest parameters are fixed. The effects of θ and h on the RMSE values for the two datasets are shown in Figs. 2 and 3, respectively. The results validate the above analysis.
6
Conclusion
In this paper, we propose a non-convex and non-differentiable loss function and develop a robust support vector regression which has strong ability of suppressing the impact of noise and outliers and also keeps the sparseness. We construct a smooth loss function which is the combination of two Huber loss functions to approximate the non-convex loss function. The resultant optimization problem can be formulated as a d. c. program. We employ the concave-convex procedure and develop a Newton-type algorithm to solve it. Numerical experiments on the benchmark datasets show the effectiveness of the proposed algorithm.
In this paper, we only focus on constructing the robust model based on the truncated quadratic loss function. Further research is required for discussing the general form of non-convex loss
function to establish a general robust model. 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 30 40 50 60 70 80 90 100 RMSE 0 0.02 0.04 0.06 0.08 0.10 0 50 100 150 200 250 h RMSE
Fig. 2: Influence of θ (left graph) and h (right graph) on RMSE values for Pollution
0 0.1 0.2 0.3 0.4 0.5 RMSE 0 0.02 0.04 0.06 0.08 0.10 h RMSE 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05
Fig. 3: Influence of θ (left graph) and h (right graph) on RMSE values for Pyrim
References
[1] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000
[2] B. Sch¨olkopf and A. J. Smola, Learning with kernels, MIT Press, 2002
[3] V. N. Vapnik, The nature of statistical learning Theory, Springer-Verlag, New York, 1995
[4] J. Suykens, J. DeBrabanter, and L. Lukas, Weighted least squares support vector machines: ro-bustness and sparse approximation, Neurocomputing 48 (2002) 85-105
[5] C. Lin and S. Wang, Fuzzy support vector machines, IEEE Transactions on Neural Networks 13 (2002) 464-471
[6] H. Huang and Y. Liu, Fuzzy support vector machines for pattern recognition and data mining, International Journal of Fuzzy Systems 4 (2002) 3-12
[7] R. Collobert, F. Sinz, J. Weston, and L. Bottou, Trading convexity for scalability, in: Proceedings of the 23rd International Conference on Machine Learning, ACM Press, 2006, pp. 201-208
[8] L. Xu, K. Crammer, and D. Schuurmans, Robust support vector machine training via convex outlier ablation, in: Proceedings of the 21st National Conference on Artificial Intelligence, 2006, pp. 536-546
[9] S. Yang and B. Hu, A stagewise least square loss function for classification, in: Proceedings of the 2008 SIAM International Conference on Data Mining, IEEE 2008, pp. 120-131
[10] B. Trafalis, Gilbert C. Robust classification and regression using support vector machines, Euro-pean Journal of Operational Research 173 (2006) 893-909
[11] P. Zhong, M. Fukushima, Second order cone programming formulations for robust multi-class classification, Neural Computation 19 (2007) 258-282
[12] P. Zhong, L.Wang, Support vector regression with input data uncertainty, International Journal of Innovative Computing, Information and Control 4(2008) 2325–2332
[13] J. A. K. Suykens and J. Vandewalle, Bechmarking least squares support vector machine clssifiers, Machine Learning 54 (2004) 5-32
[14] P. D. Tao and L. T. H. An, D. C. optimization algorithms for solving the trust region subproblem, SIAM Journal of Optimization 8 (1998) 476-505
[15] L. T. H. An. and P.D. Tao, The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems, Annals of Operations Research 133 (2005) 23-46
[16] A. L. Yuille and A. Rangarajan The concave-convex procedure, Neural Computation 15 (2003) 915-936
[17] O. Chapelle, Training a support vector machine in the primal, Neural Computation 19 (2007) 1155-1178