• No results found

2.3 Linear Support Vector Regression

Support Vector Regressions (SVRs) and Support Vector Machines (SVMs) are rooted in the Statistical Learning Theory, pioneered by Vapnik (1995) an co- workers. Detailed treatments of SVR and SVM can be found, for example, in Burges (1998), Smola (1996) and Smola and Scholkopf (1998). The following is a self-contained basic introduction to Support Vector Regressions (SVRs).

SVRs have two main strengths and these are good generalizability/avoidance of overtting and robustness against outliers. Generalizability refers to the fact that SVRs are designed in such a way that they provide the most simple solution for a given, xed amount of (training) errors. A solution is referred to as being simple if the coecients of the predictor variables are penalized towards zero. Thus, an SVR addresses the problem of overtting explicitly, just like many other penalization methods such as RR (Tikhonov, 1963) and Lasso (Tibshirani, 1996). The robustness property stems from considering absolute, instead of quadratic, values for the errors. As a consequence, the inuence of outliers is less pronounced. More precisely, SVRs employ the so-called †- insensitive error loss function, which is presented below. To put it in a nutshell, (linear) SVR departs from the classical regression in two aspects. The rst one is the utilization of the †-insensitive loss function instead of the quadratic one. The second aspect is the penalization of the vector of coecients of the predictor variables.

The classical multiple regression has a well known loss function that is

quadratic in the errors, r2

i = (y ¡ f (xi))2. The loss function employed in SVR

is the †-insensitive loss function

g(ri) = jyi¡ f (xi)j†· maxf0; jyi¡ f (xi)j ¡ † g = maxf0; jrij ¡ †g

for a predetermined nonnegative †, where yi is the true target value, xi is a

vector of input variables and f(xi)is the estimated target value for observation

i. Figure 2.1 shows the resulting function for the residual. Intuitively speaking,

if the absolute residual is o-target by † or less, then there is no loss, that is, no penalty should be imposed, hence the name †-insensitive. However, if

the opposite is true, that is jyi¡ f (x)j ¡ † > 0, then a certain amount of loss

should be associated with the estimate. This loss rises linearly with the absolute dierence between y and f(x) above †.

Because SVR is a nonparametric method, traditional parametric inferential statistical theory cannot be readily applied. Theoretical justications for the SVR are instead based on statistical learning theory (Vapnik, 1995). There are two sets of model parameters in (linear) SVR: coecients, and two manually- adjustable parameters  C and †  that explicitly control the interplay between model t and model complexity. For each value of the manually-adjustable pa- rameters C and † there is a corresponding set of optimal coecients, which are obtained by solving a quadratic optimization problem. The C and † parame-

² − ²

g (ri)

ri

Figure 2.1: The †-insensitive loss function that assigns no penalty to residuals

ri2 [f (xi)¡†; f (xi)+†]for point i. As jrijgets larger than †, a nonzero penalty

g(ri)that rises linearly is assigned.

ters are usually tuned using a cross-validation procedure. In such a procedure, the data set is rst partitioned into several mutually exclusive parts. Next, models are built on some parts of the data and other parts are used for evalu- ation of model performance for a particular choice of the t-versus-complexity parameters C and †. This is quite analogous to the process of adjusting the bias-versus-variance parameter in Ridge Regression, for instance. We start out the intuitive SVR exposition with assuming that C has implicitly been set to unity and † has been set to 2. We later relax that assumption and give a more formal meaning of these parameters in terms of their role in the SVR opti- mization problem (2.7). In the nonlinear SVR case, other manually-adjustable parameters may arise. Then a cross-validation grid search over a certain range of values for C, † and these parameters has to be performed in order to tune all parameters.

Let us rst consider the case of simple linear regression estimation by SVR

by the usual linear relation y = fl1x1+ b, where fl1 and b are parameters to

be estimated. Figure 2.2 shows an example with three cases of possible linear functional relations. The SVR line is the solid line in Figure 2.2c, given by the

equation f(x1) = fl1x1+ b. The tube between the dotted lines in Figure 2.2

consists of points for which the inequality jy ¡ f(x1)j ¡ † • 0holds, where † has

been xed arbitrarily at 2. All data points that happen to be on or inside the tubes are not associated with any loss. The rest of the points will be penalized according to the †-insensitive loss function. Hence, the solutions in Panel (b) and (c) both have zero loss in †-insensitive sense.

The exact position of the SVR line of Figure 2.2c is determined as follows. The starting point is that the SVR line should be as horizontal/simple/at as

possible. The extreme case of fl1 = 0 in Figure 2.2a will unavoidably yield

several mistakes, as † is not big enough to give zero loss for all points. This case represents a simple but quite lousy relationship. However, notice that the resulting region between the dotted lines, referred to as the †-insensitive region, occupies the greatest possible area (for † = 2). It is argued in the SVR

2.3 Linear Support Vector Regression 45 −5 0 5 10 15 −5 0 5 10 15 20 y x 1 | y − f ( x1) | + ² | y − f ( x1) | + ² ² = 2 y y −5 0 5 10 15 x 1 ² = 2 −5 0 5 10 15 x 1 ² = 2 (a) (b) (c)

Figure 2.2: Three possible solutions to a linear regression problem with data points that lie on a line. The vertical line segments in panel (a) indicate loss

per observation, which is equal to jy ¡ f(x1)j ¡ †, for † = 2. In line with

the †-insensitive loss function, a point is not considered to induce an error if its deviation from the regression line is less than or equal to †. The horizontal regression line in panel (a) is the simplest possible one since it hypothesizes that

there is no relation between y and x1, and it produces too much loss. Panel

(b) gives the classical linear regression estimation, yielding zero loss. Panel (c) shows the linear SVR, which also yields zero loss but it atter than the regression in Panel (b).

literature that this particular area can be seen as a measure of the complexity of the regression function used. Accordingly, the horizontal regression line provides

the least complex functional relationship between x1and y, which is equivalent

to no relationship at all.

Consider the next step in Figure 2.2b. Here, the solid line ts the training data extremely well. This line is the actual regression function from classical regression analysis, where the loss measured as the sum of squared errors of the estimates is being minimized. The distance between the dotted lines however has clearly diminished as compared to Figures 2.2a and 2.2c. What the SVR line of Figure 2.2c aims for is to nd a balance between the amount of atness (or complexity) and training mistakes (or t). This balance is the fundamental idea behind SVR analysis. Good generalization ability is achieved when the best trade-o between function's complexity (proxied by the distance between the dotted lines) and function's accuracy on the training data is being struck. The idea that such a balance between complexity and amount of training errors should be searched has been formalized in Vapnik (1995).

dependent variable in a data set of n observations, the mathematical formulation of the optimization problem of SVR can be derived intuitively as follows. The objective is to nd a vector of p coecients fl and an intercept b so that the

linear function f(x) = fl0x + bhas the best generalization ability for some xed

error insensitivity. From the complexity side, this linear surface should be as

horizontal as possible, which can be achieved by minimizing the quadratic form

fl0fl. From the amount of errors side however, a perfectly horizontal surface

(obtained for fl = 0) will generally not be optimal since a lot of errors will typically be made in such a case. According to the †-insensitive loss function,

the sum of these errors is dened to be equal toPn

i=1g(ri) =

Pn

i=1maxf0; jyi¡

f (xi)j ¡ †g. One can strike a balance between amount of errors and complexity

by minimizing their sum

Lp(fl; b) := 1 2fl 0fl + C n X i=1 maxf0; jyi¡ (fl0xi+ b)j ¡ †g; (2.7)

where C is a user-dened constant that controls the relative importance of the two terms. This minimization problem formulation is the familiar penalty plus loss minimization paradigm that arises in many domains (see, e.g., Hastie et al., 2001).

The problem can equivalently be represented by introducing the so-called

slack variables » and »⁄. Then, minimizing L

p(fl; b)can be represented as the

constrained minimization problem

minimize Lp(fl; b; »; »⁄) := 1 2fl 0fl + C n X i=1 (»i+ »⁄i); (2.8) subject to yi¡ (fl0xi+ b) • † + »i; fl0x i+ b ¡ yi• † + »i⁄; and »i; »i⁄‚ 0

(Vapnik, 1995; Smola & Scholkopf, 1998).

If the estimate fl0x

i+ b of the ith observation deviates from the target yi

by more than †, then a loss is incurred. This loss is equal to either »i or »i⁄,

depending on which side of the regression surface observation i lies. It turns out that (2.8) is a convex quadratic optimization problem with linear constraints, and thus a unique solution can always be found. As already mentioned, the

objective function in (2.8) consists of two terms. The rst term, 1

2fl0fl, captures

the degree of complexity, which is proxied by the width of the †-insensitive

region between surfaces y = fl0x + b + †and y = fl0x

i+ b ¡ †. If fl = 0, then

complexity (1

2fl0fl) is minimal since the †-insensitive region is biggest. The slack

variables variables »iand »i⁄, i = 1; 2; : : : ; n, are constrained to be nonnegative.

All points i inside the †-insensitive region have both »i = 0 and »i⁄ = 0. If a