Conjugate gradient solution - Direct L2 Support Vector Machine

The conjugate gradient method is an iterative method which allows us to ap- proximately solve the system of linear equations where the size of a problem is too

large and therefore slow for a direct method implementation, [28]. General Conjugate Gradient (CG) algorithm (without constraints) can be modified to accommodate the case of nonnegative constraints as explained by Hestenes [28]. Here we give a brief explanation of CG method with nonnegative constraints adjusted to DL2 SVM no- tation. DL2 SVM’s problem as given in 4.1 subject to 4.2 is the problem of finding a minimum point x(∗) of f(x) such that each component of x(∗) has a nonnegative value.

Lets first assume that x(0) _{is the initial guess for solution vector} _x(∗) _{and that}

x(0) ₌ ₀_{. Starting with this initial guess the algorithm searches for the solution} with the help of a certain metric that tells if the guess is closer to the solution or not. The metric used here is the residual vector r which becomes smaller as the algorithm gets closer to the unique solution vectorx(∗). Fig. 9 shows the comparison of a steepest (gradient) descent and CG descent. As one can see, conjugate gradient method converges faster, it only takes at mostn steps (assuming no round-off errors) where n is the size of the matrix (here n = 2). However, every iteration of gradient descent method is cheaper than that of conjugate gradient’s.

Before we present the algorithm it is important to introduce the set I and the residual vectorr. Let I be the set of all indexes i ≤m such that x(_ik) = 0. Then, at a point x(k) _{residual vector can be calculated as}

r(k) =−f0(x(k)) =1−Ax(k). (4.11)

As one can see, the residual vector r is in fact the negative gradient of f(x) at point

x(k)_{. The algorithm can be explained through the following steps:}

Step 1: Select an initial point x(∗) ₌_x(0) _{that satisfies the nonnegativity con-} straints, e.g. x(0) _{= 0. The residual vector}_r(∗)₌_r(0) _{used as the first search direction} is calculated as r(0) ₌₁₋_A_x(0)_.

Step 2: Let I be the set of all indexes i≤m such that

x_i(∗)= 0, r_i(∗)≤0 ∀i∈I. (4.12)

If residual valuer_i(∗) = 0,∀i /∈I(or equivalently r

(∗) i

< τ, whereτ is some small stopping criteria) thenx(∗) _{is the solution of the optimization problem and the algorithm} terminates.

Step 3: Set the conjugate direction vector p(0) ₌_¯_r(0)_{, where}_¯_r(0) _{is the vector} having ¯ r_i(0) =        0 :i∈I r_i(∗) :otherwise (4.13)

Step 4: Start with k = 0 and perform the standard CG step by computing:

s(k) =A·p(k), (4.14) a(k)= p

(k)T _·_r(k)

where scalar valuea(k) _{is used to adjust the solution and residual vector as follows:}

x(k+1) =x(k)+a(k)·pk, (4.16) r(k+1) =r(k)−a(k)·s(k). (4.17)

Step 5: If x(k+1) _{lies outside of the feasible region, i.e., some} _x(k+1)

i violate the nonnegativity constraint then go to Step 6 of the algorithm.

Otherwise, if residual valuer_i(k+1)= 0,∀i /∈I (or equivalently r (k+1) i < τ), reset x(∗) and r(∗) as follows: x(∗)=x(k+1), (4.18) r(∗)=r(k+1) =1−Ax(k+1), (4.19)

setk =k+ 1 and go to Step 2.

Else, set the new residual¯r(k+1) and update the conjugate direction vectorp(k+1) as described bellow: ¯ r_i(k+1) =        0 :i∈I r_i(k+1) :otherwise (4.20) p(k+1) = ¯r(k+1)− s (kT)_·_¯_r(k+1) p(kT)_·_s(k) ·p (k)_. _(4.21)

Set k=k+ 1 and go to Step 4.

Step 6: LetJ be the set of indexesj such thatx(_jk+1) <0 and define some scalar value ¯a(k) _{to be the smallest of the ratios:}

¯ a(k)=min −x (k) j p(_jk) ! ,∀j ∈J. (4.22)

Once we have the value of ¯a(k) _{we can reset the solution and residual vector as follows:}

x(∗) =x(k)+ ¯a(k)·p(k), (4.23) r(∗) =r(k)−a¯(k)·s(k). (4.24)

Redefine set I to be the set of all indexes i≤m such that x(_i∗)= 0. If residual vector r_i(∗) = 0,∀i /∈I go to Step 2, else go to Step 3.

The algorithm presented above finds the solution vector in a finite number of steps. It starts with a point x(0) that lies in a feasible region (nonnegative). If

x(0) is not a minimum point of f in set S which represents a set of points x whose components are nonnegative, the algorithm locates a pseudo-minimum pointx(1) _such that f(x(1)₎ _{< f}₍_x(0)_{). If} _x(1) ₆₌ _x(∗)_{, we locate a new pseudo-minimum point} _x(2) that minimizes the function on feasible area, such thatf(x(2)₎_{< f}₍_x(1)_{). If}_x(2) ₆₌_x(∗) the process is repeated again until the minimum point has been found. Since there are only a finite number of pseudo-minimum points of F on S, this procedure will eventually terminate when minimum point x(∗) has been found.

Conjugate gradient method, in the absence of round-off errors, produces the exact solution after a finite number of iterations, which is not larger than the size of the matrix. However, the conjugate method is unstable with even small errors, so the exact solution is never really obtained. The error developed in calculating the direction can be detrimental to the convergence. To overcome this drawback, Fletcher and Reeves [33] suggested to revert to the direction of steepest descent after every n or (n + 1) iterations. To complicate things, accumulated roundoff error in the recursive formulation of the residual 4.17 may yield a false zero residual or a value that is within a predefined stopping criterion. This problem could be resolved by restarting with equation 4.11.

To summarize, the CG method monotonically improves the approximation vector until the exact solution is reached within some tolerance. The speed of improvement depends on the condition number of matrix A, larger the number slower the convergence.

In document Direct L2 Support Vector Machine (Page 71-76)