3.4 Choice of A Classifier Support Vector Machines (SVM)
3.4.4 SVM Optimization Algorithms
Since SVM is a quadratic optimization problem, there are a lot of quadratic pro- gramming solvers that aim to find a solution for the problem. However most of the early approaches were adhoc approaches which achieved optimization either by:
– Taking advantage of the sparsity in the quadratic part of the objective func- tion (Iterative searching/chunking methods),
– Performing successive applications of a very simple direction search (Direc- tion search methods),
– Calculating kernel coefficients on the fly (Decomposition methods).
All these techniques considered the computational difficulty faced by the predictor discussed in Section 3.4.3. We will not discuss each of these algorithms in de- tail, rather just discuss those algorithms which have been used in the thesis. We have used sequential minimal optimization (SMO) and stochastic gradient descent (SGD) learning methods for optimizing the objective function of SVM. Details of each of these is given below :
3.4.4.1 Sequential Minimal Optimization
Sequential Minimal Optimization (SMO) is a simple algorithm that can quickly solve the SVM QP problem without any extra matrix storage and numerical QP optimization steps (Platt,1998). It works by decomposing the main QP problem into smallest possible QP sub-problems called working sets. Each working set involves two Lagrange multipliers α1 and α2 whose analytic solution is optimized jointly using Osuna’s theorem (Osuna et al., 1997) while keeping the other αi’s fixed. The algorithm could be summarized into two parts: (1) a set of heuristics for efficiently choosing the pairs of Lagrange multipliers to work on, and (2) the analytical solution to a QP problem of size two.
SMO maximizes the following objective function in dual form: LD = N X i=1 αi− 1 2 X i,j αiαjyiyjK(xi.xj), ∀i, 0 ≤ αi ≤ C and N X i=1 αiyi= 0.
Thus, for any two multipliers α1 and α2 , the constraints are reduced to: 0 ≤ α1, α2 ≤ C,
y1α1+ y2α2 = k,
where k is the sum over the rest of the terms in the equality constraint (PN
i=1αiyi = 0), which is fixed in each iteration. There is a one-to-one relationship between each Lagrange multiplier and each training example. Once the Lagrange multipliers are determined, the normal vector and the threshold b can be derived from the Lagrange multipliers: w = N X i=1 yiαiφ(xi), b = wφ(xk) − yk for some 0 ≤ αk≤ C. (3.8)
Because w can be computed via Equation3.8 from the training data before use, the amount of computation required to evaluate a linear SVM is constant in the number of non-zero support vectors. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Overall, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by kernel evaluation, hence SMO is fastest for linear SVMs and sparse data sets (Platt,1998). This algorithm is deployed by the popular machine learning toolbox LIBSVM used in this research as well.
3.4.4.2 Stochastic Gradient Descent Learning
Stochastic gradient descent (SGD) is a simple yet very efficient approach to dis- criminative learning of linear classifiers under convex loss functions such as (lin- ear) support vector machines and logistic regression. Even though SGD has been around in the machine learning community for a long time, it has received a con- siderable amount of attention just recently in the context of large-scale learning and sparse machine learning problems often encountered in text classification and natural language processing.
Given a set of training examples (x1, y1), . . . , (xN, yN) where xi ∈ RN and yi ∈ {−1, 1}, our goal is to learn a linear scoring function f (x) = wTx + b with model parameters w ∈ Rm and intercept b ∈ R. In order to make predictions, we simply look at the sign of f (x). A common choice to find the model parameters is by minimizing the regularized training error given by:
E(w, b) = N X
i=1
L(yi, f (xi)) + αR(w), (3.9) where L is a loss function that measures model (mis)fit and R is a regularization term that penalizes model complexity; α > 0 is a non-negative hyper-parameter. Different choices for L entail different classifiers such as:
– Hinge: (soft-margin) Support Vector Machines. – Log: Logistic Regression.
– Least-Squares: Ridge Regression.
– Epsilon-Insensitive: (soft-margin) Support Vector Regression. Popular choices for the regularization term R include:
L2 norm : = R(w) := 1 2 N X i=1 w2i, L1 norm : = R(w) := N X i=1 |wi|, Elastic Net = R(w) := ρ1 2 N X i=1 w2i + (1 − ρ) n X i=1 |wi|,
a convex combination of L2 and L1, where ρ is given by 1 − l1ratio; l1ratiocontrols the convex combination of L1 and L2 penalty. The algorithm iterates over the training examples and for each example updates the model parameters according to the update rule given by :
w ← w − η α∂R(w) ∂w + ∂L(wTxi+ b, yi) ∂w , (3.10)
where η is the learning rate which controls the step-size in the parameter space. The intercept b is updated similarly but without regularization. The learning rate η can be either constant or gradually decaying. For classification, the default learning rate schedule is given by:
η(t)= 1 α(t0+ t)
, (3.11)
where t is the time step (there are a total of Nsamples× Nitertimesteps), t0 is deter- mined based on a heuristic proposed by Leon Bottou (Bottou et al., 2008) such
that the expected initial updates are comparable with the expected size of the weights (this assuming that the norm of the training samples is approx. 1). The major advantage of SGD is its computational efficiency, which is basically linear in the number of training examples. If X is a matrix of size (N, p), training has a cost of O(kN ¯p), where k is the number of iterations (epochs) and ¯p is the average number of non-zero attributes per sample. Recent theoretical results, however, show that the runtime to get some desired optimization accuracy does not increase as the training set size increases. For multi class classification through SGD, Bottou’s implementation (Bottou et al.,2008) uses one versus all strategy.