Previous Work - Balancing flexibility and robustness in machine learning: semi-parametric metho

We review two methods for the construction of sparse linear classifiers that employ a network of feature dependencies and will be used as benchmarks for comparison: the network-based support vector machine (NBSVM) (Zhu et al.,2009) and the graph lasso method (GL) (Jacob et al.,2009). Both methods build linear models whose coefficients are determined by minimizing a penalized loss function. The penalty term is included to enforce sparsity in the vector of model parameters. This penalty also takes into account the network of feature dependencies. Features

that are linked by edges in the network tend to be either both excluded or both included in the learned model.

5.2.1 Network-based Support Vector Machine

Zhu et al.(2009) present an extension of the standard support vector machine (SVM) (Vapnik,

1995) that takes into account a network encoding feature dependencies. In this model, a sparsity enforcing penalty is added to the hinge loss function of the standard SVM so that features that are linked in the network tend to be either both excluded or both included in the model. Given a training dataset {(xi, yi)}n_i=1with features xi∈ Rd+1and corresponding class labels yi∈ {−1, 1},

the network-based SVM (NBSVM) searches for the parameter vector w = (w0, . . . , wd)T that

minimizes n

∑

i=1 1 − yiwTxi ++ λ

∑

{i, j}∈E max|wi|, |wj| (5.1) where E is the set of edges in the network of feature dependencies and λ is a positive regularization parameter. The zeroth component of each xiis assumed to be constant and equal to 1 so that

w₀is the bias coefficient for the model. Typically, w0is not regularized in the NBSVM. For this

reason, the zeroth feature is not linked to any other feature in the network of feature dependencies. The absolute value functions in the penalty term favor sparsity in the classification model. Additionally, if a specific feature is excluded from the model, then the penalty term in (5.1) favors the exclusion of features that share an edge with it. This is a consequence of the singular nature of the max {| · |, | · |} function at the origin (Zou and Yuan,2008). The minimization of (5.1) is a linear programming (LP) problem (Zhu et al.,2009) and can therefore be efficiently performed using standard LP solvers.

5.2.2 Graph Lasso

The graph lasso (GL) was introduced by Jacob et al. (2009) as a regularization method that allows to obtain a sparse linear model in which the selected features tend to be connected to each other in a graph. Before describing the penalty term used by GL, it is useful to introduce some notation. Let w = (w0, . . . , wd)T be the parameter vector of a linear model and let G = (V, E)

be a network whose vertices V = {0, . . . , d} correspond to features. E is the set of edges that connect features. The elements of E are sets {i, j} such that i, j ∈ V . Let D be the set of vertices (features) that are not linked to any other vertices in G. For any vector v = (v0, . . . , vd)T, the

quantity kvk represents the Euclidean norm of v. Let supp(v) ⊂ V denote the support of v; namely, the set of features i ∈ V such that vi 6= 0. Given v and an edge e ∈ E, ve is the 2-

dimensional vector (vi, vj)Twhere i and j are the two features linked by e, and i ≤ j. Similarly,

vDis the |D|-dimensional vector given by the components of v that belong to D.

To construct the penalty function in GL, we consider a decomposition of the vector w as a sum of |E| + 1 vectors:

w = u +

|E|

∑

i=1

vi, (5.2)

where u is a vector whose only non-zero components correspond to the disconnected part of the graph (supp(u) ⊂ D), and vi is a vector whose only non-zero components are the elements

Chapter5. Network-based Sparse Bayesian Classification 81

corresponding to the vertices linked by ei, the i-th edge in E (supp(vi) ⊂ ei). This decomposition

is not unique. Let

V

w be the set of (|E| + 1)-tuples (v1, . . . , v|E|, u), which correspond to all

possible decompositions of w of this type. The GL regularization term is

ΩE_graph(w) = min (v1,...,v|E|,u)∈Vw |E|

∑

i=1 kvei i k , (5.3)

which is written in terms of the decomposition of w that minimizes the sum of the Euclidean norm of the vectors that correspond to edges. The Euclidean norm in (5.3) enforces sparsity at the edge level in w. Specifically, if one of the components of vei

i is zero, the value of Euclidean

norm is the absolute value of the other component. This is akin to a lasso penalty, which favors that this second component also becomes zero (Yuan and Lin,2006). This form of regularization privileges weight vectors w whose support is the union of D, the disconnected part of the graph, and a subset of the edges in E. In contrast with the sparsity patterns generated by other network- based methods, which tend to select connected components in the network, the edges included in the model by GL are not necessarily connected to each other.

This penalty function is combined with the negative log-likelihood of a logistic regression model to obtain a network-based sparse classifier. Given a training dataset {(xi, yi)}n_i=1, where

xi∈ Rd+1 is the feature vector and yi∈ {−1, 1} is the class label of the i-th example, the GL

method searches for the w that minimizes ∑ni=1`(yi, xi, w) + λΩE_graph(w), where λ > 0 is a reg-

ularization parameter, `(yi, xi, w) = yi− 1 2 log(1 − σ(w T_x i)) − yi+ 1 2 log σ(w T_x i) (5.4)

and σ(·) is the logistic function. The zeroth component of each xi is constant and equal to 1

so that w0is the bias coefficient for the model. To avoid regularizing w0the zeroth component

of each xi is not connected to any other feature in the network G. The optimization problem

can be readily solved by duplicating the features in the dataset that are involved in edges of G. Specifically, the original feature vector for the i-th instance xiis replaced by the enlarged vector

˜xi obtained by concatenating copies of the features, one copy per edge ˜xi= (xei1, . . . , x e|E|

i , xDi )T.

Using these expanded feature vectors, the optimization problem becomes

min ˜ w n

∑

i=1 `(yi,˜xi,w)˜ subject to |E|

∑

i=1 k( ˜w2i−1, ˜w2i)k ≤ M (5.5)

where ˜w = ( ˜w1, . . . , ˜w2|E|+|D|)T and M is a positive regularization parameter that is in a one-

to-one relation with λ. Once a solution to this expanded problem has been found, a min- imizer for the original problem can be computed by realizing that at both optima uD = ( ˜w_2|E|+1, . . . , ˜w_2|E|+|D|)T_{, v}ei

i = ( ˜w2i−1, ˜w2i)Tand w is equal to the sum of all the vi and u. The

constrained optimization in (5.5) is a Group-Lasso regularization problem (Kim et al., 2006;

Yuan and Lin,2006) which can be efficiently solved using the method described byRoth and Fischer(2008).

In document Balancing flexibility and robustness in machine learning: semi-parametric methods and sparse linear models (Page 93-96)