Sparsity and Shrinkage methods - Probabilistic multiple kernel learning

In the previous sections we reviewed approaches for regression and classification where the resulting model utilises the entire training set of past observations and attributes (denoted by the design matrix X, the feature expansion φ (X) or the kernel matrix K) for predicting novel responses. In many cases this is unfeasible and undesirable, due to memory and computing restrictions, and in this section sparse approaches that utilise a subset of observations and/or attributes are introduced.

The main reasons for aiming at sparse solutions are:

• Scalability - Methods that utilise the whole training set become computa- tionally unfeasible for large data collections (either in number of attributes D or number of samples N ). Kernel-based methods that are governed by an O (N3_{) complexity, require sparse solutions to scale up for large appli-}

cation scenarios.

• Interpretation - Identifying the significant samples or attributes for the prediction task at hand can be crucial in some application areas such as bioinformatics, medical informatics and all cases where information and in- tuition about the problem’s characteristics are more important than just a prediction output. The context in which a sample or attribute is judged as significant for the prediction task can be statistical, e.g. marginal likelihood (Tipping 1999, Damoulas et al. 2008) or predictive likelihood (Lawrence et al. 2003, Girolami and Rogers 2006), information theoretic, e.g. in-

formation gain (MacKay 1992a), or geometric, e.g. decision boundary

construction (Vapnik and Chervonenkis 1964, Vapnik 1995).

• Prediction Accuracy - Sparse models can improve the prediction accuracy on a problem as they sacrifice bias (how well the model describes the specific training set of the phenomenon) in order to reduce variance (how much the resulting model will vary when trained on a different training set of the same phenomenon). This is achieved by obtaining a sparse solution that is based on a subset of informative observations or attributes and hence less likely to fit the noise.

In the following subsections a brief review of the main sparsity and shrinkage methods is offered together with the corresponding advantages and limitations.

2.7.1 Ridge Regression and the Lasso

In statistical learning theory sparsity is achieved via appropriate regularisation and different linear regression approaches have been developed according to the specific penalising term used. Ridge Regression (Hastie et al. 2001) adds to the OLS estimate a quadratic penalising term and hence it effectively minimises:

N X i=1 yi− w T xi 2 + λ D X d=1 w_d2 (2.74)

while the Lasso (Tibshirani 1996) employs a different nonlinear penalty term (L1 _{norm) and minimises:}

N X i=1 yi− w T xi 2 + λ D X d=1 |wd| (2.75)

The subtle differences in the penalising terms have a significant effect on the resulting estimates and obtained sparsity. The Lasso has better interpretation properties as it completely shrinks regression coefficients to zero and trans- lates others (Tibshirani 1996), in contrast with ridge regression whose quadratic penalty term only scales all of the coefficients by a constant factor.

Both approaches use the same amount of shrinkage for each regression coefficient, as there is a global factor λ, and hence (coefficient) selection results can be inconsistent. Towards that direction, recent work by Zou (2006) proposed the adaptive Lasso, an extension that utilises individual shrinkage levels for each coefficient: N X i=1 yi− w T xi 2 + D X d=1 λd|wd| (2.76)

The shrinkage levels are generally estimated through cross-validation (multiple partitions of the dataset to training and test sets) or an analytical unbiased estimate of risk (Tibshirani 1996, Berger 1985). For further theoretical analysis, convergence guarantees and direct connections to the standard penalised least squares estimator see Zou (2006), Wang and Leng (2007) and references within.

2.7.2 Sparsity in Kernel Methods

The previous section reviewed standard sparse methods on linear models with a linear estimating function. The same analysis directly applies to basis function or other feature expansions for obtaining nonlinear responses. However, the induced shrinkage from the penalty terms is with respect to the regression coefficients and acts on the features of each input sample and not on the size of the training set. Hence any sparsity is on the dimensionality of the regressors and identifies significant and non-significant attributes based on the MSE loss function.

In the kernel setting, similar penalising constraints on the regression coefficients results in sample-wise sparsity, i.e. a kernel-based Lasso (Roth 2004) that will identify significant and non-significant training samples instead of attributes. The general kernel-based lasso function to be minimised is:

N X i=1 yi− w T ki 2 + λ N X i=1 |wi| (2.77)

where now the regression coefficients w ∈ RN _{operate on the kernel matrix and}

the shrinkage effectively prunes out training samples.

One other prominent sparse kernel method is the Support Vector Machine (SVM) (Vapnik 1995) which is a geometric method that maximises the smallest distance between the decision boundary and the closest samples (margin). This results in a penalising term on the regression coefficients 1₂||w||2 _{which is the L}2

norm. The resulting sparse solutions from SVMs retain only training samples that are close to the decision boundary, due to the initial assumptions of the model, and are termed as support vectors as they are responsible for defining or “supporting” the boundary.

SVMs have the drawback of not producing probabilistic outputs as they are “decision” machines (Bishop 2006) and the resulting sparsity levels are moderate when compared to other alternative sparse kernel methods such as the Relevance Vector Machine that is briefly described in the next section.

2.7.3 Sparsity in Bayesian Inference

In the Bayesian framework, the analogous sparsity-inducing role to regularisation is performed by the prior distributions placed on the model’s variables. For

example, the Lasso approach is equivalent to placing a Laplace prior on the regression coefficients. Hence in this framework no ad-hoc penalising term needs to be introduced but we can formally place appropriate prior distributions that induce sparsity via the principle of Automatic Relevance Determination (ARD) (MacKay 2004).

ARD describes the Bayesian process by which sparsity inducing prior distributions on the parameters, such as the Laplace or the Student-t prior, effectively determine the “relevance” of a feature (or sample in kernel-based methods) based on the evidence from the data. The two dominant ARD approaches within the Bayesian paradigm and the Machine Learning community are the Relevance Vector Machines (RVMs) (Tipping 2001) and the Informative Vector Machines (Lawrence et al. 2003).

RVMs employ a hierarchical prior formulation with a zero-mean Gaussian distribution on the parameters and a Gamma distribution on the scales of the Gaussian. This results (by marginalising the scales) to an implicit Student-t distribution on the regression coefficients which, similarly to the Laplace, has probability mass at the mean (zero) and on the tails of the distribution. This enforces coefficients with no evidence to shrink to zero and significant ones to be non-zero.

The main driving force behind the RVM formalism is the maximisation of the marginal likelihood with respect to the hyper-parameters (regression):

p(y|α, σ) = Z

p(y|w, σ)p(w|α)dw (2.78)

where as before α are the scales and σ2 _{is the noise term in regression. This}

maximisation is known as type-II maximum likelihood (type-II ML) and it leads to efficient and incremental ways (Tipping and Faul 2003, Faul and Tipping 2002) to prune out and include features or samples based on their contribution to the marginal likelihood. The resulting solutions are typically very sparse but the scalability to multiclass classification is problematic17 _{due to the M C × M C}

Hessian matrix required for the Laplace approximation.

A further sparse Bayesian approach was suggested by Lawrence et al. (2003) within the context of Gaussian Processes (Rasmussen and Williams 2006) that model directly the estimating function ˆy by placing an appropriate (data depen-

dent) zero-mean Gaussian distribution directly on the possible functions. The sparse approximation follows an information theoretic criterion based on the entropy contribution of each sample and it is competitive with SVMs in training processing times.

The criterion proposed is the differential entropy score and in effect favours samples that reduce the variance of the predictive distribution. Sparsity levels are comparable to SVMs with the additional benefit of probabilistic outputs. However, similarly to SVMs, they are binary classification methods and require multiple dichotomy of the solution space with one versus one or other ad-hoc procedures.

In document Probabilistic multiple kernel learning (Page 55-59)