Spike and slab model - Sparse regression - Machine learning in systems biology at different sca

3.1 Sparse regression

3.1.1 Spike and slab model

A Bayesian variable selection approach is the so called spike and slab model from Mitchell and Beauchamp [104], which has the posterior p(θ|D) ∝ p(θ)p(D|θ). The

3.1. SPARSE REGRESSION 43 called`0-pseudo-norm that regulates the number of selected features. The relevance of

the features is indicated by a bit vectorθ = (θ₁, .., θN) with θj = 1in the case that a

featurejis selected or relevant, andθj = 0if it is irrelevant. The`0-norm is formulated

with||θ||₀ =PN

j=1θj and penalizes the prior density of the bit vector with a Bernoulli

distribution: p(θ) = N Y j=1 Ber(θj|π0) =π ||θ||0 0 (1−π0)N−||θ||0 (3.3)

where π0 is the probability p(θj = 1) that a feature should be selected into the

model. Hence low values ofπ0 negatively penalize the number of features inθand high

values promote a large number of selected features. The feature vector θ aects the

prior probability of the weights vector w= (w₁, .., wN) by setting a weight wj to zero

if the corresponding featurej is dened irrelevant with θj = 0. Wheneverθj = 1, the

weight wj can be expected to be non-zero. In this case a reasonable prior is dened

by a normal distribution with a mean of zero and a variance σ2

w that controls how strong the weight can uctuate around the mean scaled by an additional noise variance variableσ2_:

p(wj|σ2, θj) = (1−θj)δ0(wj) +θjN(wj|0, σ2σw2) (3.4) The rst termδ0(·)is a point probability mass that causes a spike at zero and the

second term is referred to as slab in the case when σ2

w→ ∞ and N(wj) approaches a uniform distribution. The prior for the selected feature setθ and the weights prior

are combined in the full posterior with:

p(θ|D)∝p(θ)p(D|θ) =p(θ|π₀)p(y|X,θ) =p(θ|π₀)

Z Z

p(y|X,w,θ)p(w|θ, σ2)p(σ2)dwdσ2 (3.5) A disadvantage of using the`0-pseudo-norm is that the values||θ||0are discrete which

causes the objective function to become very non-smooth, i.e. non-convex. Hence replacing the discrete with a continuous prior leads to a convex approximation of the non-convex optimization problem.

44 Chapter 3

3.1.2 `₁ and `₂ regularization

The posteriorp(θ|D) in Equation 3.5 has 2N _{possible models that are computational}

expensive to explore given the fact that θ is a discrete parameter vector. The spike

and slab prior on θ can be replaced with a prior of the continuous weight variables wby encouraging wj = 0 with a distribution that centers a lot of probability density

at zero. A Laplace distribution with a spike at the zero-mean (µ= 0) and heavy tails that are parametrized with a regularization term can be formulated as

p(w|λ) = N Y j=1 Lap(wj|0,1/λ)∝ N Y j=1 e−λ|wj| (3.6)

The negative logarithm of this prior yieldsPN

j=1λ|wj|=λ||w||1, where||w||1 is the

`1-norm ofwandλis the scaling parameter that controls the strength of regularization. This prior can be used to do MAP estimation because minimizing the negative log likelihood is equivalent to the MAP given a uniform priorp(θ). An estimate for the

weight parameterwˆ can thus be formulated as the negative logarithm of the posterior

in Equation 3.2: ˆ wM AP =−log argmax w n p(D|w)p(w) o = argmin w n −log p(D|w)−log p(w)o (3.7) The rst term in Equation 3.7 becomes −1/(2σ2)PM

n=1(yi−w0−

j=1wjxj)2 in a

linear regression scenario with Gaussian likelihood, and the second term is the previ- ously described Laplace prior. By eliminating−1/(2σ2)from the rst term one recovers

the residual sum of squares that quanties the loss of the linear model.

ˆ wM AP = argmin w nXM n=1 (yi−w0− N X j=1 wjxj)2+λ 0XN j=1 |w_j|o = argmin w n ||y−XTw||2₂+λ0||w||₁o (3.8)

where the penalty factor is λ0

= 2λσ2 _{. This equation is also known as the Lasso}

described in Section 2.4 and represents in essence the Lagrangian form of a constrained optimization problem with the RSS corresponding to a quadratic objective function

3.1. SPARSE REGRESSION 45 subject to the constrain of the penalty term||w||₁ under the boundaryB:

ˆ w= argmin w ||y−XTw||2₂ s.t. ||w||₁ ≤B (3.9)

B is inversely related to the penalty λ and is an upper bound on the`1-norm con-

straint: a small value ofB corresponds to a large value of λhence the penalization of the weightswis stronger than with a relaxed constraintB.

wˆ w₂ w1 wˆ w1 w₂

Figure 3.1: Geometric interpretation of `1 and `2-norm. Left plot illustrates

the `1-norm with weight estimates wˆ touching the boundary of the

diamond shaped constrained area. This is a solution to the optimization problem and will encourage weights to take on values of zero because of

the particular shape of the constraint. The right plot shows the`2-norm

constrained area that has a circle form. In this case no regularization of weights towards zero occurs because of the round shape. Based on Figure 3.12 from Hastie et al. [70].

The interpretation ofBfor the`1-norm is illustrated geometrically for a 2-dimensional

weight vector in the left plot of Figure 3.1. The grey area in diamond shape is dened by the`1-norm whereas the size is determined by B. The area thus acts as the bound-

ary that intersects the ellipse of estimated valueswˆ of the objective function. Relaxing

B causes the shape to grow in size until it touches the objective functions estimates. For smallB and hence a small constraining area this is likely to occur along one of the axis, i.e. values ofwj = 0will be encouraged because of the specic geometric shape of

the diamond. In Figure 3.1 this is the case forw1 = 0, whereasw2 6= 0. WhenB →0,

the area becomes condensed at the origin zero and all weights approximatewˆ →0. The right plot in Figure 3.1 illustrates the case for ridge regression, that has a `2-

46 Chapter 3 ˆ w= argmin w ||y−XTw||2₂ s.t. ||w||2₂ ≤B (3.10)

or in the Lagrangian form:

ˆ w= argmin w ||y−XTw||2₂+λ||w||2₂

In document Machine learning in systems biology at different scales : from molecular biology to ecology (Page 55-59)