3.1 Sparse regression
3.1.1 Spike and slab model
A Bayesian variable selection approach is the so called spike and slab model from Mitchell and Beauchamp [104], which has the posterior p(θ|D) ∝ p(θ)p(D|θ). The
3.1. SPARSE REGRESSION 43 called`0-pseudo-norm that regulates the number of selected features. The relevance of
the features is indicated by a bit vectorθ = (θ1, .., θN) with θj = 1in the case that a
featurejis selected or relevant, andθj = 0if it is irrelevant. The`0-norm is formulated
with||θ||0 =PN
j=1θj and penalizes the prior density of the bit vector with a Bernoulli
distribution: p(θ) = N Y j=1 Ber(θj|π0) =π ||θ||0 0 (1−π0)N−||θ||0 (3.3)
where π0 is the probability p(θj = 1) that a feature should be selected into the
model. Hence low values ofπ0 negatively penalize the number of features inθand high
values promote a large number of selected features. The feature vector θ aects the
prior probability of the weights vector w= (w1, .., wN) by setting a weight wj to zero
if the corresponding featurej is dened irrelevant with θj = 0. Wheneverθj = 1, the
weight wj can be expected to be non-zero. In this case a reasonable prior is dened
by a normal distribution with a mean of zero and a variance σ2
w that controls how strong the weight can uctuate around the mean scaled by an additional noise variance variableσ2:
p(wj|σ2, θj) = (1−θj)δ0(wj) +θjN(wj|0, σ2σw2) (3.4) The rst termδ0(·)is a point probability mass that causes a spike at zero and the
second term is referred to as slab in the case when σ2
w→ ∞ and N(wj) approaches a uniform distribution. The prior for the selected feature setθ and the weights prior
are combined in the full posterior with:
p(θ|D)∝p(θ)p(D|θ) =p(θ|π0)p(y|X,θ) =p(θ|π0)
Z Z
p(y|X,w,θ)p(w|θ, σ2)p(σ2)dwdσ2 (3.5) A disadvantage of using the`0-pseudo-norm is that the values||θ||0are discrete which
causes the objective function to become very non-smooth, i.e. non-convex. Hence replacing the discrete with a continuous prior leads to a convex approximation of the non-convex optimization problem.
44 Chapter 3
3.1.2 `1 and `2 regularization
The posteriorp(θ|D) in Equation 3.5 has 2N possible models that are computational
expensive to explore given the fact that θ is a discrete parameter vector. The spike
and slab prior on θ can be replaced with a prior of the continuous weight variables wby encouraging wj = 0 with a distribution that centers a lot of probability density
at zero. A Laplace distribution with a spike at the zero-mean (µ= 0) and heavy tails that are parametrized with a regularization term can be formulated as
p(w|λ) = N Y j=1 Lap(wj|0,1/λ)∝ N Y j=1 e−λ|wj| (3.6)
The negative logarithm of this prior yieldsPN
j=1λ|wj|=λ||w||1, where||w||1 is the
`1-norm ofwandλis the scaling parameter that controls the strength of regularization. This prior can be used to do MAP estimation because minimizing the negative log likelihood is equivalent to the MAP given a uniform priorp(θ). An estimate for the
weight parameterwˆ can thus be formulated as the negative logarithm of the posterior
in Equation 3.2: ˆ wM AP =−log argmax w n p(D|w)p(w) o = argmin w n −log p(D|w)−log p(w)o (3.7) The rst term in Equation 3.7 becomes −1/(2σ2)PM
n=1(yi−w0−
PN
j=1wjxj)2 in a
linear regression scenario with Gaussian likelihood, and the second term is the previ- ously described Laplace prior. By eliminating−1/(2σ2)from the rst term one recovers
the residual sum of squares that quanties the loss of the linear model.
ˆ wM AP = argmin w nXM n=1 (yi−w0− N X j=1 wjxj)2+λ 0XN j=1 |wj|o = argmin w n ||y−XTw||22+λ0||w||1o (3.8)
where the penalty factor is λ0
= 2λσ2 . This equation is also known as the Lasso
described in Section 2.4 and represents in essence the Lagrangian form of a constrained optimization problem with the RSS corresponding to a quadratic objective function
3.1. SPARSE REGRESSION 45 subject to the constrain of the penalty term||w||1 under the boundaryB:
ˆ w= argmin w ||y−XTw||22 s.t. ||w||1 ≤B (3.9)
B is inversely related to the penalty λ and is an upper bound on the`1-norm con-
straint: a small value ofB corresponds to a large value of λhence the penalization of the weightswis stronger than with a relaxed constraintB.
wˆ w2 w1 wˆ w1 w2
Figure 3.1: Geometric interpretation of `1 and `2-norm. Left plot illustrates
the `1-norm with weight estimates wˆ touching the boundary of the
diamond shaped constrained area. This is a solution to the optimization problem and will encourage weights to take on values of zero because of
the particular shape of the constraint. The right plot shows the`2-norm
constrained area that has a circle form. In this case no regularization of weights towards zero occurs because of the round shape. Based on Figure 3.12 from Hastie et al. [70].
The interpretation ofBfor the`1-norm is illustrated geometrically for a 2-dimensional
weight vector in the left plot of Figure 3.1. The grey area in diamond shape is dened by the`1-norm whereas the size is determined by B. The area thus acts as the bound-
ary that intersects the ellipse of estimated valueswˆ of the objective function. Relaxing
B causes the shape to grow in size until it touches the objective functions estimates. For smallB and hence a small constraining area this is likely to occur along one of the axis, i.e. values ofwj = 0will be encouraged because of the specic geometric shape of
the diamond. In Figure 3.1 this is the case forw1 = 0, whereasw2 6= 0. WhenB →0,
the area becomes condensed at the origin zero and all weights approximatewˆ →0. The right plot in Figure 3.1 illustrates the case for ridge regression, that has a `2-
46 Chapter 3 ˆ w= argmin w ||y−XTw||22 s.t. ||w||22 ≤B (3.10)
or in the Lagrangian form:
ˆ w= argmin w ||y−XTw||22+λ||w||22