In many regression problems of practical interest the number of training instances available for induction (n) is small and, simultaneously, the dimensionality of the data (d) is very large. Areas in which these types of problems arise include image analysis (Seeger et al.,2010), genetic microarray studies (Dudoit and Fridlyand,2003), document processing (Sandler et al.,2008) and fMRI data modeling (van Gerven et al.,2009). To address these regression tasks, one usually assumes a simple multivariate linear model. However, when d > n, the calibration problem is under-determined because an infinite number of combinations of the values of the model coefficients can describe the data equally well. In many of these learning tasks only a subset
Spike and Slab Laplace Degenerate Student's t
Figure 4.1: Graphs of spike and slab (left), Laplace (middle) and degenerate Student’s t (right) priors. Spike and slab priors consist of a mixture of a Gaussian density (the slab) and a point probability mass placed at zero (the spike), which is displayed by an arrow pointing upwards. The degenerate Student’s t is an improper prior that is obtained as the limit of a Student’s t distribution, in which the number of degrees of freedom approaches zero. This function diverges at the origin and cannot be normalized.
of the measured features are expected to be relevant for prediction. Therefore, the calibration problem can be regularized by assuming that the vector of coefficients is sparse (Johnstone and Titterington,2009). Different strategies can be used to obtain sparse solutions, in which most of the coefficients of the model are exactly zero. For instance, one can include in the objective function a penalty term proportional to the `1 norm of the vector of coefficients (Tibshirani,
1996). In a Bayesian approach, sparsity can be favored by using sparsity-enforcing priors for the model coefficients. These priors are characterized by probability densities that are peaked at zero and simultaneously have large probability mass in a wide range of non-zero values. This structure favors a bi-separation in the coefficients of the linear regression model: The posterior distribution of most coefficients is strongly peaked around zero. By contrast, a small subset of coefficients are assigned a large posterior probability of being significantly different from zero (Seeger et al.,2010). The fraction of coefficients whose posterior distribution is peaked at zero is the degree of sparsity of the model. Ishwaran and Rao(2005) call the aforementioned bi-separation effect selective shrinkage. Ideally, the posterior mean of truly zero coefficients should be shrunk towards zero, and the posterior mean of non-zero coefficients should be barely affected by the prior. Different sparsifying priors have been proposed in the machine learning and statistics literature. Some examples are the Laplace (Seeger,2008), the degenerate Student’s t(Tipping,2001) and the spike and slab (George and McCulloch,1997) priors. Graphs of the corresponding probability distributions are displayed in Figure4.1.
Spike and slab priors have some advantages over Laplace and degenerate Student’s t priors. In particular, spike and slab priors are often more effective in enforcing sparsity because they allow to selectively reduce the magnitude of only a subset of the model coefficients. Both the Laplace prior and the Student’s t prior have a single characteristic scale. Consequently, they tend to reduce the magnitude of every coefficient in the model, including those coefficients that should actually be different from zero. The spike and slab distribution is a mixture model with two characteristic scales. This allows to discriminate between coefficients that are better modeled by the slab, which are not shrunk to zero, and coefficients modeled by the spike, which have large posterior probability of being exactly zero. An additional advantage is that the desired
Chapter4. Linear Regression Models with Spike and Slab Priors 53
degree of sparsity in the posterior distribution is directly related to the weight assigned to the spike. Moreover, spike and slab priors are formulated in terms of a set of latent binary variables that specify whether each coefficient is assigned to the spike or to the slab. The expected value of these latent variables under the posterior distribution gives the probability that the corresponding model coefficients are exactly zero.
A disadvantage of using spike and slab priors is that Bayesian inference becomes a difficult and computationally demanding problem. Since the posterior distribution cannot be expressed in closed form, it needs to be estimated numerically. However, the computational cost of numerical algorithms is excessively large for most problems of practical interest. Therefore, inference in linear models with spike and slab priors is often implemented using Markov chain Monte Carlo (MCMC) methods; in particular, with Gibbs sampling (George and McCulloch, 1997). However, MCMC methods require to simulate very long Markov chains to obtain an accurate approximation of the posterior. The computational cost of Gibbs sampling is
O
(p20d3k), where p0 is the expected fraction of non-zero coefficients, d is the dimension of the data and k is thenumber of samples drawn from the posterior (see AppendixC.1). Typically, accurate inference requires k d. This high computational cost makes Gibbs sampling infeasible when d is very large. In this chapter, expectation propagation (EP) (Minka,2001) is proposed as an efficient alternative to Gibbs sampling. Despite the fact that EP is an approximate method, it has been shown to perform well in a linear classification model with spike and slab priors for microarray data (Hern´andez-Lobato et al., 2010a). The performance of the linear regression model with spike and slab priors and EP for approximate inference is evaluated in regression problems from different domains of application. The problems analyzed include the reverse engineering of transcription control networks (Gardner and Faith,2005), the reconstruction of sparse signals (Ji et al.,2008) and the prediction of user sentiment (Blitzer et al.,2007). In these problems, EP outperforms or obtains comparable results to Gibbs sampling at a much smaller computational cost. Additionally, spike and slab priors are more effective than Laplace or Student’s t priors. The improved performance of the linear regression model with a spike and slab prior is explained by the superior selective shrinkage capacity of this type of prior distribution.
This chapter is organized as follows: Section4.2introduces the linear regression model with a spike and slab prior (LRMSSP). Section4.3describes the EP algorithm and its application to the LRMSSP. This section includes a description of the posterior approximation generated by EP (Subsection4.3.1), the EP update operations (Subsection 4.3.2) and the approximation of the evidence given by EP (Subsection4.3.3). Section 4.4presents an exhaustive evaluation of EP in different problems of practical interest: the reverse engineering of transcription networks (Subsection4.4.1), the reconstruction of sparse signals (Subsection4.4.2) and the prediction of user sentiment (Subsection4.4.2.1). Finally, the results and conclusions of this investigation are summarized in Section4.5.