• No results found

The definition of sparsity used here assumes that many of the coefficients

s are exactly zero. However, the prior probabilities used in the previous chapter had most of their probability mass close to zero but not at zero. In this and the next chapters we use different prior formulations that have a high probability mass at zero. Monte Carlo approximations can then be used to approximate the learning rule 3.1 introduced in subsection 3.2.1.

5.1.1 Prior Formulation

In this chapter we impose the following mixture prior in order to enforce sparsity of the coefficients s:

p(s|u) =Y n p(sn|un) = Y n (un r λG 2πe −λG2 s 2 n + (1u n)δ0(sn)), (5.1)

where un is a binary indicator variable with discrete distribution: p(un) =

1 1 +e−λu2

e−λu2 un (5.2) and δ0(sn) is the Dirac mass at zero. This prior is a mixture of a Gaussian

distribution and the Dirac mass, therefore forcing many of the coefficients to be exactly zero with the hyper-prior regulating the sparsity of the dis- tribution.

5.1.2 Dealing with Parameters

The parameters defining this model are θ = {A, λG, λu, λǫ}. These pa-

CHAPTER 5. IMPORTANCE SAMPLING APPROXIMATION 81

maximum likelihood, MAP estimation or marginalisation. Marginalisa- tion is the proper Bayesian approach to deal with nuisance parameters; the MAP estimate would be the best possible estimate under zero-one error loss, whilst the posterior mean is the best estimate of a parameter of interest under a squared error loss. The main problem in this thesis is the approximation of integrals for marginalisation over nuisance param- eters. Which parameters to estimate and which parameters to integrate out depends on the specific application and model. In this thesis we are primarily interested in marginalising over the coefficients s in the above model in order to calculate estimates of parameters of interest. The ex- tension of the proposed methods to marginalisation over other parameters is possible by a straightforward extension of the ideas presented here and is not discussed further.

5.1.3 ML Learning of Model Parameters

Instead of adopting a fully Bayesian approach to the estimation of the parameters θ, i.e. instead of specifying prior distributions and calculating their joint posterior distribution or the maximum thereof, we again use a stochastic gradient descent algorithm to find the maximum likelihood estimate. In this model, the coefficientssanduare assumed to be nuisance parameters and are therefore integrated out of the data likelihood. The maximum likelihood estimate is then

ˆ θ = arg max θ Y i Z p(xi,si,ui|θ) d{si,ui}.

We use the subscript i to denote the ith observation vector and the as-

sociated coefficients, and I to denote the number of observations. This maximisation can again be solved using stochastic gradient optimisation by approximating the gradient w.r.t. all data with the gradient w.r.t. a single data vectorxi. As discussed in chapter 2, we can write the gradient

as: ∂ ∂θ logp(x|θ) = Z p(s,u|x, θ) ∂ ∂θlogp(x,s,u|θ)dsdu, (5.3)

where from now on we drop the indexi. Again, this expectation cannot be evaluated analytically in general and different approximations have been

proposed in the literature [68, 80, 103], all of which require the calculation of the MAP estimate ofp(s,u|x, θ). However, for many prior distributions the posterior over the coefficients is multi-modal and such estimates then only reflect a section of the distribution and might fail to account for most of the probability mass. Furthermore, such estimates are generally biased, so that convergence to the true maximum of the likelihood is not guaranteed.

During stochastic gradient learning of the parameters the algorithm randomly iterates through the available data, updating the parameters by a small amount in each iteration. This method therefore averages the gradient over several steps. This suggests the use of a less accurate approximation of the gradient in equation (5.3), which itself is already a rather poor approximation of the true gradient with respect to all available data. The stochastic gradient algorithm is then still able to converge to a maximum, given that the unbiasedness of the approximation is ensured and that the learning rate is decreased to zero [114].

Here we discuss a Monte Carlo approximation of the above integral using importance sampling [115]. This technique does not rely on MAP estimation and can therefore be implemented efficiently as shown below.

Importance sampling approximates an integral by a sum of weighted samples, Z p(s,u|x, θ) ∂ ∂θlogp(x,s,u|θ)≈ J X j wj ∂ ∂θp(x,ˆsj,uˆj|θ),

where ˆsj and ˆuj are samples drawn from a proposal distribution q(s,u)

with the same support as p(s,u|x, θ). Here we use the subscript j to label the individual samples drawn. We further use J to denote the total number of samples. The weights are calculated as:

wj = 1 J p(ˆsj,uˆj|x, θ) q(ˆsj,uˆj) = 1 J p(ˆsj|uˆj,x, θ)p(ˆuj|x, θ) q(ˆsj,uˆj) . (5.4) The use of the weights calculated with this formula gives an unbiased gradient estimate for the problem at hand. It can also be shown that the above Monte Carlo approximation converges for J → ∞.

CHAPTER 5. IMPORTANCE SAMPLING APPROXIMATION 83