ML Estimation of the Model Parameters - Adaptive Sparse Approximations

2.4 Adaptive Sparse Approximations

2.4.2 ML Estimation of the Model Parameters

The problem of learning the matrix A can be formulated from a proba- bilistic point of view as the problem of finding the maximum likelihood estimate of the marginal likelihood [80]:

p(_{x_}|A) =

p(_{x_}|A,s)p(s) ds,

where we use {x}to denote the set of all available data vectors x.

Unfortunately, for the sparseness inducing priors discussed above, this integral cannot be solved analytically and approximations are required. If we assume the observations x to be independent we can use the factori- sation p(_{x_}|A) = Q

p(x_|A) where the product is over all observations. Instead of maximising this joint distribution, it is possible to use stochastic gradient descent optimisation. This procedure has the advantage that not all data needs to be taken into account in each step, reducing the memory demands of the algorithm. Furthermore, it is then possible to update model parameters ‘on-line’ as new data becomes available. Fur- thermore, for the maximisation studied here, the gradient, whether with respect to all data or with respect to a single observation, is not available analytically. The approximations introduced below can only offer noisy estimates and naturally lead to stochastic gradients.

In the stochastic gradient descent procedure used here, the matrixAis updated iteratively using a single data-point in each iteration to calculate an approximation of the gradient. If the gradient with respect to a single data point is unbiased, then this method converges to a local maximum of the likelihood [72].

In order to derive a stochastic gradient learning rule and in order to gain a better understanding of the problem we rewrite the required gradient by following [78] and use the notation:

Z =p(x_|A) =

p(x_|A,s)p(s)ds

and the abbreviation:

An expression for the gradient of the log-likelihood:

L= logp(_{x_}|A) can be found as:

∂logp(_{x_}|A)

∂A ,

where the derivative is w.r.t. the individual elements of the matrix A. The learning algorithm is derived as a stochastic gradient algorithm for which in each iteration the gradient has to be evaluated for a single observation vector xand not for the set of all available observations {x}. This gradient can be written as:

∂logZ ∂A = 1 p(x_|A) ∂ ∂Ap(x|A) = Z 1 Ze E(s₎ ∂ ∂AE(s)ds = Z p(s_|A,x) ∂ ∂AE(s)ds = ∂ ∂AE(s) p(s_|A_,x₎ (2.4)

where <·>denotes expectation.

So the gradient can be written as an expectation of the derivative of equation (2.3) with respect top(s_|A,x). Taking the derivative of equation (2.3) and assuming ǫ _{∼ N}(0, σ2

ǫI) the negative of the gradient can be

written as: −∂logZ ∂A = σ_ǫ2(x−As)sT p(s_|A_,x₎, (2.5)

where the derivative is again with respect to the individual elements of the matrix A.

2.4.3 Approximations to ML Learning

As the expectation w.r.t. p(s|A,x) cannot be evaluated analytically, different strategies have been proposed. In [72] different conditions on the estimation of the gradient w.r.t. a single data-point are given that en- sure convergence to a local maximum. One important condition is the

CHAPTER 2. SPARSE CODING ₄₃

(asymptotic) unbiasedness of the gradient estimate. The first two methods discussed below do not take this bias into account. The Gibbs sampling method in chapter 6, however, does offer such an unbiased estimate (at least asymptotically). The importance sampling method developed in chapter 5 also address this problem and is also asymptotically unbiased, however, for finite samples, the bias can be significant.

Delta Approximation

The simplest approximation of the integral in equation (2.5) is to approximate the posterior p(s_|x,A) with a delta function at its maximum as suggested in [103]. In [57] this approximation was shown to lead to the joint maximum likelihood estimation of s and A for the complete likelihood function in a missing data problem, in which the missing data is s. In this case the gradient estimate becomes:

∂logZ

∂A ≈σ

ǫ(x−Aˆs)ˆsT,

where we use ˆsto denote the MAP estimate ofp(s|x,A). This method requires the estimation of ˆs, which can be done using the methods discussed in the previous section.

Gaussian Approximation

Lewicki [80] proposed a Gaussian approximation of the posterior around the MAP estimate of s which leads to the approximation:

∂log_Z

∂A ≈σ

−2

ǫ ((x−As)−AH−1),

whereHis the Hessian of the log-posterior evaluated at the current MAP estimate of p(s_|x,A). Further approximations can be made [80] leading to: ∂logZ ∂A ≈ −µA(− ∂ ∂slogp(s)ˆs T ₊_I)_.

This method also requires the evaluation of ˆs, which can again be done using methods introduced in the previous section.

Monte Carlo Approximation

Using sampling methods to sample from p(s_|x,A) does not only allow us to estimate the mean or maximum of the posterior as discussed in section 2.3, it also allows us to use Monte Carlo approximations of the expectation in equation (2.5). This method was proposed in [126, 105]. This approximation is extensively used in chapters 5 where we develop an importance sampling method and in 6 where we study a Markov chain sampler. More details on previous methods based on Monte Carlo approximations are given in these chapters.

Other Approximations

For completeness we mention two other solutions suggested in the litera- ture. One of these approaches is to approximate equation (2.5) with the help of variational methods (see for example [44, 59, 92, 116]). The other approach was proposed by Engan in [30]. This batch method (Method of Optimized Directions) is similar to the solution of the standard Wiener Filter [63], because once the vectorsor its correlations with xare known, or assumed to be known, the model reduces to the standard linear model with Gaussian noise.

2.5 Applications of Sparse Coding

There are two main areas for which sparse coding ideas have been used: feature extraction and Blind Source Separation (BSS). BSS based on sparse signal representations uses the realisation that most signals can be transformed with an orthogonal transform into a representation in which expected features occur sparsely. For example, the time domain representation of a spoken word is not sparse, however, the frequency domain representation has only a small number of significant coefficients.

Feature extraction based on sparse coding ideas uses the assumption that most features do not occur most of the time in any one observation. This is the assumption used in this thesis. A general overview of previous applications based on this approach to feature extraction can be found

CHAPTER 2. SPARSE CODING ₄₅

in [16]. Possible applications include audio, image, and biomedical data analysis. Previous contributions to these areas as well as applications to BSS are given below.

In document Bayesian modelling of music: algorithmic advances and experimental studies of shift invariant sparse coding (Page 42-46)