• No results found

The proposed algorithm can roughly be broken into three parts. 1. Selecting a subset of the features for each observation vector x. 2. Finding the maximum of p(s|A,x) within this subset.

3. Updating the features in matrix A.

These steps are further explained below. The complete algorithm is shown in table 4.1.

In the algorithm described here, the length of the observation vector x can be chosen arbitrarily to be at least as long as the features ak. In

the experiment reported later this vector was chosen to be twice the size of the feature length so that the matrix A contained 3K 1 shifted versions of each feature (where K is the feature length). The choice of the length ofxmust be a compromise between computational complexity and the problems caused by truncated features and the associated end-effects. The features can be initialised with Gaussian noise but can also be pre-set to known functions such as Fourier bases. This can speed up convergence, but might also influence the outcome.

CHAPTER 4. ANALYTIC APPROXIMATION 77

Table 4.1: Shift-invariant learning algorithm via EM. Input:

User defined: signal{xi}, the size of the subsetW, the percentage of maximal overlapµand the number and length of features ak. Output:

A.

1 {ak}k∈K=random,

K is the set of all features 2 randomly select a data vector x

3 calculate inner product between xand all shifted features {dk,l}k∈K,l∈L=<x,{ak,l}k∈K,l∈L>

K is the set of all features L is the set of all shifts. 4 forγ = 1, γ < W

[ ˜K(γ),L˜(γ)] = arg maxk,ldk,l set{dk,l}k= ˜K(γ),lLˆ= 0

ˆ

Lis the set of shifts close to the selected position. 5 forr= 1, r < R

{s[r+1]}kk,l˜ L˜

=EM({s[r]}kk,l˜ L˜,{ak,l}kK˜,lL˜)

6 calculate gradient{∆ak}k∈K

{∆ak}kK˜ ={Pl∈L(x−ak,lsk,l)sk,l}k∈K,l∈L Lis the set of all shifts for which features are not truncated 7 update{ak,l}k∈K,l∈L {ak}[kr+1]K˜ ={ak}k[r]K˜ +µ{∆ak}k∈K˜ 8 normalise {ak}kK˜ {ak}[kr+1]KI˜ :={ak}[kr+1]K˜ /k{ak}[kr+1]K˜ k2 9 µ[r+1]=µ[r]ν;ν <1

10 repeat from step 2 until convergence

The sparse coding model studied here has some indeterminacies. For example the value of the coefficients and the energy of the features can be

scaled so that the model is still valid. To avoid problems with constant growth of the features re-normalisation has to be applied after each up- date. Here theL2 norm of the features is arbitrarily normalised to 1. The

model also has an ordering ambiguity, but as there is no natural order to the features, the found order is not relevant for the implementation.

The algorithm involves repeated calculation of As as well as ATx. In

the shift-invariant model these products are convolutions and can therefore be evaluated efficiently in the Fourier domain. However, due to the high sparsity ofsa simple multiplication might be faster in some circumstances. Due to the shift-invariant structure, Adoes not have to be stored entirely; it is sufficient to store the individual features.

Conclusions

Approximations to the learning rule of the features ak can be based on

integral approximations around the MAP estimate of p(s|x,A). An easy approximation based on a delta function can be used. This delta rule has an interpretation as a joint maximisation of the complete data likelihood in a missing data problem, which can justify its application.

For large problems, the derived learning rules cannot be used directly. Instead, we developed a subset selection step which can reduce the problem size. The reduced problem can then be solved using the learning rules derived. Experimental results and different applications of the developed algorithm are presented in chapters 7 and 8.

However, before studying the performance of the proposed method we develop other approximations to the learning rule. In the next two chap- ters we use Monte Carlo approximations. Chapter 5 deals with importance sampling Monte Carlo approximations of the learning rule, which can be much faster than the method developed here, so that the subset selec- tion step is not required. In chapter 6 we study Gibbs sampling to draw samples from the posterior ofs. This method allows for an easy incorpo- ration of additional constraints by the specification of more complex prior distributions.

Chapter 5

Importance Sampling

Approximation1

The learning rule developed in chapter 3 used stochastic gradient descent steps for each data vector. As we use individual data vectors and not the entire set of available observations, the gradient is only on average the gradient of the complete data likelihood. This is however, sufficient for the stochastic gradient descent procedure [72] to find the maximum likelihood estimate. If we have a large amount of data we have to take many small gradient steps in this procedure. The learning rule developed in the previ- ous chapter attempted to find a good approximation of the gradient of the likelihood of a single data vector. But as we have just stressed, this gradi- ent is only a rough approximation of the gradient of interest and it seems wasteful to spend too much computation on an accurate approximation of this gradient.

Instead of finding a good approximation of the gradient we concentrate in this chapter on a fast method able to find a ‘rough’ approximation to this gradient. The parameters of interest in such an approximation are the variance and the bias. If the approximation is biased, then the stochastic gradient descent procedure only finds a biased estimate.

In this chapter we introduce an importance sampling approximation of the gradient. In order to develop such a method we first define a prior distribution that is a mixture of a Gaussian and a delta function. The delta function, which is centred at zero, forces coefficients to zero, whilst

1

This chapter is based on work published in [12]

the Gaussian distribution models the non-zero coefficients.

In this model we have additional parameters, all of which can be es- timated using maximum likelihood estimates. This is also done using stochastic gradient descent similar to the estimation of the dictionary. We therefore introduce the learning rules for these parameters in section 5.2. In the same section the particularities of the importance sampling method are introduced and the algorithm derived.