In this section we introduce an extension of the general sparse coding model for time-series and similar data in which features can occur at arbitrary locations. First, we define a generative model that includes all shifted features. We then derive learning rules for this extended model by finding approximations to the gradient of the marginal likelihood w.r.t. the feature parameters.
In time-series, features can often occur at arbitrary time locations. The application of the sparse coding model described in the previous chapter to such time-series has previously been achieved by arbitrarily cutting the time-series into blocks x. However, these blocks seldom align with the features in the time-series such that the shifts of the different instances of a feature relative to the block positions vary arbitrarily. We therefore modify the model to account for these arbitrary shifts of features relative to a selected block position. This can be achieved by enforcing structure on the matrix A. From now on we use the notation A to refer to this structured matrix. To state the model used in [113, 79, 14, 102, 126, 147] we introduce the following notation:
The index k labels a particular feature whilst the index l denotes the corresponding shift relative to the beginning of the data-block analysed.
CHAPTER 3. SHIFT-INVARIANT SPARSE CODING 55
we denote the length of the features ak as L. From now on we let l be
zero to denote no shift2, i.e. a
k0 = [ak1, ak2,· · · , akL,0,· · · ,0]T whilst, for
example, ak−4 = [ak5, ak6,· · · , akL,0,· · · ,0]T and so forth. Note that for
allp−l /∈[0, L] the elements of akl are set to zero and that for l <0 and l > M −L the features ak have to be truncated. We use akp to denote
the pth component of a feature, which should not be confused with the
notation akl that refers to a shifted feature. With this notation we can
write A= [a1,−L,a1,−L+1, . . . ,ak,M−1,ak+1,−L, . . . ,aK,M−1].
A is shown graphically for M = 4, N = 12, L= 3, K = 2 below:
⋆3 ⋆2 ⋆1 0 0 0 ◦3 ◦2 ◦1 0 0 0 0 ⋆3 ⋆2 ⋆1 0 0 0 ◦3 ◦2 ◦1 0 0 0 0 ⋆3 ⋆2 ⋆1 0 0 0 ◦3 ◦2 ◦1 0 0 0 0 ⋆3 ⋆2 ⋆1 0 0 0 ◦3 ◦2 ◦1
Here the two features are shown as stars ⋆ and circles◦ respectively with the subscripts labelling the sample.
If we use skl as the coefficient multiplying feature akl, then the data
model can be written as
x= X
k∈K,l∈L
aklskl+ǫ=As+ǫ.
The observation block x is modelled with features and all their possible shifts. Note that this model is a mixture of convolutions and that the matrix A is the concatenation of convolution matrices. The coefficient vector s is now a concatenation of the signals being convolved.
This is shown more clearly by writing the model in the familiar form of a discrete system: x[t] =X k∈K X l∈L ak[L+ 1−l]sk[t+ 1−l] +ǫ[t],
which shows the equivalence of the model to a mixture of linear shift- invariant filters with added noise.
The above model can be used to describe data blocks of arbitrary length and it would be possible to model the complete observation sequence.
2
Note that the observation x ∈ RM with M ≥ L so that the features ak have to be
However, for many time-series of interest, it is infeasible to deal with the complete observation at once. Nevertheless, it is possible to randomly select blocks of data from the time-series of interest and to use stochastic gradient descent to learn the model parameters by using a similar method to the one introduced in the previous chapter. In this case, the length M
of the observation vector x can be chosen arbitrarily to be at least L. In the experiment reported later this vector was chosen to be twice the size of the feature length.
3.2.1 Learning Rule
The model introduced above requires a revision of the learning rules in- troduced in the previous chapter. The elements of the featuresak are now
repeated along the diagonals of the matrix A. The values of A cannot be updated individually without taking this repetition into account, which is achieved by calculating the gradient of logp(x|A,s) (which is required in equation (2.4)) w.r.t. the pth component of the feature a
k.
Using ǫm =xm−Pk∈K,l∈Lakm+lskl we can write the log likelihood as:
logp(x|A,s)∝ −0.5 σ2 ǫ X m ǫ2 m.
We can now calculate the derivative of this w.r.t. akp and write: ∂logp(x|A,s) ∂akp =− 1 σ2 ǫ X m ǫm ∂ǫm ∂akp .
The derivative on the right only leaves those ak,m+l for which m+l =p.
The gradient then becomes:
∆akp = 1 σ2 ǫ * X m ǫmsk,m−p + p(s|A,x) . (3.1)
Ifxandakare both inRL, we can write this expression as a convolution
and derive a gradient update rule for the set of features {ak} as:
∆{ak} ∝σ−ǫ2hǫ ⋆{sk}ip(s|A,x) ,
CHAPTER 3. SHIFT-INVARIANT SPARSE CODING 57
This gradient then leads to an update of the features of the form:
{ak}r+1 = {
ak}r+ν∆{ak} k{ak}r+ν∆{ak}k2
.
Due to the scale ambiguity in the model, in each update the features ak
are normalised to unit L2 norm.
This learning rule again requires the evaluation of an expectation w.r.t.
p(s|x,A), which cannot be solved analytically, so that approximations are required that are similar to the methods discussed in the previous chapter. We introduce and study several possible approximations to the learning rule for the shift-invariant sparse coding model in the next part of this thesis. Chapter 4 studies analytic approximations to the required integra- tion whilst chapter 5 develops an importance sampling method. Chapter 6 developes and studies a Markov chain Monte Carlo approximations.