Many engineering problems of interest suffer from high dimensionality. In the problems studied here, the length of the expected features can often be of the order of a few thousand and the number of features often in the hundreds. In the shift-invariant model this leads to a matrix A of substantial size, which means that the calculation of the maximum of the posterior p(s|A,x) becomes prohibitively costly. This forbids a direct implementation of the above algorithms. Therefore, we propose the use of a subset selection step that offers a fast way to select a small subset of features depending on their correlation with the observation. After this selection, the optimisation routines mentioned in the previous section can be used by ignoring features not contained within the subset. With this approach, results can be obtained, even for problems of very high dimension.
Most of the coefficients s are zero with high probability and therefore most columns of A do not contribute to any one observation. In order to speed up the optimisation required to find the maximum of p(s|A,x), we
propose to exclude a large set of the columns of Afrom the optimisation. Information about which features to keep and which to exclude has to be taken from a particular observation x. An additional requirement is that this selection process can be performed efficiently.
For extremely sparse approximations (as used in this thesis) we assume that each feature contributing to the observation has a high correlation with this observation, i.e. it is assumed that such signals are similar to their component features. The selection of a subset of features is therefore based on the correlation between the observation xand all columns of A. Due to the structure in the matrix A, this correlation can be evaluated efficiently using fast convolution. Based on this correlation it is possible to only select those features for which this correlation is high. However, an additional constraint has to be imposed. As smooth features shifted only slightly are similar to themselves, the same feature would be selected several times at adjacent locations. This can be avoided by constraining the selected subset to only include shifted versions of the same features if these are shifted by more than a certain distance, i.e. by selecting akl and
ak˜l only if |l− ˜
l|
L > Q for some Q <1.
The iterative selection procedure then selects the feature and shift with the highest correlation
{ki, li}= arg max
{k,l}∈Ki×Lih
akl,xi,
where the product space of indices Ki× Li is defined iteratively by remov-
ing subsets from the set of all features and shifts
Ki× Li =K × L\
[
˜i<i
k˜i×[l˜i−QL;l˜i+QL].
To better understand the assumptions made in this subset selection procedure we present the method from a statistical point of view. The posterior for s can be factored, using the index n instead of the indices k
and l to denote the feature and the associated shift.
P(s|A,x) = P(sn1|A,x)P(sn2|sn1,A,x). . .
As a first approximation we only work with the MAP estimates for each distribution (i.e. to approximate the distributions with delta functions)
CHAPTER 4. ANALYTIC APPROXIMATION 75
and to further truncate the right hand side to a few terms, which are assumed to be non-zero. For the terms with non-zero sn we assume a
uniform prior forP(sn), andP(x|A, sn) is assumed to be Gaussian. These
approximations lead to the posterior:
P(sn|A,x)∼ N(ansn, σ2I)
The problem now is to determine which coefficients to select to be non- zero. This is done iteratively. The first non-zero coefficient sn1 is chosen by calculating:
n1 = arg max
n P(sn|A,x),
which is the index n, which maximises xTa n.
In order to approximate the other terms in the factorisation, an ex- pression for P(sn|s1:ˆn−1,A,x) has to be found where we use the subscript
notation 1 : ˆn to denote all variables with subscripts between 1 and ˆn. Here the notation ˆn is used to distinguish the ordered indices ˆn from the unordered indices n. Bayes’ rule gives:
P(snˆ|A,x, s1:ˆn−1)∝P(x|A, s1:ˆn)P(snˆ|s1:ˆn−1).
The constraint on feature shifts can be interpreted in probabilistic terms as the use of the prior P(snˆ|s1:ˆn−1) = P(sˆn)U1:ˆn−1, where P(sˆn) is again a
uniform distribution andU1:ˆn−1 is a function which is zero for shifts around
l1:ˆn−1 but otherwise has a value normalising the distribution.3 The main
computational advantage in the subset selection procedure is the result of the selection of features close to the observation which has a statis- tical interpretation as an approximation of the probability P(x|A, s1:ˆn)
by P(x|A, snˆ). Note that for a Matching Pursuit algorithm, where each
feature is selected to model the residual and not the original observation, the distributionP(x|A, s1:ˆn) is Gaussian with a mean ofPa1:ˆns1:ˆn, whilst
here a mean of anˆsˆn is used.
Selecting the index ˆn can therefore be done in a similar fashion as above. The correlation of all features at those shifts which do not violate
3
This interpretation of the constraint in terms of conditional priors is somewhat contrived, however it can be used to develop alternative methods in which other priors are specified such that close features are selected with a small but non-zero probability.
the constraint are again required. These correlations have already been calculated and do not have to be re-evaluated. In a Matching Pursuit algorithm the correlation would have to be recalculated at each step as it is determined from the residual.
The difference between Matching Pursuit and the method proposed here is that Matching Pursuit selects in each step a set of features that are as orthogonal as possible, whilst the proposed method selects a set of similar features. This choice seems to be more appropriate for harmonic musical mixtures as studied in this thesis, but may not have to be ap- propriate for other signals. The results presented in the later chapters of this thesis show the performance of this subset selection method. Similar experiments with an Orthogonal Matching Pursuit algorithm for subset selection did not produce satisfying results.