• No results found

3.2 Flexible Patch Models in the malaria application

3.2.4 Choosing the number of useful modes t

In image processing and computer vision it has become customary following Cootes and Taylor [32] to use t modes which explain a fraction f of the total variance, i.e. such that:

t

with f chosen as appropriate for the application, but often in the range 0.7 − 0.95. Choosing a value f dependent on the application may be regarded either as an advantage or as a disadvan-tage but it is clear that once a value for f has been chosen the procedure is straightforward to apply. It is undoubtedly a convenient ‘rule of thumb’ but opting immediately to use it ignores that fact that a whole chapter of Jolliffe’s book (chapter 6) is devoted to this question and that much work has been done by statistical researchers to try to provide an answer. Those interested are referred to Jolliffe’s chapter for details but brief consideration of several of the approaches described is worthwhile.

The underlying aim in choosing t is to use only modes that describe interesting, significant variations in the data and to ignore modes which describe uninteresting variations such as noise or artefacts peculiar to the particular training sample used. This is consistent with the remarks made in previous sections about the modes with high k. However in a pattern recognition application choosing t is less straightforward than it might otherwise seem. Discriminating characteristics may be quite subtle and easily lost if too many of the high modes with k > t are rejected.

In chapter 6 of [92] Jolliffe presents several ‘rules of thumb’, describes attempts to develop statistical tests that may be used to calculate t, and discusses briefly partial correlation, ‘boot-strap’ and cross-validation methods. Many of the statistical methods are criticised as based on over-simplistic (often Gaussian) assumptions, for often being more appropriate to factor anal-ysis than PCA, and for being unreliable in that they can fail in certain circumstances (i.e. may lead to the rejection of significant, important modes) and frequently produce estimates of t which are unrealistically large or unrealistically small. Bootstrap and cross-validation methods tend to be based on low-rank approximations to the data matrix X or to the covariance matrix S (recall the spectral decomposition of equation 3.18). With the possible exception of some jackknife methods, these are too complicated and computationally intensive to be useful, as well as still being potentially unreliable and in some cases more applicable to factor analysis than PCA. In practical applications it seems we are left with the ‘rules’ of thumb.

The first rule is to choose the number of modes t that explain a fraction f of the total variance as in 3.36 above. If an independent estimate of the noise level σ2 is available, this could be

improved by adopting the probabilistic, pPCA, point of view and choosing t such that the aver-age of the unexplained variation over the rejected modes with k = t + 1, · · · n when n ≤ d is consistent with the expected noise level of a patch , σ2P , i.e. such that:

n

In imaging applications, it would be ideal to estimate the noise level of a patch σ2P from images taken at different times but otherwise of exactly the same scene. Unfortunately such images are rarely available, even in applications where successive images in a video sequence may be taken, so other estimates have to be used as a fallback. For example, in the malaria application one could try to estimate σ2P from the background, plasma region but even if done manually this is not ideal as the background itself must be modelled satisfactorily and imaging noise may have multiplicative characteristics rather than being purely additive.

The second rule discussed by Jolliffe focuses on the magnitudes of the variances of the PCs, i.e.on the magnitudes of the eigenvalues λ(k) = σ(k)2. A simple approach following Kaiser’s rule for PCA of correlation matrices would be to choose the cut-off λcfor the PCs to be retained according to the average of the eigenvalues, ¯λ. For a full rank covariance matrix S this average would be:

but if S is rank-deficient with rank r < d (and, in particular in image processing and computer vision when rank r ≤ n << d) it could be argued that it would be preferable to take:

¯λ = 1

Jolliffe [92] and [91] suggests that a lower cut-off at λc= 0.7¯λ may be more appropriate in order to avoid 3.38 or 3.39 selecting too few PCs which, given the rapid decrease of the λ(k) with k shown in figure 3.3, seems likely in our application.

Finally, we note that Jolliffe mentions the “broken-stick” model in which the arbitrary cut-off λc= 0.7¯λ above is replaced by (for a full rank matrix S):

with ¯λ as in 3.38. It represents a kind of ‘parallel analysis’ (see below) in which the λ(k) are being compared with the expected magnitude of the t th largest segment obtained were tr(S) broken at random into d segments. If S is of rank r < d it would seem reasonable to replace d by r in equation 3.40 and to use equation 3.39 instead of equation 3.38.

The third rule is concerned with the rate of decrease of the λ(k) as k increases and requires judgement of when λ(k) − λ(k + 1) stops being large. This depends on both the relative magnitudes of λ(k) − λ(k + 1) and its predecessor λ(k − 1) − λ(k) and on their absolute values. It is obviously motivated by the fact that as we have seen the λ(k) are expected to become almost constant at high k if they represent noise, but this doesn’t specify how to decide when λ(k) − λ(k + 1) is not large. Regarding a plot of λ(k) against k as a ‘scree-graph’ and looking for the ‘knee’ on the curve where it stops being steep can be highly subjective, although for graphs like that shown in figure 3.3 it is possible that most researchers might choose a similar cut-off t.

One way of making this less subjective and even of potentially automating it, is to look for the point on the curve where its slope is equal to the slope of the chord (λ(1) − λ(r))/(r − 1) or, more simply and more conservatively since more modes would be retained, to λ(1)/d. Should these yield values of t explaining similar fractions of the total variance one might be confident that a reasonable cut-off had been specified. If there are several points where the slope is equal to the chord a conservative choice to take the largest resulting t can be made. Another way would be to carry out a ‘parallel analysis’ of the eigenvalues of a (suitably defined) random matrix and to compare the two.

Other methods described by Jolliffe include testing the hypothesis that successive eigenval-ues are equal commencing with the last two that are non-zero (Bartlett’s test) or its reverse commencing with the two largest eigenvalues (Jackson’s test), cross validation (and bootstrap) methods, and a partial correlation test. Cross-validation, bootstrap and jackknife approaches are essentially looking at how well the retained modes can approximate data that was not in the training set. This is a reasonable criterion, but unfortunately the goodness of fit has itself to be specified and we are returned to the kind of issues that confronted the first rule and, except for some jackknife methods (see Jolliffe [92], section 6.1.5), are confronted with vastly more com-putational work. Similarly, the level of significance has to be specified in methods that utilise hypothesis testing. The only other method Jolliffe discusses (section 6.1.6) which avoids this kind of difficulty is Velicer’s partial correlation criterion in which one looks for a minimum of the average V of a particular squared partial correlation measure.

Unfortunately, Jolliffe describes Velicer’s method as most suited to factor analysis of cor-relation matrices rather than to PCA. It would seem therefore that only the first three rules described above are worthy of much consideration, with the first having the obvious merit of being the simplest. It is also the most widely used in computer vision and image processing applications and, though it would seem interesting to explore the other two, is the only rule

used in this work.