Pattern Recognition Approaches in Biomedical and Clinical Magnetic Resonance Spectroscopy ^
3.3 The pattern recognition process
3.1.1 Pattern representation and dimensionality reduction
As pointed out above, the presentation of data is a major factor in determining the performance of a pattern recognition system, whether for classification or prediction. We seek inputs to the system that are a true and accurate representation of the process under investigation. A major part of the pattern representation process is to reduce the dimensionality of the measurement vector. Smaller dimension patterns are desirable because they yield more compact, less complex pattern recognition systems which, by the principle of Okham’s razor^, are generally more accurate. According to Bayes’ decision theory there is no reason to limit the number of features or variables of the input pattern to the classifier. In practice, however, only a finite set of samples is available for the design of the classifier, and so the performance of the classifier deteriorates as the
“ In MRS one may regard the FID signals as the measurement vectors and the subsequent filtering and Fourier transformation as steps in the pattern representation process. For simplicity we shall treat the full spectral pattern in the frequency domain as the measurement vector.
^ After logician and Franciscan monk William of Okham (1285-1349). The simplest explanation of some phenomenon is more likely to be accurate than more complicated explanations. See http://www.weburbia.com/ physics/occam.html.
C hapter 3: Pattern recognition in MRS 38
dimension of its input increases (the curse o f dimensionality) (Duda and Hart, 1973). In MRS, the measurement vector has a large number of components (thousands in the case of high resolution spectroscopy). Inevitability, some of these components will be irrelevant or redundant to the particular classification task under investigation and their presence will have a detrimental effect on the classifier performance. Therefore, it is preferable that the classification or prediction system is designed on the basis of only few significant features that characterise the measurement vector. This is achieved either by (7) selecting a subset of variables directly from the components of the measurement vector assessed to be representative of the process, while discarding redundant and less relevant measurements {variable or feature selection), {2) combining (transforming) the original measurements to form a new set with fewer features {feature extraction), or (3) by using both approaches'^. In any case it is desirable to preserve the information content of the original patterns. The result of dimensionality reduction is the feature vector. Several studies have been dedicated to the investigation of ways of defining what are ’good’ features, searching for them and evaluating their classification/prediction capabilities (Kittler, 1975; Ben-Bassat, 1980; Ben-Bassat, 1982; Ray, 1985; Choakjarernwanit, 1992). In MRS several approaches to reduce the large dimensionality of the spectral patterns have been adopted, many of which make assumptions about the importance of certain regions or peaks in the spectra. The most common approach is peak picking (metabolite selection), using either the peak area or the peak height (Holmes et al, 1992; Hanaoka et al, 1993; Ghauri e ta l, 1993; Anthony et al, 1994; Hagberg et al, 1995; Kari et al, 1995; PreuI et al, 1996; Tate et al, 1996). Another approach is to reduce spectral resolution by integrating all the peaks in a segmented spectrum or by representing a segment with its maximum height (Farrant et al, 1992; Howells et al, 1992a, b; Branston et al, 1993; Holmes et al, 1994). On the other hand, Tate et a l (1996) identified a region of interest in the spectra, transformed it into wavelet coefficient (section 4.7) and selected the coefficients most highly correlated with the corresponding class. Other studies developed their own ways of describing the spectral data in a more compact form, such as the seven-level scoring mechanism, developed by Gartland ef a/ (1990; 1991), to encode the changes in concentrations from a number of metabolites selected from the spectra at three time instances. The same mechanism was used by Anthony et al (1994) on their selected peaks. Spraul ef a/ (1994) segmented the spectrum and counted the number of peaks in each segment above a certain threshold. They also compared the effects of the segment width and line broadening on the classification. Many of the aforementioned approaches were followed by principal components analysis (RCA) (see section 4.2) for further reduction in dimensionality; for example, Somorjai ef a/(1995a) down-sampled the two selected regions (0.64-2.59 ppm and 2.59-3.41 ppm), then transformed and reduced
The distinction between feature selection and extraction seems to be blurred in the literature of pattern recogntion and the terms are sometimes used synonymously. This is probably because the transformation that extracts the new features is usually followed by a selection of fewer of those features, as in the case when PCA is used (see section 3.4.1). Our definitions of feature extraction and feature selection follow Devijver and Kittler (1982).
each of them separately using PCA, while Tate et al (1996) used PGA following the wavelet transformation of the data. Other approaches to dimensionality reduction include those of Thomsen and Meyer (1989), Ala-Korpela et ai (1995) and Usenius et ai (1996) who used directly all the spectral components of the measurement vector (the original spectrum) from selected regions, while Lisboa and Mehridehnavi (1996) used the entire spectra and designed classifiers to select the significant discriminating variables automatically.
Data reduction and feature extraction in biomolecular MRS are treated using different approaches from the above. Multidimensional MRS in protein structure prediction is used in two different ways: through-bond spectra to identify the residues, through-space spectra to identify neighbouring residues and construct longer sequences from the previous two steps. Each stage requires handling the data differently. Prior knowledge plays an important role in extracting (or constructing) good descriptors from the spectra and encoding the spectral information to reflect structural information requires more understanding of the process. Peaks are usually selected from more than one spectrum (e.g. iR and ’’ ^C) acquired in one, two and higher dimensions such that simple peak picking becomes a separate classification task (Corne et al, 1993). Changes in spectra are not only in peak amplitudes but the whole spin system can change significantly from one sample to another depending on its components and finding descriptors to characterise structure-chemical shift and function-chemical shift relationships is not straightforward. These aspects result in a formulation of the classification or prediction problem in different ways from biomedical MRS. Therefore we shall restrict the scope of this section to data representation and reduction in the biomedical MRS literature only and refer the interested reader to the more specialised literature of biomolecular MRS (e.g. see Zimmerman and Montelione, 1995).
3.3.2 Estimating the probabiiity distribution o f MRS data
As pointed out above, Bayes’ decision theory relies on the knowledge of the probability distributions of the data set, or their estimates. It is thus natural to consider the probability distributions of MRS data before proceeding to discuss the common classification approaches used. To the best of our knowledge there have been no previous attempts to model the probability distributions of MRS data prior to selecting the classification techniques.
There are several ways to estimate the probability distribution, either by assuming its form {parametric) or by using the data directly {non-parametric) (see Duda and Hart, 1973; Bishop, 1995). One of the simplest techniques is Parzen windows (Parzen 1962), also known as kernel functions. This is a non-parametric estimation technique which, if used carefully, yields reasonably accurate results. It defines and uses a window function, the kernel, centred at each given data point. It then interpolates the range of the given data such that each point contributes to the estimate according to its distance from the true
Chapter 3: Pattern recognition in MRS 40
distribution. Let be the Ath D-dimensional data point {k=^ then, assuming a Gaussian kernel function^, the estimate of the joint probability function is given by Bishop (1995)
N
D / exp<
x - x
2/z" (3.1
where II.11^ indicates the Euclidean norm. The width of the kernel, h, acts as a smoothing factor. An appropriate value for h is important for a true density to be obtained. A large value of h results in an over-smoothed function while too small a value brings out the differences between the individual data points. Another factor in the choice of h is the range of values of x. Examples of distribution estimates from three different proton MRS data sets are shown in Fig. 3.3.
h=.02 0.05 0.1 0.15 0.2 X ■0.05 0 h=.005 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 * h=.01 X 0.15 0.2 0.25 0.05 0.1 h=.01 0.05 0.1 0.15 0.2 -0.05 0 h=.001 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 h=.0075 0.1 0.15 0.2 0.25 0.05
F ig u re 3.3 M odelling the jo int pro ba bility density function of high resolution proton MRS of perchloric acid extracts from three different experim ents at different values of h: A -A ’) hum an high grade gliom a cell cultures, 2000 variables, A/=19. Even when using a larger value for the sm oothing param eter h the data are is too skew ed to be considered G aussian, B -B’) Human m eningiom a from Florian e t al (1995), 2000 variables,
N=27. T he data is bim odal at h=0.005 and m ultim odal at h= 0.00^. A sm aller value of h w ould bring out the differences between individual sam ples rather than the true distribution structure and C -C ’) A nim al tum ours from H ow ells at a l (1992a), 180 variables, /V=71 (original data had 84 spectra). Even at the higher value of the sm oo th in g factor h, the data cannot be considered G aussian. Increasing h further will eventually force a G aussian distribution.
Even with the Gaussian kernel, the estimates seem to suggest that either /) these data sets are not Gaussian or unimodal, or //) there are not enough data points to bring out the normality of the distributions. Both observations are im portant when deciding which technique to adopt for designing the appropriate classifier. Choosing a technique that assum es normality, explicitly or implicitly, may give inaccurate results. Using the Gaussian kernel it is possible to increase the value of the factor h until a smooth unimodal Gaussian distribution is imposed. However, this risks losing information about the true nature of the
■ A Gaussian kernel favours or, for large values of h, imposes a Gaussian distribution. Parzen’s original function was uniform.
data. In addition, once the distribution is estimated, it should be the properties of the estimate, rather than the individual data points, that are used in subsequent analysis or classifier design.
3.3.3 Discrimination and Ciassification
Using the probability density function, or its estimates, has the advantage of simplifying the design and implementation of the classification system and providing analytical terms for the expected performance, which would helps in the process of validating and testing the system against the available data. Once a choice is made of the most informative features from the measurement vector (section 3.3.1) and whether to use an estimate of probability density function or the available data directly, the appropriate analysis algorithm is designed. According to Johnson and W ichern (1992), discriminant analysis attempts to maximise the separation between the observed data sets, while classification derives rules that allocate new observations to pre-defined categories. This distinction, however, is somewhat artificial, since a separating function is often used to assign new measurements while a decision rule for classification also separates between the categories. According to Raudys and Jain (1991) there are over 200 classifier algorithms to choose from. The choice of algorithm will depend on the complexity of the problem, the number of features and the objectives of the classifier. In section 4 we will be discussing details of the most common analysis algorithms used in MR spectroscopy.
3.3.4 Error estimation and confidence measures
Once the design parameters of the classifier are estimated from the training data, the next step is to predict the future classification performance (block e in Fig. 3.2). The average probability of error makes an ideal performance index (Devijver and Kittler, 1982). Yet, even using a Bayesian classifier calculating the probability of error is very difficult. In practice, the performance is assessed by testing the classifier experimentally, counting the number of errors it makes and use the results as an indicator of future accuracy. Unless the training data set is very large, samples used for training (designing the classifier) and those used for testing must be statistically independent (at least different) (Devijver and Kittler, 1982). This is usually achieved by partitioning the available data into a training set and a test set. However, when the data is limited, one wishes to use as many samples for the design is possible. The Leave-one-out error estimate removes one sample from the data set, uses the remaining samples for the design and tests it on the left-out sample. Then repeat the process until all samples are independently tested. The general form of Leave-one-out is cross-validation, where the role of design and test data is constantly interchanged.
Even if a classifier consistently gave the correct answers on the all sets in a cross validation test, error counting does not provide any information on how confident is the classifier in making the decisions. In addition, the routine use of the classifier will be to assign new data with unknown categories, and hence there is no way to check if the
C hapter 3: Pattern recognition in MRS 42
classifier is making the correct decisions. Error bars and confidence measures are estimates of the quality of these decisions regardless of what they are. They also provide information on the similarity between the new data and the training data. Unknown data close to the decision boundary or different from the training samples will be classified with low confidence. In Bayesian classifiers, the outputs are usually expressed in terms of the posterior probability distribution. The standard deviation of the predictive distribution can be interpreted as an error bar on the mean of this distribution: the more confident the system is about the prediction, the narrower the posterior distribution (Bishop, 1995). For non-Bayesian or non-parametric classifiers, it is possible to define a confidence measure if the outputs could be expressed approximately as probabilities. Since the unknown pattern is assigned to the class corresponding to the maximum output, confidence distance can be defined as the difference between the maximum predicted output and its nearest contender. The larger this difference the more confident the system is in deciding for the winning class. In order to compare different classifiers based on this scheme, the outputs from all classifiers need to be scaled in the same way so that the confidence distance remains consistent across all designs. This confidence scheme, however, has the disadvantage of not being applicable to regression problems, where a more explicit estimate of the posterior probability distribution are required and Bayesian tools will need to be deployed (Bishop, 1995).