Peak selection - Processing of MSI data - Development and Application of Chemometric Methods fo

5.4 Processing of MSI data

5.4.3 Peak selection

Peak selection has been shown to be integral to obtain useful multivariate models of msi data212 and here it is confirmed that peak selection is useful: it decreases the number of non-informative variables, and it reduces the data size and calculation times. Peak selection is not trivial because spectral information and quality varies between the pixels, so an easy definition that can be used to reject noisy variables is not directly available. A number of approaches have been evaluated, and here, two methods are applied consecutively, and both fulfil two important criteria. Firstly, a thorough selection of peaks is achieved, where only a small fraction of the original number of variables is retained. Secondly, the approaches were pragmatic and relatively intuitive to understand and calculate.

Approach A for peak selection: correlation with matrix peaks

Preliminary analysis of the data set demonstrated that some peaks are more prominent in the region surrounding the tissue sample than on the sample itself, and these therefore probably arise from the applied matrix solution used in maldi. These peaks, although possibly informative, do not directly convey information on endogenous metabolites. Hence, as a first approach, it was decided to remove these peaks. One such peak at m/z = 172.0 is from the αchca matrix solution ([M - H₂O + H]+, m/z = 172.04, C₁₀H₆NO₂). The correlation of each m/z variable with this selected matrix peak m/z = 172.0 across all pixels was evaluated, because it is anticipated that peaks with a negative correlation with the matrix peaks are more prominent on the tissue than in

100 200 300 400 500 600 700 800 900 1000 0 2000 4000 6000 190 144172 656 401 234 146 650 116 379 m/z mean intensity 100 200 300 400 500 600 700 800 900 1000 0 2000 4000 6000 m/z mean intensity 100 200 300 400 500 600 700 800 900 1000 0 2000 4000 6000 m/z mean intensity

A

B

C

Figure 5.2: (A) The 10 peaks that were selected to represent the main matrix peaks (those with maximum covariance with m/z = 172.0) are shown in red. They-axis is the mean intensity, calculated across all 20535 pixels for the 4751 variables. (B) The selection of those peaks of which the summed correlation with the 10 selected matrix peaks is negative is shown in orange (1224 variables, approach A). (C) Peaks from the selection in B that fulfilled the second criterion, based on the variance explained in a pca model, are shown in green (564 variables, approach B).

the region outside the tissue, and correspond to interesting variables. Conversely, peaks correlating positively with a matrix peak are more likely to be related to the matrix and other experimental settings, than endogenous metabolic variation. To correct for any chance correlations, 10 prominent matrix peaks were used, selected by calculating the covariance of each m/z value with the peak at m/z = 172.0. These 10 peaks displaying the highest covariance with the selected matrix peak are indicated in figure 5.2 A, where covariance was used rather than correlation in order to select high-intensity peaks only and to avoid selection of isotopes instead of different matrix molecules.

The correlations for each variable in the data set (4751 in total) with each of the 10 matrix peaks were calculated and summed; only those peaks with a negative sum of the 10 correlations were retained, shown in orange in figures 5.2 and 5.3. It is clear from examples in the different correlation regions, shown in figure 5.3, that positive correlations indeed correspond to variables

Figure 5.3: The correlations of each variable with 10 selected matrix peaks, see figure 5.2 A, were summed, and the summed correlations were sorted. Note that thex-axis is arranged according to decreasing correlation with the matrix peaks, which is plotted on they-axis, and does not to relate to the individual m/z values. Only peaks with a negative summed correlation are retained (coloured orange). Images of 3 selected variables demonstrate that positive correlations correspond to higher intensity outside the sample (m/z = 650.2), low correlations often correspond to non-informative peaks (m/z = 207.6), and large negative correlations show clear relevance to the biological tissue (m/z = 761.6).

with a higher signal intensity outside the tissue region than within the tissue, and are therefore unlikely to be biologically relevant (e.g. m/z = 650.2). Peaks with a low correlation, e.g. m/z = 207.6, are mostly representing noisy variables, and peaks with a negative overall correlation, e.g. m/z = 761.6, display a clear structure and distribution in the sample.

Approach B for peak selection: variance explained inPCA on the image

It is clear that the peak selection made with approach A, which was based on correlation with the matrix, could be improved, since noisy variables were still included (e.g. see m/z = 207.6 in figure 5.3). Although many multivariate approaches are able to cope with noisy variables, the model strength decreases with a large number of non-informative variables. For univariate methods, the effects of noise can be even more problematic. Therefore, it was proposed to identify the variables

in the selection of approach A that lack any relation to anatomy, and are likely to be noise. This was achieved using a pca-based decomposition of each m/z image: the intensity values of the pixels for the selected m/z are represented in a matrix, where rows correspond to different y-locations and columns to the x-position in the sample. The variance explained in the first principal component of the pca model (with only mean-centring) was used as an indicator of image-related intensity distribution of the variable. If the intensity differences are randomly distributed, the variance explained in the pca model will be low, e.g. for m/z = 989.6, see figure 5.4. On the other hand, if there is any structure in the image, more variance is modelled with pca, as is shown for m/z = 873.6 in figure 5.4.

To select a pragmatic and user-friendly cut-off, an h-index was used as an appropriate heuristic, and calculated as the sum of all explained variances divided by the number of original variables (1224). Variables that have explained variances in pc 1 higher than this h-index of 24.3% were retained and coloured green in figure 5.4. The expansions in figure 5.2 show that the variables selected with approach A but not retained in approach B are mostly corresponding to variables that have a lower mean intensity. However, this pca-based approach avoids removing low-intensity, informative ions; or retaining artefactual high-intensity, noisy ions. A similar approach would be the investigation of structure in the image using an entropy-based criterion.

Figure 5.4: The variance explained for the first principal component of a pca-based decomposition of the image for each variable (mean-centred) is plotted. Note that the x-axis corresponds to variables that were sorted with decreasing variance explained, and does not correspond to the m/z values. Variables with lower levels of explained variance contain less biological and anatomical relevance (compare m/z = 873.6 and m/z = 989.6). Anh-index of 24.3% was used as a cut-off for variable selection: only variables with a higher percentage of variance explained were selected and coloured green.

The m/z images and values of the deleted variables verified that not many informative peaks were removed with approach B: if any signal was found at all, the related main isotopes were still selected, so information regarding the parent ion was not lost. Thus, a potential loss of information is not problematic, since there is a lot of redundancy in the data. Even if 10% of the potentially informative peaks were removed, it is unlikely that this would result in the disappearance of characteristic molecular fingerprints. Moreover, there are clear computational and interpretative

advantages for smaller and cleaner data.

It should be noted that a very similar variable selection resulted if the images were rotated through 90◦

(swapping the columns and rows for the pca, i.e. a matrix transpose). If for any future sample, anatomical structures are expected to be directional, e.g. more horizontally or vertically oriented, this information should be used in the decision of transposing the data matrix before pca-based variable selection. Columns with only 0 were removed prior to pca calculations. Mean-centring of the data prior to pca-decomposition is necessary as it removes the overall mean, which would otherwise be the main contributor to the variance in the first principal component. Unit variance scaling was not performed, as this would give equal weight to all columns, which could negatively affect the selection of small anatomical features and emphasise noise in the data. A high level of explained variance would also be obtained for m/z images that are high in the surrounding and low in the sample region. Therefore, the variable selection with approach A has to precede peak selection with approach B.

In document Development and Application of Chemometric Methods for Modelling Metabolic Spectral Profiles (Page 112-116)