Non-Negative Matrix Factorisation - Compressed Factorisation

3.2 Compressed Factorisation

3.2.3 Non-Negative Matrix Factorisation

Non-negative matrix factorisation (NNMF) is used to decompose the data into a set of purely positive additive, coefficients and scores. This can be very useful for interpreting a system where each spectrum can be considered to be a mixture of a definite number of pure sources which cannot feature negative components[154]. This approach has been proposed for mass spectrometry imaging as the detected spectra have the physical property of being positive[108].

Recall from Chapter 1 that in NNMF the data, X is decomposed into the form (equation 1.9)

∗X ≈ EG (3.1)

where E and G contain spectral coefficients and pixelwise scores respectively, both are constrained to be positive. As spectral compression using basis approximation results in negative values in the compressed data NNMF cannot be applied directly. It would be preferable to operate on the compressed data in a similar man- ner to PCA to reduce the computational burden and take advantage of the dimensionality reduction that has already been achieved. The semi-non negative matrix factorisation (s-NNMF) has been proposed as a solution for systems where the coefficients contain negative values but are still present as a mixture, removing the pos- itivity constraint on E whilst maintaining it for G. An algorithm for efficiently solving this problem has been presented [55] and is available online as a MATLAB toolbox http://cs.uwindsor.ca/~li11112c/nmf.html (v1.3).

coefficients ¯E in the BASC basis domain:

A ≈ ¯EG (3.2)

by pre-multiplying with Q and substituting in Equation 3.1 the coefficients can be recovered as in the original m/z domain.

QA ≈ Q ¯EG (3.3)

EG = X ≈ QA ≈ Q ¯EG (3.4)

EG ≈ Q ¯EG (3.5)

E ≈ Q ¯E (3.6)

Choosing the matrix rank

In contrast to PCA, NNMF is a factor analysis technique that requires a model of the data to be provided before the algorithm can be applied, including the number of factors present. Unfortunately, this is rarely available in advance, especially for a exploratory datasets. To estimate the data rank for the NNMF problem the eigenvalues generated by BASC-PCA were examined, looking at where the gradient of the eigenvalue curve asymptotes to zero (the scree plot). A value of 9 was selected from the scree plot and confirmed by running NNMF with 10 factors, which then produced a rank deficient output. There is not such a sensitivity to the number of random samplings used in basis approximation as an overestimate does not penalise the quality of the basis. Whereas, a poor estimate of the number of components can substantially degrade the quality of factor analysis methods[63]

Compressed s-NNMF was evaluated through a comparison to NNMF performed directly on the data (using algorithms from the nnmf toolbox version 1.3[131]). It was not possible to apply the NNMF algorithm to the whole fixed rat brain dataset, due to similar memory constraints that prevented the application of PCA directly. To provide a comparison set spectral rescaling at ∆m/z = 0.2 was again performed. The compressed s-NNMF algorithm can be used on a full dataset without rescaling, but for this evaluation was also applied to the spectrally rebinned data. Computing NNMF on the reduced dataset (4751 channels) took over half an hour whilst performing compressed NNMF took less than three minutes, including compression and decompression.

Matching Scores for Comparison A disadvantage of NNMF is that whilst there is a unique best solution, the iterative solving algorithms used to search for it are very sensitive to the initial conditions and consequently

3.2. COMPRESSED FACTORISATION 79

Figure 3.3: Comparison of the abundance maps produced by NNMF directly on raw data and compressed BASC. In both cases nine NNMF abundance maps were produced on data re-binned at ∆m/z = 0.2 and then again folliwng BASC-compression. The colour on each map is scaled linearly between zero and the maximum value].

are prone to getting stuck in local minima[55]. This means that there is no way for the user to know whether a result produced is a global solution so repeated initialisations are required to expand the search space, keeping the best fit according to the smallest error metric ||A − ¯EG||. These replicates mean that additional time savings obtained by reducing the dimensionality of the problem are achieved for every replicate used. For all NNMF and compressed s-NNMF experiments in this chapter seven replicates of the algorithm were applied.

The initialisations are randomised so there is no defined order in which the maps from NNMF are produced. In order to compare results from different replicates and from compressed against full datasets the most similar abundance maps were matched using the following procedure.

Algorithm 3.3: match loadings across runs of factor loadings Data: array of loadings, G

Result: vector of matching indicies k

1 Calculate P the pairwise correlation coefficient between each pair of columns in G1 and G2 ; 2 Find the highest value in P and extract the row and column indices i, j;

3 set k_i = j ;

4 Match the abundance map j to i and clear the column j and the row i (set the value to -1), repeat

The scores from direct-NNMF and compressed sNNMF with a compression ratio of 0.005 are shown side- by-side in Figure 3.3. Visual inspection of Figure 3.3 reveals that three factors corresponded to on-tissue distributions and the other four showed tissue-edge features and variation in the surrounding matrix. The positive values score maps can be interpreted as fractional abundances of the coefficients at each pixel. The scores from direct-NNMF and compressed s-NNMF with a compression ratio of 0.005 are shown side-by-side in Figure 3.3.

A systematic range of compression ratios (number of basis vectors) were trialled in the range of 0.003-0.05 (10-200 basis vectors). Each set of factors was matched to the most similar factor from the NNMF scores shown in Figure 3.3 using Algorithm 3.3. The correlation between the distributions was then calculated and is plotted as a function of number of samplings in Figure 3.4. There is a trend towards improving results as the number of projection increases (compression ratio decreases) which would be expected as the compression quality improves but there is a high level of variance at all points. One challenge in interpreting these results is that it is not known whether the factorisation on the full data is actually the best possible result that can be obtained. One observation that can be made is that the variance in the compressed s-NNMF results is more substantial than the compression errors seen from this dataset in Chapter 2. This suggests that variation from the initialisation conditions of the factorisation results tends to dominate and so it is essential to be able to repeat the factorisation many times to obtain an ‘optimum’ result.

Figure 3.4: The correlation between NNMF abundance maps produced directly from the data and following compression. Typically a correlation of >0.9 is achieved regardless of the level of compression.

3.2. COMPRESSED FACTORISATION 81

Figure 3.5: Simultaneous visualisation of all NNMF abundance maps. The segments of the circles represent the total fraction of the pixels under the circles belonging to each factor. By changing the circle radii, either dataset-wide visualisations can be produced or detail within regions can be seen.

Viewing the data

It is desirable to have a single overview of the factors, as it is difficult to visually assess abundance maps, especially as they may be on different colour scales for individual clarity. Recent visualisation work[67] required the embedding of the data onto three dimensions, so these could be directly transformed to red- green-blue channel intensities, which severely restricts the number of factors that can be introduced. The compressed s-NNMF approach (and factorisation methods in general) produce positive fractional abundances. This allows the production of single-view images, such as Figure 3.5, where the image region is overlaid with a grid of ‘pie charts’. Each small circle uses a set of coloured wedges to show the fraction of that pixel attributed to each factor. By using a coarse grid on the whole image area an initial overview of the image is provided which clearly delineates the major features of the data, increasing the density of the map by zooming in or using a finer grid allows fine structure to be seen whilst preserving the knowledge from all of the multivariate components. In conclusion, this visualisation presents the quantitative elements of the multivariate analysis and provides the user with an overview of the structure of the data within a single image.

In document Information processing for mass spectrometry imaging (Page 90-95)