2.8 Spatial Random Projections
2.8.2 Basis Approximation for Spectral Compression
The motivation for forming an orthonormal basis for the data is that inverting the projection can be achieved by simply transposing the basis so reduced data A could be decompressed to form X0 = QA, again, this is demonstrated on the fixed rat brain dataset. The size of the dataset prevented it from being loaded into memory as a whole, so Algorithm 2.4 was modified to allow sequential data access:
Algorithm 2.4: Memory efficient implementation of basis approximation Data: Spectral image, X; integer k
Result: Approximate basis for X, Q
1 Consider a dataset Xm×n containing n pixels in each of m spectral channels.; 2 Generate random vectorsv(i)
i=1:k of length n (with k & rank(X)) by drawing values from a normal
distribution with mean zero and standard deviation one N (0, 1).;
3 Form the random projection matrix Ωn×k=v(1)| . . . |v(k); 4 Initialise Sm×k as a matrix of zeros;
for j=1 to n do
5 Load jth spectrum from disc x = Xj;
6 Randomly project each spectrum: Stemp= xΩ; 7 Update sampling matrix: Si= Sj−1+ Stemp;
end
8 Create orthonormal matrix Qm×k from S by factorising S = QR, with R an upper triangular matrix.;
Algorithm 2.4 was applied to the rat brain data to generate a randomised basis matrix. The basis matrix produced Q was used to generate a reduced data matrix through projection A = QTX which requires one
compression ratio of 0.01, is illustrated in Figure 2.9 and was found to adequately represent the data, giving a PCC of greater than 0.99 and an SNR of 45.
Figure 2.9: Selective decompression of specific spectra and ion images. Decompressed MALDI image. A. Channel map formed from raw data of m/z = 782.55 ± 0.05 showing the distribution of a common lipid, PC(34:1)[32]. B. Channel map showing the distribution following a single compress-decompress cycle of PC(34:1) with k = 100. C. Overlay of raw (solid black) and decompressed (red) spectra from a single pixel and there is no substantial deviation following decompression. Enlargements of specific peaks shows that peak shape and intensity are maintained regardless of the initial peak m/z or intensity and the two spectra are still all but indistinguishable.
Qualitatively, this figure shows that the image representation of the selected ion channel is visually indistinguishable from the raw data following a compression-decompression cycle and that decompression of individual spectra show only small differences at the level of the noise. The compression procedure reduced the dataset from the single matrix X with m × n = 129796 × 20535 = 2665360860 elements to a pair of matrices Q and A with k(m + n) = 100 × (129796 + 20535) = 15033100 elements giving a compression ratio Rc = 0.0056. The raw data is ≈ 20GB in size, this is reduced to ≈ 115MB with this method.
One feature of this compression scheme is that selective decompression can be performed, so decompressing the whole image is avoided when recovering individual spectra or intensity maps, see Figure 2.9 for a visual schematic. Rows in the basis matrix correspond to an individual spectral channel whilst columns in the spatial
2.8. SPATIAL RANDOM PROJECTIONS 57 matrix correspond to individual pixels. The decompression of a particular single pixel spectrum, i.e. the jth row from the original data, Xj, can be achieved by selecting the jth column from the abundance matrix and
multiplying it by the basis Xj = QAj. This is advantageous for minimising the computational memory that
needs to be allocated for any operation. The decompression of an intensity map from a particular spectral channel, i.e. a column of the original data Xi, can be achieved by multiplying the appropriate row from
the basis and by the whole abundance matrix Xi = QiA, multiple basis columns are summed row-wise
before multiplication for multi-channel images. An image is presented by reforming the resulting list of pixel intensities back to the image dimensions.
To quantify the quality of compression two metrics were calculated: the SNR of the spectra and PCC between the raw and decompressed data. This was done for a range of values of k in order to investigate the trade-off between data size and compression quality. The results of this are shown in Figure 2.10. The SNR and PCC are both seen to be positively correlated with the compression ratio, indicating that taking a large value of k does increase the data quality, but the PCC increases rapidly at first, and flattens out very quickly. The SNR continues to increase, but reaches acceptable values at low compression ratios. In optical hyperspectral imaging SNR values of above 30 for lossy compression are typically considered to be good, and 50 and above excellent[61, 71, 208]. An SNR of 30 is achieved for a compression ratio of < 0.002, corresponding to k > 35 on this dataset. For the example in Figure 2.10 with k = 100, the SNR is ≈ 43 and the PCC is > 0.99. This suggests that the information lost from the data is at the level of the noise. This curve provides one practical method to estimate a suitable value of k, an alternative starting point is to use the result of the JL lemma (Equation 2.2) and take k = 8∗log(n)2 (where n is the smallest dimension
of the data and is 0 < < 1). However, this approach is known to substantial overestimate the number of projections[17], e.g. for = 0.05, k > 30000.
Comparisons to other mass spectrometry data reduction schemes are difficult as they tend to be based on peak-picking procedures which can be tuned to pick an appropriate number of peaks to compress the data to the size dictated by the computer’s memory. Measures of compression quality for the peak-picking methods are not known, but it is commonly accepted that the process of re-binning and peak picking does cause information loss and most efforts have focussed on not discarding ‘informative’ peaks [66]. Extracting 50-200 peaks has been suggested as appropriate for further analysis such as segmentation [2], which corresponds to a sample-to-feature ratio of around 10 for image sizes typically collected from MALDI time-of-flight experiments. For high-resolution instruments this may require discarding the majority of detectable peaks. The main advantage of the randomised methods is that the dimensionality of the data can be reduced to
Figure 2.10: Quantitative metrics for evaluating the quality of compression of a MALDI MSI dataset. Pearson product-moment correlation coefficient (solid blue, left axis) and signal-to-noise ratio (dashed red, right axis). As values of PCC tend to one a high quality signal recovery is achieved.
a similar level as that obtained by peak-picking methods, but no part of the data is removed completely. Further, the dimensionality reduction obtained via these means is reversible. The discrete wavelet transform, a more comparable method, has been employed in an attempt to preserve the spectral integrity of the data [215]. This was shown to reduce the dimensionality from 6490 to 819 but details of total data size and metrics of compression quality were not presented. On this limited basis, randomised basis approximation appears to be able to achieve superior compression ratio to wavelet-based methods whilst maintaining the quality of the data.
Image Magnitude Recovery
Like the direct random projection the basis approximation projection preserves the l2norm but not the l1, see
Figure 2.11, however, the ability to back-project to the original data means that the l1 norm is not entirely
lost through the process. It can be recovered without requiring the full dataset be back-projected by taking the sum of each basis vector and back-projecting.
l1n≈ m X i=1 Qn ! · A (2.11)
Taking a similar approach with random projection, see Figure 2.12 does not yield such results. This is particularly useful as the TIC is commonly used for image normalisation.
2.8. SPATIAL RANDOM PROJECTIONS 59
Figure 2.11: Recovery of data norms from data compressed with BASC. It was not possible to recover the l1 norm (as calculated directly on the data) from the BASC projections but the l2 norm can be calculated
directly from the BASC projections