4.4
Pipelines Compared on Real Data
Name Memory Efficient Disk Passes Memory Passes Time (relative ‘none’)†
None Y 2 0 1
Re-Binning 0.2 Y 2 0 0.14
Standard Deviation Spectrum Y 2 3 0.20
Multiple Summary Spectra Y 2 3 0.22
Frequent Peaks Y 1.15 3 0.15
Spatial Correlation N 1* 3 0.09
Table 4.1: Pass efficiency analysis of the pipelines plus dimensionality reduction, as implemented within this work. Numbers shown for processing starting at raw data. † timings are approximate and no code optimisation was attempted. * a first stage of feature selection (re-binning at 0.2m/z ) was specified for this method so that the data small enough to fit into memory, if this cannot be achieved then this method becomes substantially less pass-efficient.
This section will examine the application of each of these pipelines to a real-world MALDI image. It will compare the output from the feature extraction in terms of number and nature of features retained and the final ‘molecular histology’ results obtained by segmentation. There is not a particular criteria for the number of measurements retained, but for effective calculation of distances a rule of thumb is that the sample to measurement ratio of 10 is not exceeded [195], comments from other authors suggest that 100- 200 measurements should be retained[2]. More important is that the measurements kept are discriminatory between tissue regions so that histological differences are revealed. Using a real Matrix Assisted Laser Desorption Ionisation (MALDI) dataset comes with drawbacks, other than the data size, most notably the lack of a ground truth against which to compare any results which leads to a narrative conclusion. Fortunately as BASC can be applied to the data without prior reduction, a comparison can now be made against a baseline of no processing giving an absolute analysis of the effect of each pipeline. Several types of efficiency can be considered but the largest hurdle for mass spectrometry for day-to-day use the total time is the largest concern[2]. As a major bottle-neck for imaging data can be disk load time, the pass efficiency (number of times a dataset must be read) can have a substantial impact on the total time.
The rat brain dataset introduced in Chapter 2 is re-used here (for a complete description and schematic see Figure 2.1).
4.4.1
Efficiency
All of the pipelines except the spatial correlation method were implemented in a memory efficient manner so that only a single spectrum was required in memory at a time. Most of the literature methods are have a stated aim or requirement of making the data small enough to fit into memory [66, 148]. For unsupervised
trend extraction using PCA multiple copies of the data are required in memory simultaneously (up to 4, depending on the algorithm used[174]), meaning that simply making the data smaller than the available memory may not be sufficient to enable further processing. Another requirement for further processing is that the measurements from each pixel are consistent, in a usual mass spectral storage scheme there may be differences in the exact m/z bins measured and as the data can be sparse often a reduced set of m/z -intensity pairs are recorded. All of the pipelines produced an internally consistent set of measurements where each pixel has values for each measurement.
4.4.2
Timings
Time is important in the processing of MSI and in general a processing time of less than the image acquisition time is desirable[2]. Disk access is the most significant time factor for all of the feature selection methods tested so the most pertinent measure is the number of data passes that are required, all the algorithms were implemented to be as pass efficient as possible, at any point that the data could be stored and accessed from memory it was. The data storage used (imzML, processed pairs) only permits easy access to individual spectra, some data storage frameworks allow for rapid access to whole ion images[203] but these are not com- monly supported. As the code is implemented in MATLAB and is not optimised it is not really appropriate to make substantial commentary on the time taken but some relative times are provided for guidance.
It is clear from Table 4.2 that the disk load time is not the only consideration for timing, comparing the time for no processing with re-binning it is perhaps surprising that doing extra processing can reduce the total dimensionality reduction time, but using this non-optimised code the multiplication required for the BASC sampling step creates a large matrix in memory for every spectrum (of size m × k), rebinning reduces the number of spectral channels from 129796 to 9500, which is a 93% reduction in sampling matrix size. During the evaluation it was discovered that it was possible to store the re-binned data in memory. A BASC implementation can be applied to data in memory which substantially reduced the computation time (to seconds, once the data is in memory) as a second data read could be avoided.
Memory efficient coding is not necessarily the fastest. For example, in the re-binning pipeline each spectrum is loaded and re-binned independently then the sampling matrix is formed spectrum-by-spectrum. Performing the re-binning as a separate stage and storing the result in memory allows faster matrix operations to be used for the basis construction and compression. This produced a final time of 0.06 (relative to Pipeline:none), including building the datacube. There is clearly a trade off between increased disk access and time but the disadvantage of algorithms that require data to be accessible in memory is that they will
4.5. COMPARING THE DATA AFTER THE PIPELINES 105