CHAPTER 2 DEVELOPMENT OF THE TOOLBOX
2.1 Standard Multivariate Pattern Analysis
2.1.2 Timecourse data preparation
2.1.2.3 Caching
The classification analyses implemented in this toolbox are typically applied multiple times to the same data, varying in the set of selected voxels used. Rather than re-loading and
processing the time-course data each time, which would impose a significant overhead, the time-course data are cached.
The caching is handled by the CachedPreprocessedData class, which wraps the standard PreprocessedData classes, as shown indicated Figure 2.9. The
CachedPreprocessedData class maintains a cache of the timecourse data for both training and testing sets, along with a set of voxels to which the cached data relates. All requests to instances of the CachedPreprocessedData class for time-course data are
first checked to see if the requested data is already in the cache; the only data loaded is that which relates to voxels not already present in the cache, as shown in the following algorithm:
Listing 2.8: Pseudo code algorithm for accessing timecourse data via a CachedPreprocessedData object.
By caching at this stage, after the sample estimation has been performed, the necessity of reapplying the sample estimation is also avoided and the quantity of data which are to be cached is reduced. A further benefit arises from the fact that the toolbox primarily relies on leave-one-run-out cross-validation. Since each run, and thus cross-validation fold, is sample estimated independently; caching at this stage simplifies the generation of training and testing sets to the concatenation of the already prepared training runs. Since the decision of which voxels to cache data for is dependent on the feature selection method used, the
CachedPreprocessedData class focuses solely on storing and sourcing the data, relying on the FeatureSelection classes to specify the required voxels.
If (the cache has no data for one or more requested voxels),
Use the PreprocessedData object to load/preprocess missing data. Add the data to the cache.
End If
2.1.2.4 Cross-validation
In the typical fMRI study there are a limited amount of data available due to the cost and difficulty of collection. To avoid the need to dedicate a relatively large proportion of the data to testing it is common to employ cross-validation. By holding out a relatively small portion of the data, but performing multiple analysis while rotating which portion is held out, it is possible to generate a good measure of the performance of the classification algorithm while retaining a large training set for building the models.
When dealing with fMRI data the most commonly used method is leave-one-run-out cross- validation. The leave-one-run-out cross-validation method is derived from k-fold cross- validation, in which data sets are divided into k discrete folds. Each of these k folds is used as the test set in turn, with the remaining sets being combined to form the training sets. The leave-one-run-out method makes use of each run of the fMRI data as a single fold since they are already separated into independent sets of samples. This method is further helped by the fact that design of the experiment can be tailored to ensure certain properties which are beneficial to the cross-validation. As discussed by Kohavi (1995), a k-fold cross-validation will benefit from having approximately 10 folds, particularly if the folds are stratified. By designing experiments such that the runs number in the region of 8-10, and contain an equal number of presentations of the stimuli from each class, these properties can be provided.
Naturally there are some situations in which providing 8 runs can be problematic, such as cases where one or more runs have been excluded due to excessive noise or head motion during the scan. Also ensuring an equal number of stimuli from each class may not be possible, for example if the categories being classified are the participant's responses to stimuli. In situations such as these it may be desirable to use other cross-validation methods which can ensure greater numbers of balanced folds themselves, such as leave-p-out cross- validation; however using such methods is by no means as simple as using leave-one-run-out cross-validation. The overlapping nature of the haemodynamic response to stimuli means that adjacent trials are not independent; while this can be accounted for when said trials belong to the same fold, it can violate the requirement of independence of training and testing sets if adjacent trials are placed in separate folds. Further discussion on the application of alternative methods of cross-validation is provided in Chapter 3.
The cross-validation step is implemented in the toolbox as part of the DataSource class, and performed following the sample estimation. The independent nature of the time course runs allows for pre-processing up to the sample estimation stage to be performed separately on each run. By dealing with the runs independently up to this stage the data can be cached rather than needing to perform sample estimation separately for each cross-validation step. After this, the cross-validation process merely comprises the concatenation of the training folds.