Chapter 3 : General Methodology
3.3 fMRI data Analysis
3.3.4 Classification
Classification, or decoding, is a mathematical process performed on multiple voxels to predict a stimulus category. In the context of fMRI, features may represent a group of voxels, examples may represent trials in the experimental run, and classes may represent the type of stimulus. Classifiers estimate the decision function that takes the values of various features (independent variables) as examples and predicts the class (dependent variable) of those examples. To obtain the decision function, fMRI data must be partitioned into a training set and a testing set. The classifier is trained using the training dataset. The training phase maps the features to the class label by assigning a weight to each feature. If more than two classes are present in the experimental design, the analysis can be transformed into a combination of multiple two-class problems (i.e., each class versus all the others), then a voting scheme is used to predict the winning class (Haxby et al., 2011; Misaki et al., 2010; Reddy et al., 2009). The classifier is then evaluted using the testing set to determine its performance in discriminating new responses.
Prior to training classifiers, fMRI data need to be transformed into examples. There are different ways of producing these examples, such as using either the raw data at a single TR, averaging raw data at multiple TRs of one task block, or using GLM estimated parameters (Ξ²- weights or t-values) for a given experimental condition (Pereira et al., 2008)
Classifiers can use different algorithms to find the optimal decision boundary, such as Support Vector Machine, Linear Discriminant Analysis (LDA), Gaussian NaΓ―ve Bayes (GNB) and Artificial Neural Networks (ANN). Misaki et al., (2010) compared the classification performance of six different classifiers (pattern-correlation classifier, k-nearest neighbours, LDA, GNB, and linear and nonlinear SVM), and found that linear SVM classifiers performed the best in dealing with large high-dimensional datasets and their flexibility in decoding diverse sources of brain data.
101
In the simplest linear form of SVMs for two classes, the goal is to estimate a decision boundary (a hyperplane) that separates with maximum margin a set of positive examples from a set of negative examples (figure 3.9). Each example is an input vector xi (i = 1, ...,N), having
M features (i.e., xi in RM), and is associated with one of two classes yi = β1 or +1. For example,
in fMRI research, the data vectors xi contain BOLD values at discrete time points (or averages
of time points) during the experiment, and features could be a set of voxels extracted in each time point; y = β1 indicates condition A, and y = +1 indicates condition B.
Figure 3.9. 2D space demonstration of the decision boundary of the linear SVM. (A) The hard margin on linearly separable examples where no training errors are permitted. (B) The
soft margin where two training errors are introduced to make data nonlinearly separable. Dotted examples are called the support vectors.
If we assume that data are linearly separable, meaning that we can draw a line on a graph of the feature x(1) versus the feature x(2) separating the two classes when M = 2 and a hyperplane on graphs of x(1), x(2), ... , x(M) when M > 2, the SVM produces the discriminant function f with the largest possible margin:
π(π) = π. π + π (3.4)
102
w is the normal weight vector of the separating hyperplane, b is referred to as the bias,
which translates the hyperplane away from the origin of the feature space, and (.) is the inner product:
π. π = β π(π)π(π)
π΄ π=π
(3.5)
SVM attempts to find the optimal hyperplane w . x + b = 0 which maximises the margin magnitude 2/||w||, that is, it finds w and b by solving the following primal optimisation problem (subject to yi (xi . w + b) β₯ 1, βi β {1, ... ,N}): π¦π’π§ π,π π πβπβ π (3.6)
For linearly separable data (figure 3.9, A), the SVM produces a discriminant function with the largest possible margin, and since the decision line separates the two classes without error, it is referred to as the hard margin SVM. It can be shown that finding the maximal margin corresponds to solving an optimisation problem, which involves minimising the term
π πβπβ
π under the constraint that all exemplars are classified correctly. The term ||w|| refers to
the norm or length of a vector, which is obtained as follows: βπβ = βπ. ππ
(3.7) However, in practice, data are often not linearly separable. The generalised SVM, which is able to handle non-separable data by allowing errors, is referred to as the soft margin SVM (figure 3.9, B). It involves minimising the term (subject to yi (xi . w + b) β 1 + ΞΎi β₯ 0, βi
β {1, ... ,N}, ΞΎi β₯ 0): π¦π’π§ π,π,π π πβπβ π+ πͺ β π π π΅ π=π (3.8) The new term on the right side sums the (potential) errors produced for exemplars for a given weight vector. The ΞΎi values are called slack variables; a slack variable is 0 in the case
103
of no error. If a pattern falls within the margin or even on the right side or on the other side of the decision boundary (miss-classification), the slack variable for this pattern is positive. The cost or penalty constant C > 0 is very important, since it sets the relative importance of maximising the margin and, thus, generalisation performance (small C value) and minimising the number of classification errors (large C value). The latter case forces slack variable ΞΎi to be
smaller, approximating the behaviour of the hard margin (Mahmoudi et al., 2012).