and the second Sobel kernel (Sy) estimates the vertical derivatives and is defined
as Sy = −1 −2 −1 0 0 0 +1 +2 +1 . (3.42)
The central difference method used to calculate the second order derivatives is calculated by ∂u ∂x = u(x + ∆x) − u(x − ∆x) 2∆x = p(x + ∆x) − p(x − ∆x) 2∆x (3.43) ∂v ∂y =
v(y + ∆y) − v(y − ∆y)
2∆y =
q(y + ∆y) − q(y − ∆y)
2∆y (3.44)
where (∆x, ∆y) is a change of 1 pixel. All equations have been replicated from [6,93, 107].
Finally, after the calculation of the strain patterns, a visual representation, called a strain map, can be created by normalising the strain magnitude values to 0-255. The benefit of creating this visualisation is that the estimated motion on the face can be highlighted and is shown in Fig. 3.11.
3.4
Methods of Classification
Machine learning is all about learning structure from data, and many methods exist in this field. In this section, two classification methods, SVM and RF, are described.
Figure 3.11: A strain map calculated from the normalised optical strain mag- nitudes. The motion on areas of the face are estimated in the colours closer to
red.
3.4.1
Support Vector Machines
First proposed by Cortes and Vapnik [12] anSVMattempts to find a linear decision surface (hyperplane) that can separate classes and has the largest distance between support vectors (elements in data closest to each other across classes). If a linear surface does not exist, then anSVMis able to use kernel functions to map the data into a higher dimensional space where a decision surface can be found. SVM was originally based on the Structural Risk Minimisation principle, which was used for machine learning from a finite dataset.
As shown in Fig. 3.12, data points are split using an optimal separating hy- perplane. The dashed lines on either side of the hyperplane is hereby defined as the margin m. Each training vector x belongs to a class y, with the training set defined as (x1, y1), . . . , (xn, yn). The total set and classes are defined as (xi) ∈ Rd
and yi ∈ {−1, +1} where Rd is a real number in d-dimensions and {−1, +1} are
the two classes. For a given hyperplane, x+ and x− are the closest points to the
Figure 3.12: Visualisation of an SVM hyperplane. The green and red circles represent the positive and negative classes respectively, with the support vec- tors contributing to hyperplane separation leading to the determination of the
maximum margin.
is denoted by ||w|| as its length and is given by √wTw. A unit vector w in the
direction of w is given by w/||w|| and ||w|| = 1.
From a geometric consideration, the margin of a hyperplane h with respect to a dataset D can be defined as
mD(f ) =
1 2w
T
(w+− w−) (3.45)
where there is an assumptions that w+ and w−) are equidistant from the decision
boundary as
f (x+) = wTx++ b = a (3.46)
f (x−) = wTx−+ b = −a (3.47)
for some constant a > 0. To make this geometric margin meaningful, the value of the decision for the points closest to the hyperplane, a = 1. By adding Eq. 3.46
and Eq. 3.47 and then dividing by ||w||, the margin becomes
mD(f ) = 1 2w T(w +− w−) = 1 ||w|| (3.48)
to handle linearly separable data. It can then be modified to attempt to handle less easily separable (or non-separable) data. The maximum margin classifier is the discriminant function that maximises the geometric margin 1/||w|| which is the equivalent to minimising ||w2||. This leads to the following constrained optimisation problem minimize x,b 1 2||w 2|| subject to yi(wTxi+ b) ≥ 1, i = 1, 2, . . . , n (3.49)
where the constraints show ensure that the maximum margin classifies each exam- ple correctly assuming the data is linearly separable. However, it is often the case that data is not linearly separable. A larger margin can be determined by allowing for some misclassification of points. The optimisation problem now becomes
minimize x,b 1 2||w 2|| + C n X i=1 ξi subject to yi(wTxi+ b) ≥ 1 − ξi, ξi ≥ 0 (3.50)
where ξ ≥ 0 are the variables that allow for a margin error, 0 ≤ ξi ≤ 1, or
to be misclassified by ξ > 1. The constant C > 0 sets the relative importance of maximising the margin and minimising the amount of errors. This way of calculating for non-separable data is called a soft margin SVM.
Lagrange multipliers are used as a mathematical method to solve constrained optimisation problems of differentiable functions. With anSVM, the saddle point of the Lagrange function can be found using
L(w, b, α) = 1 2||w 2|| − n X i=1 αi{yi[(wT · xi) + b] − 1} (3.51)
where αi are the Lagrange multipliers. The Lagrangian function has to be min-
imised with respect to w, b and maximised with respect to αi ≥ 0. The optimisa-
tion can be transformed into its dual problem as
max α D(α) = maxα min w,b L(w, b, α) (3.52)
and the optimal separating hyperplane is represented by the dual solution w = n X i=1 αi· yi · xi (3.53)
The value of b can be estimated by inputting w into the original equation wTx + b = 0. For testing, the classification is given by
f (x) = sign(w · x + b) (3.54)
for any new data point x. If the training data input into theSVMis non-separable, then the error variables, ξ, can be used.
For this thesis, the kernel used for the SVM was the Radial Basis function. During empirical experiments, the cost parameter was found to be optimal set to 1, gamma to 1/number of features and to 0.001. LibSVM within Weka was used as the SVM classifier.
3.4.2
Random Forests
A relatively new machine learning approach, developed by Breiman [26], has the idea that if one classification tree is good, then many trees (a forest) should be better. Random Forests (RF) can run efficiently on large datasets, with only the storage requirements of the dataset being the major memory requirements. They are also resistant to overfitting (i.e. the model described is random error or noise instead of the underlying relationship), therefore the performance of this algorithm does not decrease as the number of trees increases.
Firstly, each tree is trained on around two thirds of the training data provided to the algorithm, with each case picked at random from the original data. Selection of this data is also named bootstrap aggregation, which randomly generates a number of decision trees (denoted as ntree), which are each provided with randomly
selected samples of the training input and then all decision trees are combined into a decision forest. Then, some predictor variables, m, are randomly selected from all of the predictor variables to best split the node. By default, m is set to the square root of the total number of predictors for classification and stays constant during the growing of the forest. By changing m (change denoted by mtry), the
RF can be tuned for different data. The number of trees used in a forest, ntree,
can also be adjusted to fine tune parameters in this method.
The remaining one third of data left from the training set is used to calculate the misclassification rate named the out of bag (OOB) error rate. The combined error from all trees is used to determine the overall OOB error rate for classi- fication. The error rate of a forest can depend on two main points. First, the correlations between any two trees in a forest, therefore increasing the correlation increases the error rate. Second, trees with low error rates are classed as strong classifiers, and so increasing the strength of individual trees will decrease the over- all forest error rate. Reducing mtry reduces both the correlation and strength, and
vice versa.
Finally, each tree provides a classification choice, and it can be said that the tree has voted for that particular class. Whichever classification has the most votes from all the tree is chosen as the correct class. For example, in a binary classification problem, the vote would be in either yes or no, with the RF score is the percentage of the yes votes and is the predicted probability.