• No results found

Chapter 2: Review of Literature

2.6 Characteristics of Gait Data

2.6.1 Development of Automatic Recognition Tools using Multivariate Statistical

Automatic gait recognition tools can be developed using multivariate statistical analyses and machine learning algorithms in order to discriminate and classify data. These tools are developed using different stages of training, predictive and evaluation (Lever et al., 2016a; c). (1) During the training stage a model is developed, i.e. the machine is supplied with information to learn. (2) During the prediction stage, the model developed in (1) is used to identify classes. A threshold is established to define the probability of an observation belonging to one class or the other, i.e. the classification performance of different models form the machine are assessed and/or improved. For example, as seen in Figure 2.14 the features of two classes/categories (solid and open black circles) are separated into two classes using the predicted probability. In (a) the membership of classes is perfectly separable when using a 0.5 threshold, in (b) however, the membership of

Chapter 2: Review of Literature

37 features into classes is ambiguous as shown by the 0.5 thresholds. Thus, the threshold needs to be tuned to control false positives (0.75) and false negatives (0.25) (Lever et al., 2016c). (3) During the evaluation stage data that was not used for the training or the classification is used to assess how a classifier performs (Lever et al., 2016a).

Figure 2.14 Threshold example of a classification of a data set. Figure adopted from Lever et al. (2016c).

Prior to the development of a machine learning algorithm, data needs to be pre-processed, which is done by reducing large volumes of data through dimensionality reduction, followed by feature selection, which can be followed by a cross-validation method and finally the classification procedure. Different methods can be implemented at any given stage of the development (Figure 2.15).

Figure 2.15 Schematic illustrating the steps in the development of an automatic recognition tool. Abbreviations are Principal Component Analysis (PCA); Kernel based-PCA (kPCA); Genetic Algorithm (GA); Cross-Validation (CV); Leave-one-out (LOO); Distribution optimally balanced stratified CV (DOB–SCV); Clustering Analysis (CA); Support Vector Machine (SVM); Naïve Bayes (NB); Logistic regression (LR); K-Nearest Neighbours (KNN); Decision Tree (DT); Discriminant Analysis (DA); Artificial Neural Networks (ANN); Negative likelihood ratio (NLR); and area under the curve (AUC). Figure adapted from Figueiredo et al. (2018).

Chapter 2: Review of Literature

38

2.6.1.1 Feature selection methods

When presented with large volumes of data, as is the case during gait analysis, a parsimonious representation of the data is sought. During the development of automatic pattern recognition tools, feature selection methods are used, which also improve classification outcomes (Alaqtash

et al., 2011a; Lee et al., 2009; Muniz et al., 2010a). These methods can be divided into three

groups: filter methods, wrapper methods, and embedded methods (Saeys et al., 2007). Filter methods process data without considering the classification algorithm that follows (Saeys et al., 2007), wrapper methods use heuristic criterion to evaluate a subset of data according to the classification method that follows (Lu et al., 2014; Saeys et al., 2007) and embedded methods interact with the classification method. Different feature selection methods have been investigated in gait analysis, such as Principal Component Analysis (PCA) and kernel based-PCA (kPCA) which are filter methods, genetic algorithm (GA) which is a wrapper method, and hill-climbing which is an embedded method (Lu et al., 2014; Martins et al., 2014).

(1) Principal Component Analysis (PCA) is the orthogonal transformation of variables, i.e. dependent variables are transformed to become independent variables. It aims to establish the optimal linear transformation representing the data in the least squared sense (Yang et al., 2012). Data is presented in a new coordinate system, capturing the maximum variance within the data set (Badesa et al., 2014; Dillmann et al., 2014; Wu et al., 2007; Yang et al., 2012). The different axes of the coordinate system are referred to as principal components (PCs), where the first few of PCs hold the most variance of the original data set. To reduce large data volumes, the dimensionality of the data is reduced by preserving the first two PCs and removing the remaining PCs (Lee et al., 2009; Yang et al., 2012). Principal Component Analysis was first applied to gait data in order to derive a method to represent signals instead of using the signals themselves (Wootten et al., 1990), to reduce large volumes of data (Olney

et al., 1998) and to assess entire temporal waveforms of gait data retaining potentially

valuable information (Deluzio et al., 1997).

(2) Kernel based-PCA (kPCA) is used to map non-linear data onto a higher dimensional feature space using a kernel function such as linear, polynomial and radial basis function (RBF) (Wu

et al., 2007). Studies comparing different kernel functions established that a polynomial

kernel achieves better performance relative to linear and RBF kernels (Liang & Lee, 2013; Wu et al., 2007).

Chapter 2: Review of Literature

39 (3) Genetic algorithms (GA) are inspired by Charles Darwin’s theory of natural evolution and is considered a time efficient optimisation technique (Ardestani et al., 2014; Martins et al., 2014). The algorithm starts with a random set of individuals called a population, where each individual of the population is characterised by a set of parameters known as genes. Each one of these genes is jointed into a string creating a chromosome (solution). A fitness function determines an individuals’ probability of reproduction and only candidates with the potential to pass to the next generation as defined by the classification method are preserved, i.e. optimising the cost function (Ardestani et al., 2014; Sarbaz et al., 2012).

(4) Hill-climbing algorithms iteratively search for features that positively contribute to the classification procedure (Begg et al., 2005). Each feature is used for initial classification and based on the performance of the classifier and the features are ranked from the highest to the lowest contributor (Chan et al., 2013).

Studies demonstrated classification results improved following feature selection methods. Using PCA, Eskofier et al. (2011) demonstrated that the classification outcome improved from 58% to 95.8%. Using kPCA, Wu et al. (2007) showed that the classification outcome improved from 85% to 91% after the original dataset of 36 features was reduced to 17 features, creating a compact data set of uncorrelated features which still represented the original data set without compromise. However, other studies showed that the classification outcome did not necessarily improve following feature selection (Badesa et al., 2014; Lai et al., 2008). Nevertheless, the reduced data set minimised performance time, and thus also computational cost (Lai et al., 2008).

Principal Component Analysis is the most common data reduction method applied in gait analysis (Badesa et al., 2014; Deluzio & Astephen, 2007; Eskofier et al., 2011; Figueiredo et al., 2018; Lee et al., 2009). However, Wu et al. (2007) demonstrated that kPCA extracts PCs that are more relevant to non-linear gait data in the presence of noise, relative to PCA, but it is mathematically more complicated. An issue with both PCA and kPCA is the number of PC scores retained during the analysis since the selection of PCs is fundamental to achieve the best possible classification outcome (Badesa et al., 2014; Deluzio & Astephen, 2007; Eskofier et al., 2011). Too many PC scores would result in overfitting of the data, and too little would result in underfitting of the data. Hill-climbing and GA aim to find a local optimum and a global minimum, respectively (Ardestani

et al., 2014; Martins et al., 2014; Su & Wu, 2000). Genetic algorithms quantitatively and

qualitatively identify the most relevant features, but the selection processes depend on other factors associated with high computational cost (Pratihar et al., 2002). Compared to other methods, PCA and kPCA are less complex.

Chapter 2: Review of Literature

40

2.6.1.2 Classification methods

Machine learning algorithms have the ability to identify patterns automatically, and model complex, non-linear and high dimensional data (Alaqtash et al., 2011b; Lai et al., 2008; Zheng et

al., 2009). In gait analysis, different machine learning algorithms have been used such as

Discriminant analyses (DA) which are either in the form of Linear Discriminant Analysis (LDA) or Quadratic Discriminant Analysis (QDA), Artificial Neural Networks (ANNs), Support Vector Machine (SVM), Naïve Bayes (NB), Logistic regression (LR), Clustering analysis (CA), Fuzzy logic clustering or K-Nearest Neighbours (KNN).

(1) Discriminant Analysis (DA) is a supervised machine learning algorithm that finds a linear or quadratic combination of input features in order to separate data into two or more classes. Discriminant analysis is thus referred to as Linear Discriminant Analysis (LDA) which is also known as Discriminant Function Analysis (DFA) or Quadratic Discriminant Analysis (QDA). Each feature has its own weighting factor which indicates its importance to the discrimination procedure between the classes. The intra-class and inter-class distances between the features are determined to define which class a feature belongs to (Badesa et al., 2014; Harper, 2005; Lee et al., 2009).

(2) Artificial Neural Networks (ANNs) are machine learning algorithms based on the biological neural system (Ardestani et al., 2014). They are made up of multilayer feed-forward neural networks, which are composed of layers of interconnected sets of nodes which loosely model the neurons in a biological brain. Connections between units move forward through hidden layers of nodes to form the input to the output layer (Alaqtash et al., 2011a; Chau, 2001b). (3) Support Vector Machine (SVM) is a supervised machine learning algorithm using a kernel

method to classify non-linear gait data and map it to a higher dimensional feature space (Begg & Kamruzzaman, 2005; Begg et al., 2005). Classification is performed by finding the optimal hyperplanes that separate between classes.

(4) Naïve Bayes (NB) based on Bayes’ theorem, assumes that all features are independent of each other (Badesa et al., 2014; Chan et al., 2013). It creates a probabilistic model defining the class that an input belongs to.

(5) Logistic regression (LR) transforms data into logic variables (binary variables) to maximise classification outcomes (Harper, 2005; Muniz et al., 2010b).

Chapter 2: Review of Literature

41 (6) Clustering Analysis (CA) uses homogeneous groups or “clusters” to classify data. Two methods of hierarchical and non-hierarchical clustering are used to minimise variability within a class and maximise variables between classes (Kaczmarczyk et al., 2009).

(7) Fuzzy Logic Clustering offers an insight into the non-linear relationship between variables (Chau, 2001a). It does not consider sharp boundaries so input data can simultaneously contribute to multiple classes (Alaqtash et al., 2011b; Chau, 2001a).

(8) K-Nearest Neighbours (KNN) is a non-hierarchical clustering method. It defines the characteristics of the input data depending on similarity to their neighbours (Alaqtash et al., 2011a; Badesa et al., 2014).

All machine learning algorithms have benefits and limitations, hence an algorithm is chosen depending on the application. Artificial Neural Networks handle non-linear data and can learn and adapt to new data, but large volumes of variables are required for an accurate generalisation of the algorithm. It can suffer from overfitting which compromises its generalisation (Begg & Kamruzzaman, 2005; Begg et al., 2005; Chau, 2001b; Lai et al., 2008). Support Vector Machine, however, considers a global optimum and overfitting of the training process can be avoided (Saeys

et al., 2007; Wu et al., 2007). It can be applied to small data, and new data can be added to the

classification procedure without compromising the outcome (Begg & Kamruzzaman, 2005; Begg

et al., 2005; Khandoker et al., 2007). Other classification methods, such as CA, are sensitive to

correlated variables. This means that prior to its application, correlated variables need to be identified and removed (Kinsella & Moran, 2008). Furthermore, a priori rules and clusters need to be defined by the user, which may introduce bias (Chau, 2001a; Dobson et al., 2007). Nevertheless, CA based on fuzzy logic clustering offers insights into non-linear relationships between variables.

Different machine learning algorithms will result in different classification outcomes due to the unique approach of each algorithm. No algorithm suits all applications (Harper, 2005) especially considering the complexity of gait data. Studies have therefore investigated different approaches to evaluate which suits their data. Comparing DA, tree-based algorithms, ANN, and LR classification methods, Harper (2005) demonstrated that no ideal approach exits, instead the performance of the classification depends on the features. Other studies suggest that the combination of different classification methods can improve the classification outcome (Pogorelc

et al., 2012). Classification performance depends on multiple factors such as the relevance of

features, type of feature, size of the dataset, and/or number of participants (Badesa et al., 2014; Begg & Kamruzzaman, 2005).

Chapter 2: Review of Literature

42

2.6.1.3 Classification evaluation

The performance of machine learning algorithms can be evaluated using different methods. The evaluation should be performed on a test set which has not been used for the training and whose classification is not known. One method involves using the confusion matrix to define accuracy, sensitivity and specificity (Lever et al., 2016a). In a two-class problem, there are four possible outcomes of classification: true positive (TP), false negative (FN), true negative (TN), and false positive (FP), where accuracy (Equation 2.6) evaluates the classifier defined as the percentage of true predictions using a model of these four categories. However, high accuracy does not necessarily imply a good classifier. Thus, sensitivity (Equation 2.7) and specificity (Equation 2.8) measure the proportion of actual positives and negatives of the classifier, respectively. True positives and false positives can be captured by precision, also known as a positive predictive value, which is the proportion of predicted positives (Lever et al., 2016a).

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (%) = 𝑇𝑁 + 𝑇𝑃 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁× 100% (2.6) 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (%) = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁× 100% (2.7) 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 (%) = 𝑇𝑁 𝑇𝑁 + 𝐹𝑃× 100% (2.8)

There are several methods to aggregate the confusion matrix. The most popular is 𝐹𝛽 score, which controls the balance of recall and precision using 𝛽, which can be calculated as follows:

𝐹𝛽 =

(1 + 𝛽2)(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙)

𝛽2× 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (2.9)

Where 𝐹𝛽 = 𝐹𝛽 score, 𝛽 = shape parameter

As 𝛽 decreases, precision is given greater weight. Commonly, the 𝐹1score is calculated with 𝛽 = 1, balancing recall and precision with the equation reduced to 2𝑇𝑃/(2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁). The 𝐹𝛽 does not capture the whole confusion matrix since it does not give an indication of TNs. One method to capture all data in a coefficient matrix is Matthew Correlation Coefficient (MCC), which ranges from -1 to 1, where ‘-1’ indicates classification is always wrong, ‘0’ indicates classification is no better than random and ‘+1’ indicates classification is always correct.

Chapter 2: Review of Literature

43 It should be noted that no single matrix can distinguish all strengths and weakness of a classifier. Instead of evaluating a classifier using a positive or negative, a level of certainty can be used, which can be visually interpreted using a receiver operating characteristic (ROC) curve. It is also possible to determine the ratio between false and true negatives using the negative likelihood ratio (NLR) and the area under the curve (AUC) (Begg et al., 2005; Chan et al., 2013).

2.6.1.4 Cross-validation

After a machine learning algorithm is developed, different approaches are implemented to improve the classification performance. These include cross-validation and feature normalisation methodologies. Cross-validation (CV) methods are used to evaluate the generalisability of the classification outcome as new data is added. These methods can also minimise the likelihood of overfitting (Figueiredo et al., 2018). The conventional CV method starts by dividing the data set into training and testing (predictive) data sets, based on 𝑘-fold. During the process, the cross- validation process is repeated 𝑘 times until every trial is used as a testing sample at least once whilst all other trials make up the training sample. Finally, the average 𝑘 results are calculated, determining the performance of the classifier (Alaqtash et al., 2011a; b).

During the leave-one-out (LOO) cross-validation method data in each fold belongs to a particular participant instead of randomly assigned trials (Alaqtash et al., 2011a; Badesa et al., 2014). Therefore, the LOO method uses 𝑘-fold depending on the number of trials. This method is unsuitable if trials are unbalanced since that may introduce different data distributions (López et

al., 2014). For an unbalanced number of distribution trials, optimally balanced stratified cross-

validation (DOB-SCV) should be used (López et al., 2014).

Other methods to improve a classifier’s accuracy and robustness is by implementing normalisation procedures. Time normalisation is an example of such method, during which each feature is expressed as a function of a gait cycle instead of time (Alaqtash et al., 2011b; Eskofier

et al., 2011; Zhang et al., 2014). Similarly, kinematic data can be standardised to a person’s body

weight instead of a gait cycle (Laroche et al., 2014). Using a 𝑧-score to standardise data (Begg and Kamruzzaman, 2005; Hanson et al., 2009; Wu and Wang, 2008) ensures that all features have a mean of zero and a variance of one (Yang et al., 2012; Zhang et al., 2014):

𝑥 − 𝜇 𝜎

(2.10)

Chapter 2: Review of Literature

44

2.6.2 Multivariate Statistical Analysis and Machine Learning Algorithms in Gait