Random Forests is a prediction/classification model technique consisting of an ensemble of classification and regression tree-structured classifiers (79).
Decision trees are classifiers which use a single feature at each non-terminal node, so that a sample is classified to the left or the right branch at that specific node if its value for that feature is greater or less than the threshold specified at that node. At each split, the data is partitioned in two mutually-exclusive groups, and attempts to achieve as much homogeneity as possible in each group; subsequently, the splitting is repeated for each group in order to improve this homogeneity within them, but with the constraint of having a tree relatively small and easy to interpret (80). Once the tree is unable to continue the splitting of samples into different groups, the process completes. The tree can then be ‛pruned’ back to the desired size.
In 1984, Brieman, Olshen, Friedman and Stone published the book ‛Classification and
Regression Trees’, and developed the CART software (81). This book presented the
applicability of RFs and its successful performance when attempting to solve a range of problems that one of the researchers faced when he was working at the UC San Diego Medical School.
RFs uses an ensemble of classification trees; each of these trees is grown by random feature selection from a ‛bootstrap’ sample [‛training set’ sub-samples with replacements
33
obtained from the original dataset (82)] at each branch (83). Class prediction is based on the average classification performance of the aggregate of trees. The process of building up a tree consists of selecting a bootstrap sample of 63.2% (on average) from the original whole
sample set, and then a random group of variables are chosen for splitting at a particular node. The remaining 36.8% sub-set is then employed to obtain an unbiased estimate of the classification error (out-of-bag error, OOBE error) (84). Since the OOB observations are not used for fitting the trees, this is a cross-validated accuracy estimate, and hence it represents an unbiased estimation of the generalization error (79). Variable importance is evaluated via measurements of the increase of the OOB error value when it is permuted (79). The OOB error estimate ranges from 0 (if the model is able to predict with a 100% in accuracy the class of the test set) to a value of 1.00 (no sample from the test set was correctly classified).
1.4.2. Support Vector Machines (SVMs)
Support Vector Machines were originally developed in 1982 by Vapnik (85), and the most relevant outcome appeared in 1992 with the seminal Boser, Guyon and Vapnik paper: ‛A training algorithm for optimal margin classifiers’, in which this technique was formally proposed (86).
SVMs is a classification algorithm that operates by finding the decision surface that can more clearly separate samples from different classes, and hence has the largest distance between borderline samples (support vectors). Separation is then achieved by identifying these support vectors for the respective classes, identifying the separating hyperplane (space of K-1 dimensions constructed in a K-dimensional space, where K > 3) in between, so that these points are the most influential ones over the parameters selected to ‛draw’ this hyperplane in such a manner that moving a support vector moves the hyperplane. However,
34
the remaining samples do not have any influence in the process of seeking of this hyperplane. Therefore, the algorithm generates the weights, and only considers the support vectors to define their values, and hence the boundary. Nevertheless, if this decision surface
does not exist, then the dataset has to be mapped onto a higher dimensional space where this decision surface exists. This transformation is known as the Kernel trick.
If the training set xi ∈ Rn, i= 1... m where each of the xi (samples) belong to one of the two categories yi, indicated by either -1 or 1, SVMs finds a hyperplane with the parameters
(w,b), using the following convex optimization problem to obtain them (86):
min 𝑤,𝑏,𝜖 1 2𝑤 𝑡𝑤 + 𝑐 ∑ 𝜖 𝑖 𝑁 𝑖=1 (18)
subject to 𝑦𝑖(𝑤𝑡𝑥𝑖+ 𝑏) ≥ 1 − 𝜖𝑖 , where 𝜖𝑖 ≥ 1, for i = 1,…N
In Equation 18, c works as a regularization parameter which acts as a ‛slack’ constant, and which is a ‛trade-off’ between the ability of the system to model accurately using the training set, and its predictive performance (87). If c is small, then the margin is large, so constraints are easily ignored; however, if c is large, then the margin is narrow, and therefore constraints are hard to ignore, so that c controls the margin width. 𝜖𝑖 is a measure of
misclassification rate and is also known as the ‛slack variable’. The function of this term is to reduce the overfitting problem, enhancing the performance of SVMs, but also allowing a fraction of training sample objects to be within the margin or even misclassified in order to not introduce more complexity within the model (i.e., a ‛soft’ margin classifier).
35
1.4.3. Genetic Algorithms (GAs)
GAs are randomized search and optimization algorithms inspired by Darwin’s theory of evolution that were introduced by Holland in 1975 (88). GAs function as a simulation of
an evolutionary process, in which a population of solutions evolve over a sequence of generations (89) (Figure 1.9). Each chromosome represents a potential solution, and the fitness value associated with each chromosome indicates its relevance to obtain the best possible solution (90).
Encoding and Initial Population: A chromosome is a string of ‘bits’ with a length determined by the total number of variables. The presence/absence of a variable in the chromosome is given by the value 1 or 0 in the corresponding i-th place within the chromosome; this assignment is generated randomly. Consequently, each chromosome represents a different sub-set of features.
Selection, Crossover and Mutation: The selection process involves the principle of ‛Survival of the fittest’, i.e. improved solutions are selected to get through to the next generation; notwithstanding, ‛bad’ solutions are discarded. The ‛good & bad’ criteria are given by the fitness function.
Crossover causes a structured yet randomized exchange of genetic information amongst solutions/chromosomes, with the aim that effective solutions could be crossed with other high-performance ones in order to generate improved models. This process is applied with a probability, called crossover rate, i.e. the probability that a chromosome (solution) experiences cross-over during its reproduction process. A crossover rate of 1.00 indicates that all the chromosomes experience crossover, so no unchanged solutions go through to
36
the next generation. For instance, a value of 0.75 indicates that a chromosome may go through to the next generation unchanged with a probability of 0.25.
Mutation works by modifying the value of each gene with a certain probability. This
operator then restores missed or unexplored genetic material stored in the population in order to prevent the premature convergence of the GAs to sub-optimal solutions (89).
Fitness function: The fitness function is a measure of the ‘goodness’ of a solution, and can be used to rank chromosomes/solutions against the other chromosomes. It can also be employed to assess feature sub-set selection, in order to simplify the model using less features, achieving at least the same level of success when classifying samples according to their correct class. In a classification problem context, the fitness is equivalent to evaluations of the predictive ability of the model using a sub-set of genes (variables).
Figure 1.9. Schematic representation of a GAs procedure for a classification problem: variables (xi…xn) are coded in chromosomes as 1 or 0 if they are or are not included in the model respectively. The whole process moves on for N number of generations, or until the fitness value is attained.
37
1.5. NIEMANN PICK TYPE C1 DISEASE