Three machine learning algorithms were used in this study viz: SVM, ANN, and Random forest. Below is a brief overview of their peculiarities and how they are used to predict information required in any tasks they are applied:
2.15.1 Support Vector Machines (SVM)
SVM is a kernel-based ML model which mainly used for solving classification and regression problems. It was first introduced in 1986 by Vladimir Vapnik. SVMs are a set of interlinked supervised learning methods used in pattern recognition, regression, classification, estimation, and operator inversion for difficult tasks (Witten, I.H, et al., 2005). The application of this method to time series prediction tasks has been a success story (Voyant, C., et al., 2017). The approach used in SVM for classification problems has the primary objective for discovering the hyperplane which effectively separates the class representation of data. The hyperplane is a generalization of a line in 2-D and a plane in 3-D. When there are several hyperplanes to choose from, SVM selects the one where the distance of the hyperplane from the nearest data points is the farthest (Maiga, A., 2011). The maximum margin linear hyperplane can be formed as soon as instances from the support vector have been identified as shown in figure 2.7:
34 Figure 2.7: SVM showing the optimal separating hyperplane
By introducing a kernel, SVM can become non-linear and non-parametric as the case demands. From Mercer theorem, kernels are regarded as symmetric, semi-positive definite functions that fully support the theorem. The forecasting task done using an SVM for an input test case x is expressed by:
Ε·= βππ=1βΊπππππ(π₯π, π₯β) + π (9) From the above equation, the radial basis function (RBF) kernel is defined by:
ππππ(π₯π, π₯π) = ππ₯π [
β(π₯πβπ₯π) 2
2π2 ] (10) Despite its successes, SVM as a powerful classifier method is known to have some limitations due to its high algorithmic complexity and extensive memory demand needed for its quadratic programming in large scale tasks (Suykens, et al., 2003). Burgess in 1992 enumerated other limitations of SVMs to include:
β’ It is computationally expensive, sometimes, to carry out model selection for some contemporary applications. The applications require new objects to be consistently loaded into an already large database.
β’ Both in training and testing, SVMs exhibit limitations in speed and size. β’ The choice of the kernel limits supports the vector approach.
β’ In SVM, discrete data are known to be difficult in finding their hyperplanes. Support vectors
35 2.15.2Artificial Neural Network (ANN)
ANNs are a class of models inspired by the structure of biological neural networks. Like kernel methods, they are good for solving problems involving pattern-matching techniques. In the review conducted by Mellit, A. et al; 2009, 79% of AI techniques used in weather forecasting data are based on ANN. Biologically, the artificial neuron model corresponds to the neuron of the brain where the inputs represent the dendrites, an activation function regulates the conditions for the firing of the neuron if the threshold is reached. Also, the weights (w) correspond to the synapses linking the neurons to each other in the brain, while the output corresponds to the axon.
An increase in the number of connections in ANN improves its ability to imitate biological networks which then results in much-improved learning of patterns of information embedded in the data (Moncada, A et al; 2018). Just like a perceptron, the differences between the actual and predicted values are reduced by careful and gradual selection of random weights. One of the best strategies for optimizing these weights is to use backpropagation. But, while the ANN utilizes a non-linear activation function in calculating its output errors, the perceptron uses the step function. The great power of the ANN lies in its non-linearity. This accounts for the difference between the ANN and a perceptron.
ANNs are multi-layer fully connected neural nets that consist of an input layer, hidden layers, and an output layer. Nodes in one layer are connected to other nodes in the next layer.
In the ANN model, a node receives the weighted sum of its inputs and sends it through a non- linear activation function, f. The input of another node in the next layer is the output of the node in the previous layer. The last output is determined by repeating this procedure for all the nodes. Training ANN means learning the weights linked with all the edges.
Equation 11 summarizes the output (z) of a given node. The weighted (w) sum of its inputs (x) is passed through a non-linear activation function (f). n is the number of inputs for the node.
π = π(π₯. π) = π (β π₯πππ
π
π=1
)
π₯ β π1Γπ, π β ππΓ1, π β π1Γ1 (11) An input to all the nodes is called bias (b), and it always has a value of 1. Bias makes it flexible to shift the outcome of the activation function to the left or the right. With bias, the model can
36
still train when all the input features are 0. When a bias is included in the above equation, the output of the node changes to the equation (12):
π = π(π + π. π) = π (π + β ππππ
π
π=1
)
π β π1Γπ, π β ππΓ1, π β π1Γ1, π β π1Γ1 (12) Equations (11) and (12) above illustrated how the output of the forward pass of a node is calculated. The forward pass is used to make the predictions after training is completed. The procedure below is followed to train the ANN model to learn the weights:
β’ The weights are randomly initiated for all the nodes.
β’ Going from left to right (forward pass), use the current weights to calculate the output of each node. The value of the last node is the final output.
β’ The final output of the forward pass is compared with the actual target in the training data. The loss function is used to measure the error.
β’ Use backpropagation to propagate the error to each node from right to left (backward pass). Use gradient descent (GD) to adjust the weights accordingly and calculate each weightβs contribution to the error. The error gradients must be propagated back starting from the last layer.
2.15.3 Random Forest (RF)
Random forests, also known as a random decision forest, are supervised learning algorithms that use ensemble machine learning methods for solving regression and classification problems. They operate by constructing a host of decision trees at training time and outputting the class that has the frequency of occurrence (mode) among the classes or predicting the mean (regression) of the individual trees. Random decision forests are primarily designed to handle the problem of over-fitting associated with decision trees (DT).
Tim Kam Ho created the pioneer algorithm for random forests using the random subspace method. The random subspace method is a way to apply the stochastic discrimination approach to classification. Random forest employs bagging technique while running the trees in parallel. There is no interaction between these trees While building, random forest ensures that there is no interaction between the trees. With some gainful changes, it combines the result of multiple
37 predictions by aggregating many decision trees. Each of these trees extracts a random sample from the main dataset when generating its splits. This action further adds an element of randomness that eliminates overfitting.
RF permits splitting the number of features at each node by restricting it to some percentage of the total before training. This hyperparameter property ensures that the RF model does not depend too much on any individual feature.
Advantages of Random Forest
β’ In comparison to other learning algorithms, its accuracy is regarded as very high.
β’ It is efficient in handling large databases.
β’ It acceptsmultiple input variables without deleting any of them.
β’ It offers estimates of vital variables in the classification or regression.
β’ As the building of the forest advances, it produces an impartial internal approximation of the generalization error.
β’ It can effectively estimate missing data and still maintains accuracy when there is a large percentage of missing data.
Disadvantages of Random Forests:
1. They have been found tooverfit for some datasets.
2. For data with a different number of levels,they are observed to be biased in favor of those attributes with more levels.