Chapter 1: Introduction
2.4.5. Data mining, machine learning and pattern matching
2.4.5.6. Examples of supervised learning methods
As discussed in section 2.4.5.5, for supervised learning the training data are provided with a set of their corresponding targets or labels (Witten et al., 2011), with the aim of producing a model based on the relationship between the training data and corresponding targets (Kantardzic, 2011). There are a number of different
supervised learning methods that have been utilised, and this section will provide some examples of these methods.
2.4.5.6.1. Decision trees
Decision trees are a method of supervised learning which is used within data mining (Witten et al., 2011). Decision trees can also be expressed as classification rules or associated rules (Witten et al., 2011).
Decision trees are constructed using a logical method (Kantardzic, 2011), based on seeking the best attribute split that can be used to separate each of the classes provided in the training data (Witten et al., 2011), and this then continues on all subsequent splits until no further data splits are required (Witten et al., 2011). The structure of a decision tree consists of nodes, branches and leaves. Where each node is used to test a particular attribute (Witten et al., 2011), each branch attached to that node represents the possible outcomes from the node (Kantardzic, 2011), for example yes or no and finally, the leaves represent the classification. An example of a structure of a decision tree is shown in figure 2.1, where each circle represents a node, each line represents a branch and each square represents a leaf.
Figure 2.1: Example of a simple decision tree (Adapted from Murphy, (2012) and Witten et al., (2011))
Decision trees work well on simple problems, and have an advantage of fast computation time (Jain, Duin, & Mao, 2000) and on being easily interpretable (Kotsiantis, Zaharakis, & Pintelas, 2006). For more complex problems decision trees suffer from losing their interpretability, due their large size (Kantardzic, 2011) and suffer from over-fitting (Kotsiantis, 2013), a characteristic which is described in section 2.4.5.8.
2.4.5.6.2. Artificial neural networks
Artificial neural networks are inspired by biology and the study of how the brain computes and performs tasks (Kantardzic, 2011; Mitchell, 1997). The overall structure of an artificial neural network is modelled on the structures of neurons inside the brain (Kantardzic, 2011), with an artificial neural network containing a number of interconnected artificial neurons. An artificial neuron, as shown in figure 2.2, is constructed of three parts, the inputs with weights, an adder and an activation function (Haykin, 1999; Kantardzic, 2011). The inputs to the neuron are multiplied by their corresponding weights and passed to the adder, which sums up all the inputs. The summed weights and inputs are then passed to the activation function, in which, if the value of the summed weights and inputs are higher than the threshold of the activation function, the neuron provides an output (Theodoridis, Pikrakis, Koutroumbas, & Cavouras, 2010).
Figure 2.2: An example of an artificial neuron (Adapted from Haykin, (1999)) Artificial neural networks can be used for both supervised and unsupervised learning (Kantardzic, 2011). For a supervised learning task, a feed forward neural network is used, and this type of network is usually used for classification task (Gurney, 1997). A classification task involves placing input data into a class, based on the training and target data for each class (Marsland, 2009).
To form a feed-forward neural network, a number of the artificial neurons (as shown in figure 2.2) are joined, as shown in figure 2.3. Each circle, as shown in figure 2.3, is classed as node that contains the adder and activation from the artificial neurons
and each connection represents the modified weights. The feed-forward neural network in figure 2.3 contains a three-layer structure, which consist of an input layer, a hidden layer and an output layer. The precise structure of a neural network is determined by the designer and, for example, can contain a number of hidden layers (Gurney, 1997).
Figure 2.3: An example of a three-layer artificial neural network (Adapted from Haykin, (1999))
For supervised learning of a feed-forward network a back propagation algorithm is used (Duda, Hart, & Stork, 2001; Gurney, 1997). This process involves presenting an untrained neural network with the training data to provide an output. This output is then compared with the provided target data of the training data to produce an error (Duda et al., 2001), which then determines how the weights of the network are changed (Gurney, 1997). The process is repeated until the output of the model converge to match the provided target data closely (Duda et al., 2001; Gurney, 1997).
An advantage of using artificial neural networks is that they are good at finding non-linear solutions (Jain et al., 2000; Sathya & Abraham, 2013) as their structure allows the representation of non-linear decision boundaries (Witten et al., 2011). Similarly, to decision trees, as discussed in section 2.4.5.6.1, artificial neural networks also suffer from overfitting (see section 2.4.5.8).
2.4.5.6.3. Support vector machines
Support vector machines aim at finding the hyperplane which is a subspace (Cristianini & Shawe-Taylor, 2000) that provides optimal separation between two classes (Rogers & Girolami, 2012). This is achieved by finding the hyperplane which maximizes the margin, which is the distance between the hyperplane and the closest points on each side (Duda et al., 2001; Witten et al., 2011). For non-linear cases, the training of a support vector machine involves the transformation of the training data into a higher dimensional space (for example kernel functions (Rogers
& Girolami, 2012)) where the data can be separated using a hyperplane (Duda et al., 2001). Support vector machines are used for binary classification (Rogers &
Girolami, 2012), though they have been extended for multiple class classification tasks (Murphy, 2012).
Support vector machines have the advantage that they work well on small training datasets (Jain et al., 2000) as well as being less prone to overfitting (Jain et al., 2000) than artificial neural networks. However, with support vector machines, the risk of overfitting is increased with the addition of kernels (Cristianini & Shawe-Taylor, 2000). A disadvantage of support vector machines is that, compared with the other examples above, this approach does have a very slow training time (Jain et al., 2000).