2.5 Introduction to Machine Learning
2.5.2 Machine Learning Models
A machine learning model is used to learn from, and make predictions about, a particular input dataset. An important aspect to this learning process which can affect how the model is used is whether it can be considered interpretable or not. An interpretable model is one where the underlying decision process can be explained and understood by human observers, with models failing to meet this criteria being labelled as black box1 [77].
Some of the major families of approaches relevant to this thesis are reviewed below.
1So called as only the inputs and outputs can be observed, with the internal components mapping between the two remaining opaque.
Traditional Supervised Models
Before the recent increase in the popularity of neural-based models, other forms of supervised machine learning models were prevalent. Whilst these approaches differ in the algorithms used, they almost all share one common trait, they require an n-Dimensional vector as input. This means that data which is not naturally represented in this format, including graphs, images and text, must be converted into a vector. Typically this vector represents descriptive features extracted from the data by domain experts, in a process know as feature extraction or feature engineering [122]. Once the input data has been converted into vector form it can, along with its associated set of labels, be used as input to a variety of models. Three of the most frequently used are detailed below:
• Logistic Regression: A supervised model for classification which is often used as a strong baseline approach is logistic regression [163]. Considering the binary case2, logistic regres- sion is a linear function that has a parameter per element in the input feature vector. The result of the multiplication between the input vector and the parameters is then passed through the logistic sigmoid function to ensure that the output is in the range 0 to 1 so that it can be interpreted as a prediction [80]. The parameters of the model are then tuned such that the model is more likely to produce the desired result using gradient descent [200] (a process introduced in greater depth in the following section).
• Support Vector Machines: A more complicated family of algorithms for supervised classi- fication is that of Support Vector Machines (SVM) [58], which unlike logistic regression, directly map data points to predicted labels. Again considering the binary case, SVMs attempt to fit a decision boundary, in the form of a hyperplane, between the data points in a high dimensional space. This decision boundary is optimised such that it separates the data points belonging to the two classes [80]. Class predictions about any new data can then be made by measuring which side of the decision boundary they are. The mapping from the initial input vector to the new high dimensional space, through which an accurate decision boundary can be made, can be costly and computationally intractable [55]. To overcome this issue, SVMs exploit what is know as the kernel trick [206], which enable distances in a high dimensional space to be measured, without the need to actually perform the mapping process [55].
• Random Forests: More recently Random Forests have become one of the most widely used models for supervised learning. Random Forests are essentially ensembles, or collections, of individual decision tree models, combined together to perform a classification task [100]. As well as demonstrating excellent predictive performance, they are often favoured because their output can easily be considered as a series of decision rules, making for a more interpretable model [55]. Each decision tree model can be conceptualised as a tree-like structure, where the split at each node can be thought of as a test on a certain attribute or feature of the input data, for example, if a feature is below or above a certain value. Decision trees are trained using a two-step process: the induction process, where new rules are created and applied to the data, and the pruning process, where unnecessary structure is removed from the tree to help the model generalise better to unseen data [40].
Neural-based Models and Deep Learning
Artificial Neural Networks (ANNs) are a field within Machine Learning inspired by, but importantly not completely replicating, the functionality of a brain [142]. Whilst the origins of ANNs dates back to at least the 1960’s, and perhaps earlier [195], they have recently experienced a dramatic increase in capability and thus popularity [142]. ANNs model problems via the use of connected layers of artificial neurons. Each ANN has an input layer of such neurons to which the data is passed, at least one hidden layer to transform the data in some way and an output layer where predictions are produced. In the traditional ANN concept, each neuron takes as input a weighted sum of the outputs of the neurons which are connected to it, with each layer containing a parameter matrix to enable this. Once the weighted sum has been performed, it is transformed using a pre-specified non-linear activation function. Commonly used examples of such functions including Sigmoid, Softmax and the Rectified Linear Unit (ReLU) [80]. Without the use of non-linear activation functions, a model would be limited to just learning linear (affine) transformations of the input data [80]. This would severely limit the learning capability of the model and make the use of multiple stacked layers redundant, as combing multiple linear layers would still result in a linear operation overall [55].
ANNs are modified to become better at a certain task using an iterative process, commonly referred to as training. This training process is performed as follows: Input data is passed into the network, transformed via the hidden layers and a prediction is produced at the output layer. Typically for ANNs, the correctness of this prediction is assessed via the use of a loss function.
A variety of functions can be utilised for this task and are specific to the type of learning which is being performed. For example, supervised problems use loss functions which exploit the availability of labels such as the cross-entropy function, a way to use the Kullback–Leibler (KL) divergence to measure the distance between the true and predicted output [108]. Once a loss value for the model has been computed, the parameters or weights are updated such that the probability of producing the desired outcome would increase if the same data was passed in a second time – a process know as back-propagation [201]. The back-propagation algorithm exploits the fact that all components of a neural network are differentiable and computes the gradient for the loss with respect to the model parameters, exploiting the chain rule for computational efficiency [55]. A separate family of algorithms, called optimisers, then takes this gradient and uses it to update the parameters directly. One of the most frequently used optimisers is Stochastic Gradient Decent (SGD), which uses randomly chosen sub-samples of the larger dataset to enable more efficient training [200].
Deep Learning is a term generally used to refer to ANN’s which have multiple stacked hidden layers, so called Deep Feed Forward or Dense networks. In practice though the term encompasses an emerging field, including new model architectures, training procedures to allow for the use of massive datasets and even a philosophical shift in how data is represented as input to the models [80]. Traditionally Machine Learning has been performed upon features extracted from the data, which can be a cumbersome task performed by domain experts [186]. This manual process, known as feature selection [90] in the literature, has clear disadvantages as certain features may only be useful for a certain task. It could even negatively affect model performance if utilised in a task for which they are not well suited. Arguably, many of the recent exciting advances seen in the field of Deep Learning have been driven by the removal of this feature selection process [87], instead allowing models to learn the best data representations themselves [80]. This is often known as end-to-end learning as the model is learning the optimum feature representation, which is tuned to perform a certain task. An example of a deep model which exploits this setup is the family of Convolutional Neural Networks (CNNs) models, which have demonstrated state-of-the- art performance in image classification, among others [142]. CNNs take as input raw images, and exploit spatial locality patterns by sliding learnable filters over the images to both improve predictions and reduce the total number of parameters needed to perform a certain task [143]. However, such models have faced criticism for being black boxes and thus not possessing an interpretable decision process [77].
Unsupervised Models
As discussed in Section2.5.1, unsupervised models are ones which do not require the use of labels to guide the learning process. One important unsupervised task, explored in detail in this thesis (See Chapter4), is that of representation learning, more commonly know as embedding [186]. In the context of the machine learning literature, embedding models are used to map between a discrete entity, with no natural numerical representation, and a meaningful value for it in some vector space [165]. This can be formalised as performing the following function:
f :O→Rd, wheref learns to map a set of entitiesO to a vector of sized, importantly without requiring the use of labelled examples. Examples of entities which can be mapped this way include words [164], retail products [225] and graphs (see Chapter4).
To perform this mapping function, a variety of unsupervised models can be used, with some traditional approaches using matrix factorization to learn the representation [149]. Increasingly however, neural networks are being utilised in place of such approaches, with one popular approach being the skip-gram model from Word2Vec [165]. Skip-gram is designed to transform words, taken from a sentence, into vector representations – crucially where some of the semantic and linguistic meaning of the word is preserved in the new embedding space. The skip-gram model is able to learn an embedding for a word by using surrounding words within a sentence as targets for a single hidden layer neural network model to predict. Due to the nature of this technique, words which frequently co-occur together in sentences will have positions which are close within the embedding space. However, it has been argued that such techniques should really be labelled as self-supervised learning, as they employ models and objective functions more commonly found in supervised learning, but generate the labels automatically from within the dataset [80]. The skip-gram model has subsequently been adapted to work on graph data [186].