• No results found

2. Machine learning algorithms and descriptions

2.2. Support Vector Machines

2.2.1. Finding optimal linear decision boundaries

The image in figure 2.1 shows an example of two different classes of data that display linear separability. The blue and orange circles represent data points from two different classes, each taking a value on the x and y axes that determine their position on the plot. In machine learning terminology, these inputs are referred to as the features of the model, and this term will be used throughout this thesis. These hypothetical classes could represent anything, but given the topics presented in this thesis, can be imagined as case (orange) and control (blue) status in schizophrenia, with the axes representing risk scores for two collections of variants or mutations. This example is very simple, but helps to show the conceptual idea behind the algorithm of an SVM.

Figure 2.1.: A visual representation of two different classes that display linear separability.

The aim is to find a hyperplane (which in two dimensions can easily be represented as a straight line) that finds the optimal means of separating these two sets of data points. It is important to stress what is meant by the term “optimal” in this situation, as there are an infinite number of lines that could be drawn that separate these points. Figure 2.2 shows some examples that successfully separate the classes, but are not doing it in an optimal manner. The two black lines in particular are clipping the edge cases of the classes in two different ways, but the red line seems to represent a slightly better dividing

line by appearing somewhere between the others.

Figure 2.2.: An example showing three “sub-optimal” ways to divide the data points into two regions.

What is meant by the “optimal” solution is the line that retains the largest possible margin between the line and the nearest points of each class. This is shown in figure 2.3. In this figure, the centre line shows the optimum fit, and the dotted lines represent the margins which are now at their maximum possible width. The areas on either side have been coloured to represent the class of the data points belonging to each respective side. This is a visual representation of the main aim of building categorical models, which is to create models that allow the categorisation of new, previously unseen, data points. Any new data points in this example model would be plotted onto the graph, and be assigned the class relating to the colour of its position.

Figure 2.3.: In this figure, the optimum separating line is shown together with the dotted margin lines, the colours representing the regions assigned to each class, and circles around the data points on the margin - the support vectors.

Another feature in this figure is that some of the points have a circle around them. These represent the points that are closest to the margin, and it is these points that are referred to as the support vectors, hence the name of the whole algorithm. Due to the nature of the mathematics involved in an SVM, it is in fact only the information from these points that is used to find the optimum hyperplane. In short, it is only the “ambiguous” data points that play a role in building the model, which can lead to greater efficiency during the optimisation procedure. Of course, if there is no clear divide between classes, as is often the case with psychiatric genetic datasets, then there are many ambiguous data points, and hence an equally large number of support vectors, which results in a longer computational running time.

One of the outputs of a linear SVM is the coefficients that are assigned to the different features. What these coefficients are describing is a vector that runs from the origin of the figure (where both feature values are 0) and is orthogonal to the hyperplane. This can be seen in figure 2.4, represented as a red arrow. In this situation, the origin of the figure is in the top left (and therefore, the value assigned to the y axis feature in this circumstance would be negative). In order to calculate which side of the line a new data point falls, its feature values are projected orthogonally onto this coefficient vector to see

how far along it lands. This can also be seen in figure 2.4 as the green dotted arrow.

Figure 2.4.: The red arrow is this figure represents the coefficient vector, which runs from the origin of the graph, in the top left, and is orthogonal to the separating line. The dotted green arrow represents a projection of one of the data points onto the coefficient vector, and shows that this point does not pass the hyperplane boundary.

In this example, the projection clearly falls within the blue area. This projection is calculated by taking what is known as the dot-product between the data point vector and the coefficient vector. If the data point vector is written as u and the coefficient vector as w, then the dot product is written as u · w1. The letter “w” is used in this vector as the coefficients are also referred to as the “weights” of the model. When the SVM is linear in nature, then the values in this weight vector can provide some very useful interpretability to the model, in the sense that they represent the importance of the different features, and how much of a role they play in the categorisation. This is put into extensive use in chapter 5 when the importance metrics are assigned to gene sets.

1

Of note, in the field of linear algebra, this can often be written as hu, wi, which will be used in later chapters when the issues of different kernels is raised

Finding the hyper-parameters of the models

Another aspect that plays a vital role when building any machine learning model is that of hyper-parameters and a conceptual overview is provided with diagrams here.

The situation presented in figure 2.5 is very similar to that already presented but with one minor change: one of the blue points has now moved position. The data is still linearly separable, but the situation is more unclear as it looks as though this point should be orange, based on its proximity to the other orange data points. This brings about some ambiguity as to whether the point’s colour should be interpreted as correct, or is it better to assume that this point could be a mistake? This matter is handled by setting the value of the hyper-parameter C, which represents the cost-parameter, and in essence, states how seriously the model should be concerned with not classifying the point as incorrect.

Figure 2.5.: In this example, there is an unclear datapoint, shown as circled, which while it is blue, looks like it would be better classified as orange based on its proximity to the others.

In the first example, this value has been set to 100, which forces the model to get the classification of the data provided correct. The effect of this can be seen in figure 2.6, where forcing the model to classify all of the data points correctly has resulted in dra- matically reducing the width of the margins. This is a clear demonstration of what is termed overfitting of the model to the training set; while the model is technically correct,

it looks like the line is not in the optimum position, and therefore any new data points could be possibly misclassified.

Figure 2.6.: The effects of setting the C parameter to 100, no incorrect classifications are made, but the width between margins is considerably decreased.

Due to this problem of overfitting, it is often far more preferable to build a model that is permitted to make mistakes on the data used to train it, in preparation for better performance on any new data points assigned for categorisation. In this example, this can be achieved by setting the parameter to C to 1 instead of 100, and the effects of this can be seen in figure 2.7, resulting in a model that can better generalise to new data.

Figure 2.7.: When the C parameter is set to 1, the unclear data point in the training set is misclassified, but the wide margin is regained, which allows the model to better generalise to new data points.

This example also highlights another important ability and characteristic of an SVM in that it is permitted to make mistakes; the algorithm is described as a soft margin classifier in the sense that the dividing boundary does not have to make totally correct predictions on the training data, a concept first described by Cortes and Vapnik (1995). This is crucial as, unlike in the simple example provided here, most real datasets that are presented to machine learning algorithms are not linearly separable, and that is certainly the case with psychiatric genetic datasets.

When linearity fails to make suitable classifications

While an SVM is capable of building these soft margins to allow it to make mistakes with the training data in order to improve generalisability later, there are frequent occasions when a linear classifier is simply not suitable for use in the model. Another simple example showing this can be seen in figure 2.8. This example is a variant of the eXclusive-OR (XOR) problem, whereby the only way for a data point to be classified as positive is if it has a high value on one OR the other feature; not having either, or having both, will

result in belonging to the negative class.

Figure 2.8.: An example showing four different clusters of points belonging to two different classes. There is no way to fit a linear decision boundary to discriminate between the two classes in this situation.

This is shown in figure 2.8 and can be related to a very simple genetic example: if the x and y axes represent scores on two different sets of mutations, a high score in either would result in developing a disorder, but the presence of both could result in the epistatic effect of each of them cancelling out the contribution of the other. There is no way that these different sets of classes can be separated by a single straight line. An attempt to do so can be seen in figure 2.9, where all of the data points in the lower left hand side have been misclassified.

Figure 2.9.: An example showing an attempt to fit a linear decision boundary to a situation that re- sembles the XOR problem. It is simply not possible and many wrong classifications will be made.

Finding a suitable model to accurately predict patterns like this involves applying one of the most powerful machine learning techniques that are available to be used with an SVM, and involves mapping the original input space of the data points into a higher dimensional space using the kernel trick.