2. Machine learning algorithms and descriptions
2.2. Support Vector Machines
2.2.2. Using “The Kernel Trick” to identify non-linear patterns
This techniques involves looking at the relationships between the data points (more ac- curately the support vectors), which is not just a linear relationship. The example given in figure 2.10 makes use of Euclidean distances between the points. In this section, only a conceptual overview is provided and an example of some of the mathematics involved is given in chapter 3.
Figure 2.10.: Here, the RBF kernel has been used to find clusters of similarly placed data points, measured on their Euclidean distances from each other. This technique has allowed for perfect classification of the data.
This method, using the Radial-Basis Function (RBF) kernel is clearly very effective at dealing with the XOR problem. When using this kernel, the data points are actually being mapped into a higher dimensional space than the two provided by each data point (the x and y axes values). The act of mapping into a higher dimensional space allows data that were not linearly separable to actually become linearly separable, a concept described in Cover’s Theorem. The full details of the proof of this theorem are not given here, but can be read about in the original publication (Cover, 1965).
Instead, a graphical representation is provided that helps to clarify this concept. The example in figure 2.11 shows a warping of the canvas used to plot the points in the XOR problem into three dimensional space. If the blue and orange points are now plotted onto the warped canvas, the orange points would be located towards the top of the higher peaks; this means that a linear separating plane can be drawn that allows for perfect classification. This idea works in much higher dimensions, but these are not possible to visualise, so this simple example is given instead.
Figure 2.11.: A example showing how the mapping of the XOR problem into 3-dimensional space allows for a linear plane to be drawn that successfully separates between the classes.
This example can also be used to explain an important, if unfortunate, effect of using the kernel methods in an SVM. As was mentioned earlier, the values in the coefficient (or weight) vector actually provide some useful interpretation information because the values assigned to the features can be used to assess the importance of each respective feature, as this value determines how each data point is projected and whether this results in them crossing the decision plane. When using a kernel however, it is not the position in the original input space that determines the classification, but the position in the higher dimension. In the example provided above, it is actually the height of the peak where the data point lands that states which class it belongs to. This is a classic example of how increasing the complexity of the model results in the decrease of its interpretability; in other words: some models focus on delivering predictive power while others focus on increasing the interpretation of what could be driving the performance and results, and care must be taken when deciding which to use (Shmueli, 2010). This is one of the reasons why the SVM was chosen to be used in this thesis as it can provide an element of both: if
interpretability is required, then the focus can be placed onto the linear model, but if an increase in predictive power is required, especially when looking for the non-linear effects of interactions, then the kernel methods can be employed.
The use of kernels bring about the additional complication of extra hyperparameters that also require tuning to find the optimum values. For the RBF kernel, there is an additional parameter: γ (the Greek letter pronounced “gamma”), which sets the width of the Gaussian kernel to use. The value used in the example in figure 2.10 was 3, but an additional example is shown here using a different value to highlight the effect of changing this. In figure 2.12, γ is set to 50, and it can be clearly seen that this results in a tightening of the kernel around the orange points to make the region far more specific to the areas close to these data points. This is another example of overfitting, as it is unlikely that new data will be classified correctly in this example.
Figure 2.12.: An example of the RBF kernel being used on the XOR related problem, but this time using a γ value of 50. The effect is that the kernel width has been tightened to become more specific to the data points, and could very well result in the overfitting to the training set and misclassification on unseen data
All of the machine learning modelling performed in this thesis was carried out using the scikit-learn machine learning library (Pedregosa et al., 2011), written for use with the Python programming language. When performing kernel based modelling using SVMs,
the default value for γ is set to the reciprocal of the number of data points in the training set: 1
N, which attempts to avoid the problem of overfitting. The examples of the RBF
kernel shown in this chapter were actually using unnaturally high values of the γ param- eter for the benefit of the visualisations, but as can be read about in chapter 3, when applied to the real-life datasets, it is these lower values of γ that perform much better.
A comment on the scaling of the features
Another important point to note about the SVM algorithm is that it is sensitive to the scale, or range, of the values for each of the input features. This is because the algorithm looks for patterns that the data points are projected into in common dimensional space between all of the features. To clarify this point, an extreme example is given here: if the feature represented on the x axis were in a very small range of 0.001 to 0.01, but the feature on the y axis was in an extremely large range from 10,000 to 100,000,000, then any plotting of these points onto a common set of axes would show almost no variability along the x direction compared to the y direction and it would be almost impossible for the SVM to find any optimum boundary in this situation.
It is therefore necessary to carry out some pre-processing steps of the data before the model is built and fitted. One of these common steps to remove the problem of different feature ranges is called feature scaling, and if often performed by transforming each feature so that the distribution of its values in the dataset have a mean of 0 with a variance of 1.