2.5 Machine learning models
2.5.5 K-Nearest Neighbors (KNN)
The k-nearest neighbors (KNN) algorithm is also a non-parametric method, which can be used for classification and regression problems. Like decision tree or random forest, the
KNN algorithm does not make any assumptions about the underlying data. This is a useful feature because the real-world data does not follow any theoretical assumptions, for ex- ample, the data cannot be usually linearly separated nor it is uniformly distributed. The algorithm needs only the following information: the number of neighbors, the distance metric to use, and the training data set. (Batrinca, Hesse, Treleaven 2017; Webb et al. 2011, 152).
The KNN algorithm divides new observations into different groups based on their dis- tance from the other observations. The distance can be calculated by using different func- tions, but the most common function used is the Euclidean distance function. The Euclid- ean distance can be defined by
π· = ββ
π(π₯
πβ π¦
π)
2π=1 (13)
The Euclidean distance is a straight line between the two different observations in the feature space. If the observations have similar values, the Euclidean distance between the observations is a small value and vice versa (Shalev-Shwartz et al. 2014, 258; Batrinca et al. 2017). The KNN model attempts to estimate the conditional distribution of the de- pendent variable Y, and then classifies each observation to specific class with highest estimated probability. The model starts the classification by identifying K points (K is a positive integer, indicating the number of neighbors) in the training data set which are the closest to a test observation x0, represented by π©0. Then, the conditional probability for
class j is estimated which is basically a fraction of the points in π©0 where their values of
the dependent variable equals j. Finally, the test observation x0 is assigned to the class where it has the largest probability. The KNN method can be defined mathematically in following way (James et al. 2013, 39).
Pr(π = π|π = π₯
0) =
1πΎ
β
πβπ©0πΌ(π¦
π= π)
(14)The KNN algorithm is often referred as βlazy learningβ because it postpones its cal- culations until the part of classification. The algorithm does not use the training data ob- servations to make any generalizations. Overall, the method is quite simple and one of the simplest machine learning algorithms, but despite of that, it can produce classifiers that have quite high accuracy. The accuracy of the KNN can also vary significantly if there are noisy or irrelevant features included in the model. Also, the chosen number of neighbors, K, affect highly on the classification performance of the model because it controls the bias-variance trade-off (Imandoust & Bolandraftar 2013; MartΓnez, FrΓas, PΓ©rez & Rivera 2017).
The most basic version of KNN model is 1-nearest neighbor model where K = 1. This model has often low bias, but high variance. Meaning that the model is able to cap- ture important relationships between features and the dependent variable, but when changes occur in the given data set the results vary a lot. The decision boundary of the model is non-linear and therefore overly flexible. In this case, the model is overfitting the data and producing results that will vary a lot between different data sets. That is why this model will not be reliable when predicting the future data points. As the value of K is increased, the models will become less flexible because its decision boundary will become closer to linear. This model will give results that are more stable as its vari- ance is lower, but its bias is higher. This means that there will be some important rela- tionships unrevealed (Batrinca et al. 2017; James et al 2013, 39β40; Imandoust et al. 2013).
The number of neighbors, K, is usually chosen empirically where different numbers of K values are tested, and the results are compared to each other. The K value, which gives the highest accuracy is chosen to define the classifier. There are also some ad- vanced methods for choosing the right K value. For instance, Hassanat, Abbadi and Al- tarweneh (2014) have proposed a method in which a weak KNN classifier is used each time with different K value. They started with the K value of 1 and continued to value of the squared root of the size of the training set. These results from the weak classifiers are then combined using the weighted sum rule. Their results indicated that this method is competitive with other traditional KNN classifiers and can even outperform them in some cases.
In this study, a number of different neighbors are tested to examine which one of them provided the best accuracy. Figure 7 presents two different KNN models with dif- ferent decision boundaries. The first model has K = 1 and the other one has K = 25. The decision boundary will resemble more linear as the number of neighbors in the model is increased.
Figure 7 KNN models with different values of K (James et al. 2013, 41)
KNN method has many advantages and therefore it is also a popular choice for imple- menting machine learning. The method is simple, effective, and robust which can produce good results even when noisy training data is used. It is more effective when the training data is large. However, the method has also some disadvantages such as sensitivity and poor run-time. The KNN is sensitive to irrelevant and redundant features because the features do not provide any information which could be useful for classification. It has also long run-time when the training data is large because each distance between training samples have to be calculated (Imandoust & Bolandraftar 2013).