K-Nearest Neighbour Algorithm (KNN)

Chapter 4 Model Descriptions

4.2 Construction Trees

4.2.5 K-Nearest Neighbour Algorithm (KNN)

K-Nearest Neighbour Algorithm (KNN) is considered one type of machine learning that have been used in many domains, such as machine learning, statistical pattern recognition, data mining, and many others [155]. It follows a way of classifying features based on closest training samples in the attributes space. To demonstrate a KNN analysis, the procedure of classifying a new value (query point) among known samples is shown in Figure 4.3, which shows the instances with the green and yellow signs and the query point with a black circle [156]. Our aim is to classify the output of the query point dependent on a nominated number of its nearest neighbours. Specifically, it needs to check whether the query point is classified as a green or a yellow sign. The main advantages of applying this technique in this research is the ability to classify a new object based on the training samples. Moreover, KNN can be implemented when there is no prior knowledge about the distribution of the data [157].

KNN is a model that is easy to understand, but works exceptionally well in the training model and testing model [158]. This model applies for regressing and classification, which is used in pattern recognition and statistical estimation as a non-parametric technique. The purpose of using this classifier is to predict new instances from the split datasets. The fundamental idea of this algorithm has two significant processes: Firstly, find the nearest k instances to the unseen data. Secondly, it classifies the datasets by taking the majority vote of its neighbours, If K = 1, then the case is simply assigned to the class of its nearest neighbour [159].

52 | P a g e

Figure 4-3: K-nearest neighbour algorithm (KNN) example

The test sample in Figure 4.3 is black circle, which classified into the green circle or the yellow triangle. If K = 5, it is assigned to the yellow classes due to containing 3 triangles and 2 green circles inside the green line circle. If K= 7, it is assigned to the green circles. Algorithm 4.3 illustrates the learning approach of K-Nearest neighbour’s algorithm. [160].

The KNN works as follows. Firstly, check the parameter K, the total number of nearest Neighbours (NN). Then, the distance needs to measure between the query-feature and the training instances. In order to find the measurement distances for the training instances, the NN method of KNN minimum distance is confirmed. Typically, a large K value is considered more precise as it decreases the overall noise based on the datasets. The best K value in this case should be between 3 and 10, which provides outstanding outcomes than 1 K.

53 | P a g e

Algorithm 4.3: K-Nearest neighbour’s algorithm (learning approach) 1 Input:

2 𝑆 = {(𝑥𝑖, 𝑡𝑖)}| 𝑥𝑖 ∈ 𝑅𝑚, 𝑡𝑖 ∈ 𝑁 , 𝑖 ∈ {1,2,3,4. . . , 𝑛} – the set of n training instances and class labels; 3

4 5 6

𝑍 = {𝑧𝑖 | 𝑧𝑖 ∈ 𝑅𝑚, 𝑖 ∈ {1,2, . . . , 𝑙}} – the set of l belongs to the test instances; 𝐾 – the total number of nearest neighbours;

∆ − A distance measures model; 𝒞 − A classification approach; 7 Initialization: 8 𝑌 ⟵ 𝜃; 9 10 Computation: For 𝑧𝑖 ∈ 𝑍 𝑑𝑜

(a) 𝑁 ← the nearest refers to 𝑘 neighbors to 𝑧𝑖 from S according to ∆;

(b) 𝑓 ← the discriminant procedure of 𝒞 trained on element 𝑁; (c) 𝑌 ← the class label predicted by employing f on 𝑧𝑖 ;

(d) Y ← 𝑌 ∪ {𝑦}; Output:

11 Y = { 𝑦𝑖 ∈ 𝑁, 𝑖 ∈ {1,2, … , 𝑙}} − the test samples in 𝑍 with the set of predicted class lables.

There are two types of metrics commonly used in the KNN, the Euclidean and the Minkowski's distances. These metrics improve the accuracy of KNN using specialised models, for instance, neighbourhood components analysis or large Margin Nearest Neighbour [161]. One of the main disadvantages of KNN is the complexity in searching the nearest neighbours for each sample. 𝑑(𝑥, 𝑦) = √ (𝑥1− 𝑦1)2… + (𝑥𝑛− 𝑦𝑛)2 (4.8) √∑(𝑥₂− 𝑦₂)2 𝑛 𝑖=1

Therefore, d refers to the Euclidean distance, 𝑥𝑖 𝑎𝑛𝑑 𝑦𝑗 represents the element of x and y as shown in Equation (4.8). In the case of categorical variables, typically use the hamming distance. It brings up the issue of standardisation of the numerical variables between (0,1) when there is a mixture of categorical and numerical variables in the datasets [162]. Then, the distance is zero when 𝑥 and 𝑦 are same. Alternatively, if 𝑥 and 𝑦 are not same, so, the distance is equal to one. Suppose, (𝑥, 𝑦) , ( male, male) so the distance is zero. (𝑥, 𝑦) , ( male, female), so, the distance is one. Equation (4.9) illustrates the hammer distance measurement [163].

54 | P a g e 𝐷_𝐻 = ∑|𝑥_𝑖 − 𝑦_𝑖| 𝐾 𝑖=1 (4.9) 𝑥 = 𝑦 → 𝐷 = 0 𝑥 ≠ 𝑦 → 𝐷 = 1

KNN has applied for diagnosing Sickle Cell Retinopathy (SCR). Minhaj et al[164] proposed an automatic method to explore classification of SCR through illustrating attributes in optical coherence tomography angiography (OCTA) images. They used 35 images from sickle cell patients (23 females and 12 males) and 14 control subjects (3 female and 11 males). The average age was 40 years between 20s and 60s for the patients and 20s to 70s for the control subjects. The OCTA images were analysed based on eyes images, so the datasets involved 35 SCD and 14 control eyes. Vascular tortuosity, blood vessel density, foveal avascular zone (FAZ) area, vessel perimeter index, diameter, contour irregularity of FAZ, and parafoveal avascular density as feature vectors were calculated. There were three algorithms - support vector machine, discriminant analysis, and KNN - used as a classification technique to classify the datasets. For the control subjects, the training sets received (50%) from the total images and (50%) for the testing phase. On the other hand, (mild vs. severe) among SCR patients, 95% were used to train the classifier and 5% data used for testing the classifier. The performance evaluation for the classification method used performance evaluation measurement features to examine the algorithms. The outcomes among all three classifiers show that KNN provides acceptable results in terms of performance and accuracy.

Sharma et al [165] proposed a new technique involving several features, radial signature, aspect ratio, metric value, and its variance, then training the datasets using the KNN model to test the selected images. The classifier comprises four classes. The first class trained images for Sickle cells; the second class is concentrated on Dacrocytes (teardrop cells); the third class worked with Ovalocytes and the four class is Normal Erythrocytes. KNN is trained with hundred patient’s images to predict three different kinds of sickle cell disorder, dacrocytes, and elliptocytes related to thalassemia. The acceptable outcome was provided with an accuracy of 80% and sensitivity of 87%.

KNN does not require using the training sets to apply any generalisation. Lack of generalisation leads this technique to keep all the training datasets. This means, there is no explicit training set needed. Moreover, the vast majority of the training samples are required during the testing sets. This approach is considered as a lazy algorithm, which creates a decision depending on

55 | P a g e the entire training dataset. Finally, KNN performs poorly in classification due to the parameters not contributing equally by using the Euclidean distance method.

In document Machine Learning Approaches and Web-Based System to the Application of Disease Modifying Therapy for Sickle Cell (Page 68-72)