A Review of Data Classification Using K-Nearest Neighbour Algorithm

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 6, June 2013)

354

A Review of Data Classification Using K-Nearest Neighbour

Algorithm

Aman Kataria

1

, M. D. Singh

2 1

P.G. Scholar, Thapar University, Patiala, India

2_{Assistant professor, EIE Department, Thapar University, Patiala, India}

Abstract—To classify data whether it is in the field of

neural networks or maybe it is any application of Biometrics viz: Handwriting classification or Iris detection, feasibly the most candid classifier in the stockpile or machine learning techniques is the Nearest Neighbor Classifier in which classification is achieved by identifying the nearest neighbors to a query example and using those neighbors to determine the class of the query. K-NN classification classifies instances based on their similarity to instances in the training data. This paper presents various output with various distance used in algorithm and may help to know the response of classifier for the desired application it also represents computational issues in identifying nearest neighbors and mechanisms for reducing the dimension of the data.

Keywords— K-NN, Biometrics, Classifier,distance

I. INTRODUCTION

The belief inherited in Nearest Neighbor Classification is quite simple, examples are classified based on the class of their nearest neighbors. For example If it walks like a duck, quacks like a duck, and looks like a duck, then it's probably a duck. The k - nearest neighbor classifier is a conventional nonparametric classifier that provides good performance for optimal values of k. In the k - nearest neighbor rule, a test sample is assigned the class most frequently represented among the k nearest training samples. If two or more such classes exist, then the test sample is assigned the class with minimum average distance to it. It can be shown that the k - nearest neighbor rule becomes the Bayes optimal decision rule as k goes to infinity [1]. However, it is only in the limit as the number of training samples goes to infinity that the nearly optimal behavior of the k - nearest neighbor rule is assured.

II. ALGORITHM OF K-NN CLASSIFIER

A. Basic

In 1968, Cover and Hart proposed an algorithm the K-Nearest Neighbor, which was finalized after some time. K-Nearest Neighbor can be calculated by calculating Euclidian distance, although other measures are also available but through Euclidian distance we have splendid intermingle of ease, efficiency and productivity[2].

The example is classified by determining the majority of samples of the labels for K-Near neighbor [3]. In other words this method is very easy to enforce for instance if an example “x” has k nearest examples where feature space and majority of them are having the same label “y”, then “x” belongs to “y”. The K-NN method is mostly depends upon furthermost theorem while considering theory. When the decision course is considered consider small number of nearest neighbor. Hence when this method is used, example disproportion problem can be solved. While limited number of nearest neighbor are considered by K-NN, not a decision boundary. Hence exceptional to say that K-NN is suitable to classify the case of example set of boundary intercross and in that case example overlapped. The Euclidian distance can be calculated as follows [4]. If two vectors xi and xi are given where xi =(xi1, xi2, xi3, xi4, xi5……. xin ) And xj =(xj1, xj2, xj3, xj4, xj5……. xjn ) The difference [5] between xi and xj is

D (xi, xj) =

√∑

–

(1)

(2)

International Journal of Emerging Technology and Advanced Engineering

355

B. Use in Data mining

Data mining is the extraction of veiled information from large database. Classification is a data mining task of forecasting the value of a categorical variable by building a model based on one or more numerical and/or categorical variables. Classification mining function is used to achieve a intense understanding of the database structure There are various classification techniques like decision tree induction, Bayesian networks, lazy classifier and rule based classifier. Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data set. These tools can include statistical models, mathematical algorithm and machine learning methods. Consequently, data mining consists of more than collection and managing data, it also includes analysis and prediction [8]. Data mining applications can use a variety of parameters to examine the data. They include association, sequence or path analysis, classification, clustering, and forecasting. Classification technique is capable of processing a wider variety of data and is growing in popularity. The various classification techniques are Bayesian network, tree classifiers, rule based classifiers, lazy classifiers [9], Fuzzy set approaches, rough set approach etc.

III. MATERIAL AND METHODOLIGY

A. Material

Outputs with different methodology has been compared.

a)Sample

Matrix whose rows will be classified into groups. Sample must have the same number of columns as Training.

b)Training

Matrix used to group the rows in the matrix Sample. Training must have the same number of columns as Sample. Each row of Training belongs to the group whose value is the corresponding entry of Group.

c)Group

Vector whose distinct values define the grouping of the rows in Training.

d)K

The number of nearest neighbors used in the classification. Default is 1.

e)Distance

1) Euclidean

2) Cityblock (taxicab metric)

3) Cosine

4) Correlation

f) Rule

1) Nearest

2) Random

3) Consensus

I) Distance

a)Euclidean distance-

The Euclidean distance between points p and q is the length of the line segment connecting them (p_{q ).} In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance from p to q, or from q to p is given by[10]:

d(p,q)= d(q,p)= √(q1-p1)2 + (q2-p2)2 (2)

The position of a point in a Euclidean n-space is a Euclidean vector. So, p and q are Euclidean vectors[11], starting from the origin of the space, and their tips indicate two points. The Euclidean norm, or Euclidean length, or magnitude of a vector measures the length of the vector:

|p|= p12+ p22+ p32··· +pn2= √p.p (3)

Where the last equation involves the dot product. A vector can be described as a directed line segment from the origin of the Euclidean space (vector tail), to a point in that space (vector tip).[12] If we consider that its length is actually the distance from its tail to its tip, it becomes clear that the Euclidean norm of a vector is just a special case of Euclidean distance: the Euclidean distance between its tail and its tip. The distance between points p and q may have a direction, so it may be represented by another vector, given by[13]

q-p=(q1-p1, q2-p2,···, qn-pn,) (4)

In a three-dimensional space (n=3), this is an arrow from p to q, which can be also regarded as the position of q relative to p. It may be also called a displacement vector if p and q represent two positions of the same point at two successive instants of time. The Euclidean distance between p and q is just the Euclidean length of this distance (or displacement) vector: [14]

|q-p|= √ q-p).(q-p) (5)

Which is equivalent to equation 1, and also to:

(3)

International Journal of Emerging Technology and Advanced Engineering

356

b)Cityblock (Taxicab metric)

The taxican distance, d1, between two vectors p, q in an n-dimensional real vector space with fixed Cartesian coordinate system, is the sum of lengths of the projections of the line segment between the points into the coordinate axis.More formally,

d1(p,q)=

||p-q||

1=∑ |p q | (7) Where p=(p1,p2,p3……pn) and q= (q1,q2,q3……qn) are vectors[15]. For example, in the plane, the taxicab distance between (p1,p2) and (q1,q2) is |p1-q1|+ |p2-q2|[16].

c)Cosine distance

The cosine of two vectors can be derived by using the Euclidean dot product formula:

a.b=||a|| ||b|| cosθ (8)

Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude[17] as

Similarity= cos(θ = . || || ||B|| = ∑ Bi/√ ∑ √ ∑ (9)

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity. For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

d)Correlation

The distance correlation of two random variables is obtained by dividing their distance covariance[18] by the product of their distance standard deviations. The distance correlation is

dCor(X,Y)= dCov(X,Y √dVar(X dVar(Y (10)

II) Rule

a)Nearest

Majority rule with nearest point tie-break (by default)

b)Random

Majority rule with random point tie-break

c)Consensus

B. Material

To get the output some training data and sample data are chosen and with different rules and with different distance matric we get different classified outputs. The data chose for the classification is

Sample = [0.559 0.510; 0.101 0.282; 0.987 0.988] Training= [0 0; 0.559 0.559; 1 1]

Group= [1;2;3]

C. Results Case 1

Fig.1 Distance used Euclidean and Rule Nearest

In this case Euclidean distance is used and reference of this figure has been used in Table I .By using Nearest neighbor Algorithm classification result was medium.

Case 2

(4)

International Journal of Emerging Technology and Advanced Engineering

357

In this case Euclidean distance is used and reference of this figure has been used in Table I .By using Random distance Algorithm classification result was good.

Case 3

Fig.3 Distance used Euclidean and Rule Consensus

In this case Euclidean distance is used and reference of this figure has been used in Table I .By using Consensus distance Algorithm classification result was Excellent.

Case 4

Fig.4. Distance used Cityblock and Rule Nearest

In this case Cityblock distance is used and reference of this figure has been used in Table I. By using Nearest neighbor Algorithm classification result was excellent.

Case 5

Fig.5 Distance used Cityblock and Rule Random

In this case Cityblock distance is used and reference of this figure has been used in Table I .By using Random Algorithm classification result was medium.

Case 6

Fig.6 Distance used Cityblock and Rule Consensus

(5)

International Journal of Emerging Technology and Advanced Engineering

358

Case 7

Fig.7 Distance used Cosine and Rule Nearest

In this case Cosine distance is used and reference of this figure has been used in Table I .By using Nearest neighbor Algorithm classification result was medium.

Case 8

Fig.8 Distance used Cosine and Rule Random

In this case Cosine distance is used and reference of this figure has been used in Table I .By using Random rule Algorithm classification result was poor.

Case 9

Fig.9 Distance used Cosine and Rule Consensus

In this case Cosine distance is used and reference of this figure has been used in Table I .By using Consensus Algorithm classification result was medium.

Case 10

Fig.10 Distance used Correlation and Rule Nearest

(6)

International Journal of Emerging Technology and Advanced Engineering

359

Case 11

Fig.11 Distance used Correlation and Rule Random

In this case Correlation distance is used and reference of this figure has been used in Table I .By using Random rule Algorithm classification result was poor.

Case 12

Fig.12 Distance used Correlation and Rule Consensus

In this case Correlation distance is used and reference of this figure has been used in Table I. By using Consensus rule Algorithm classification result was Medium.

Hamming distance has not been used in this paper because that distance requires binary data which is not in sample.

[image:6.595.69.280.154.355.2] [image:6.595.55.274.419.630.2]

D. Inference

TABLE I

RESULTS AND EFFIECIENCY OF CLASSIFIERS

Sr. no.

Case Result Efficiency Percent

of efficiency

1 1 Classified

Successfully

Medium 99%

2 2 Classified

Successfully

Good 99.8%

3 3 Classified

Successfully

Excellent 100%

4 4 Classified

Successfully

Excellent 100%

5 5 Classified

Successfully

Medium 99%

6 6 Classified

Successfully

Good 99.8%

7 7 Classified

Successfully

Medium 99%

8 8 Classified

Successfully

Poor 98.5%

9 9 Classified

Successfully

Medium 99%

10 10 Classified

Successfully

Poor 98.5%

11 11 Classified

Successfully

Poor 98.5%

12 12 Classified

Successfully

Medium 99%

IV. CONCLUSION

(7)

International Journal of Emerging Technology and Advanced Engineering

360

REFERENCES

[1] R.O. Duda and P.E. Hart, “Pattern Classification and Scene Analysis”, New York: John Wiley & Sons, 1973.

[2] Dasarathy, B. V., “Nearest Neighbor (NN) Norms,NN Pattern Classification Techniques”. IEEE Computer Society Press, 1990. [3] Wettschereck, D., Dietterich, T. G. “ n Experimental

Comparison of the Nearest Neighbor and Nearesthyperrectangle lgorithms,” Machine Learning, 9: 5-28, 1995.

[4] Platt J C. “Fast Training of Support Vector Machines Using Sequential Minimal Optimization [M]. Advances in Kernel Methods:Support Vector Machines” (Edited by Scholkopf B,Burges C,Smola A)[M]. Cambridge MA: MIT Press, 185-208, 1998.

[5] Y. Yang and X. Liu, “ Re-Examination of Text Categorization Methods,” Proc. SIGIR ‟99, pp. 42-49, 1999.

[6] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. European Conf. Machine Learning, pp. 137-142, 1998.

[7] Man Lan, Chew Lim Tan, Jian Su, and Yue Lu, “Supervised and Traditional Term Weighting Methods for Automatic Text Categorization”, Ieee Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 4, April 2009.

[8] Thair Nu Phyu, “Survey of Classification Techniques in Data Mining” , Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I,IMECS 2009,March 18-20,2009.

[9] William Perrizo,QinDing Anne Denton. “Lazy Classifiers Using P-trees”, Department of Computer Science ,Penn State Harrisburg, Middletown, PA 17057.

[10] A. Y. Alfakih, “Graph rigidity via Euclidean distance matrices, Linear lgebra ppl.”, 310 , pp. 149–165, 2000

[11] M. Bakonyi and C. Johnson, “The Euclidean distance matrix completion problem, SIAM Journal on Matrix Analysis and Applications”, 16 , pp. 646–654, 1995.

[12] Elena Deza & Michel Marie Deza,” Encyclopedia of Distances”, page 94, Springer, 2009.

[13] W. Glunt, T. L. Hayden, S. Hong, and J. Wells, “An alternating projection algorithm for computing the nearest Euclidean distance matrix, SIAM Journal on Matrix Analysis and Applications”, 11, pp. 589–600, 1990.

[14] R. W. Farebrother, “Three theorems with applications to Euclidean distance matrices”, Linear lg. ppl., 95, 11-16, 1987. [15] Akca, Z. and Kaya, R.,”On the Taxicab Trigonometry”, Jour. of

Inst. of Math& Comp. Sci. (Math. Ser) 10 , No 3, 151-159, 1997. [16] Thompson, K. and Dray, T., “Taxicab Angles and Trigonometry”,

Pi Mu Epsilon J., 11, 87-97, 2000.

[17] Bei-Ji Zou,” Shape-Based Trademark Retrieval Using Cosine Distance Method” Intelligent Systems Design and Applications, 2008. ISDA '08. Eighth International Conference on 26-28 Nov. 2008