AN
E
FFICIENT
B
RANCH AND
B
OUND
S
EARCH
ALGORITHM
FOR
COMPUTING
K
NEAREST
NEIGHBORS IN A
MULTIDIMENSIONAL
VECTOR
S
PACE
WIMD’HAES1,2 [email protected]
DIRKVANDYCK1 [email protected]
XAVIERRODET2 [email protected]
1Visionlab – University of Antwerp (UA) Groenenborgerlaan 171·2020 Antwerp·Belgium
2IRCAM – Centre George Pompidou 1, place Igor-Stravinsky·75004 Paris·France
Abstract
An efficient branch and bound search algorithm is proposed for the computation of theKnearest neighbors in a multi-dimensional vector space. In a preprocessing step, the sam-ple of feature vectors is decomposed hierarchically using hyperplanes determined by principal component analysis (PCA). During the search of the nearest neighbors, the tree that represents this decomposition is traversed in depth first order avoiding nodes that cannot contain nearest neighbors. The behavior of the algorithm is studied on artificial data.
Key words: pattern recognition,Knearest neighbors, non parametric estimation, principal component analysis
1
Introduction and State of the Art
Searching theK nearest neighbors in a multidimensional vector space is a very common procedure in the field of pat-tern recognition where it is used for non parametric density estimation and classification [2]. When the number of sam-ples is large however, the computational cost of the nearest neighbor search can prohibit its practical use. Various tech-niques that reduce the number of distance computations have been proposed. For data that can be represented in a vector space, branch and bound search algorithms have been proposed [3, 4, 8]. For nearest neighbors in a met-rical space, several approximating and eliminating search
algorithms (AESA) are available [9, 6, 10, 7, 5].
In the present work, a new branch and bound algo-rithm is proposed that searches the nearest vectors in a vec-tor space where the dissimilarity between two vecvec-tors is ex-pressed by the euclidean distance. The main contribution consists of a very efficient hierarchical decomposition that uses hyperplanes determined by principal component anal-ysis. It can be shown that this decomposition maximizes the probability that nearest neighbors are grouped together.
2
Decomposition of the Sample
2.1
Definition of the decomposition
In order to obtain an efficient decomposition, each subset has to contain an equal number of vectors and the number of vectors that have nearest neighbors in both subsets has to be minimized. This is realized in the following manner. First, a multivariate gaussian is fit to the set. Then, the vec-tors are divided according to the hyperplane containing the mean vector and perpendicular to the eigenvector of the co-variance matrix with the largest eigenvalue. The proof that this decomposition method is optimal is based on the fact that when the distribution is sampled with infinite accuracy (the continuous case), only vectors that are exactly on the hyperplane have nearest vectors on both sides on the plane. Therefore, we determine the hyperplane for which the inte-gral of the gaussian function over that plane is minimal.
The multivariate gaussian, fit to aD-dimensional data
setS ={x¯1, . . . ,xN¯ }withxi∈ D is given by
exp −1 2(¯x−µ¯) T Σ−1(¯x−µ¯) (1) whereµ¯is the mean vector andΣ the covariance matrix over S. T denotes the transpose operator. By calculat-ing the eigenvectorsui¯ and eigenvaluesλiofΣ(implying
Σ¯ui =λiui¯), each vectorx¯∈Scan be expressed as a lin-ear combination of the eigenvectors usingαk = (¯x−µ¯)Tuk¯
resulting in ¯ x= ¯µ+ D X k=1 αkuk¯ (2)
Inserting (2) in (1), the gaussian distribution can be simpli-fied to exp ( −1 2 D X k=1 α2 kλ− 1 k ) (3)
-x 6 y @ @ I @ @ I @@R ¯ u1 u¯2 ¯ µ λ−1/2 1 λ −1/2 2 -x 6 y ¯vp ¯ µp @ @ @ @ @ @ @ ¯ vT p(¯x−µp¯ ) = 0
Figure 1. Optimal decomposition after gaussian fit.
The integration of the distribution over all axes determined by the eigenvectors yields
D Y k=1 Z exp −α 2 k 2λk dαk= D Y k=1 p 2πλk (4) From this equation it follows that the hyperplane which is perpendicular to the eigenvectorui¯ with the largest eigen-valueλi(the principal component) yields the optimal split since the integration over all other components results in the smallest value for the integral. Figure 1 shows the con-tour of a two-dimensional gaussian with its corresponding eigenvalues and eigenvectors. On the right, the decompo-sition parameters of the optimal split are depicted.
As visualized in figure 2, the decomposition process is applied iteratively resulting in a hierarchic decomposi-tion represented by a binary tree where each node repre-sents a subset of the total data setS. Enumerating the nodes starting from zero, the root nodeS0denotes the entire data
sample. The child nodes of a nodepare defined by induc-tion using S2p+1 = {¯xi∈Sp: ¯vpT(¯xi−µp¯ )<0)} S2p+2 = {¯xi∈Sp: ¯vpT(¯xi−µp¯ )≥0)} (5) with ¯ µp = µ¯
imax = argmaxi{λi}
¯
vp = ui¯max (6)
whereλi andui¯ are the eigenvalues and eigenvectors, re-spectively, of the covariance matrixΣoverSp. This defini-tion implies thatSp=S2p+1∪S2p+2andS2p+1∩S2p+2=
∅.
Obviously, the distribution of the vectors might not be gaussian at all, resulting in the fact that the tree will not be balanced. In order to balance the tree, the value ofµp¯ is chosen to be the median in the direction ofvp¯ , instead of the meanµ¯. The balancing is realized defining a scalarβ
β=median{v¯T
p(¯xi−µ¯)}, xi¯ ∈Sp (7)
and calculating the value ofµp¯ using
¯ µp=βvp¯ + ¯µ (8) -x 6 y @ @ @ @ A A A AA ¯ v0T(¯x−µ0¯ ) = 0 ¯ vT1(¯x−µ1¯ ) = 0 ¯ v2T(¯x−µ2¯ ) = 0 h 3 5h h 4 6h e @ @@R e e e e e e B BBN BBBN 3 4 5 6 1 2 0 level 2 level 1 level 0
Figure 2. Example of a hierarchical decomposition in two dimensions.
2.2
The Decomposition Algorithm
The decomposition algorithm of the data set up to a level
L ≤ blog2Ncis described. The vectors are organized so
that all vectors belonging to the same node are grouped to-gether. For each nodepthe index of its first vectorbpand last vectorepare stored so thatSp={xi¯ ∈S0:bp ≤i≤ ep}. The decomposition parameters that are determined for each nodepare
• bpthe index of the first vector ofSp
• epthe index of the last vector ofSp
• ¯vpeigenvector ofSpwith the largest eigenvalue
• µp¯ median vector ofSp, in the direction ofvp¯
The complete decomposition algorithm is listed below:
S0={x¯1, . . . ,xN¯ }
b0= 1
e0=N
p= 0
whilep <2L−1
// Calculation of decomposition parameters
Np=ep−bp+ 1 ¯ µ= 1 Np Pep k=bpxk¯ Σ= 1 Np Pep k=bp(¯xk−µ¯)(¯xk−µ¯) T ¯
ui, λi=eigenvectors(Σ)
imax=argmaxi{λi}
¯
vp= ¯uimax
β=median{v¯Tp(¯xk−µ¯), bp≤k≤ ep}
¯
µp=βvp¯ + ¯µ
// Organization of the order
i=bp j=ep whilej > i whilev¯Tp(¯xi−µp¯ )≤0 i=i+ 1 end whilev¯T p(¯xj−µp¯ )>0 j=j−1 end
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
Figure 3. Examples of the decomposition
ifj > i SW AP(¯xi,xj¯ ) end end b2p+1=bp e2p+1=j b2p+2=i e2p+2=ep p=p+ 1 end
Algorithm 1: Hierarchical decomposition.
In figure 3 some results of the decomposition are shown. All vectors are visualized by drawing a line from each vector to the mean vector of the node it belongs to. The data set consists of210 2-dimensional vectors drawn
from a gaussian distributions. The decomposition is real-ized up to the sixth level resulting in 64 nodes each con-taining 16 vectors.
3
The Search Algorithm
3.1
Elimination rule
The branch and bound algorithm searches the nearest neighbors for a given vectorx¯and consists of a depth first traversal of the tree that represents the hierarchical decom-position of the data set. When a node is evaluated, it is de-termined whether it can contain nearest neighbors. If this is not the case, this node can be omitted from the search procedure. The rule that is used to determine whether a node can contain nearest neighbors is called the
elimina-tion rule. The eliminaelimina-tion rule is adapted to the definielimina-tion
on the decomposition and relies on a distance measure be-tween a vectorx¯and a node with indexp. This distance is compared to the distance to theKth nearest node in order to determine whether that node can be discarded from the search procedure. A vector-to-node distanced(¯x, p)is pro-posed that is defined to be0whenpis the root node. Ac-cording to the decomposition defined in equation (5), the distances fromx¯to child nodes ofpare determined from
d(¯x, p)using if ¯vp(¯x−µp¯ )<0 d(¯x,2p+ 1) =d(¯x, p) d(¯x,2p+ 2) = max(d(¯x, p),|v¯T p(¯x−µp¯ )|) else d(¯x,2p+ 1) = max(d(¯x, p),|v¯T p(¯x−µp¯ )|) d(¯x,2p+ 2) =d(¯x, p) (9) For the child node that contains the vectorx¯, the same distance is taken as the parent node. For the other child node, all vectors belonging to it have a greater distance tox¯than the perpendicular distance |¯vpT(¯x−µp¯ )| to the
hyperplane. This distance might therefore be a valid def-inition for the vector-to-node distance. However, hyper-planes on previous levels might provide larger, thus more efficient, distances. Therefore, the maximum ofd(¯x, p)and
|v¯T
p(¯x−µp¯ )|is taken.
During the search procedure, the currently found nearest neighbors and their distances to the vectorx¯are stored in the variablesyk¯ and dk respectively with k =
1, . . . , K. The values ofdk are initially set to ∞. The
index of the vectory¯that is the currentKth nearest neigh-bor is denotedkmax. When a vector is found that is closer thanyk¯max, it is replaced by this nearer vector. Then, it is determined which ofyk¯ is the newKth nearest neighbor which results in a new value forkmax. By the definition of the vector-to-node distance, the following elimination rule can be applied. A nodepcan be discarded from the search procedure, if
d(¯x, p)2
>(¯ykmax−x¯)
T(¯yk
max−x¯) (10) since all the vectors belonging topare further fromx¯than
¯
3.2
The Search Procedure
The tree traversal can be implemented efficiently using a stackswhich is addressed by an indext. On this stack, the node indexpof nodes that still need to be evaluated and their vector-to-node distancesdxpwill be stored. Initially, the root node and the distance to this node (both zero) are pushed on the stack. When a node is evaluated it is popped off the stack and the distancedxpfrom the vectorx¯to the nodepis compared withdkmax. When the node is further thandkmaxit cannot contain nearest neighbors, and the fol-lowing node on the stack is evaluated. If not, two cases can be distinguished. If the node is a leaf node (p≥2L−1)
the vectors in Sp = {xb¯p, . . . ,xe¯p}are searched, which means that their distance tox¯is calculated and compared withdkmax. If the node is branched, the child nodes and their distances are pushed on the stack. Note that the node closest tox¯is pushed on the stack last so that it will be evaluated first. The algorithm terminates when the stack is empty (t < 0) which indicates that the entire tree was traversed and theKnearest neighbors found.
The complete search algorithm is listed below. TheK
nearest neighbors of a vectorx¯are determined from a data setS0that is decomposed up to a levelL.
d1, . . . , dK =∞
S0={x¯1, . . . ,xN¯ }
kmax= 1
s0= 0// push the root node s1= 0// push the distance
t= 1
whilet≥0
dxp=st// pop distance
p=st−1// pop node
t=t−2
ifdxp < dkmax // elimination rule
ifp≥2L−1//pis a leaf node i=bp whilei≤ep dxx= (¯x−xi¯)T(¯x−xi¯) ifdxx < dkmax ¯ ykmax = ¯xi dkmax =dxx
kmax= max argk{dk}
end
i=i+ 1
end
else//pis a branched node ifvp¯(¯x−µp¯ )<0 st+1= 2p+ 2 st+2= max(dxp,(¯vTp(¯x−µp¯ ))2) st+3= 2p+ 1 st+4=dxp t=t+ 4 else st+1= 2p+ 1 st+2= max(dxp,(¯vTp(¯x−µp¯ ))2) st+3= 2p+ 2 st+4=dxp t=t+ 4 end end end end
Algorithm 2: Search Algorithm.
4
Results
The behavior of the algorithm was studied by means of ex-periments on artificial data. Prototype sets were produced from aD-dimensional normal probability distribution with mean 0 and unit covariance matrix. Each result was ob-tained from the average of210 experiments. The
perfor-mance of the algorithm was studied with respect to differ-ent values of
• Dthe dimensionality of the vector space
• Nthe number of vectors in the data set
• Kthe number of nearest neighbors that are searched
• Lthe level of decomposition that is used for the search In table 1 the average number of vector-to-vector dis-tance computations is given. From this table, one can ob-serve that the average number of distance computations is in general very small. The nearest neighbor in a 2-dimensional space was obtained after4.7distance compu-tations. Other works report results of 46 [3] and 165 [4] average distance computations. An interesting overview is given in [9]. The number of distance computations tends to be independent of the number of prototypes (for a dimen-sionality ofD≤4). This property was also observed from the AESA algorithm and its derivatives.
In addition to these distance computations, the search algorithm also spends time calculating the vector-to-node distance and traversing the tree. If the cost of the distance computation is very high relative to this extra effort, this could be neglected. However, when observing the average calculation time of the algorithm for increasing levels of decomposition, the total calculation time decreases fast at the lower levels, obtains a minimum and increases towards the highest level. Every extra level of decomposition re-duces the number of distance computations but increases the traversal cost. If this extra cost exceeds the reduction in distance calculation time, an additional level of decompo-sition will increase the total search time. This implies that there is an optimal levelLoptfor which the total computa-tion time is minimal. Using this optimal level of decom-position, the average computation time was determined in function of the number of vectorsN. Results forKbeing 1, 2, 4 and 8 are shown for different dimensionalitiesDin figures 4 to 6. More detailed results will be published in [1].
10 11 12 13 14 15 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 log2 N
average computation time (s)
Figure 4. Average computation time forD= 2
10 11 12 13 14 15 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 log2 N
average computation time (s)
Figure 5. Average computation time forD= 4
10 11 12 13 14 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 log2 N
average computation time (s)
Figure 6. Average computation time forD= 8
Table 1. Average number of distance computations Number of vectors in data setN
D K 210 211 212 213 214 215 2 1 4.5 4.7 4.7 4.9 4.7 4.7 2 7.7 7.9 8.0 7.9 8.0 7.9 4 13.5 13.8 13.6 13.6 13.5 13.5 8 23.7 23.8 23.8 23.9 23.8 24.0 4 1 34 35 37 39 39 41 2 51 55 58 62 62 63 4 79 87 93 98 101 102 8 121 136 149 157 163 168 8 1 371 544 739 968 1197 1515 2 495 736 1025 1380 1789 2210 4 608 954 1388 1915 2534 3203 8 726 1185 1785 2567 3443 4483
5
Conclusions and Further Work
Since the number of nodes in the tree is 2L+1−1, with
Lbounded byblog2Ncthe space complexity of the
algo-rithm grows linear withN. It was observed that the av-erage number of distance calculations was bounded by a constant which is independent of the sample size. The low-est average computation time was obtained for an optimal levelLopt, which realized the tradeoff between the distance computation cost and the tree traversal cost. Using this op-timal level of decomposition, it was shown that the average calculation time grows sublinear with the data set sizeN. For a low dimensionality (≤ 4) it was very close to
loga-rithmic.
A weak property of the algorithm in the form pre-sented here, is that its performance decreases drastically with the dimensionalityD of the vector space. However, it decreases only with the intrinsic dimensionality of the data set. For a four-dimensional sample where the third and fourth dimension were a linear combination of the first two dimensions (intrinsic dimensionality of 2), the same number of distance computations was observed as for a two dimensional sample. The results that are presented for the gaussian distribution with unit covariance matrix provide therefore a worst case estimation of the efficiency.
Future work will investigate if the performance can be improved by combining the decomposition with other well known elimination rules [3, 4].
6
Acknowledgements
This work was financially supported by the Flemish Insti-tute for the Promotion of Scientific and Technological Re-search in the Industry (IWT), Brussels. The author thanks Diemo Schwarz, Steve de Backer, Jan Sijbers and Paul Scheunders of the Visionlab at the University of Antwerp and the Analysis/Synthesis team at IRCAM.
References
[1] Wim D’haes, Dirk Van Dyck, and Xavier Rodet. The principal component split: A new branch and bound search algorithm for computingknearest neighbors, (submitted). Pattern Recognition Letters.
[2] Keinosuke Fukunaga. Statistical Pattern Recognition. Academic Press, 1990.
[3] Keinosuke Fukunaga and Patrenahalli M. Nerada. A branch and bound algorithm for computing k-nearest neighbors. IEEE transactions on computers, 24:750– 753, July 1975.
[4] Behrooz Kamgar-Parsi. An improved branch and bound algorithm for computing k-nearest neighbors.
Pattern Recognition Letters, 3:7–12, January 1985.
[5] Mar´ıa Luisa Mic´o, Jos´e Oncina, and Raphael C.Carrasco. A fast branch & bound nearest neigh-bor classifier in metric spaces. Pattern Recognition
Letters, 17:731–739, June 1996.
[6] Mar´ıa Luisa Mic´o, Jos´e Oncina, and Enrique Vidal. An algorithm for finding nearest neighbors in con-stant average time with a linear space complexity.
Proc. of the 11th ICPR, Vol. II:557–560, 1992.
[7] Mar´ıa Luisa Mic´o, Jos´e Oncina, and Enrique Vidal. A new version of the nearest-neighbour approximation and elimination search alorithm (AESA) with linear preprocessing time and memory requirements.
Pat-tern Recognition Letters, 15:9–18, January 1994.
[8] H. Niemann and R. Goppert. An efficient branch and bound nearest neighbour classifier. Pattern
Recogni-tion Letters, 7:67–72, February 1988.
[9] E. Vidal. An algorithm for finding nearest neighbors in (approximately) constant average time complexity.
Pattern Recognition Letters, 4:145–157, July 1986.
[10] E. Vidal. New formulation and improvements of the nearest-neighbour approximation and elimination search alorithm (AESA). Pattern Recognition Letters, 15:1–7, January 1994.