CHAPTER 3: C-SPACE APPROXIMATION USING INSTANCE-BASED LEARNING
3.3 Overview
3.3.3 LSH-based Approximate k-NN Query
A key challenge for our learning framework is its computational efficiency. As we generate hypotheses directly from training instances, the complexity ofk-NN computation grows with the size of historical data. If we use exactk-NN computation as the underlying learning method, its complexity is a linear function of the size of the dataset, especially for high-dimensional spaces. To improve the efficiency of our instance-based learning algorithm, we use approximate k-NN algorithms.
Given a datasetD = {x1,x2, ...xN} ofN points in Rd, we consider two types of retrieval
queries. One retrieves the points fromDclosest to a given point query: this is the well-knownk-NN query, which we call thepoint-pointk-NNquery. The second query tries to find the points fromD that are closest to a given line inRdwhose direction isvand which passes through a pointa, where v,a∈Rd. We call this second query theline-pointk-NNquery. The two types ofk-NN queries are
illustrated in Figure 3.2.
In order to develop an efficient instance-based learning framework, we use locality-sensitive hashing (LSH) as an approximate method fork-NN queries, which is mainly designed for point-point queries (Andoni and Indyk, 2008). However, it can be extended to line queries (Andoni et al., 2009) and hyper-plane queries (Jain et al., 2010). (Basri et al., 2011) further extend it to perform point/subspace queries.
LSH requires randomized hash functions which guarantee that the probability of two points being mapped into the same hash bucket is inversely proportional to the distance between them. The distance metric is defined based on the specific task or application. Since two similar points are likely to fall into the same or nearby hash buckets, we only need to perform a local search within the bucket that contains the given query.
Definition 3.1. (Andoni and Indyk, 2008) LethHdenote a random choice of hash functions from the
function familyH, and letB(x, r)be a radius-rball centered atx. His called(r, r(1 +), p1, p2)-
sensitive fordist(·,·)when for anyx,x0∈ D,
• ifx0 ∈B(x, r), thenP[hH(x) =hH(x0)]≥p1,
• ifx0 ∈/ B(x, r(1 +)), thenP[hH(x) =hH(x0)]≤p2.
For a family of functions to be useful, we requirep1> p2.
A higher dimensional hash functiongcan be constructed by concatenating several hash functions randomly selected from the function familyH: g(x) = [h1
H(x), h2H(x), ..., hMH(x)], whereMis the
dimension ofg. Given the hash functiong, the hashing collision probability for two close points is at least(p1)M, while for dissimilar points it is at most(p2)M. Each item in the datasetDis mapped to
a series ofLhash tables indexed using independently constructed functionsg1, ..., gL, where eachgi
is a dimension-M function. Next, given a point queryp, an exhaustive search is carried out only on the items in the union of theLbuckets. These candidates constitute the(r, )-nearest neighbor forp, meaning that ifphas a neighbor within radiusr, then with high probability some item within radius r(1 +)would be found. Whendist(·,·)corresponds to thel2metric, the following is true about the
point-pointk-NN query:
Theorem 3.1. (Point-pointk-NN query) (Datar et al., 2004) LetHbe a family of(r, r(1+), p1, p2)-
sensitive hash functions, withp1 > p2. Given a dataset of sizeN, we set M = log1/p2N and
L=Nρ, whereρ= logp1
logp2. UsingL-hash tables over dimensionM and given a point queryp, the
LSH algorithm solves the(r, )-neighbor problem with probability at least 12 −1e. In other words, if there exists a pointxthatx∈B(p, r(1+)), then the algorithm will return the point with probability ≥ 12 −
1
e. The retrieval time is bounded byO(N ρ).
If we chooseHto be the hamming hash orp-stable hash
{hu :hu(x) = sgn(uTx)}or{ha,b:ha,b(x) =ba Tx+b
W c}, (3.1)
where u anda ∼ N(0,I), b ∼ U[0, W]andW is a fixed constant, we haveρ ≤ 1
1+ and the
algorithm has sub-linear complexity, i.e., the results can be retrieved in timeO(N1+1).
We build on these prior results for point-pointk-NN queries, and we present a new LSH-based algorithm for line-pointk-NN queries. The LSH parameters (e.g.,u,W,aandbin Equation 3.1) are chosen randomly a priori. When the collision result for a new configuration query is computed, we calculate the hash code for that query and add its collision information to the hash tables. This operation is is performed once for each item stored in the database.
Later in Section 3.4, we discuss challenges in designing appropriate hash functions for line-point k-NN queries, and we derive LSH bounds for line-pointk-NN. We also address many challenges in extending our formulation to non-Euclidean metrics (e.g., in handling articulated models) and reducing the dimension of embedded space.
Q (a) Q (b) x2 x1 (c) l
Figure 3.2: Two types ofk-NN queries used in our method: (a) point-pointk-NN and (b) line-point k-NN.Qis the query item, and the results of different queries are shown as blue points in each figure. We present novel LSH-based algorithms for fast computation of these queries. (c) The line-point k-NN query is used to compute prior instances that can influence the collision status of a local path which connectsx1andx2 inC-space. The query line is the line segment betweenx1andx2. The
white points are prior collision-free samples in the dataset, and the black points are prior in-collision samples.