• No results found

In this section, we extend to 3D our exact algorithm in Section 2.2.1, which was based on convex layers.

The size of the convex hull in 3D is linear in the number, n, of input points. Hence the convex layers occupy only O(n) space. The sketch of the algorithm in Section 2.2 still holds, but we need to address two issues that can impact the query time: 1) how to find the extreme points efficiently? and 2) what is the degree of each vertex (i.e., its number of neighbors) on the hull?

1. To find an extreme point on a convex layer, we can use planar point location [27], as follows.

Given any 3D convex hull, we first create the unit-normal of each facet. Then we translate all the normals so that their starting points coincide with the origin; thus their ending points will be all on the unit sphere. For each vertex pi on

the hull, we list all its associated facets, fi1, fi2,· · · , fit, in clockwise/counter- clockwise order. Next we connect fi1 and fi2, fi2 and fi3, · · · , fit−1 and fit, and fit and fi1 respectively via arcs on their corresponding great circles. These arcs will form a closed cycle on the surface of the sphere, and we associate the interior of the cycle with vertex pi. For convenience, we name that interior ci. Figure 2.7a

28 shows an example where we assume the convex hull is a tetrahedron with vertices p1, p2, p3, p4. fi and ni indicate the ith facet and its unit normal respectively.

Figure 2.7b shows the positions of these normals after translation. Moreover, c1 = n1n2n3n1 is the cell associated with p1, cell c2 = n1n2n4n1 is associated to

p2, etc.

Now given any weighting vector w, vertex pi is the extreme point with respect to

w if and only if the ending point of w is inside the interior of the cell ci. Note

that every arc on the sphere is part of the great circle, and thus it will become a line segment if gnomonic projection [25] is applied. Therefore, after gnomonic projection, the result will be a planar graph in which each cell ci will uniquely

correspond to a facet, ˆci, in that graph, and the weighting vector w will become

a point ˆw. It is clearly that w lies in ci if and only if ˆw is inside facet ˆci. Hence

finding an extreme point has been reduced to the planar point location problem. By using a persistent search tree, planar point location can be done in O(log n) time and O(n) space [52]. Therefore, finding an extreme point in 3D can be done in O(log n) time without increasing the asymptotic space complexity.

Note that, unlike the 2D case, each 3D weighting vector is uniquely determined by two parameters, and the fractional cascading technique discussed in Section 2.2.2 is no longer supported here.

2. Unfortunately, the degree of each vertex is no longer a constant in 3D; and it can be any number from 3 to n− 1, so that after some point p is deleted from the heap, we have to check all its neighbors, which would be costly in the worst case. We can soften the worst case a little by dividing the m≤ n points on the hull into

m groups of roughly √m points each. We then build the convex hull for each group. It is clear that the total space remains O(m), but the maximum degree in each sub-hull is less than √m. To find the extreme point from all the original m points w.r.t. some preference vector w, we can find the extreme point of each sub-hulls using point location and maintain these O(√m) points (by score) in a max-heap, which takes O(√m log√m) = O(√m log m) time. Retrieving the next largest point involves checking its neighbors in its sub-hull and inserting them into the heap and finally deleting the maximum point. Note that a point will be

inserted and deleted at most once, hence all the heap operations invoked in each group take O(√m log m) time.

Assume the 3D onion structure of all the n points consists of t layers, and the i-th layer contains ni points, i.e., n1+ n2+· · · + nt = n. We apply the strategy

above, i.e., for the hull in layer i, we partition the ni points into √ni groups and

maintain a max-heap on them. We also build one extra max-heap to keep track of the largest point in the first k layers as our previous algorithm does. To sum up, at most k extreme points will be accessed, which takes at most O(√n1log n1 +

· · · +√nklog nk) = O(k√n log n). Reporting the top-k objects in sorted order

also takes O(k√n log n). Therefore, the worst case running time is bounded by O(k√n log n). p1 p2 p3 p4 n1of f1 n2of f2 n3of f3 n4of f4

(a) An example of 3D convex hull

z y x n1 n2 n 3 n4

(b) Unit normals partition the unit sphere into several cells.

Figure 2.7: Illustrating how to reduce finding extreme points in 3D to planar point location.

Remark. The basic idea of our algorithm extends to higher dimension as well, but it is not very attractive due to two reasons: 1) the degree of a vertex of the convex hull can be as large as n, and, more importantly, 2) the size of the convex hull grows dramatically to O(nbd2c) in d-dimensions. In other words, storing the entire convex hull structure is too costly when the dimension is high. Therefore, in next chapter develop an efficient approximation algorithm for the preference top-k query in higher dimensions that avoids storing the entire hull.

Chapter 3

Approximate preference top-k

query

In this chapter, we present our sampling-based approximation scheme for the preference top-k problem in higher dimensions.

3.1

Problem formulation

Recall the preference top-k problem we defined in Section 2.1. We are given a set of n d-dim data points D ={a1, a2, . . . , an} ⊂ (R+)dand an integer k≤ n. For any given d-

dim query vector (i.e., preference) q = (q1, q2, . . . , qd) satisfying each qi ≥ 0 and q 6= 0,

the goal of the preference top-k problem is to find k points from D, which have the largest inner products with q, and report them in order. In other words, the reported points aπ1, aπ2, . . . , aπk have to satisfy

q· aπ1 ≥ q · aπ2 ≥ · · · ≥ q · aπk ≥ q · ai,

for any ai ∈ D − {aπ1, aπ2, . . . , aπk}.

An effective approximation approach for dealing with preference top-k queries is sampling. The high-level idea of a sampling-based approximation is to sample a subset of the original dataset D (called sampling set or sampling subset), which can well- represent the top-k features of the entire set but has a much smaller size. When a query preference vector q is given, the algorithm focuses only on the points in the sampling

set, i.e., it identifies the top-k points of the sampling set under q, and uses the data points so identified as an approximation to the true top-k result. In this way, each query can be answered much more efficiently. Clearly, how to get a small sampling set with high quality is the crucial part of sampling-based approximation. Ideal sampling sets should be representative and small-sized so that both the quality of the top-k answer and query efficiency can be guaranteed.