• No results found

In this section we propose a preliminary algorithm in 2D, based on convex layers [20], which uses O(n) space and has query time O(k log n). Then we improve the query time to O(log n + k log k) by propagating information about extreme points on each layer. Finally we use the special selection technique given in [37] to get an optimal query time of O(log n + k).

The convex layers of P are defined as follows: The first layer is the convex hull of P . The second layer is the convex hull of the set obtained by deleting from P the points in the first layer. And so on until there are no more points left.

2.2.1 Preliminary algorithm

Consider a unit weight vector w in the 2D plane. If we sweep a line l perpendicular to w over P , from +∞ to −∞, the first k points that we encounter correspond to the top k objects in decreasing order of score. Based on this, the top-1 point must lie on the convex hull of P , and that point, called extreme point, is just at the position where l is the tangent of the hull (see Figure 2.2a). To move one step further, the candidates for the point with second largest score are p’s two neighbors on the convex hull and one point inside the convex hull (see Figure 2.2b). Formally, let p be the point with rank i, based on score, and let Cand be the set containing all the possible candidate points with rank i + 1. We maintain the following invariant.

1. If p is the extreme point in the current layer, Cand← Cand ∪ {q, p0, p00}, where q is the extreme point in the next inner layer, and p0, p00 are p’s two neighbors on its

layer (if they exist).

2. Else, Cand ← Cand ∪ {p0}, where p0 is p’s left or right neighbor on its layer, as appropriate (if it exists).

Based on this invariant, we use a binary heap to maintain all the possible candidates, and report the top-k objects one by one in non-increasing ordering of score. In case 1, we delete one element from the heap, and insert at most three elements; in case 2, we delete one element and insert one more. Hence, the size of the heap is O(k), and O(k) heap operations done (insertion and deletion) take O(k log k) time.

Since w lies in the first quadrant, the extreme point on each convex layer must lie between the topmost and rightmost vertex on the layer (vertices p1, p2, p3 for the first

layer in Figure 2.2c).

If we shoot rays which are perpendicular to each segment of this part, we partition the range of angles of all the vectors (0◦− 90◦) into several parts. For a given weighting

vector, the interval in which it lies yields the corresponding extreme point. For instance, in Figure 2.2c, if the weighting vector lies in [0, θ1], (θ1, θ2], or (θ2, θ3 = 90◦], the

corresponding extreme point will be p1, p2, or p3, respectively. If the angle of weighting

vector is equal to some θi, we can arbitrarily choose either pi or pi+1 to break the tie.

Hence finding the extreme point on a layer requires only one binary search, if the angles are stored in sorted order in an array, and takes O(log n) time. We only need to find at most k such points, hence the total query time is O(k log n + k log k) = O(k log n). The total space used is O(n).

2.2.2 Applying fractional cascading

We note that finding the extreme points in different layers requires a sequence of binary searches, whose total cost is O(k log n). This cost can be reduced to O(log n + k log k), without affecting the O(n) space bound, by using the fractional cascading technique [21]. This technique stores appropriate pointers (called bridges) between consecutive arrays of angles. With this approach, we need to perform one O(log n) time binary search to find the first extreme point; subsequent extreme points are found in O(1) time each by following the stored pointers. Hence, the total time improves to O(k + log n).

18 O x y w p l

(a) Linel is normal to unit vec- torw, and l is the tangent of the convex hull. Pointp on the hull ranks as the top-1 point in this situation. O x y w p l q p p′′

(b) The candidates for the points with the second largest score can bep0

,p00

, and the ex- treme pointq in the inner convex hull. O x y p2 l p1 p3 θ1 θ2 θ3= 90◦

(c) The extreme point can only lie in the red region. The green lines, which are perpendicular to the red segments, divide the range[0◦, 90] into three inter-

vals that can be searched for the extreme point as discussed in the text.

Figure 2.2: Illustrating the search for the top-k points for weight vector w. Here are details. Initially we have m angle arrays, denoted by angle[1][.], angle[2][.], · · · , angle[m][.], as well as information about the corresponding nodes (extreme points) as arrays node[1][.], node[2][.], · · · , node[m][.], where m is the number of convex layers. (Layers are numbered from 1 (outermost) to m (innermost).) We will construct the fractional cascading data structure in two steps. Figure 2.3a shows an example of three convex layers, and we will use it to illustrate the construction.

Step 1: Extending arrays. First, we take every other element from angle[m][.], i.e., angle[m][1], angle[m][3], angle[m][5], · · · , and merge them into angle[m − 1][.] in linear time. While merging, we can calculate the node information for each added entry very easily. Then we take every other element from the extended angle[m− 1][.], and merge them into angle[m− 2][.]. We repeat the above process m − 1 times from bottom to top to extend the array information of the first m− 1 layers (see Figure 2.3b). Note that it is unnecessary to have identical numbers in an array, so if there exist two identical numbers after an extension, we will just keep only one of them (and the corresponding node).

Step 2: Building pointers. We start at layer 1 and proceed to layer m, as follows. For each element in current layer, we set a pointer to the smallest element (with odd

index) in the next layer which is larger than or equal to it (if such element does not exist, then we take the largest element in the next layer instead). Formally, for the j-th element in i-th layer (1≤ i < m), i.e., angle[i][j], we point it to angle[i + 1][j0], where (1) j0 is odd, or j0 is the last element of angle[i + 1][.]; (2) angle[i + 1][j0]≥ angle[i][j];

and (3) j0 = 1 or angle[i + 1][j0− 2] < angle[i][j] (see Figure 2.3c).

We can analyze the total space used via the accounting method of amortized analysis [24]. We assign each element of each original angle[.][.] array 1 credit. Throughout, we maintain the invariant that each element in the array where elements are currently being propagated has 1 credit available. This is true for angle[m][.] by the above assignment. Let angle[i][.] be the current array (1 < i ≤ m) and assume that the invariant holds. Each propagated element from angle[i][.] pays for itself using its stored credit and carries with it to angle[i− 1][.] the credit from the unpropagated neighbor to its right. Thus, the invariant holds for angle[i− 1][.]. Hence the total number of elements propagated is upper-bounded by the number of credits assigned initially, which is P

si = O(n) where

siis the original size of angle[i][.]. Thus the total space isPsi (for the original elements)

+Ps

i (for the propagated elements), which is O(n). (Note that there can be at most

one element in angle[i][.] that does not have a neighbor on the right to borrow a credit from. There are O(n) such elements in total, so this does not affect the space bound.)

Moreover, instead of performing binary search on each layer, we perform binary search only on the first layer to find the first extreme point, and then follow the pointers to the other ones. Specifically, assuming the j-th element of layer i represents the extreme point under some weighting vector w, and it points to the j0th element of layer i + 1, then we claim that the extreme point of layer i + 1 must be either the j0th element

or the (j0− 1)th element. For instance, assume the angle of the given vector w is 38◦.

The binary search on the first layer will tell us that the element p3, corresponding to the

angle interval (30, 45], is the extreme point. Then we follow its pointer to the element q3 on the second layer, corresponding to (30, 45]. The angle on its left is 30◦ < 38◦, so

q3 is the extreme point of the second layer. This in turn points to the element r5 on the

third layer, corresponding to (40, 45]. However the angle to its left is 40◦ > 38◦, so r4

is the extreme point on the third layer, corresponding to (30, 40].

In summary, finding all the extreme points now costs only O(log n + k), thus the time for the top-k query improves to O(log n + k log k).

20 angle node angle node angle node 10 30 45 60 70 90 p1 p2 p3 p4 p5 p6 5 25 75 90 q1 q2 q3 q4 15 20 30 40 45 80 90 r1 r2 r3 r4 r5 r6 r7 layer 1 layer 2 layer 3

(a) Original angle and node arrays. For example, in layer 2, the numbers5, 25, 75 and 90 repre- sent intervals[0, 5], (5, 25], (25, 75], (75, 90] re- spectively, andq1, q2, q3, q4are their correspond-

ing extreme points.

angle node angle node angle node layer 1 layer 2 layer 3 10 30 45 60 70 90 p1 p2 p3 p4 p5 p6 5 25 75 90 q1 q2 q3 q4 15 20 30 40 45 80 90 r1 r2 r3 r4 r5 r6 r7 15 15 q2 30 30 q3 45 45 q3 5 p1 25 p2 45 q3

(b) Extending the arrays. The elements with red border are selected to be merged above, and the shaded cells are the final positions of these selected elements from the array one level below.

angle node angle node angle node layer 1 layer 2 layer 3 10 30 45 60 70 90 p1 p2 p3 p4 p5 p6 5 25 75 90 q1 q2 q3 q4 15 20 30 40 45 80 90 r1 r2 r3 r4 r5 r6 r7 15 15 q2 30 30 q3 45 45 q3 5 p1 25 p2 45 q3

(c) Each element (except the ones on the last level) points to the proper selected element one level be- low.

2.2.3 An optimal algorithm

Note that the k objects reported are actually in non-increasing order of score, and that is why the k log k factor appears in the bound. If we want the results to be sorted, then the previous algorithm is suitable; otherwise, we can do better. We now propose an optimal algorithm which will output the top-k objects in arbitrary order within a query of O(log n + k). Our approach relies on the following result from [37].

Lemma 2.1. ([37]) Given m sorted arrays of reals, it is possible to find the ck-th smallest/largest real in O(m) time, where c≥ 1 is a constant.

In [37], the operation associated with Lemma 2.1 is called a “CUT”. Assume that the points on each layer are given in (say) clockwise order in an array. Then for a given weighting vector w, the corresponding extreme points on the layer partition the points of the layer into two sub-arrays that are sorted by score w.r.t. w. (See, for instance, the points colored red and green in Figure 2.4.) From our earlier discussion, we know that the top-k points w.r.t. w must belong to the first k or fewer layers. Using the approach in Section 2.2.2, we find the extreme points in the first k layers in O(k + log n) time and use these to create m = 2k sorted arrays of points from these layers. We then apply the CUT operation (Lemma 2.1) to identify the ck-th largest point (by score) in O(k) time. Next we scan each of the 2k sorted arrays by non-increasing score and identify a set, S, of points whose score is greater than or equal to the ck-th largest score; this takes O(k) time. Note that since c ≥ 1, S is a superset of the set of top-k points desired, and the size of S is just ck = O(k). We then run a standard selection algorithm [24] on S to find the point with the k-th largest score in O(k) time. Finally, we use this point to extract from S the desired top-k points in additional O(k) time. Thus, the overall time to answer the top-k query is O(log n + k), which improves upon the result in Section 2.2.2. (However, note that the points are now no longer reported in non-increasing order of score.)