Accommodating k-d Trees in the SDDS Model
Panayiotis Bozanis
Dept. Computer & Communication Engineering, University of Thessaly Glavani 37, Volos 382 21, Greece
e-mail: pbozanis@inf.uth.gr
Abstract
We propose a new scheme for accommodating pseudo k-d trees, random versions of the well known multidimensional search structures, in the SDDS model. Our scheme exhibits logarithmic update time, either constant or logarithmic search time for single key queries and output sensitive query time for range search query.
Keywords: Distributed Data Structures, Scalability, Network Computing, Binary Search Trees, Multi- dimensional Indexes.
1 Introduction
The last years, due to rapid evolution of network technology, the technological framework known as network computing emerged; that is, many powerful and low-priced workstations, interconnected over fast networks, create a pool of terabytes of RAM and even more of disc space [13]. This distributed storage, composed of RAM and disk devices, is provided for data file accommodation and maintenance. Every site in such a network is either a server managing data or a client requesting access to data. Additionally, a site can be either client or server or both. Every server provides a storage space of b data elements, termed bucket, to accommodate a part of the file under maintenance. Sites communicate by sending and receiving point-to-point messages. The underlying network is assumed error-free; in that way we can devote our concern to efficiency aspects only, i.e., on the number of the messages exchanged between the sites of the networks, irrespectively from the length of a message or the network topology.
This environment imposes the following constraints to the design and implementation of distributed al- gorithms and data structures, known as the Scalable Distributed Data Structure (SDDS) model : (a) they should expand to new servers gracefully, and only when servers already used are efficiently loaded; and (b) their access and maintenance operations never require atomic updates to multiple clients, while there is no centralized “access” site.
Litwin et al. introduced the SDDS model with the LH* data structure in the seminal paper [13]. Following
this work, there have been various kinds of SDDS proposals: RP* [14, 15], DRT [11], lazy k-d-Tree [17, 18],
DEH [4, 20], BDST [5, 6, 7], DSL [2], ADST [8, 9], LH* G [16] and LDT [3]. On the other hand, Kr¨oll and
Widmayer [12] showed that if all the hypotheses used to efficiently manage search structures in the single
processor case are carried over to a distributed environment, then a tight lower bound of Ω( √
n) holds for the height of balanced search trees.
In this paper we propose a new scheme for accommodating pseudo k-d trees, which are dynamic random versions of the well-known and thoroughly examined multi-dimensional data structure, in the SDDS model context, based on carefully distributed auxiliary ‘on-top’ tree structure information across servers. In that way we achieve performance one expects from logarithmic tree height, without dependence on data distribution nor node rebalancing operations. Additionally, one can decide whether he affords auxiliary logarithmic or linear to the number of servers space per server according to performance criteria. In section 2 we briefly discuss k-d trees and give the intuition of this work. Section 3 introduces the scheme and discusses its performance, while section 4 concludes our work.
2 Pseudo k-d Trees
In [1] Bentley proposed k-d trees as an improvement over quad-trees [10], the first data structure efficiently solving multi-dimensional search problems that he with Finkel had introduced. These structures are defined as binary trees, built recursively over a static set S of d-dimensional points, in the following way: A point p ∈ S is chosen to be associated with the root ρ of tree. The first coordinate p x
1of p divides S into two disjoint subsets S 1 , S 2 ; the first subset includes all points p 0 ∈ S such that p 0 x
1≤ p x
1while the second one consists of every point p 00 ∈ S with p 00 x
1> p x
1. S 1 is related with the left child of ρ and S 2 with the right one.
Then both subsets are split with respect to the second coordinate x 2 and so on. After splitting with respect to the d-th coordinate, we continue with the first coordinate. This procedure is repeated, using coordinates in a round-robin fashion, until each subset consists of at most one point. The resulting tree corresponds to a space partition which is used to decompose range queries into disjoint components that are served independently, searching the tree in a top-down fashion.
Since their introduction, several versions of the above structures were proposed. In this work we will adopt dynamic pseudo k-d trees [19], restricting the discussion, due to space constraints, to insertions only;
this is quite common in the SDDS literature. Insertions are performed as follows: After searching for the appropriate leaf l, corresponding to the finest (smallest) space subdivision that contains the point to be inserted, we make l internal node with two leaf-children. Using the appropriate coordinate for the splitting, one leaf accommodates the new point and the other the ‘occupant’ of l.
In order to use the scheme in the SDDS context we will expand the leaves into buckets so that they can accommodate, instead of just one point, at most b. This has the following consequence for the insertion operation: if the involved leaf l stores less than b points, then it accommodates the new point without any further alterations; else, we split the bucket, distributing evenly the ‘old’ points and the new one between l and a newly allocated bucket. In Figure 1(a) a range query on an instance of a 2-dimensional k-d tree is presented, while Figure 1(b) depicts an instance of an insertion causing leaf (bucket) splitting. Please observe that, by definition, the resulting auxiliary ‘routing’ binary tree is a random one.
This bucketed version of pseudo k-d trees was adjusted for the SDDS model by Nardelli et al in [17, 18],
treating both the cases of point-to-point and multicast messages availability. The first case corresponds to
A
A B
C
1
3 4
C
2 1
B
4 5
D