• No results found

Accommodating k-d Trees in the SDDS Model

N/A
N/A
Protected

Academic year: 2021

Share "Accommodating k-d Trees in the SDDS Model"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Accommodating k-d Trees in the SDDS Model

Panayiotis Bozanis

Dept. Computer & Communication Engineering, University of Thessaly Glavani 37, Volos 382 21, Greece

e-mail: pbozanis@inf.uth.gr

Abstract

We propose a new scheme for accommodating pseudo k-d trees, random versions of the well known multidimensional search structures, in the SDDS model. Our scheme exhibits logarithmic update time, either constant or logarithmic search time for single key queries and output sensitive query time for range search query.

Keywords: Distributed Data Structures, Scalability, Network Computing, Binary Search Trees, Multi- dimensional Indexes.

1 Introduction

The last years, due to rapid evolution of network technology, the technological framework known as network computing emerged; that is, many powerful and low-priced workstations, interconnected over fast networks, create a pool of terabytes of RAM and even more of disc space [13]. This distributed storage, composed of RAM and disk devices, is provided for data file accommodation and maintenance. Every site in such a network is either a server managing data or a client requesting access to data. Additionally, a site can be either client or server or both. Every server provides a storage space of b data elements, termed bucket, to accommodate a part of the file under maintenance. Sites communicate by sending and receiving point-to-point messages. The underlying network is assumed error-free; in that way we can devote our concern to efficiency aspects only, i.e., on the number of the messages exchanged between the sites of the networks, irrespectively from the length of a message or the network topology.

This environment imposes the following constraints to the design and implementation of distributed al- gorithms and data structures, known as the Scalable Distributed Data Structure (SDDS) model : (a) they should expand to new servers gracefully, and only when servers already used are efficiently loaded; and (b) their access and maintenance operations never require atomic updates to multiple clients, while there is no centralized “access” site.

Litwin et al. introduced the SDDS model with the LH* data structure in the seminal paper [13]. Following

this work, there have been various kinds of SDDS proposals: RP* [14, 15], DRT [11], lazy k-d-Tree [17, 18],

DEH [4, 20], BDST [5, 6, 7], DSL [2], ADST [8, 9], LH* G [16] and LDT [3]. On the other hand, Kr¨oll and

Widmayer [12] showed that if all the hypotheses used to efficiently manage search structures in the single

(2)

processor case are carried over to a distributed environment, then a tight lower bound of Ω( √

n) holds for the height of balanced search trees.

In this paper we propose a new scheme for accommodating pseudo k-d trees, which are dynamic random versions of the well-known and thoroughly examined multi-dimensional data structure, in the SDDS model context, based on carefully distributed auxiliary ‘on-top’ tree structure information across servers. In that way we achieve performance one expects from logarithmic tree height, without dependence on data distribution nor node rebalancing operations. Additionally, one can decide whether he affords auxiliary logarithmic or linear to the number of servers space per server according to performance criteria. In section 2 we briefly discuss k-d trees and give the intuition of this work. Section 3 introduces the scheme and discusses its performance, while section 4 concludes our work.

2 Pseudo k-d Trees

In [1] Bentley proposed k-d trees as an improvement over quad-trees [10], the first data structure efficiently solving multi-dimensional search problems that he with Finkel had introduced. These structures are defined as binary trees, built recursively over a static set S of d-dimensional points, in the following way: A point p ∈ S is chosen to be associated with the root ρ of tree. The first coordinate p x

1

of p divides S into two disjoint subsets S 1 , S 2 ; the first subset includes all points p 0 ∈ S such that p 0 x

1

≤ p x

1

while the second one consists of every point p 00 ∈ S with p 00 x

1

> p x

1

. S 1 is related with the left child of ρ and S 2 with the right one.

Then both subsets are split with respect to the second coordinate x 2 and so on. After splitting with respect to the d-th coordinate, we continue with the first coordinate. This procedure is repeated, using coordinates in a round-robin fashion, until each subset consists of at most one point. The resulting tree corresponds to a space partition which is used to decompose range queries into disjoint components that are served independently, searching the tree in a top-down fashion.

Since their introduction, several versions of the above structures were proposed. In this work we will adopt dynamic pseudo k-d trees [19], restricting the discussion, due to space constraints, to insertions only;

this is quite common in the SDDS literature. Insertions are performed as follows: After searching for the appropriate leaf l, corresponding to the finest (smallest) space subdivision that contains the point to be inserted, we make l internal node with two leaf-children. Using the appropriate coordinate for the splitting, one leaf accommodates the new point and the other the ‘occupant’ of l.

In order to use the scheme in the SDDS context we will expand the leaves into buckets so that they can accommodate, instead of just one point, at most b. This has the following consequence for the insertion operation: if the involved leaf l stores less than b points, then it accommodates the new point without any further alterations; else, we split the bucket, distributing evenly the ‘old’ points and the new one between l and a newly allocated bucket. In Figure 1(a) a range query on an instance of a 2-dimensional k-d tree is presented, while Figure 1(b) depicts an instance of an insertion causing leaf (bucket) splitting. Please observe that, by definition, the resulting auxiliary ‘routing’ binary tree is a random one.

This bucketed version of pseudo k-d trees was adjusted for the SDDS model by Nardelli et al in [17, 18],

treating both the cases of point-to-point and multicast messages availability. The first case corresponds to

(3)

A

A B

C

1

3 4

C

2 1

B

4 5

D

2

A B

3 1

C

1

2

3 4

A B

C

1

2 3

4

D

5 (a)

(b)

Figure 1: An instance of pseudo k-d tree: (a) range query, (b) insertion of a new point and bucket split

maintaining auxiliary index information to clients as well as to servers while the second one only to clients.

In this work we will consider the first case. Due to resemblance with DRT [11], the distributed k-d trees of [17, 18] present the same worst case performance, i.e. updates and queries can exhibit linear to the number of servers complexity. We tackle this using a node clustering technique introduced in [3].

3 Distribution Scheme for k-d Trees

Distribution. Due to lack of space we assume familiarity with the standard approach in SDDS model. A portion of the global structure is stored at each server. A client maintains its own view of the structure, initially coarse and partial, becoming better as it issues more and more requests.

Servers. As previously explained, leaf nodes represent buckets capable of holding up to b data items lying into the corresponding multi-dimensional extent of the leaves. Internal nodes and leaves are associated with the server managing them. A node or a leaf is associated to only one server. In contrast, a server can hold several nodes, but only one leaf. Each node stores its extent, in the form of a multi-dimensional rectangle, parent and child pointers.

Initially, the structure maintains only one bucket b 1 belonging to server s 1 . Whenever b 1 overflows, it splits

in two; the new bucket b 2 and the new father node can be assigned to a new server s 2 as [18] does. As long

as insertions keep coming, we have bucket splits and new internal nodes. The evolving ‘on-top’ search tree

T is a random one, which means linear to the number of leaves (i.e., servers) worst case height. Logarithmic

height can be enforced by periodical reorganization balancing [19]. However, if one adjust this technique to a

distributed environment, the bounds he achieves are only amortized. As a matter of fact, there would be time

periods during which the underlying network would experience linear to the number of servers reorganization

messages. An alternative way to manipulate a random tree T is to carefully restrict which servers receive

auxiliary routing information. This can be achieved by either introducing super-nodes with exponential to

(4)

4 5 2

1 3 2

2 2 2 2

3

3 3 3 4 4 5 5 5 4

4 4 4

1 3 2 2

2 2 2

3 3 3

4 4

(a)

(b)

T'

T

T'

T

Figure 2: (a) An instance of distributed k-d tree, (b) after insertion of a new point and bucket split

the super-level fan-out size [8, 9] or building a logarithmic height auxiliary search tree T 0 on top of it [3]. We will follow the second approach since in the first treatment a super-node, and so the server managing it, can have, at worst-case, between √

n and n children.

Before we discuss the scheme we need the following definitions: (i) let r be a rectangle split w.r.t. x i -axis into two (disjoint) rectangles r 1 , r 2 ; i.e., r = r 1 ∪ r 2 . We say that x i is the definition axis of r 1 , r 2 ; and (ii) let r 1 , r 2 be two multi-dimensional rectangles of the same definition axis x i . We consider that r 1  r 2 if their projections to x i -axis satisfy the relation π x

i

(r 1 ) ≤ x

i

π x

i

(r 2 ). That is, either π x

i

(r 1 ) is contained into π x

i

(r 2 ) or the upper bound π x

i

(r 1 ) u of the first projected interval is less than or equal to the lower bound π x

i

(r 1 ) l of the second one. Please observe that  defines a partial order over the universe of multi-dimensional rectangles.

So, on the reverse topological order defined by  over the nodes of the original tree T (cf. fig. 2), we can merge adjacent nodes, whose extents respect the  relation, level by level, until we come up with just one node. Whenever we want to insert a new leaf f 0 due to overflow of an old one f , firstly, we add it, and then we follow the upward path from f in T 0 , merging nodes so that the per level reverse topological node sorting is maintained. Here, we must note that this scheme consists a multi-dimensional extension of [3]. All newly generated nodes are stored in the new server s 0 allocated to leaf f 0 . And so the auxiliary per server information of these ‘super’ nodes has logarithmic space complexity. Therefore,

Lemma 1 The insert operation costs O(log n) messages in the worst case. 2

Clients. As the SDDS dictates, each client maintains local data reflecting its own view of the distributed

data structure. Specifically, it keeps node associations and their extent as they were during the last time it

accessed the data structure. Based on its local view, it issues the desired operation to the most appropriate

server. In case of valid local information, the client will receive the answer immediately. Otherwise, the

scheme forwards the request to the pertinent server and informs back the client about the part of the data

structure visited during the routing. The information is piggybacked in the answer, adopting standard–in the

(5)

SDDS context–approaches (for example, cf. [17, 18].)

Search Operation. The client searches in his local data for the most appropriate server(s), i.e., the server(s) owing the bucket with multi-dimensional extent intersecting the query rectangle or containing the data element in case of simple single-element search. When a server receives a search request that can be served, it answers to the client appropriately. Otherwise, it checks the portion of T 0 it owns to forward the demand down to a ‘descendant’ or up to an ‘ancestor’ w.r.t. T 0 server along with local information. When the client receives the answer and the traversed portion of T 0 in the form of piggybacked correction information, it can refine the private index it maintains, after using, for example, the logical volumes of [18] for termination check. So, following the previous discussion, it be can be proved that

Lemma 2 The single element search operation costs O(log n) messages, while a range query costs O(log n+k)

messages, where k is the number of buckets holding query results. 2

As a matter of fact if one spent additionally O(n log n) replication information, or, equivalently, O(n) extra auxiliary information per allocated server, then the bounds of the above lemmas can be reduced to O(1) and O(k) respectively, employing extent replication, the way [3, 8, 9] do. Omitting the details due to lack of space, we then have:

Lemma 3 The single-element search operation costs O(1) messages while the range search operation costs O(k) messages, where k is the number of buckets holding query results. 2

4 Conclusions

In this paper we proposed an alternative to [17, 18] scheme for distributed k-d trees employing simple opera- tions and exhibiting design level trade-offs. Our future plans include the extension of the approach to the case of quad-trees [10] which, unlike k-d trees, are defined as highly —in the general case— unbalanced recursively built 2 d -ary trees.

References

[1] J.R. Bentley. Multidimensional Binary Search Trees used for Associated Searching. Comm. ACM, 18: 509–517, 1975.

[2] P. Bozanis, Y. Manolopoulos. DSL: Accommodating Skip Lists in the SDDS Model. In Proceedings of 3rd Workshop on Distributed Data and Structures (WDAS’00), 2000.

[3] P. Bozanis, Y. Manolopoulos. LDT: A Logarithmic Distributed Search Tree In Proceedings of 4th Workshop on Distributed Data and Structures (WDAS’02), 2002.

[4] R. Devine. Design and Implementation of DDH: Distributed Dynamic Hashing. In Proceedings of the 4th Intl. Conference on Foundations of Data Organization on Algorithms (FODO’93), 1993.

[5] A. di Pasquale, E. Nardelli. Balanced and Distributed Search Trees. In Proceedings of the DIMACS Workshop on Distributed Data and Structures (WDAS’99), 1999.

[6] A. di Pasquale, E. Nardelli. Distributed Searching of k-dimensional Data With almost Constant Costs. In Symposium on Advances in Databases and Information Systems (ADBIS’00), 2000.

[7] A. di Pasquale, E. Nardelli. An Amortized Lower Bound for Distributed Searching of k-dimensional Data. In Proceedings of 3rd Workshop on Distributed Data and Structures (WDAS’00), 2000.

[8] A. di Pasquale, E. Nardelli. A Very Efficient Order Preserving Scalable Distributed Data Structure. In Database and Expert

Systems Applications, 12th International Conference (DEXA’01), 2001.

(6)

[9] A. di Pasquale, E. Nardelli. ADST: An Order Preserving Scalable Distributed Data Structure with Constant Access Costs.

In 28th Conference on Current Trends in Theory and Practice of Informatics (SOFSEM’01), 2001.

[10] R.A. Finkel, J.L. Bentley. Quad-Trees: a Data Structure for Retrieval on Composite Keys. Acta Inform., 4:1–9, July 1974.

[11] B. Kr¨ oll, P. Widmayer. Distributing a Search Tree Among a Growing Number of Processors. In Proceedings of the 1994 ACM SIGMOD Intl. Conference on Management of Data (SIGMOD’94), 1994.

[12] B. Kr¨ oll, P. Widmayer. Balanced Distributed Search Trees do not Exist. In Proceedings of the 4th Intl. Workshop on Algorithms and Data Structures (WADS’95), 1995.

[13] W. Litwin, M. A. Neitmat, D. A. Schneider. LH*-Linear Hashing for Distributed Files. ACM Transactions on Database Systems (ACM TODS), 21(4):480–525, December 1996.

[14] W. Litwin, M. A. Neitmat, D. A. Schneider. RP*-A Family of Ordered-Preserving Scalable Distributed Data Structures. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), 1994.

[15] W. Litwin, D. A. Schneider. RP*

RS

- A High Availability Scalable Distributed Data Structure using Reed Solomon Codes.

In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’00), 2000.

[16] W. Litwin, T. Risch. LH*

G

- A High-Availability Scalable Distributed Data Structure By Record Grouping Transactions on Knowledge and Data Engineering (TKDE, 14(4): 923–927, 2002.

[17] E. Nardelli. Distributed k-d Trees. In Proceedings of the XVI Intl. Conference of the Chilean Computer Science Society (SCCC’96), 1996.

[18] E. Nardelli, F. Barillari, M. Pepe. Distributed Searching of Multidimensional Data: a Performance Evaluation Study. Journal of Parallel and Distributed Computation (JPDC), 49(1):111–134, March 1998.

[19] M.H. Overmars. The Design of Dynamic Data Structures Springer-Verlag, Berlin, Heidelberg, 1987.

[20] R. Vingralek, Y. Breitbart, G. Weikum. Distributed File Organization with Scalable Cost/Performance. In Proceedings of

the 1994 ACM SIGMOD Intl. Conference on Management of Data (SIGMOD’94), 1994.

References

Related documents

Considering that the corrosion resistance of engineering alloys is dependent on the passive film characteristics, and also considering that carbonation and chloride attack are

This study was designed to assess whether genetic variations of AMACR were associated with sporadic prostate cancer development in Korean men, known to be an ethnically

In a prospective series of 102 Western patients with Child Pugh A hepatocellular carcinoma (HCC) managed with stereotactic body radiotherapy (SBRT), the

the RNA-RNA interaction assays conducted in the presence of group C rotavirus NSP2 did not yield stable inter-segment RNA contacts normally formed upon co-incubating interacting

Forecast place overall (private and public sector) non-residential construction spending growth at about 7.5-8.0% per year through 2009.. [ASIDE: A comparison table of total

Glavni razlog zdravstvene neispravnosti vode za ljudsku potrošnju u vodovodima Gorskog kotara je fekalno onečišćenje vode, dok u ostalim dijelovima županije najčešće

This paper provides an overview of our current approaches to quantifying the influence of electrical interference signals on CP and steel corrosion using newly designed

In the field of mass media, the translation of English advertisements into Malay has increased the number of loanwords significantly when translators tend to borrow source words