SPATIAL JOIN

(1)

SPATIAL JOIN

Biplob Kumar Debnath

Department of Electrical and Computer Engineering, University of Minnesota

SYNONYMS

Intersect Join, Distance Join

DEFINITION

Spatial join operation is used to combine two or more datasets with respect to a spatial predicate. The predicate can be a combination of directional, distance, and topological spatial relations. In the case of nonspatial joins, the joining attributes must be of the same type, but for spatial joins they can be of different types. Usually each spatial attribute is represented by its minimum bounding rectangles (MBR).

A typical example of a spatial join query is “Find all pair of rivers and cities that intersect”. For example, in Figure 1, the result of the join between the set of rivers {R1, R2} and cities {C1, C2, C3, C4, C5} is {(R1, C1), (R2, C5)}.

Figure 1: Example of a spatial join

HISTORICAL BACKGROUND

The first known technique to solve the spatial join operation was a grid based technique developed by Orenstein in 1986. The technique uses a grid to divide multidimensional spaces into smaller blocks known as pixels. A z-ordering is applied to order the pixels.

(2)

Each object is approximated by the pixels which intersect with its MBR. As pixels are ordered by z-ordering, each object is represented by a set of z-values, which are one- dimensional. Any one-dimensional index (e.g., B+-tree) can then be used to sort them and using sort-merge spatial join operation is done. The performance of this technique solely depends solely on the granularity of the grids. Finer grids will give accurate results, but they will consume more memory. To remedy this problem, multidimensional indices (e.g., R-tree) were devised which can directly handle spatial data. Various new spatial join algorithms (e.g., R-tree join, sort and match, spatial hash joins, slot index hash join etc.) are based on multi-dimensional index approach.

KEY CONCEPTS:

A spatial join involves two steps. In the first step, tuples whose MBRs overlaps with a query region MBR are determined. This step is not computationally expensive since at most four computations are required to determine whether two rectangles intersect. This step is known as filter step. The second step is called refine step. The tuples that passed the filter step are fed to the refinement step, where the exact spatial representation is used and a spatial predicate is checked on these spatial representations. The refinement step is computationally expensive, but the number of tuples it processes in this step is less, due to the initial filter step. Figure 2 illustrates an example of the filter-refine policy.

A

C

B

C

B

C FILTER

REFINE MBR

Query Region

Data Object

Figure 2: The filter-refine strategy

Spatial join algorithms can be classified into three categories: nested loop, tree matching, and partition based. For the discussion below, we will assume that we want to perform a spatial join on relations R1 and R2. The focus will be only on an intersection join. The Same techniques can be extended for other join variants (e.g., distance join).

(3)

Nested Loop

In this algorithm, all the tuples of R2 are scanned for each tuple of R1. Any pair of tuples of R1 and R2 which satisfies the spatial join predicate is added to the result. The basic algorithm follows:

1. for all tuple r1 ∈R1

2. for all tuple r2 ∈R2

3. if pair (r1, r2) satisfies the spatial join predicate 4. add <r1, r2> to result

Here, R1 is the outer relation and R2 is the inner relation. If an index is available, we can make that relation the inner one. In this case, we need not scan the entire inner relation. A Nested loop is very efficient when relations are small and no indices are present.

Tree Matching

The tree matching algorithm can be applied when indices are available on both the relations. For this discussion, we will assume that an R-tree index is available. In an R-tree, every node is in the form of <ref, rect>, where ref is a pointer to a child node and rect is the MBR of the child node or the MBR of a spatial object. The pages which contain leaf nodes are called data pages, and the pages which contain non-leaf nodes are called directory pages. Since directory entries contain the MBR of the child node entries, if MBRs of two directory entries Er1 and Er2 are disjoint, then there can be no match between entries of both directory pages. If they are not disjoint, there is some match between the entries, so we have to traverse deeper into the tree to get the matching tuple.

The basic algorithm follows:

Spatial_Join (R1, R2 ) // R1 and R2 are R-Tree nodes 1. for all Er1 ∈R1

2. for all Er2 ∈R2

3. if (Not_Disjoint(Er1.rect, Er2.rect)) 4. if ( R1 and R2 are leaf pages)

5. if pair (R1, R2) satisfies the spatial join predicate 6. add <R1, R2> to result

7. else if (R1 is a leaf page) 8. Read_Page (Er2ptr)

9. Spatial_Join (Er1.tr, Er2.ptr) 10. else if (R2 is a leaf page)

11. Read_Page (Er1ptr)

12. Spatial_Join (Er1.tr, Er2.ptr) 13. else

14. Read_Page (Er1.ptr) 15. Read_Page (Er2.ptr)

16. Spatial_Join (Er1.tr, Er2.ptr)

(4)

Figure 3 illustrates the tree matching spatial join algorithm. At the beginning of the join operation, Spatial_Join() receives two root R-trees as parameters. R1 intersects only with Q1 but not with B2. Similarly, A2 intersects only with B2 but not with B1. Thus qualifying entry pairs at the root level will be (A1, B1) and (A2, B2). Next, Spatail_Join will be recursively called for the qualifying pairs (A1, B1) and (A2, B2). This process will continue until a leaf level is reached. Finally, pairs (a1, b1) and (a3,b3) will be returned as the output.

A1 A2

a1 a2 a3 a4

B1 B2

b1 b2 b3 b4

A1 B1 B2

A2 a1

a2

a3

a4

b1

b2 b3

b4

Figure 3: Two datasets indexed by R-trees

When an index exists for only one relation, the index on the other relation is built on the fly and the tree-matching technique is applied after. Commercial database systems for example Oracle, Informix, implement a variant of R-trees matching algorithm for spatial join operation.

Partition-Based Spatial Merge Join

In this approach, the relations are divided into p partitions if both of them cannot be contained in main memory. After that, partition i of R1, where1≤i≤ p, is compared with the corresponding partition i of R2. We briefly go through the filter step of this algorithm:

1. For each tuple in R1 and R2, form new relations R1’ and R2’ where each tuple consists of unique object id of the tuple and MBR of the joining attributes.

2. If we can fit both R1’ and R2’ in the main memory we can process the join relation, using a plane-sweep algorithm.

3. If both R1’ and R2’ cannot be fitted in the main memory, we partition both the relations into p parts (R1’1,….R1’p and R2’1,….R2’p) where any partitions pair (R1’i,R2’i ) fits in main memory. In addition, we will make sure that, for each

(5)

R1’i, any overlapping tuples in R2’ will reside in partition R2’i. Now, we can apply a plane-sweep algorithm in each partition.

This strategy works is very well when no indices are present on both the relations and the relations are too big to fit in main memory. One of variants of it spatial hash join.

Spatial hash join strategy is inspired by the hash join algorithm in relational database. It works very well when no indices are present on both relations. At first, relation R1 is partitioned into K buckets. The size of K is determined by system parameters. Sampling is used to determine the initial extents of the buckets. As sampling is used the size of the partitions are not equal. Each object is inserted into a bucket which is the least sized at the time of insertion. Relation R2 is hashed into buckets using the same extents as R1 but a different strategy is followed to insert the object of R2. An object of R2 is inserted in every bucket that it intersects. As a result there will be multiple copies of an object of R2.

If an object of R2 does not intersects with any bucket is not inserted at all. Now individual partitions are compared to find the qualifying objects.

KEY APPLICATIONS

Spatial joins and their variants are used to analyze the data for data mining and data clustering. For example, given two sets of spatial objects R and S, and a distance function f(), є-distance join query returns the pairs of spatial objects { (r,s): r∈R, s ∈ S, f(r,s) ≤ є}.

The closest pair query returns the set of closest pairs {(r,s): r∈R, s ∈ S, f(r,s) ≤ f(r’, s’), r’∈R, s’ ∈ S}. The all k-nearest neighbor query returns the k nearest neighbors from S for each object in R. The iceberg distances join, given an integer t and a real number є, retrieves all pairs of objects { (r,s): r

∀

∈R, s ∈ S, f(r,s) ≤ є and r appears at least t times}.

An example of this is: “Find all objects from R which are at most 5 km away from at least 5 objects of S”.

ε

r3 r1

r2

h2 h1

r4

Figure 4: Example of some variants of spatial join

Figure 4 illustrates different spatial join variants. It consists of a set of hotels {h1, h2}

and a set of restaurants {r1, r2, r3, r4}. The є-distance join query returns the pair {(h1, r1), (h1, r2), (h2, r3), (h2, r4)}. The closest pair query returns the pair (h1, r1). The all

(6)

1-nearest neighbor query returns {(h1 ,r1), (h2, r4)}. For t =3, iceberg distance join returns h2 as it the only hotel which has 3 nearby restaurants within a distance of є.

Another extension of spatial join is multi-way joins, which involve an arbitrary number of spatial inputs. This type of operation is very useful for Geographical Information System (GIS) and Very Large Scale Integration (VLSI) circuit design. Examples of multi-way jon queries are “Find all cities which intersect with the Mississippi river and are adjacent to a golf course” and “Find all sub-circuits which formulate a cache configuration” are examples of multi-way spatial join.

FUTURE DIRECTIONS

For processing spatial join queries we usually follow the filter and refine steps in order.

In some cases, however variants of this ordering for example an interleaving of these steps may give us more benefit. In the future, we can explore where probable variants can be beneficial and what information we need to collect for this.

Although intersection join algorithms (e.g., R-tree joins) can be directly extended for other types (e.g., distance joins) but often it cause inefficient performance benefit.

Various optimization techniques can be applied to remedy this. Extending existing intersection join algorithms with various optimization criteria to other domain will be another interesting area for future research.

CROSS REFERENCES 1. Intersection join 2. Distance join 3. Similarity join

4. Spatial access method 5. R-Tree

6. Iceberg Distance join RECOMMENDED READING

1. Shekar S. and Chawla S. (2003). Spatial Databases A Tour, First Edition, Prentice Hall.

2. Patel J. M. and Dewitt. D. J. (1996). Partition Based Spatial-Merge Join, Proceedings of ACM SIGMOD, pages 259-270.

3. Brinkhoff, T., Kriegel H., and Seeger B. (1993) Efficient processing of spatial joins using R-trees. In Proceeding of ACM SIGMOD, pages 237-246.

4. Brinkhoff, T., Kriegel H., and Seeger B. (1996) Parallel processing of spatial joins using R-trees. Proceeding of ICDE Conference, pages 258-265.

5. Manolopoulos Y., Papadopoulos A., Vassilakopulous M. (2005). Spatial Databases, Technologies, Techniques and Trends, IDEA Group Publishing.

6. Böhm C. and Krebs F. (2002). High Performance Data Mining Using the nearest Neighbor Join. Proceedings of the IEEE International Conference on Data Mining, pages 43-55.

(7)

7. Shou Y., Mamoulis N., Cao H., Papadis D., Cheung D. W. (2003). Evaluation of Iceberg Distance Joins. Proceedings of the Eighth International Symposium on Spatial and Temporal Databases, pages 270-288.

8. Corral A., Manolopoulos Y., Theodorisdis Y., Vassilakopoulos M., (2000).

Closest pair queries in spatial databases. Proceedings of the ACM SIGMOD Conference, pages 189-200.

9. Guttmann A.(1984) R-trees: A dynamic index structure for spatial searching.

Proceedings of the ACM SIGMOD Conderecee3, pages 47-57.

10. Koudas N., Sevcik k. (2000)/ High Dimensional Similarity Join. Proceedings of the ACM SIGMOD Conference, pages 324-335.

11. Mamaulis N., Papadias D. (2001). Multi-way Spatial Joins. ACM Transactions on Database Systems (TODS), 26(4), pages 424-475.

12. An N. Yang, Sivasurbramaniam A. (2001). Selectivity estimation for Spatial Joins. Proceddings of the IEEEE ICDE Conference, pages 368-375.

13. Faloutsos C., Seeger B., Traina A. , Traina C. (2000). Spatial Join Selectivity Using Power Laws. Proceedings of the ACM SIGMOD Conference, pages 177- 188.

14. Mamoulis N., and Papadias D. (2003). Slot Index Spatial Join, IEEE Transactions on Knowledge and Data Engineering (TKDE), 15(1), pages 211-231.

15. Orenstein J. (1986). Spatial Query Processing in an Object-Oriented Database System. Proceedings of the ACM SIGMOD Conference, pages 326-336.