Hypergraph Representation of Spatial Semantics
4.2 Computing Hypergraph Weights for Point Data
Although the method based on the Inclusion-Exclusion theorem used in the last section to compute the weight of the hypergraph edges is straightforward, the implementation based on it is not very efficient. We will refer to this method as the Inclusion-Exclusion method. The reasons are as follows. Suppose the number of data points in a data set is n. First of all we need to compute , … for all possible 0<i<n, 0<i<j<n, …etc. This can be done by computing the intersections among all R
i
A Ai,j A1,2...n
i
A
is to get Ai,j, computing the intersections among all Ri,j to get Ai,j,k, etc.
This process might repeat up to n rounds and the number of regions to be intersected in each round increases monotonically. For each round, the computation complexity can be reduced from O(N2) of an intuitive method, which exhaustively examines the intersection between two regions, to O(N*logN) by using the well known Line Sweeping algorithm (Cormen, 2001) where N is the number of regions to be intersected in each round. Thus the total number of intersections performed is in the
order of
∑
where N = n i i i N N 1 log* i is the number of regions to be intersected in each
round i. N1 is the number of regions to be intersected in the initial data set, i.e., N1=n.
Since N1<N2…<Nn, this number is at least in the order of (n2log(n)). Second,
maintaining the relationships among Ai,Ai,j…A1,2...n in order to compute ~ , A~i,j, …A~1,2...n is either very time consuming or very space consuming. Furthermore, the implementation of the Line Sweeping algorithm is not trivial.
R6 R5 R4 R3 R2 R1 6 5 4 3 2 1 X11 X3 X1 X0 Y11 X2 Y0 Y1
Fig. 4-3. Computing the Smallest Intersection Regions
We next describe a simple intuitive method with a complexity of O(n3). The idea behind the method is that we first compute all the possible smallest intersection regions (e.g., regions 1-6 in Fig. 4-3) and then assemble them in their corresponding result set (Cixiang Zhan, Environmental Research Institute - ESRI, 2001, Personal Communication). We call this method Intersect-Assemble method. The process is shown in Fig. 4-4. It first sorts the coordinates of all points along the x and y directions respectively. For each of the two neighboring coordinates along the x and y direction, xi and xi+1 and yi and yi+1, a smallest rectangle can be constructed using
these coordinates. For each of such rectangles, the algorithm examines which original regions contain it and all the labels of these original rectangles form a final result set. If multiple smallest rectangles are contained in a result set, their area will be summed up and set as the area of the result set.
Assume there are six regions (R1 through R6) to be intersected as shown in Fig
1)*(12-1) smallest rectangles. We focus on the three regions on the top-left since they intersect with one another while do not intersect with the other three regions. There are six smallest rectangles in region R3 as numbered 1 through 6. Rectangle 1 is
contained in both region R1 and R3, thus its label is L13, similarly rectangle 4 is
labeled as L123 and rectangle 5 is labeled as L23. Rectangles 2, 3 and 6 are all labeled
as L3 and their areas are summed up. Thus we have
3
~
A =area(2)+area(3)+area(6), A~23=area(5), A~13=area(1), A~123=area(4)
Input: An array of n regions Rect
Output: A hash table H, each entry of which stores the label (hash key) and the area value. Set H to empty
Extract x and y coordinates of the points in the n regions into two arrays X and Y with size of 2*n.
Sort these two arrays, in ascending order. For each i from 0 to 2*n-1
For j from 0 to 2*n-1
Build a rectangle (tempRect) with the following four coordinates (X[i],Y[j],
X[i+1],y[j+1])
Set the label set (L) of tempRect to empty. For k=0 to n-1
If tempRect is within Rect[k] then Add k to L End if End for k If L is not empty If L is already in H H(L)=H(L)+area(tempRect) Else H(L)= area(tempRect) End if End if End For j End For i End
It takes four comparisons to determine whether tempRect is within an original rectangle. Assuming it takes Q0 time on average to perform a lookup in a hash table,
since in the worst case there is always an update for each lookup which is also assumed to take Q0 time, the cost of processing a single smallest rectangle is at most
4*n+ Q0+ Q0=4*n+2*Q0. Since there are (2n-1)*(2n-1) of such smallest rectangles
(i.e., i and j loops), the complexity of the algorithm is (2n-1)*(2n-1)*(4n+2*Q0).
Since usually it takes sub-linear complexity to look up a data item in a hash table by using a reasonable hash function, i.e., O(Q0)<O(n). Thus the above algorithm has
approximately O(n*n*n)=O(n3) complexity. Although the theoretical complexity of the Intersect-Assemble method is higher than that of Inclusion-Exclusion method, it is still competitive due to the simple implementation when n is small. However, the computation cost is prohibitive when n is big and we need a more efficient method.
We observe that the number of intersection rectangles associated with each intersection line, i.e., a unique coordinates along either the x or y direction as shown in Fig. 4-3, is very likely to be much smaller than n. A region often only intersects with a limited number of other regions since the size of a region is limited. According to a hypergraph representation, this also means the number of nodes in a hyperedge is bounded by a constant. In fact, this is one of the assumptions in our complexity analysis of one of the proposed optimization methods as detailed in Section 6.6 of Chapter 6. As an example, in Fig. 4-3, the number of regions to be intersected is six while the maximum number of regions that intersect with one another is only 3.
Based on this observation, we propose a new method in computing the weights of the hypergraph for a point data set. Since the method explores R-Tree spatial index (Guttman, 1984), we call it the R-Tree based method.
For each of the extended regions of the points in a point data set, the method first retrieves all the extended regions that intersect with it. It then applies the Intersection-Assemble method on these regions. For each of the entries in the resultant hash table, the method first checks whether the label of the entry contains the label of the extended region under consideration. If true, the method further checks whether the entry has already existed in the output hash table. It will add the area value of the entry to the output hash table if the entry does exist, otherwise it will add the entry to the output hash table. The process is shown in Fig. 4-5.
Input: An array of n regions Rect
Output: A hash table H, each entry of which stores the label (hash key) and the area value 1. Construct an R-Tree for Rect
2. For each leaf node ni in the R-Tree (i.e., an extend region)
2.1 Retrieve all the extended regions that intersect ni from the R-Tree and store it in array
Rect’
2.2 Apply the Intersect-Assemble method for Rect’ (c.f. Fig.4-4) and store the result in a hash table H’
2.3 For entry (L,A) pair in H’ where L is the hash key and A is the value of the entry
Test whether the entry can be found in H and set the corresponding flag array element F(L)
End for
2.4 For entry (L,A) pair in H’
If L contains the label of ni and F(L) then
If L is already in H then H(L)=H(L)+A Else H(L)=A End if End If End For End For
The efficiency of the R-Tree based method is achieved by only performing the expensive O(n3) Intersection-Assemble method for a subset of the point data set. Although an extended region of a point might be involved multiple times when calling the Intersection-Assemble method, the overall computation complexity can be reduced as analyzed in the following. Although the strict complexity analysis of R- Tree and its variant R*-tree (Beckmann, 1990) is not available, they are experimentally shown to be low-cost spatial indexing methods which is super linear but sub-quadratic. We assume R-Tree construction complexity is O(n*log(n)) and the search complexity in an R-Tree is O(log(n)) (which is the lower bounds of sorting and searching in tree data structures ), where n is the number of points in a point data set. We also assumes the number of extended regions intersecting with the extended region of a node is bounded by a constant as discussed above, the cost of the Intersection-Assemble method on them is also independent of n. We use CIA to denote
such a cost. Thus the total cost of the R-Tree based method in the best scenario is in the order of n*log(n)+n*(log(n)+CIA), i.e. O(n*log(n)+n*CIA). Theoretically when n
is big, the n*log(n) term will dominate and reduce the complexity to O(n*log(n)). However, for practical n values (e.g., 100-10000), the CIA constant is likely to be
much larger than log(n). Thus we are expecting this algorithm to be linear with respect to n with a large hidden factor.