Hypergraph Representation of Spatial Semantics
4.6 Discussions On Related Work
There are some existing work proposed to represent access patterns as data access graphs and then use graph-theoretical approaches to generate broadcast sequences. Compared to representing complex query result sets as hypergraphs directly, these representations are essentially approximate in nature, similar to what we proposed in Section 4.5 of this chapter for spatial data.
(Si, 1999) presented a Semantic Ordering Model (SOM) for relational/object- oriented database broadcast using entity type (field/attribute) as the basic broadcast unit (data item) and represented the access patterns of a broadcast database by a directed graph as shown in Fig. 4-8. Each node vi is associated with a cost of
accessing an entity type (si, which reflects the total size of all entities belonging to the
entity type). Each node vi is further associated with a probability pi denoting the
probability of being accessed as the first entity in a query. Pi can be estimated as ni/n
Each edge eij is associated with a weight αij indicating the likehood that vj will be
accessed by a query that vi has been accessed by the same query.
s5 p5 s3 p3 s4 p4 s2 p2 s1 p1 α5,4 α1,2 α2,1 α 2,3 α3,2 α2,4 α4,2 α1,5 V4 V3 V2 V1 V5
Fig. 4-8. The SOM Model and its Graph Representation (Si, 1999)
The SOM model and the directed graph representation are suitable for the scenario where the precedence relationship between vi and vj can be easily
determined, such as the referential integrity constraints in relational databases and the parent-children relationship in object-oriented databases. They are not applicable for the scenario where a set of entity types is involved in a query processing but has no precedence order between the entity types in the set. It is also worth to note that the SOM model and the graph representation are designed for using an entity type (attribute) as the minimum broadcast unit, i.e., vertical partition of database. Due to the bandwidth limitation, usually only hot data items and frequent attributes are chosen to broadcast. If almost all attributes are required by clients which is very likely in practice, it will take almost a whole broadcast cycle to retrieve only a single
data item in the vertical partitioning broadcast scheme. Also since location is the most selective attributes in spatial queries (where often most attributes are needed) and usually only a small portion of all the data items is in a spatial query result set, we believe tuple (record) selection rather than entity type selection is more practical in geographical data broadcast. Thus the SOM model and its graph representation are not suitable for geographical data broadcast.
The graph representation in (Lee, 2002) was also based on directed graph. For each query pattern, they classified the related attributes into three groups, the select attribute (SA), the join attributes(JA) and the project attributes (PA). They assumed the order of the three groups to be SAÆJAÆPA. However, the attributes inside each group are unordered. An initial graph can be built as proposed in (Si, 1999). The unordered pairs in an attribute group in a query pattern are scanned through the rest query patterns to determine their precedence relationship by using the SAÆJAÆPA orders. For the attributes that still do not have a precedence relationship with any other attributes, all the attributes in SA have directed edges with the attributes in JA, and similarly, all the attributes in JA have directed edges with all the attributes in PA. During the process, if there are two directed edges between node u and node v with access frequency fuv and fvu then the two directed edges will be replaced by one
directed edge with access frequencies fuv-fvu. Although (Lee, 2002) provided several
additional methods in determining the precedence relationship between two attributes according to SQL query patterns, it has the same problems as (Si, 1999). In the
be impossible to determine the precedence relationship between the attributes in PA. Although it is beneficial to put attributes that are often queried together near each other, unfortunately, it is impossible to do so based on the graph representation of query patterns proposed in (Lee, 2002a).
The method presented in (Lee, 2003) also represented query patterns as a graph. They assumed a data item (which can be a tuple/record or an object) is the basic broadcast unit and the data items in a query result set are unordered. Thus their problem is essentially the same as ours. They constructed a graph before sequencing as well. For each query and for any two data items in the query, they will put an undirected edge between the two data items with the weight being the access frequency of the query. The final graph is generated by combining identical edges and setting the summation of their weights as the final weights for the combined edges. The resulting graph in (Lee, 2003) is a combination of m complete graph where m is the number of queries and is very likely to be dense, which makes it hard to handle.
For spatial range query on point data, we can prove that the graph generated by the method of (Lee, 2003) is exactly the same as the approximation graph generated by the method proposed for point data in Section 4.5 in this chapter. In order to do so, it is sufficient to prove the weight of an edge between two arbitrary nodes in the graph is the same in the two methods. The weight of the edge between any two nodes (without lose of generality, we assume they are node 1 and node 2) is
Ai,j in our method. The possible query result set that contains data items 1 and 2 are {1,2}, {1,2,3},{1,2,4},…{1,2,n},{1,2,3,4},…{1,2,…n}. Their weights, according to
the spatial semantics presented in Section 3.2 of Chapter 3 are A~1,2,A~1,2,3,A~1,2,4…A~1,2,n 4 , 3 , 2 , 1 ~
A …A~1,2...n. The weight of edge (1,2) based on the method proposed in (Lee, 2003) is the summation of these weights. By using the Inclusion-Exclusion theorem, the summarized weight is A1,2, which isthe same as our result. The method of (Lee,
2003), although applicable for handling generic complex queries, suffers from the exponential number of possible queries with respect to the number of data items when applied to spatial range queries. Furthermore, even if the number of queries is bounded by a constant M, their graph construction method has the complexity of
O(
∑
) where m = M i 1∑
= M i 1 i m2 i m2i is the number of data items in a query. Our method is much
simpler by exploring spatial semantics. The worst case complexity of our method is O(n*log(n)) using the Line Sweeping algorithm, where n is the number of nodes in the graph, or the number of points in the data set. Although for all i, mi is less than n,