Spatio-temporal Graph Patterns Features - query features extraction process

Region Server

7.4 query features extraction process

7.4.2 Spatio-temporal Graph Patterns Features

spatial function bgp

project (?sensor ?loc) distinct

sequence

function(57.467 -7.367 23.0 'miles’) triple triple

?sensor sosa:isHostedBy ?loc ?loc geo:lat ?sLat.

Figure 7.2: Algebra Feature Extractor

To address this problem, we take into account the query graph pattern similarity, which will be described in the next section.

7.4.2 Spatio-temporal Graph Patterns Features

In respect of query planning, judging two queries as being similar based on their algebraic expressions can present a significant risk in terms of an incorrect prediction. We realized when computing query similarities based on algebraic features that we only considered the frequency of the query operator, i.e, the number of triples in the basic graph patterns appearing the queries, but failed to represent the query graph structure. Recalling that a SPARQL query can also be considered as an RDF graph, it is therefore obvious that queries that have some structural similarity might potentially share the same execution plan and query performance.

To represent the spatio-temporal graph patterns, we propose building graph pattern features.

Specifically, we transform the similarity problem of two query patterns to the similarity problem of two graphs. To compute the structural similarity between two query patterns, we first con-struct two graphs from the two query patterns, then compute the graph edit distance between these two graphs. In the following, we first shortly introduce the notion of graph edit distance and then present the graph patterns features extraction process.

7.4.2.1 Graph Edit Distance

We paraphrase the graph and the graph edit distance definitions from [161] as follows.

Definition 7.1 (Graph) A graph g is a tuple g = (V, E, µ, v) where

• V is the finite set of nodes

7.4 query features extraction process 129

• E⊆ V × V is the set of edges

• µ : V → L is the node labeling function

• v : E→ L is the edge labeling function

Figure 7.3: A possible edit path between graph g1and g2[161]

The graph edit distance between two graphs is the minimum amount of distortion that is needed to transform one graph into another. The amount of distortion is considered as the cost of a sequence of graph edit operations. A standard set of graph edit operations is given by deletions, insertions, and substitutions of both nodes and edges. Figure7.3demonstrates a possible edit path to transform graph gx to graph gy. In this example, the list of edit operations consists of three edge deletions, one node deletion, one node insertion, two edge insertions, and finally two node substitutions.

It is obvious that there is a number of different edit paths transforming gxto gy. Let Υ(gx, gy) be the set of all such edit paths. To find the most suitable edit path in Υ(gx, gy), a cost function is introduced. The cost function aims to define whether or not an edit operation represents a strong modification of the graph. Obviously, there should be an inexpensive edit path for two similar graphs, which represents low-cost operations. Meanwhile, for dissimilar graphs, an edit path with high costs is needed. As a result, the edit distance of two graphs is defined by an edit path with a minimum cost between the two graphs.

Definition 7.2 (Graph Edit Distance) Let gx = (Vx, Ex, µx, vx) be the source and gy = (V_y, Ey, µy, vy)the target graph. The graph edit distance between gxand gy is defined by:

d(gx, gy) = min

(e₁...ek)∈Υ(g_x,gy)

Xk i=1

c(e_i) (7.1)

7.4.2.2 Graph Patterns Features Extraction Process

Figure7.4illustrates the graph patterns features extraction process. The process can be split into 2steps, which are briefly described as follows.

The first step is to cluster the structurally similar graph patterns in the training data into Kmed

clusters. We apply the same clustering algorithms proposed in [81], which are K-mediods clus-tering algorithm [98] and Risen’s graph distance [161], to cluster the training queries. The K-mediods is used as it chooses data points as cluster centers and allows using an arbitrary distance function. However, in addition to the standard SPARQL graph patterns, we also consider the spatio-temporal graph patterns when computing the graph similarity. We built our own graph pattern extractor to not only extract the semantic graph pattern but also to extract the spatial and temporal patterns. The spatio-temporal graph pattern will thus be considered as a standard RDF graph in which the edges of the graph are spatio-temporal functions and the nodes of the graph are the input variables for the functions. We call such a graph a spatio-temporal graph. Figure

130 adaptive query planning with learning

3 SELECT distinct ?sensor ?loc {

?loc geo:within(-47.75 1.67 140.0 'miles)

?sensor sosa:isHostedBy ?weatherStation.

?sensor sosa:observes iot:AirTemperature.

?obs sosa:madebySensor ?sensor;

sosa:resultTime ?time;

sosa:hasSimpleResult ?value.

?value temporal:avg(01/01/2018 31/03/2018)

Figure 7.4: Graph pattern features extraction process

7.5illustrates the graph representation of our sample spatio-temporal query. To ensure that the spatio-temporal triples are more important than standard triples, we heuristically improve the weighting function in the graph edit distance method by increasing the cost of edit operations on the spatio-temporal graph. The final output of this clustering step is set of Kmed clusters, each cluster will have a center graph pattern which is a representative of query patterns in that cluster.

After having a set of Kmedclusters in Step 1, we then compute the graph edit distance between the graph of each query q_iin the training set and the center graph patterns of each cluster. The structural similarity between these graphs are computed following the method described in [81] as below:

sim(q_i, C(k)) = 1

1 + d(q_i, C(k)) (7.2)

where d(pi, C(k)) is the graph edit distance between the query graphs qi and the center graph C(k) of the cluster. This formulation results in a score within the range [0, 1]. The score of 0 being the least similar and a score of 1 being the most similar. As a result, we obtain a Kmed -dimensional feature vector for qi, where Kmedis the number of cluster.

7.4 query features extraction process 131

Figure 7.5: Mapping spatio-temporal query patterns to graph

In document A scalable spatio-temporal query processing engine for linked sensor data (Page 144-147)