Sensor observation
6.1 eagle architecture
The architecture of EAGLE is illustrated in Figure6.1. The engine accepts sensor data in RDF format as input and returns an output in SPARQL Result form1. The general processing works as follows. When the Linked Sensor Data is fed to the system, it is first analyzed by the Data Analyzer component. The Data Analyzer is responsible for analyzing and partitioning the input data based on the RDF patterns that imply the spatial and temporal context. The output sub-graphs of the Data Analyzer will be converted by the Data Transformer to the compatible formats of the underlying databases. The Index Router module then receives the transformed data and forwards them to the corresponding sub-database components in the Data Manager. In the Data Manager, we choose Apache Jena TDB2, OpenTSDB3 [177] and ElasticSearch4 as underlying stores for such partitioned sub-graphs.
1 https://www.w3.org/TR/rdf-sparql-XMLres/
2 https://jena.apache.org/
3 http://opentsdb.net/
4 https://www.elastic.co/
6.1 eagle architecture 83
To execute the spatio-temporal queries, a Query Engine module is introduced. The Query Engine consists of several sub-components that are responsible for parsing the query, generating the query execution plan, rewriting the query into sub-queries and delegating sub-queries execution processes to the underlying databases. The Data Manager executes these sub-queries and returns the query results. After that, the Data Transformer transforms the query results accordingly to the format that the Query Delegator requires. Details of EAGLE’s components are described in the following subsections.
6.1.1 Data Analyzer
As mentioned above, for the input sensor data in RDF format, the Data Analyzer evaluates and partitions them to the corresponding sub-graphs based on their (spatial, temporal or text).
Data characteristics are specified via a set of defined RDF triple patterns. In EAGLE, these RDF triple patterns are categorized into three types: spatial patterns, temporal patterns, and text patterns. The spatial patterns are used to extract the spatial data that need to be indexed.
Similarly, temporal patterns extract the sensor observation value along with its timestamp. The text patterns extract the string literals. An example of the partitioning process is illustrated in Figure6.2. In this example, we define (?s wgs84:lat ?lat. ?s wgs84:long ?long) and (?s rdfs:label
?label) as the triple patterns used for extracting spatial and text data, respectively. For instance, assume that the system receives a set of input triples as follows:
:dublinAirport a geo:Feature;
wgs84:lat "53.1324"^^xsd:float;
wgs84:long "18.2323"^^xsd:float;
rdfs:label "Dublin Airport".
84 eagle - scalable query processing engine for linked sensor data
As demonstrated in Figure 6.2, the two triples (:dublinAirpot wgs84:lat "53.1324"^^xsd:float;
wgs84:long "18.2323"^^xsd:float) are found to be matched the defined spatial patterns (?s wgs84:lat ?lat; wgs84:long ?long), thus, are extracted as a spatial graph. Similarly, we have the text sub-graph (:dublinAirport rdfs:label "Dublin Ariport") extracted. These sub-graphs will be transformed into compatible formats to be used by the indexing process in the Data Manager.
The data transformation process will be presented in the following section.
s p o
Figure 6.2: Transform spatial and text sub-graphs to ElasticSearch documents
6.1.2 Data Transformer
The Data Transformer is responsible for converting the input sub-graphs received from the Data Analyzer to the index entities. The index entities are the data records (or documents) constructed to a compatible data structure so that they can be indexed and stored in the Data Manager.
Returning to the example in Figure6.2, the Data Transformer transforms the spatial sub-graph and text sub-graph into ElasticSearch documents. In addition to transforming the sub-graphs into the index entities, the Data Transformer also has to transform the query outputs generated by the Data Manager to the format that the Query Delegator requires.
6.1 eagle architecture 85
6.1.3 Index Router
The Index Router receives the index entities generated by the Data Transformer and forwards them to the corresponding database in the Data Manager. For example, the spatial and text index entities will be routed to ElasticSearch to index and ones that have temporal values will be transferred to the OpenTSDB cluster. For the index entities that do not match any spatial or temporal patterns, they will be stored in the normal triple store. Due to the fact that the access methods can vary across different databases, the Index Router, therefore, has to support multiple access protocols such as Rest APIs, JDBC, MQTT, etc.
6.1.4 Data Manager
Rather than rebuilding the spatio-temporal indices and functions into one specific system, our Data Manager module adopts a loosely coupled hybrid architecture that consists of different databases for managing different partitioned sub-graphs. More precisely, we used ElasticSearch to index the spatial objects and text values that occurr in sensor metadata. Similarly, we used a time-series database, namely OpenTSDB, for storing temporal observation values. The reasons for choosing ElasticSearch and OpenTSDB can be explained as follows: (1) ElasticSearch and OpenTSDB both provide flexible data structures which enable us to store sub-graphs which share similar characteristics but have different graph shapes. For example, stationA and stationB are both spatial objects but they have different spatial attributes (i.e., point vs. polygon, names vs. label, etc). Moreover, such structures also allow us to dynamically add a flexible number of attributes in a table without using list, set, or bag attributes or redefining the data schema.
(2) ElasticSearch supports spatial and full-text search queries. Meanwhile, OpenTSDB provides a set of efficient temporal analytical functions on time-series data. All of these features are the key-point requirements for managing sensor data. (3) Finally, these databases offer the clustering features so that we are able to address the ”big-data” issue, which is problematic for the traditional solution when dealing with sensor data.
For the non-spatio-temporal information that does not need to be indexed in the above databases, this will be stored in the native triple store. We currently use Apache Jena TDB to store such generic data. In the case of a small size dataset, it can be easily loaded into the RAM of a standalone workstation for the sake of boosting performance.
6.1.4.1 Spatial-driven Indexing
To enable querying of spatial data, we transform the sub-graph that contains spatial objects as a geo document and store it in ElasticSearch. Figure6.2demonstrates a process that transforms a semantic spatial sub-graph to an ElasticSearch geo document. Please be aware that, along with spatial attributes, ElasticSearch also allows the user to add additional attributes such as date-time, text description, etc. This advanced feature allows us to develop a more complex filter that can combine spatial filters and full-text search in a query.
The ElasticSearch geo document structure is shown in Listing6.1. In this data structure, location is an ElasticSearch spatial entities to describe geo spatial information. Its has two properties: type
86 eagle - scalable query processing engine for linked sensor data
{
<field_1>: <value_1>, . . .
<field_n>: <value_n>,
"location": {
"type": <geo shape type>
"coordinates": <points>
} }
Listing 6.1: ElasticSearch geo document structure
and coordinates. Type can be point, line, polygon, envelope while coordinates can be one or more arrays of longitude/latitude pair. Details of the spatial index implementation will be discussed in Section6.3.1.
6.1.4.2 Temporal-driven Indexing
A large amount of sensor observation data is fed as a time-series of numeric values such as temperature, humidity and wind speed. For these time-series data, we choose OpenTSDB (Open time-series Database) as the underlining scalable temporal database. OpenTSDB is built on top of HBase [59] so that it can ingest millions of time-series data points per second. As shown in Figure6.1, input triples which are comprised of numeric values and time-stamps are analyzed and extracted based on the predefined temporal patterns. Based on this extracted data, an OpenTSDB record is constructed and then stored in OpenTSDB tables.
In addition to the numeric values and timestamps, additional information can be added to each data record of OpenTSDB. Such information also can be used to filter the temporal data.
Additional information is selected by their regular use for filtering data in SPARQL queries.
For example, a user might want to filter data by type of sensor, type of reading, etc. The data organization and schema design in OpenTSDB will be discussed in Section6.2.
6.1.5 Query Engine
As shown in the EAGLE architecture in Figure6.1, the query processing of EAGLE is performed by the Query Engine that consists of a Query Parser, a Query Optimizer, a Query Rewriter and a Query Delegator. It is important to mention that our query engine is developed on top of Apache Jena ARQ [96]. Therefore, the Query Parser is identical to the one in Jena. The Query Optimizer, Query Rewriter and Query Delegator have been implemented by modifying the corresponding components of Jena. For the Query Optimizer, in addition to Apache Jena’s optimization tech-niques, we also propose a learning optimization approach that is able to efficiently predict a query execution plan for an unforeseen given spatio-temporal query. Details of our approach will be presented in Chapter7.
The query engine works as follows. First, for a given query, the Query Parser translates it and generates an abstract syntax tree. Note that, we have modified the Query Parser so that it can adapt to our spatio-temporal query language. Next, the syntax tree is then mapped to the
6.2 a spatio-temporal storage model for efficiently querying on sensor observation data 87
(propfunc temporal:avg
(?value ?obs ?sensor ?time) ("10/01/2017"^^xsd:dateTime "10/02/2017"^^xsd:dateTime) (join
(graph <urn:x-arq:DefaultGraphNode>
(propfunc geo:sfWithin
?stationGeo geo:sfWithin (40.417287 -82.907123 40 miles ) (table unit)))
(quadpattern
(quad <urn:x-arq:DefaultGraphNode> ?sensor sosa:isHostedBy ?weatherStation.)
(quad <urn:x-arq:DefaultGraphNode> ?sensor ?sensor sosa:observes got:WindSpeedProperty.) (quad <urn:x-arq:DefaultGraphNode> ?obs sosa:madebySensor ?sensor.)
)))
Listing 6.2: Textual representation of spatio-temporal query tree in Example5
SPARQL algebra expression, resulting in a query tree. In the query tree, there are two types of nodes, namely non-leaf nodes and leaf nodes. The non-leaf nodes are algebraic operators such as joins, and leaf nodes are the variables present in the triple patterns of the given query.
Following the SPARQL Syntax Expressions5, Listing6.2presents a textual representation of the query tree corresponding to the spatio-temporal query in Example5of Section6.4.
Please be aware that the query tree generated by the Query Parser is just a plain translation of the initial query to the SPARQL algebra. At this stage, there is no optimization technique being applied yet. After that, the query tree is processed by the Query Optimizer. This component is responsible for determining the most efficient execution plan with regard to the query execution time and resource consumption. After having a proper execution plan, it is passed to the Query Rewriter for any further processing needed. Basically, the Query Rewriter rewrites the query operators to the compatible query language of the underlying databases. In the next step, the Query Delegator delegates these rewritten sub-queries to the corresponding database in the Data Manager. For example, the sub-query that contains the spatial operator or full-text search will be evaluated by ElasticSearch, while the temporal operator is executed by OpenTSDB. For the non-spatio-temporal queries, they are processed by Jena. After having the sub-queries executed, the query results need to be transformed to the format that the Query Delegator requires. The Query Delegator then performs any post-processing actions needed. The final step involves formatting the results to be returned to the user.