Performance Comparison Between Using Prediction Models and Cost-based Optimization for Query Planning

Region Server

7.4 query features extraction process

7.5.2 Experiment Results

7.5.2.5 Performance Comparison Between Using Prediction Models and Cost-based Optimization for Query Planning

We repeat the concurrent processing capability test that is already described in Section6.5.2.3of Chapter6. However, in this experiment, we use our proposed learning approach for query plan-ning. Figure7.9reports the performance comparison in the case of 1000 query clients between the two experiments. It is apparent that the concurrent processing capability of our system is enhanced for all test queries. Moreover, the performance is significantly improved in the case of a large number of query clients. This is evidenced by the considerable decrease in the query ex-ecution time for Q4, Q5, Q7, Q10 in comparison with the original experiment. We attribute this to the difference between the time needed for generating an execution plan and the query plan prediction required time. It is important to mention that, in the experiment in Section6.5.2.3, for the spatio-temporal queries that have high complexity, the execution plan search space is also exploded. The required time for searching and generating a proper execution plan consequently increases. When the system is under a heavy workload, this can be even worse. Meanwhile, in our approach, regardless of the required training time, the query execution plan prediction time for a given query is less and is slightly affected by the query complexity. Moreover, because the prediction process is executed in a dedicated server, the influence that causes by the workloads of concurrent queries to the system hence is minimized.

7.6 summary 139

12/06/2019

ISC 2017 24

0 10000 20000 30000 40000 50000 60000 70000 80000

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11

Prediction models Cost-based optimizer Optimization comparison

Figure 7.9: Performance comparison of concurrent queries (1000 clients) between prediction models vs cost-based optimization

7.6 summary

In this chapter, in order to address the query planning issues discussed in Chapter6, we propose our approach to accurately predict both query planning and the query execution time using sim-ilarity identification techniques on spatio-temporal SPARQL queries. Our goal was to build a machine learning model that recognizes the similarity of queries, and only using the information that is available before the queries are to be executed. Our extensive experimental results demon-strate the efficiency of our learning approach for query planning on spatio-temporal sensor data. The most significant improvement in our approach is achieved by the inclusion of spatio-temporal information in the query similarity computation. Also, together with the algebraic expression and spatio-temporal graph patterns, we propose a new feature vector, namely query cardinality, to enhance the prediction accuracy in the case of queries that share similar algebraic expressions and graph patterns, but that have different query performance. The experimental results indicate that our approach can accurately predict over 95% of input queries. Furthermore, in comparison with the original cost-based query optimizer, the query performance of our learn-ing approach is significantly improved with respect of a large number of query clients. With these results, we believe our prediction models can be complementary to the existing EAGLE query optimizer so that it can adapt to the changes in the underlying environment or dealing with high volume concurrent queries.

8 C O N C L U S I O N S A N D F U T U R E W O R K

The need for efficient querying on a massive amount of sensor data lies at the heart of most sensor data analytics platforms. This thesis presents our recent efforts on leveraging the linked data and NoSQL technologies to effectively manage sensor data. Our approach provides not only complex spatio-temporal query functions to users but also proves the ability to handle billions of sensor data. In this chapter, we summarize a number of original contributions that have been produced in this thesis and suggest directions for future work arising from the developments undertaken.

8.1 contributions

Publishing Linked Meteorological Data on the LOD Cloud

For addressing a lack of spatio-temporal linked sensor datasets in theLOD could, in Chapter 4, we present a publishing process of our linked meteorological dataset. The dataset is built upon theISHdata, which is originally published byNOAA. Our linked meteorological dataset consists of 26 billion triples that are transformed from 3.7 billion ISH observation records. More-over, the dataset is enriched with the additional semantic sensor description by interlinking the dataset with other data sources. The dataset has a vital role and has been used in all experimen-tal evaluations presented in this thesis. Additionally, in Chapter4, we also present our proposed linked sensor data model. The model is developed upon a combination of several existing on-tologies such as SOSA/SSN, GeoSPARQL and OWL-Time, which aims to semantically describe the spatial and temporal data aspects of sensor data.

Systemic Evaluation of RDF Stores for Linked Sensor Data

The publishing process of our linked meteorological dataset on theLODcloud raises a challenge of selecting appropriate RDF store for hosting the dataset. Unlike the traditional RDF stores, the selected one should support spatio-temporal queries so that spatio-temporal characteristics of the dataset can be preserved. Moreover, this store should also have capability to deal with

”big data” challenge of processing sensor data. While there are no such studies on RDF stores assessment for Linked Sensor Data, this motivates us to conduct a first performance study of RDF stores for storing and querying sensor data, presented in Chapter 5. This work aims to analyze the weaknesses and strengths of current triple store implementations applied in Linked

8.1 contributions 141

Sensor Data context. In this study, we have summarized a list of fundamental requirements of RDF stores so that they can be used for managing and querying sensor data. We have also described the abstract architecture design of current RDF database technologies that support spatio-temporal queries. Additionally, we have provided a comparative analysis of data loading and query performance of a representative set of RDF stores. In this evaluation, particular attention has been given on evaluating the performance of geo-spatial search, temporal filter and full-text search over our linked meteorological dataset. Since such assessment aspects have not been fully considered and addressed by the existing evaluations, our study gives valuable insights about the advantages and limitations of the current RDF stores implementation when applying for Linked Sensor Data.

A Scalable Spatio-temporal Query Processing Engine for Linked Sensor Data

Following the assessments in Chapter5, several drawbacks of the current RDF stores for Linked Sensor Data have been analyzed. One of the major limitations is the poor performance of these stores when dealing with a massive amount of sensor data. Another one is a lack of spatio-temporal analytical query support. These challenges motivate us to propose EAGLE - a scalable spatio-temporal query processing engine for Linked Sensor Data. EAGLE adopts a loosely cou-ple hybrid architecture that consists of different clustered databases for managing different as-pects of sensor data. This flexible architecture not only helps the engine to deal with the overhead of ”big data” processing but also allows us to make use of the existing spatio-temporal query functions provided by the underlying databases. Another contribution is our spatio-temporal SPARQL language extensions that aim to support querying spatial and temporal RDF data. The query language extensions adopt the GeoSPARQL syntax and also provides a set of temporal analytical property functions.

Efficient Spatio-temporal Data Partitioning Strategy

Along with the query processing and spatio-temporal query language, we also propose the OpenTSDB-based storage model to address the performance issues of temporal data loading and spatio-temporal analytical queries. In this approach, we propose a rowkey design scheme that takes the geohash prefix of the sensor location and the observation result time as the first two encoded elements of the rowkey. Following them are the additional descriptions of observation data such as sensor URI, observed property, etc., that can be used for filtering data. Moreover, by using this rowkey scheme, the spatio-temporal locality of data is preserved. In particular, data of all the sensors that are located within the same area are ordered and stored in the same partitions on disk. This essential feature allows the query engine to accelerate range scans for the query that requests data of specific area over a time interval. In addition to the geohash-based rowkey design, we also propose a spatio-temporal data partitioning strategy that allows the OpenTSDB data tables to be pre-split at the time of table creation. Pre-splitting data tables helps the data to evenly distributed across all servers so that region hot-spotting issues can be avoided. Also, this spatio-temporal partitioning strategy helps the engine to minimize the time cost for locating the required data partition and the data scan operation.

142 conclusions and future work

Efficient and Adaptive Query Planning

To enhance query optimization, we propose a learning approach that can accurately predict query planning for a given spatio-temporal query based on the existing execution plans. Com-pared to the state of the art, a major contribution of our approach is the use of machine learning to improve similarity detection by not only taking into account semantic aspects but also spatio-temporal correlations. In addition to query planning, we also aim to predict the query execution time to help the query engine have a better understanding of the query behavior prior to query execution. Accurately predicting the query execution time enables us to optimize our costly cloud infrastructure by rescheduling or preventing long-run queries that might cause too much resource contention or that will not be completed by a particular deadline. Our experiments confirm that the learning approach can improve the performance of the query engine by orders of magnitude. Whilst our approach still has its limitations, it is a step towards providing an al-ternative solution that aims to address the query optimization challenge for the spatio-temporal management of sensor data.

In document A scalable spatio-temporal query processing engine for linked sensor data (Page 154-158)