4.4 Storage System Architecture
4.5.4 Query Diversity
In this experiment we evaluate the throughput variation as we increase the number of cold queries submitted. We perform this experiment with 1%, 5%, 10%, and 20% of the queries in cold regions respectively, to demonstrate the effects of query diversity on the system’s throughput and compared the results with the benchmark system. We fix the Active-Time window to 120s in this experiment. Figure 4.16(a) shows the average system throughput (queries per minute) as we vary the number of cold queries in Hot-Mode. In the benchmark system the throughput remains constant due to all data bein stored in-memory all the time, thus any query, either hot or cold, will have access to the requested data partitions readily in-memory. We noticed a light decrease in throughput as the percentage of query coldspots increased using our approach, this was due to more data being loaded to main-memory due to a larger number of cold queries, which increases disk I/O. However, the decrease in query throughput was very small compared to the increase in the number of query coldspots, furthermore, the memory savings of our approach justify these results, as we previously demonstrated. Figure 4.12(b) shows the same experiment with the system starting in cold mode; here the overall throughput decreased due to more data being loaded from disk; the effects of query variety, however, are similar to the previous experiment, and as such are justified by the savings in memory consumption.
In conclusion, in both Hot and Cold modes our system achieved good throughput, close to the benchmark, with little decrease in throughput as the number of cold queries submitted increases. The
370 372 368 375 362 355 344 311 0 100 200 300 400 1% 5% 10% 20% Thr ou ghp ut (q uery/ m in)
Percentage of Cold Queries
OceanST OurWork
(a) Throughput in Hot-Mode.
370 372 368 375 332 307 282 241 0 100 200 300 400 1% 5% 10% 20% Thr ou ghput (q uer y/ m in)
Percentage of Cold Queries
OceanST OurWork
(b) Throughput in Cold-Mode.
Figure 4.16: System throughput (queries per minutes) varying the number of cold queries.
savings in memory consumption of our approach, however, justify these results.
4.5.5
Summary
In this chapter we proposed a trajectory storage architecture on top of the Spark framework with resource-wise utilization, and concurrency control for multi-user environments. We exploit the in- memory nature and distributed parallel properties of Spark for scalable and low-latency trajectory data storage and processing. A key feature of our architecture is the ability to identify query hotspots, and exchange data between main-memory and disk based on the query workload, yet leveraging the scalability, fault-tolerance, efficiency, and concurrency control features of Spark. We developed a system on top of our proposed architecture, where administrators are able to setup the cluster configurations, as well as the parameters for data partitioning, physical planning, storage controlling, and the number of concurrent tasks supported by the cluster. Users are able to submit spatial-temporal queries for parallel concurrent processing. We developed a prototype of our system, and demonstrated its ability to process multiple concurrent requests over a large-scale data, yet maintaining steady performance and wise memory consumption, under different query workloads and configurations. Our experiments demonstrated that our system architecture achieved high throughput compared to the state-of-the-art, yet achieving up to 3.5x gain in memory usage. We believe our system will support scientists and professionals working with large-scale trajectory-based applications.
Chapter 5
Top-k Most Similar Trajectories using Spark
5.1
Introduction
Top-k most similar trajectories search (k-NN) is frequently used as classification algorithm and recommendation systems in spatial-temporal trajectory databases. The problem is useful for automatic classification, origin-destiny analysis, and identify objects that move in a same pattern, for instance. However, k-NN trajectories is a complex operation, and a multi-user application should be able to process multiple k-NN trajectories search concurrently in large-scale data in an efficient manner. The k-NN trajectories problem has received plenty of attention, however, state-of-the-art works neither consider parallel processing of k-NN trajectories search nor concurrent queries in distributed environments, or consider parallelization of k-NN search for simpler spatial objects (i.e. 2D points) using MapReduce, but ignore the temporal dimension of more complex data, such as spatial-temporal trajectories. In this work we propose a parallel approach to the k-NN trajectories problem in a distributed and multi-user environment using the Spark framework. We propose a space/time data partitioning based on Voronoi diagrams and time pages, named Voronoi Pages, in order to provide both spatial-temporal data organization and process decentralization. In addition, we propose a spatial-temporal index for our partitions to efficiently prune the search space, improve query latency and system throughput. We implemented our solution on top of Spark’s RDD data structure, which provides a thread-safe environment for concurrent MapReduce tasks in main-memory databases. We perform extensive experiments to demonstrate the performance and scalability of our approach.
Motivation and Applications: Given a query trajectory T , a constant k, a time interval [t0,t1], and
a trajectory dataset S, the top-k nearest neighbor trajectories problem (k-NN), (k-NN, also know in the literature as k-most-similar trajectories), is to find in S the k closest (or most similar) trajectories from T active during [t0,t1]. k-NN trajectories is one of the most traditional query operations in trajectory
databases, and has received plenty of attention, e.g [31], [56], [130], [153], [158]. Applications include, for example, to identify the top-k vehicle’s trajectories in a frequent path in order to calculate their average fuel consumption during a certain period of time (e.g. peak hours with more traffic jam),
in order to optimize gas stations placement or logistics optimization. Other applications include identifying seasonal pattens in natural phenomena, such as hurricanes and tornadoes; determine migration patterns of certain groups of animals along the year; and sport research to aid coaches to identify movement patterns of top players. However, processing k-NN trajectories in a multi-user environment is challenging; the application may be serving hundreds of requests over the network, and k-NN search in general demands extensive use of computational resources. Furthermore, k-NN search for trajectories is a complex operation, unlike other simpler spatial objects, trajectories are essentially non-uniform sequential data with variable length, attached with both spatial and temporal attributes; one may also need to consider data uncertainty [191]. Overall, trajectories are considered similar if they follow a certain motion pattern, or move in a similar way (i.e. keep spatially close to each other) for the majority of their time extent.
The Case for Spark: The massive amount of GPS data available, as well as the increasing number of trajectory data application users, demands more robust, reliable and scalable solutions, since real-world location-based service should be able to serve multiple requests over large-scale datasets. Therefore, a typical solution is to consider distributed parallel computation with frameworks such as MapReduce (MR) [39], which provides an abstraction for parallel computation and efficient resources allocation of concurrent threads. MR has became very popular with the increasing interest in moving data into cloud-based systems, multi-core servers, and commodity clusters. Spark [181], on the other hand, provides a MR solution with the goal of speeding up data processing by storing data in main- memory [14], [105], [182], [185]; furthermore, Spark is particularly suitable for iterative algorithms, such as k-NN search.