Avalanches of geospatial data that are streaming from various, often, heterogeneous channels are looming threats on businesses and presenting them with formidable challenges and hazards. In addition to the significant patterns that are hiding deeply inside stockpiles of geo- referenced data. Neither streaming data nor batch snapshots can exist in void, they are complementing each other and analyzing each of them alone is not revealing the whole picture that can assist better decision making. We posit that “one-size-fits-all” does not hold true in distributed spatial stream processing and management environments. Often, historical deep insights need to be combined with data-in-motion so as to improve the analytics quality. Current systems do not natively offer QoS awareness as a transparent underlying layer for processing streams of geo-referenced data. More than often, users need a technical knowledge to tune at the QoS level. A QoS aware system for processing fast arriving spatial data streams is then needed, which transparently incorporates QoS awareness within its layers so that it constituent parts operate synergistically in an aim at achieving a prespecified set of QoS goals. This consequently means that the users at the presentation layer do not need to reason about the underlying QoS logistics, but otherwise use them in their applications seamlessly.
To achieve those goals and close the gaps in the literature, in this thesis, we have designed a QoS Aware DSMS for geo-referenced huge amounts of streaming data loads (we term our
160
system as SpatialDSMS. The system is built with a modular architecture that streamlines the orchestration between the constituent sub-systems such that the development and deployment efforts are not repeated for every workload alone. Instead, the systems we have designed and incorporated in SpatialDSMS work collaboratively and synergistically in achieving the modularity. Colloquially, traditional independent big geospatial management systems are operating in an uncharted territory, and SpatialDSMS is the compass. It has been designed to provide an unrivalled capacity to achieve desired QoS goals intrinsically. We have specifically designed, implemented and incorporated in SpatialDSMS the following sub-systems:
7.1.1 SpatialBPE
SpatialBPE is the part that is responsible for batch processing of the arriving workloads in batch mode. This means that snapshots of streaming data are first spilled to disk. Thereafter, on need, SpatialBPE could be asked to analyze portions from this data-at-rest to get some historical insights that assist in decision making. The QoS of this component depends on its ability in serving results faster at times (i.e., low-latency QoS goal). Also, it is desirable to localize the geographically-nearby spatial objects so that to minimize network shuffling and thus allowing for a QoS aware sharing of network resources, thus achieving the high resource-utilization QoS goal. Those QoS aware services are transparently injected on top of the codebases of best-in-class representatives (i.e., Spark in this case). Having done that, SpatialBPE assists in complementing the modular architectural design goal that has been envisaged by designing SpatialDSMS.
7.1.2 SpatailNoSQL
SpatialNoSQL constitutes a scalable backend QoS aware storage framework for geo- referenced streaming data snapshots. It is consolidating heterogeneous resources in a unified compatible format. Snapshots coming from streams are transformed into that format and sharded appropriately (i.e., depending on QoS aware rules) to multiple shards in such a way that assists in achieving QoS goals prespecified by the user. SpatialNoSQL constitutes a custom sharding scheme (i.e., GSS) that is attuned to the data shape (i.e., being spatial). It then helps in striking a plausible balance between two sharding goals (i.e., SDL preservation and load balancing). It also hosts two spatial query optimizers that exploits our custom
161
sharding scheme in achieving the QoS goals. Being designed to complement the other components of SpatialDSMS, it has a modular architecture that enables it synergistically to co-work with the other components to solve mixed workloads problems. For example, for a fast approximate stream-static join , since the static table is stored in SpatialNoSQL with polygons represented as covering geohashes, then it would be easy to combine with a geohashed streaming data load as we simply need to overlay the streaming points map (from a micro-batch) on the covering polygons map and the join is solved in a simpler way known as MBR-join , acting as a quick sieve with statistically rigorous error bounds. The fact that both geospatial objects (i.e., the stream and the static master table) have the same representation (i.e., geohash) has enabled this kind of mix workloads with QoS guarantees. This also has encouraged us to design SpatialSPE, which is discussed in the next subsection. 7.1.3 SpatialSPE
For huge spikes that need to be processed fast, where we can sacrifice tiny accuracy for huge performance gains (i.e., low-latency, high-throughput and high-resource-utilization), we have designed SpatialSPE as the first in its class that is able to perform incremental spatial analytics based on a declarative SQL-like API, thus relieving the shoulders of geo- statisticians from having to reason about the intricacies and complexities of the underlying systems and focusing instead on the statistical analytics part. SpatialSPE is based on robust statistical modelling and is implemented with an emerging micro-batch streaming SPE (i.e., Spark Structured Streaming). SpatialSPE hosts a spatial-aware sampling method SAOS, which is attuned to the data characteristics. Thus, we reap many benefits that efficiently impact the QoS goals. SpatialSPE is able to achieve statistically plausible results and by orders-of-magnitude outperforms its counterparts. SpatialSPE complements the modularity architectural design goal of SpatialDSMS in the sense that it incorporates seamlessly with other components so that they all synergistically and collaboratively achieve an envisaged set of QoS goals.
7.1.4 SpatialSSJP
Most interesting insightful analysis happen during the spike in streaming data arrival rates, which, at times, necessitates mixing the fast loads with disk-resident descriptions, in a costly operation that is mostly known as stream-static join. We have designed SpatialSSJP so that
162
it complements the other components of SpatialDSMS in modular way. SpatialSSJP incorporates QoS aware services transparently within the layers of codebases of best-in- breed SPE (i.e., SpSS) so as to relieve the overburdened shoulders of the users at the presentation layer from having to reason about the underlying complex logistics. Services include an adaptive controller that constitutes two sub-controllers, one that is latency-aware based on the PID from the control theory and the other one is a model-based accuracy aware controller that is based on geo-statistical modelling. SpatialSSJP is modular by design and conveniently complements the other components of SpatialDSMS. Most importantly, it reuses our SAOS sampling method from the SpatialSPE framework.
7.1.5 Putting it All Together: SpatialDSMS
Dynamic applications in smart cities and Industry 4.0 require mixing several workloads so as to get deeper insights. The constituent parts of SpatialDSMS provide tools for
163
collaboratively and synergistically achieving QoS goals imposed by those workloads. QoS awareness is transparently incorporated within various layers of SpatialDSMS, thus relieving the shoulders of the users from having to reason about the underlying logistics for handling such awareness.
The map of figure 7.1 delineates in a coherent way the contributions we have made in this thesis and all the tactics and methods we have designed for achieving a list of envisaged QoS goals. This map complies with the methodology we have designed as described in section 3.2.1.