To overcome the fundamental limitations of existing approaches to data stream manage-
ment systems, our dissertation research addresses two important problems: 1) enabling
incremental maintenance oftransitive closuretype of queries over distributed data streams;
2) enablingincremental query re-optimization, both in a declarative fashion and in a pro-
cedural fashion. The first work addresses thescale and performance issues in supporting
transitive closure queries over data streams. It makes practical the support for linear re-
cursion in (distributed) data stream management systems. The second work focuses on theperformanceissues of frequent query re-optimizations in stream processing.
Our general thesis is that efficient incremental processing and re-optimization of up-
date streams can be achieved by variousincremental view maintenancetechniques if we cast
the problems as incremental view maintenance problems over data streams. In particular, we make the following contributions:
• We formulate the problem of computing recursive tasks over dynamic data streams
as a classical incremental view maintenance problem, which facilitates many generic
incremental view maintenance techniques to address the challenges. We develop
a novel, compactabsorption provenance as an annotation attached to each data item,
which enables us to directly detect when view tuples are no longer derivable and should be removed, where the views are defined over a stream of insertions or dele- tions. We also develop several heuristics to ensure that the absorption provenance annotation, maintained in a Binary Decision Diagram (BDD), remains compact un-
153
der different topologies. We implement all of the above solutions in our Aspenpro-
totype system, and experimentally validate the performance under several different scenarios. Results show that we have orders of magnitude savings compared to prior approaches in various settings.
• We explore whether full-fledged cost-based incremental techniques for query re-
optimization can be developed, where an optimizer would only re-explore query plans whose costs were affected by an updated cardinality or cost value; and whether such incremental techniques could be used to facilitate more efficient adaptivity in dynamic scenarios. We propose a rule-based and declarative approach to query re- optimization. We develop a formulation of query re-optimization as an incremental view maintenance problem, for which we develop novel algorithms like incremen- tal aggregate selection, incremental reference counting and incremental pruning.
We implement our solutions in our prototype system, Aspen, with comprehensive
studies of performance against alternative approaches over a diverse set of work- loads. Results show that we have an order-of-magnitude performance gains versus non-incremental approaches to query re-optimization.
• We show that our approaches to incremental re-optimization from the declarative
perspective can be easily incorporated into traditional procedural-based query opti-
mizers (e.g., bottom-up optimizers with dynamic programming and top-down opti- mizers with branch-and-bounding), without changing their architectures. We study how to design and implement full-fledged cost-based procedural incremental query re-optimization frameworks and present both analytical and empirical results.
• As a case study of this dissertation, we have also developed a prototype system As-
pen(in approximately80k lines of code), which serves as an end-to-end distributed
adaptive query processing system to address the challenges of our applications at hand. This system not only implements all of the solutions in this dissertation, but also serves as the backend of a real campus building application (SmartCIS) at
Penn [62,63].
154 CHAPTER8. CONCLUSIONS AND FUTURE DIRECTIONS
which include but are limited to: the extensions of our declarative query optimization framework to multi-core and distributed architectures, cost modeling of plan switching for adaptive stream processing, the problem of determining when to adapt in adaptive stream processing scenarios and adaptive stream processing over federated data centers. We illustrate each of them below.
Rethinking query optimization over multi-core and distributed architectures. A tradi- tional query optimizer is performed on a single machine; however, when more and more data sources are correlated and aggregated, and as multi-query optimization becomes essential, how to re-architect a query optimizer over parallel architectures such as multi- core or distributed environment is particularly important and relevant today. One of the natural extensions of our work on declarative query optimization is to exploit parallelism of computation in query optimization: as we express a query optimizer in a set of rules, we can rely on the distributed execution engine to parallelize the computation. There are lots of research on parallelizing Datalog queries and since a query optimizer can be cast as a specific Datalog program, there are numerous opportunities in optimizing the parallelization of this program. The main challenge of parallelizing query optimization lies in the various dependencies of computations within a query optimizer. Recent work
on distributed query optimizer [49] adopts a different approach to our declarative work,
which proposes heuristics to allocate the entire search space of candidate plans to differ- ent machines. Our declarative paradigm of query optimizer will allow us to cast the state
of intermediate computations asdata, and exploit data-driven approaches to partition the
work of computation.
Cost modeling of plan switching for adaptive stream processing. One of the biggest
challenges in adaptive query processing is the treatment ofplan switching, which was not
a challenge in conventional engines. Unfortunately, few existing work considered thecost
of switching plans, e.g., the work to be shared across plans, during query re-optimization. Indeed, in many scenarios, the expensive cost of switching is so dominant that migrating to a more cost-efficient plan might not offset the benefit. On the other hand, in an online
scenario, one can only predict the future based on past observations, hence one needs
155
overlooking the cost of plan switching, a query re-optimization solution would become incomplete as it may pick wrong plans to adapt. One line of future directions of this dissertation is to develop a cost model for plan switching and incorporate this into the
entire query re-optimization framework. One can decidewhetherto adapt to a plan which
might have better performance for the rest of data streams but is expensive to migrate to.
Determining when to adapt in adaptive stream processing. One of the most important decisions to make in adaptive query processing systems for data streams is determining
when to adapt. The granularity of adaptation will play a huge role in determining the
overall performance of query answering over data streams. Open questions lie in deciding the globally optimal points of adaptation in the life time of data streams, as well as online mechanisms to determine optimal points of adaptations when only part of the data streams have been seen and no predications of the future is available.
Adaptive stream processing across federations of data centers. Today in reality a com- puting platform may sit across multiple geographically distributed data centers. This brings unseen challenges such as huge costs of data transfer, privacy and security issues, partitioning of plans over the federation, load-balancing and re-partitioning, adaptivity and so forth. As the Map-Reduce paradigm becomes increasingly popular these days, we could enhance the capabilities of cloud processing by leveraging the ideas from this dis- sertation such as incremental processing of data streams, and extend them in nontrivial ways to address the issues of adaptivity, re-partitioning and federated query optimiza- tion, etc.