Conclusions and Future Directions - Incremental Processing and Optimization of Update Streams

To overcome the fundamental limitations of existing approaches to data stream manage-

ment systems, our dissertation research addresses two important problems: 1) enabling

incremental maintenance oftransitive closuretype of queries over distributed data streams;

2) enablingincremental query re-optimization, both in a declarative fashion and in a pro-

cedural fashion. The first work addresses thescale and performance issues in supporting

transitive closure queries over data streams. It makes practical the support for linear re-

cursion in (distributed) data stream management systems. The second work focuses on theperformanceissues of frequent query re-optimizations in stream processing.

Our general thesis is that efficient incremental processing and re-optimization of up-

date streams can be achieved by variousincremental view maintenancetechniques if we cast

the problems as incremental view maintenance problems over data streams. In particular, we make the following contributions:

• We formulate the problem of computing recursive tasks over dynamic data streams

as a classical incremental view maintenance problem, which facilitates many generic

incremental view maintenance techniques to address the challenges. We develop

a novel, compactabsorption provenance as an annotation attached to each data item,

which enables us to directly detect when view tuples are no longer derivable and should be removed, where the views are defined over a stream of insertions or dele- tions. We also develop several heuristics to ensure that the absorption provenance annotation, maintained in a Binary Decision Diagram (BDD), remains compact un-

153

der different topologies. We implement all of the above solutions in our Aspenpro-

totype system, and experimentally validate the performance under several different scenarios. Results show that we have orders of magnitude savings compared to prior approaches in various settings.

• We explore whether full-fledged cost-based incremental techniques for query re-

optimization can be developed, where an optimizer would only re-explore query plans whose costs were affected by an updated cardinality or cost value; and whether such incremental techniques could be used to facilitate more efficient adaptivity in dynamic scenarios. We propose a rule-based and declarative approach to query re- optimization. We develop a formulation of query re-optimization as an incremental view maintenance problem, for which we develop novel algorithms like incremental aggregate selection, incremental reference counting and incremental pruning.

We implement our solutions in our prototype system, Aspen, with comprehensive

studies of performance against alternative approaches over a diverse set of work- loads. Results show that we have an order-of-magnitude performance gains versus non-incremental approaches to query re-optimization.

• We show that our approaches to incremental re-optimization from the declarative

perspective can be easily incorporated into traditional procedural-based query opti-

mizers (e.g., bottom-up optimizers with dynamic programming and top-down optimizers with branch-and-bounding), without changing their architectures. We study how to design and implement full-fledged cost-based procedural incremental query re-optimization frameworks and present both analytical and empirical results.

• As a case study of this dissertation, we have also developed a prototype system As-

pen(in approximately80k lines of code), which serves as an end-to-end distributed

adaptive query processing system to address the challenges of our applications at hand. This system not only implements all of the solutions in this dissertation, but also serves as the backend of a real campus building application (SmartCIS) at

Penn [62,63].

154 CHAPTER8. CONCLUSIONS AND FUTURE DIRECTIONS

which include but are limited to: the extensions of our declarative query optimization framework to multi-core and distributed architectures, cost modeling of plan switching for adaptive stream processing, the problem of determining when to adapt in adaptive stream processing scenarios and adaptive stream processing over federated data centers. We illustrate each of them below.

Rethinking query optimization over multi-core and distributed architectures. A traditional query optimizer is performed on a single machine; however, when more and more data sources are correlated and aggregated, and as multi-query optimization becomes essential, how to re-architect a query optimizer over parallel architectures such as multi- core or distributed environment is particularly important and relevant today. One of the natural extensions of our work on declarative query optimization is to exploit parallelism of computation in query optimization: as we express a query optimizer in a set of rules, we can rely on the distributed execution engine to parallelize the computation. There are lots of research on parallelizing Datalog queries and since a query optimizer can be cast as a specific Datalog program, there are numerous opportunities in optimizing the parallelization of this program. The main challenge of parallelizing query optimization lies in the various dependencies of computations within a query optimizer. Recent work

on distributed query optimizer [49] adopts a different approach to our declarative work,

which proposes heuristics to allocate the entire search space of candidate plans to different machines. Our declarative paradigm of query optimizer will allow us to cast the state

of intermediate computations asdata, and exploit data-driven approaches to partition the

work of computation.

Cost modeling of plan switching for adaptive stream processing. One of the biggest

challenges in adaptive query processing is the treatment ofplan switching, which was not

a challenge in conventional engines. Unfortunately, few existing work considered thecost

of switching plans, e.g., the work to be shared across plans, during query re-optimization. Indeed, in many scenarios, the expensive cost of switching is so dominant that migrating to a more cost-efficient plan might not offset the benefit. On the other hand, in an online

scenario, one can only predict the future based on past observations, hence one needs

155

overlooking the cost of plan switching, a query re-optimization solution would become incomplete as it may pick wrong plans to adapt. One line of future directions of this dissertation is to develop a cost model for plan switching and incorporate this into the

entire query re-optimization framework. One can decidewhetherto adapt to a plan which

might have better performance for the rest of data streams but is expensive to migrate to.

Determining when to adapt in adaptive stream processing. One of the most important decisions to make in adaptive query processing systems for data streams is determining

when to adapt. The granularity of adaptation will play a huge role in determining the

overall performance of query answering over data streams. Open questions lie in deciding the globally optimal points of adaptation in the life time of data streams, as well as online mechanisms to determine optimal points of adaptations when only part of the data streams have been seen and no predications of the future is available.

Adaptive stream processing across federations of data centers. Today in reality a computing platform may sit across multiple geographically distributed data centers. This brings unseen challenges such as huge costs of data transfer, privacy and security issues, partitioning of plans over the federation, load-balancing and re-partitioning, adaptivity and so forth. As the Map-Reduce paradigm becomes increasingly popular these days, we could enhance the capabilities of cloud processing by leveraging the ideas from this dissertation such as incremental processing of data streams, and extend them in nontrivial ways to address the issues of adaptivity, re-partitioning and federated query optimization, etc.

In document Incremental Processing and Optimization of Update Streams (Page 167-171)