Cost Model - Cost-Based Optimization - Efficient Incremental Data Analysis

3.6 Cost-Based Optimization

3.6.4 Cost Model

We now discuss the cost model for making materialization decisions. The model relies on standard cardinality estimation techniques [61, 146] for a first-pass approximation of execution costs and estimates of the relative update event and query result request frequencies. Like the viewlet transform itself, the cost model is recursive: The cost of a materialization decision is dependent on the cost of maintaining each materialized view under it. Consequently,

for a materialization decision 〈Q0, ~MQ〉, we make a distinction between the execution cost of

Q0_{, and the maintenance cost of each materialized view M}_Q_.

Update Rates

The trigger program generated by a viewlet transform is event-based. When evaluating the cost of a viewlet transform, we must consider the relative rates at which these events occur.

Concretely, we consider: (1) The rateρ_±Rat which each relation update event ±R occurs, and

(2) The rateρQat which the materialized query results are requested. Note that we are only

concerned with the relative rates at which these events will occur. Any unit of rate may be selected arbitrarily for these values, as long as it is used consistently.

For some application domains, these two rates may be interrelated. For example, if the original query encodes a constraint being enforced over the underlying data, then the constraint will

need to be validated on every update, andρQ=P±Rρ±R. Conversely, if the view is being

maintained to reduce lookup latencies, thenρQ is defined by the application. In the absence

of user-provided statistics, we assume a uniform distribution of update rates (i.e.,ρ_±R= 1),

Evaluation Cost

Through benchmarking of our experimental workload, we have found that the dominant cost of query evaluation comes from materializing results, both intermediate and final. Conse- quently, we express the evaluation cost in units of number of tuples materialized.

Although our implementation materializes intermediate results only for the Sum[ ]operator,

the cost estimation strategy we will describe can be easily generalized to more complex query

evaluation strategies. The cost of materializing a Sum[ ]is proportional to the cardinality of the

query being aggregated. Similarly, the cost of materializing the result of an update query is proportional to the cardinality of its output.

We use standard estimation techniques [61, 146] to estimate the cardinality |Q| of query Q. The full execution cost of the query Q is its cardinality, plus the cost of materializing each intermediate result.

costexec(Q) = |Q| +

Sum[ ](Qi)∈Q

|Qi|

Example 3.6.1 Consider the query computing a single aggregate over a join between R and T .

Q := S(A) ∗ Sum[A](R(A, B ) ∗ T (B,C ))

The cost of materializing the intermediate result for this aggregate is |R(A,B) ∗ T (B,C )|. The

cost of materializing the final |Q|, and so: costexec(Q) = |Q| + |R(A,B) ∗ T (B,C )|. 2

We target application domains where the size of the base relations is relatively stable over the course of execution (e.g., as in our financial workload) or application domains where delta query evaluation cost is independent of the size of the base relations (as in many TPC-H queries). In either case, we can obtain a steady-state relation size (which is appropriate for the first case, and ignored by the latter) for cardinality estimation.

Maintenance Cost

We are now ready to discuss the cost of maintaining materialized views. Unlike query execution, maintenance is an ongoing process. Consequently, maintenance costs are expressed in terms of costs and the rates at which costs are incurred.

Maintaining a view defined by query Q requires repeatedly evaluating the view’s delta queries

when the corresponding update events occur, and the delta query∆_±RQ will be evaluated at a

rate ofρ_±R. Naïvely, the total maintenance cost of this view could be expressed in terms of the

rate-weighted cost of each delta query. X

±R

ρ±R·costexec(∆±RQ)

However, this expression does not account for the cost of maintaining views created to support

3.7. Summary

For a query Q, we consider each trigger ±R generated by the viewlet transform of Q. The total

execution cost for the trigger is the sum of execution costs of each statement M_±R,i+ =Q±R,i.

The total maintenance cost of Q is the rate-weighted cost of each trigger. X ±R Ã ρ±R· X i

costexec(Q±R,i)

Finally, we consider the rate at which query Q is requested. If the value of Q is requested at

rateρQ, then the full cost of maintaining Q using materialization decision 〈Q0, ~M 〉 is

(ρQ·costexec(Q)) + X ±R Ã ρ±R· X i

costexec(Q±R,i)

3.7 Summary

The naïve viewlet transform described in Chapter 2 is too aggressive in some cases, like with inequality joins and certain nesting patterns, for which parts of queries are better re-evaluated than incrementally maintained. We develop heuristic rules for trading off between materialization and lazy evaluation for the best performance. We present the domain extraction procedure for maintaining queries with equality-correlated nested aggregates, which no other commercial database currently supports. We also discuss the trade-offs between re-evaluation and incremental computation, and we identify the cases when batching can significantly reduce maintenance costs compared to single-tuple execution. Finally, we present a cost-based model for deciding on materialization decisions.

4 The DBToaster System

The DBTOASTERsystem1implements Higher-Order Incremental View Maintenance.

DBTOASTERis a compiler for database queries. It transforms long-lived (continuous) queries into high-performance stream processing engines that keep query results (views) fresh at very high update rates. The compiler relies on Higher-Order IVM and code generation to produce efficient update triggers for maintaining materialized views in main memory.

DBTOASTERtargets applications that require real-time, low-latency processing over chang- ing datasets. It combines the advantages of data stream processing engines and relational database systems. Like stream processors, it supports continuous queries and keeps their results always fresh but without restrictive window semantics; like relational databases, it provides rich support for analytical queries using the SQL language but offers much lower

view refresh latencies. DBTOASTERenables developers to build real-time engines directly

from their declarative specifications, which can significantly improve their productivity and eliminate the need for building ad hoc solutions for performance reasons.

4.1 System Overview and Application Usage

The DBTOASTERcompiler consists of two parts. The frontend converts SQL queries into AGCA

expressions and applies Higher-Order IVM to produce an intermediate trigger representation describing how to maintain the query results. The backend optimizes these triggers and

transforms them into source code. Currently, DBTOASTERcan generate imperative (C++)

and functional (Scala) code for local execution and data-parallel code (Spark) for distributed execution. The compiler also comes with a runtime library that provides data loading and instrumentation of generated query engines.

The DBTOASTERcompiler produces query processors that are aggressively specialized to a

specific query workload, rather than ad-hoc queries. Created engines are stand-alone and

require no other system (e.g., a database system) to run. End-users interact with a DBTOASTER- generated query processor in one of three ways:

1. Standalone binaries, where users may run the binary on a file or specify a listening socket through which data can be sent to the engine. The engine can output a stream of view query results to a file or a network connection.

2. Shared libraries, where application developers may link against our library and directly

access to snapshots of the in-memory data structures representing our views while they are concurrently maintained. We are exploring asynchronous notification meth- ods to support push-based application logic, including callback functions and futures registered with our views.

3. Source code, where application developers may adapt and extend our query processing engine as desired, for example to use custom data structures to implement views.

DBTOASTERproduces extensible query engines capable of custom stream pre-processing and workload generation, as well as on-demand querying of views. Our object-oriented design enables users to inherit our engine in their applications, where they may override a pre-processing method invoked on each arriving event. This allows users to perform basic data extraction, transformation, cleaning, and logging functionality prior to delta processing. DBTOASTERcan generate query engines for running in both local and distributed settings.

Generated local code implements a single-core, single-threaded query executor. Our implementation strategy has focused on novel view maintenance techniques rather than the full range of state-of-the-art query execution mechanisms. Thus, our results represent a lower limit on performance and scalability, both of which one could substantially improve with mul- tithreaded and vectorized execution that utilizes more flexible view data structures as inspired

by database cracking [83]. Chapter 5 presents DBTOASTER’s distributed main-memory engine

that exploits aggregate CPU and memory resources available in large-scale clusters to achieve low-latency incremental view maintenance.

In document Efficient Incremental Data Analysis (Page 63-68)