2.3 Data Access Performance of Different Database Management Systems
3.1.2 Query Cost Estimation
Estimating the costs of a query is necessary for query as well as layout optimization. It is generally desirable to estimate the costs in a metric that has a total order (e.g., a simple integer value). This allows to compare two values in the metric and determine which one is “better”. To calculate this value, the model may use any number of intermediate metrics. As explained in Section 2.1, e.g., disk based DBMS performance is often measured in the number of induced seeks and the number of read bytes. The costs are derived from that. Simple models like the one used in [43, pgs. 441ff.] calculate the costs from one metric, e.g., the page I/O-operations.
To find the optimal query plan, a query optimizer needs to compare different plans for a single query. Consequently, the costs have to be derived from the operator tree. A layout generator, since it is only interested in the relative costs of queries to each other, may use a cost model that estimates the costs directly from the query. We will discuss both approaches in the following.
Estimating Costs directly from the Query
Data Morphing [17] is an approach to the layouting problem that estimates the costs directly from the query. It relies on an input that specifies the percentage of the values that are accessed of every attribute. From that, the number of induced cache misses per tuple is estimated using a simple formula. This is done under a number of assumptions:
the values of all attributes are accessed in a uniform and random fashion,
a read cache line is not removed from the processor cache before all the values in that cache line have been processed and
all operations are execute on an empty cache
While these assumptions may hold for simple queries that can be evaluated in a single scan of attributes, it fails for complex queries that involve intermediate results or repetitive accesses to values (joins, group- bys, late materializing operators, ...).
3.1. QUERY COST ESTIMATION
Estimating Costs from the Operator Tree
To take intermediate results and repetitive accesses to values into account, a more accurate model of the evaluation of the query by the DBMS is needed. Since the relational operator tree is such a model, it can be used for a more accurate cost estimation. The estimation of query costs from the operator tree is similar to the evaluation of the tree: The costs of each executed operator are estimated (bottom up) and the overall costs of the query derived from that (usually by simply summing the costs of the operators [43]).
Estimating Operator Input and Output Size To estimate the costs of an operator it is necessary to know the number of tuples it has to process. Most operator-based cost models assume “a perfect oracle” [2] to estimate the number of input and output tuples.
This estimation is not trivial because it is influenced not only by the number of tuples in each relation, but also by the selectivity of the predicates that are used in the query. Substantial research exists on the estimation of predicate selectivity [50, 51, 17]. It is largely based on histograms, that represent the distribution of the values of an attribute. In this thesis we will assume that the values are distributed randomly and equally. This eliminates the need for histograms and reduces the needed statistical information to the number of unique values (cardinality) of each attribute. Incorporating more sophisticated selectivity estimation should be straight forward.
Disk-Based Cost Models As shown in Section 2.1, disk based data access costs do not in principle differ from main memory data access costs. It is, therefore, reasonable to investigate into disk based cost models as well. Simple models [43, pages 439ff] are solely based on the number of disk operations, i.e., the number of accessed blocks. They do not differentiate random and sequential access. Some (very simple) models [52, 53] only consider the number of accessed items without considering that two accesses might happen on the same block. Some models [27] only consider random misses (seeks), since they are much more costly than sequential misses (see Section 2.1). All of them are, too simple to be applied to our case. An appropriate model for disk-based data access would be based on latency and bandwidth [2]. Models for main memory access cost are best based on random and sequential misses [2].
Main-Memory Cost Models Listgarten and Neimat [54] differentiate Main-Memory Cost models into three categories: application-based, engine based and hardware-based.
Application based cost models estimate costs based on the limiting factor of each executed operator. Rules how to find the limiting factor are usually defined manually. E.g., join-performance may be limited by the memory access speed if the relations are large in relation to the available cache or limited by the processing speed if the joined relations fit into the cache. This makes application based cost models very unattractive because they are specific to the hardware and the implementation of the DBMS. Such a model is, e.g., used in [55].
Engine-based cost models are based around executed operations. Operations in this context are not hardware operations, like requesting an address or adding two values, but operations of the execution engine, like the comparison of two tuples or the output of a tuple. Such a model is introduced by Listgarten and Neimat in the same work [54]. Engine-based models are more generic than application- based models but still do not take parameters of the hardware into account.
The last category are hardware-based cost models. Execution costs are measured in the number and type of hardware operations. Since database performance is mainly determined by data access costs it is reasonable to only take data access operations into account. Such models are widely used [56, 2, 17] because they provide very generic models and good prediction performance. The generic cost model is the most advanced of which, since it allows the estimation of the different kinds of costs. It was therefore used as the model for out data layouting algorithms. One may note that hardware-based cost models can be considered engine based models as well. The difference is merely the degree of abstraction.
CHAPTER 3. DATA LAYOUTING BASED ON ESTIMATED QUERY COSTS