2.6 Runtime Modeling: Background Concepts
2.6.3 Building a Cost Model
There are multiple ways of building a cost model: explicitly through analytical modeling,
implicitly through machine learning given that training data is available, or as a hybrid of
the two. Analytical models are built by experts based on domain knowledge about the query processing model, and use explicit formulas among key input features to compute the output feature (runtime). Instead of using explicit formulas, learning based models build implicit models through training (i.e., model fitting). In the following we describe multiple model fitting mechanisms that we later use for prediction.
Depending on how much information is available regarding the functional dependency among the key input features and the output feature (i.e., runtime) we can categorize model fitting algorithms into: i) algorithms with a fixed functional form, where the canonical form of the modeled cost function is known a priori and the only unknowns are the coefficients of the function which are learned from the training data, and ii) algorithms with unknown functional
form, where the function corresponding to the modeled output is unknown. For the last case,
algorithms that quantify similarity among input features based on a distance metric (e.g., nearest neighbors, kernel methods, support vector machine) or that segment the input feature space (i.e., decision trees) are used instead.
Any given fitting algorithm has an objective function (also known as the loss function) that drives the process of optimizing the model using training samples. One of the most common objective function for regressive models, that predict continuous values, is to minimize the mean squared error between the actual and predicted value on the samples from the training set. We further detail the objective function when describing each model fitting mechanism in particular.
Model Fitting for Fixed Functional Forms
Figure 2.3 – CART decision tree with four input features F1− F4, four conditionals, and five possible predicted values C1−C5.
output feature Y (i.e., processing phase runtime), the model has the functional form:
f (X1, ..., Xk) = c1X1+ c2X2+ ... + ckXk+ r
where ciare the coefficients and r is the residual value. The model fitting algorithm for multi- variate regression seeks to find the coefficients and the residual value such that the mean squared error among the estimated runtime value and the actual runtime value for the queries from the training set is minimized. In fact the coefficients of the model can be interpreted as the "cost values" corresponding to each input feature.
Model Fitting for Unknown Functional Forms
Decision trees are a good modeling approach when the underlying dependency among the input features and the output feature is not known in advance or when the dependency does not follow a fixed functional form. Decision trees are thus general and applicable to a large class of prediction problems. A large number of fitting algorithms based on decision trees exist [35].
Classification Classification and Regression Trees (CART): CART models are well known
among decision tree algorithms due to their generality, practicality, and expressivity. CART models grow a decision tree by classifying the samples from the training set into multiple zones based on a recursive binary tree growing procedure. Initially, all the training samples are located in one single node. In the next step, the split (i.e., the input feature and the
Figure 2.4 – MART decision trees with two boosting iterations (i.e., two trees) and four input features F1− F4. The predicted value is a summation over the predicted values of each tree.
threshold value) that best separates the training samples into two subsets is searched for. A good separation is achieved when there is small discrepancy among the output feature value assigned to a tree node (i.e., the average output feature value of all samples within that node) and the actual output feature values of the samples within that node. More concretely, the split that reduces the average squared error the most is chosen. The process continues iteratively until there is no more significant error reduction or until the minimum number of samples within a leaf node has been reached (no more splits are allowed). Figure 2.3 shows a CART tree with four input features, four conditionals (the intermediate tree nodes), and five possible predicted values (the leaf nodes).
Multiple Additive Regression Trees (MART): In contrast with CART, MART iteratively builds
a sequence of regression trees instead of building one single regression tree per model. The advantage is that each subsequent tree in the sequence is built to compensate for the residual errors observed on the training data on the current tree, hence prediction errors can be further reduced. Figure 2.4 shows a MART model with two trees. MART models were shown to have very good properties in the context of runtime and resource prediction [50].
Hybrid Decision Trees: Hybrid decision trees combine the power of decision trees of seg-
menting the input feature space into multiple zones and the generality of fixed functional forms within a leaf node. That is, instead of estimating the average output feature value of all samples within a leaf node, a fixed functional form is fitted instead. Thus, different output feature values can be predicted from the same leaf node. This set of features make hybrid
decision trees powerful, and more advantageous to use compared with simple tree models (e.g., CART), and fixed functional form models (e.g., multi-variate linear regression). M5 Tree [65] is a model fitting algorithm that is implementing such a hybrid decision tree model.