The prediction techniques we describe in this thesis are advancing the state-of-the-art in three ways: i) by providing prediction mechanisms for a class of iterative analytics that were not empirically addressed before and are widely used today in analytical workflows; ii) by providing hybrid prediction models for different categories of data analytics and by analyzing the trade-offs at varying levels of model granularities; iii) by providing mechanisms to reduce the training cost while maintaining a competitive level of accuracy for the models.
PREDIcT improves the accuracy of analytical upper bounds for estimating iterations for PageRank from a relative error of [104, 168]% to [0, 11]%. Overall, the runtime estimates have an error of [10 − 30]% for all scale-free graph analyzed. Our training methodology for reporting analytics reduces the time for running training queries from days to hours while the 95-percentile average relative error is less than 25% on our testing benchmarks. We also show the utility of predictions in an end-to-end use case in Section 5.6.4. There we show that predictions that TITAN provides can be used to answer successfully resource allocation questions, i.e., identifying the resource allocation(s) that can satisfy a target performance goal.
6.2.1 Generality of Techniques to Similar Problems
Resource Allocation for Iterative Processing: In this thesis we answer resource allocation
questions in the context of reporting analytics. The prediction techniques we develop for iterative processing: i.e., sampling technique, transform function, and hybrid cost modeling approach can be used as building blocks to answer similar resource allocation questions for iterative analytics. We note that estimating the number of iterations for the BSP execution model is independent on the resource allocation configuration. Hence, the sample-based approach we develop to estimate the number of iterations can be re-used as it is. As the worker on the critical path changes with the number of workers available and the partitioning strategy used, estimating per task key features starting from the observations of the sample run will require extensions to explicitly model the critical path of the actual run. Of particular importance is to build mechanisms that model the critical path of each worker of the actual run as a function of the data statistics collected during the sample run, a set of basic statistics about the input dataset, and the partitioning strategy.
Incorporating Predictions into Online Estimation Techniques and Vice-versa: The predic-
tion techniques we propose in this thesis can be used in conjunction with online estimation techniques such as those proposed in the context of query progress estimators [16, 53, 57], and dynamic query re-optimization [69, 24] to improve the accuracy of estimations at runtime as more information becomes available. While conventionally, runtime predictors target re- source allocation where runtime estimates are required before the query starts execution, and online estimation techniques target dynamic re-optimization and query progress monitoring, we envision for the near future hybrid estimation approaches that combine the advantages of predictors (i.e., runtime estimates before query execution) with those of online estimation techniques (i.e., accuracy refinement by exploiting runtime data).
Concretely, incorporating predictions into online runtime estimators and progress estimators is beneficial as it enables them to exploit the initial estimate information for the inactive query pipelines for which runtime data is not yet available. In addition, certain processing characteristics cannot be estimated at runtime. For instance, the number of iterations for iterative analytics cannot be estimated at runtime, before the iterative task completes its execution. The prediction techniques we develop in this context, such as: the sampling technique and the transform function, can be used to provide initial runtime estimates. Incorporating runtime data into runtime predictors can at its turn facilitate problematic predic- tion models with high errors (i.e., model over-fitting or model under-fitting), and MapReduce jobs for which input data statistics collected during a prior reference run are insufficient or unavailable. For such cases, deliberately bring in more training data at runtime can improve the prediction models. In addition, input data statistics collected for the currently executing MapReduce job (e.g., task selectivities) can be used to refine the existing data statistics corre- sponding to that job. Of particular importance is the adaptive mappers approach proposed in the context of adaptive MapReduce [77]. While adaptive mappers were mainly used to balance
the workload of a MapReduce job among concurrent tasks, the adaptive sampling component could be also used to take approximate histograms at runtime. Thus, the tasks’ selectivity information could be refined at runtime and used as input into prediction models.
Estimating Other Performance Features: While the focus of this thesis is to estimate the
runtime of a class of analytical workloads for a pre-specified execution setting, similar models
can be built to predict other performance metrics such as: the memory utilization and the CPU utilization, averaged over the duration of the workload execution. Performance metrics like these can be subsequently used to identify the optimal allocation of resources (in terms of memory buffer sizes and task slots) that does not only aim to execute the workload within the given deadline, but it also targets a high level of utilization for the resources that were allocated. While the full set of input features for such estimation problems will include additional, specific features to the performance metric that is modeled, the modeling approach per se at operator phase granularity, and the training methodology can be re-used.