In this chapter we presented PREDIcT, an experimental methodology for predicting the run- time of a class of iterative algorithms operating on graph structures. PREDIcT builds on the insight that the algorithm execution on a small sample can be transformed to capture the pro- cessing characteristics of the complete input dataset. Given an iterative algorithm, PREDIcT proposes a set of transformations: i.e., a sample technique and a transform function, that in combination can maintain key input feature invariants among the sample run and the actual run.
PREDIcT introduces an extensible framework for building customized cost models for iterative algorithms executing on top of Apache Giraph, a BSP implementation. Our experimental analysis of a set of diverse algorithms: i.e., ranking, semi-clustering, and graph processing shows promising results both for estimating key input features and time estimates. For a sample ratio of 10%, the relative error for predicting key input features ranges in between 5%-35%, while the corresponding error for predicting runtime ranges in between 10%-30% for all scale-free graphs analyzed.
processing
4.1 Introduction
In this chapter we consider analytical workloads produced by data pre-processing tasks: i.e., tasks that Extract, Transform, and Load (ETL) the input for further analysis and more complex processing. In contrast to ad-hoc query workloads, data pre-processing tasks are comprised of
fixed data flows that are run repetitively over newly arriving data sets or on different portions
of an existing data set. For such workloads, mechanisms to predict the runtime performance for incrementally updated input data sets are required.
In contrast with analytical queries consisting of traditional database operators, ETL processing tasks executing on MapReduce often include user defined map and reduce functions written in imperative languages such as Java. The input data is read from in-situ files whose structure may be opaque to the system. One of the main differences, is that MapReduce does not always “own" the data or the query’s operators. In this context, modeling the query runtime using
state-of-the-art analytical modeling is still an open problem.
In this chapter we develop hybrid prediction models customized per query segment type. We specialize models per query segment type in order to reduce the corresponding domain knowledge required about the operators’ semantics and implementation when collecting the features. Concretely, our models use a small number of key input features (i.e., tuple size, input cardinality) and exploit historical information about prior query executions (i.e., per tuple processing cost). To compute a runtime estimate, our approach combines a set of machine learning models with a global analytical model. Machine learning models are used as building blocks to capture the processing cost and the output cardinalities of each query segment. An analytical model is then used to compute the query runtime from its segments’ estimates. A query can be modeled using one segment (coarse-grain) or multiple segments (fine-grain). We consider several options for segmentation since different granularities may be useful for different scenarios. For example, coarse grain segments are good candidates for dedicated infrastructures where performance interference and runtime variability is low. In contrast,
fine granularity segments are good candidates for shared infrastructures where the dynamics
of the system (e.g., slowdown/speed-up) must be captured. For example, such segmentation is used for query progress estimators [53, 57]. Since all of these scenarios are of interest for our workloads, we propose a generic prediction mechanism which can be applied at different segment granularities according to the particular use case at hand.
Figure 4.1 – Input / output cardinality cor- relations for Workload-A
Figure 4.2 – Input cardinality / processing speed correlations for Workload-A
In this chapter, we evaluate our proposed prediction technique in the context of applications that were written using Jaql. We investigated correlations between input / output query features including data characteristics and per segment processing costs for several real workloads such as social media analytics, data pre-process for machine learning algorithms, and general analytics (the set of workloads is described in Section 5.6). As a result of the analysis, we found strong correlations between per segment input / output cardinalities, and between input cardinalities / segment processing speeds. Figure 4.1 and Figure 4.2 show the observed correlations for a a typical task (pre-process for mining step of social media analytics). We note that, if the observed correlations can be mapped to a function, it is possible to model them either using simple linear regression (i.e., for linear functions) or more specialized regression models such as transform regression [61], which can handle non-linearities in the data (i.e., for more complex functions).
The proposed technique is applicable to MapReduce jobs in general and other high-level languages so long as sufficient information is available in log files to identify traces from similar MapReduce jobs. We note that identifying job types by only comparing job binaries is not robust because additional configuration parameters may be used to decide the actual code fragments executed by the job. For our case, Jaql’s use of transparent functions and their parameters facilitated this task.
In this chapter we make the following contributions:
• We develop hybrid prediction models that estimate the runtime of the same set of ETL queries executing on different input datasets.
• We analyze the sources of errors when predicting the query runtime and discuss the error propagation pipeline for the models.
• We evaluate and show the feasibility of the prediction models for different levels of segment granularities on real analytical workloads. In our experiments, we obtain less than 25% runtime prediction errors for 90% of predictions.