2.4.1 Self-tuning and Optimization
Herodotou et al. propose Starfish [38], a self-tuning system for Extract Transform Load (ETL) workloads that uses performance models with the goal of workload tuning, in particular, finding the best set of configuration settings for a given workload and a cluster deployment. Starfish was designed to help practitioners in data analytics getting the best job performance without requiring them to know the tuning knobs of the underlying MapReduce infrastructure. The key building block for finding the best configuration settings is the job profile, which models the processing characteristics of an input job. The processing characteristics are grouped into: processing cost factors, and data statistics. Given a MapReduce job, a job profile is taken by executing the job on a sample of the input dataset. Then, the approach uses the job’s processing characteristics from the job profile, a set of analytical models, and simulation to predict the job’s runtime for a range of input configuration settings. Starfish’s What-If engine is called for each potential input configuration, and the best configuration setting is returned to the user.
Elastisizer [39] extends Starfish’s approach for the cluster sizing problem, which stands for finding the cluster size and the type of machine instances (in terms of resource characteris- tics) that best meet the requirements of the workload. Hence, in addition to configuration settings, the search space includes cluster resources in terms of instance types on Amazon EC2. Elastisizer uses controlled black box models for estimating the processing cost factors on the target deployment. A set of synthetic workloads are generated and executed on each instance type to generate the data that is used for training. In terms of model fitting algorithm, M5 tree models [65] are used. M5 tree is a decision tree that instead of using the average of all training examples falling within a leaf node, uses a second level modeling phase among all observation falling within each leaf node. In particular, linear regression models are used to fit the observations within each node. We make several observations: Both Starfish and Elastisizer average processing cost factors among the job’s task profiles to produce a reference
profile that is later used in prediction. As we later show averaging processing cost factors
among tasks with very different input data properties is one source of modeling error that may cause important inaccuracies when estimating the runtime with Starfish’s analytical models. The task profiles are collected at MapReduce phase granularity and thus, do not track more specific information about the processing tasks executed within the map, and reduce functions (e.g., scan, join, project operators).
Wu et al. [80] propose analytical cost models for HiveQL operators with the underlying goal of query optimization. Their work is tailored towards reducing the size of intermediate results by adaptively grouping join operators that can be processed in one single MapReduce job. Unlike conventional optimization, the proposed optimization approach is tailored to the characteristics of MapReduce processing model such as materialization of intermediate results and data shuffling. Similarly with PostgreSQL cost model, or the query cost calibration approaches proposed for PostgreSQL [81, 68], processing cost factors are assumed constant for a given hardware infrastructure.
2.4.2 Nearest neighbors-based prediction
Ganapathi et al. extend their method proposed in the context of database queries [29] for HiveQL queries executing on top of MapReduce [28]. The input features used include a mix of MapReduce specific configuration settings and query features taken from the query plan. The models are built at coarse granularity (job/query granularity). The set of input features considered include configuration parameters and input data characteristics: i.e., number and location of map/reduce slots, input bytes, bytes read from local disk, bytes read from the distributed file system (i.e., HDFS in Hadoop). A large number of training queries is used for fitting the models (in the order of 1000s).
2.4.3 Resource Allocation and Scheduling
FLEX [79], ARIA [76], and Rayon [21] optimize resource allocation for large scale analytical workloads in accordance with deadlines and user contracted Service Level Agreements (SLAs). For finding an optimal allocation of resources all of the above schedulers require as input query runtime estimates corresponding to each potential resource allocation configuration. Thus, mechanisms for estimating the runtime requirements associated with each resource allocation configuration are of paramount importance.
ARIA is an SLA aware scheduler based on the Earliest Deadline First scheduling policy. Given a
MapReduce job and a deadline, ARIA estimates the resources required to execute the job while satisfying the deadline. The prediction component of ARIA uses a job profile to summarize the performance characteristics of the job. Their workloads include ETL tasks such as: word count, sort, classification, term frequency - inverse document frequency (TF-IDF).
Rayon system [21] introduces the concept of resource reservations into YARN [75] aiming to
provide predictable resource allocations for mixed analytical workloads. Rayon is built on top of the capacity scheduler configured with one dedicated queue where it accepts reservation requests for production jobs (that have strict deadline constraints), and one default queue where it accepts requests for best effort jobs (with no time constraints, but sensitive to latency). Given this workload mix, Rayon targets to improve the resource utilization of the cluster as much as possible while satisfying deadline constraints for production jobs, and minimizing latency for best effort jobs.