Enhancements to MapReduce Model - HIGH PERFORMANCE INTEGRATION OF DATA PARALLEL FILE SYSTEMS AN

8.2 Scheduling

9.2.5 Enhancements to MapReduce Model

MapReduce is mainly designed for heterogeneous single-cluster environments. It is difficult, if not impossible, for common users to obtain access to large resource pools. To get enough number of resources to run complex applications, sometimes the user needs to combine multiple clusters administrated by different domains. One convention in grid clusters is internal compute nodes are not publicly accessible from outside for security purpose. Instead, the user must log in head nodes first, and then access internal nodes from there. This constraint makes it impossible to deploy a single MapReduce system across multiple grid clusters because the master node needs to communicate with every slave node directly.

To solve the issue, we proposed Hierarchical MapReduce (HMR) in section 7.1 which is both an extended MapReduce programming model and a runtime. Firstly, the original MapReduce model is extended to Map- LocalReduce-GlobalReduce. The output of local reduce tasks is the input of the global reduce task which produces final results. Secondly, we developed a runtime that supports cross-domain execution of HMR jobs on top of multiple traditional grid clusters. An existing MapReduce job can run without modification given its reduce operation is associative. So HMR greatly expands the environments where MapReduce jobs can run without incurring any burden on developers. The architecture of HMR runtime is shown in Fig. 7.1. The global controller manages and coordinates local MapReduce clusters. Given a submitted job, the global controller decides how to distribute the workload among local clusters. We proposed a compute capacity aware scheduling algorithm that is optimized for compute-intensive applications. Our experiment with a biological application AutoDock shows that HMR can speed up job execution significantly and our scheduling algorithm balances workload well.

MapReduce and iterative MapReduce are suitable for regular and iterative MapReduce jobs respectively. There exist some applications that comprise both regular and iterative algorithms. To run them efficiently, users need to manually manage the dependency and submit constituent jobs to the corresponding system

based on their types. To automate the process, we proposed a workflow management system Hybrid MapRe- duce (HyMR) in section 7.2 which supports both Hadoop and Twister currently. A simple workflow descrip- tion language was developed in which users can describe jobs and their dependency relationship. HyMR supports data sharing through distributed file system, fault tolerance, and dependency management. We ran our bioinformatics visualization pipeline on HyMR which consists of PSA, MDS, and MI-MDS, and show some good results somewhere else.

In short, HMR and HyMR are built on top of MapReduce and add more features that facilitate cross- domain and hybrid execution of parallel jobs.

9.3 Contributions

In section 1.5, we briefly introduced the contributions of this dissertation. We summarize how we achieve them below.

• A detailed performance analysis of widely used runtimes and storage systems is presented to reveal both the speedup and overhead they bring. Surprisingly, the performance of some well- known storage systems degrades significantly compared with native local I/O subsystems. By doing thorough evaluation, we have made insightful observations regarding the performance of Hadoop and contemporary storage systems. Our findings show that substantial tuning and optimization of those systems are needed to fully exploit the power of underlying hardware systems.

• For MapReduce, a mathematical model is formally built with reasonable data placement assump- tions. Under the formulation, we deduce the relationship between influential system factors and data locality, so that users can predict the expected data locality.

Our mathematical deduction of the relationship between system factors and data locality solves the mystery of data locality in MapReduce. It enables users to know how their system configurations will impact data locality approximately without deploying real systems, and analyze cost-effectiveness more conveniently.

• The sub-optimality of default Hadoop scheduler is revealed and an optimal algorithm based on LSAP is proposed.

We convert MapReduce scheduling problem to a mathematical problem by constructing a matrix cap- turing data locality information. A LSAP based solution is proposed to find the task assignment that achieves optimal data locality. The superiority of our algorithm is evaluated against native Hadoop scheduling algorithm.

• Based on existing Bag-of-Tasks (BoT) model, a new task model Bag-of-Divisible-Tasks (BoDT) is proposed. Upon BoDT, new mechanisms are proposed that improve load balancing by adjusting task granularity dynamically and adaptively. Given BoDT model, we demonstrate that Shortest Job First (SJF) strategy achieves optimal average job turnaround time with the assumption that work is arbitrarily divisible.

To tackle load imbalancing incurred by the fixed task granularity in MapReduce, we propose task split- ting, which splits long running “straggler” tasks into smaller tasks that yield better load balancing and higher resource utilization. Prior knowledge, if available, is integrated to achieve better performance. Our finding that SJF yields optimal job turnaround time without impacting makespan can help system administrators to tune their systems accordingly if the optimization goal is job turnaround time.

• We propose Resource Stealing to maximize resource utilization, and Benefit Aware Speculative Execution (BASE) to eliminate the launches of non beneficial speculative tasks and thus improve the efficiency of resource usage.

Given the fact that most data centers are underutilized, our proposed resource stealing allows running tasks to opportunistically utilize more resources than what they are supposed to use without incurring performance degradation. Different allocation policies of idle resources have been investigated. BASE predicts the execution time of speculative tasks and launches them only under the condition that they are expected to shorten job execution. These two mechanisms greatly improve the efficiency of resource usage.

which circumvents the administrative boundary of separate grid clusters. To use both MapRe- duce runtimes and iterative MapReduce runtimes in a single pipeline of jobs, Hybrid MapRe- duce (HyMR) is proposed which combines the best of both worlds.

In HMR, we extended MapReduce to Map-LocalReduce-GlobalReduce and built a runtime to facilitate the cross-domain execution of MapReduce jobs. In addition, a hybrid workflow management system has been built by us which supports runtimes Hadoop and Twister in grid clusters. HMR and HyMR enable more sophisticated use of MapReduce in complicated environments.

In document HIGH PERFORMANCE INTEGRATION OF DATA PARALLEL FILE SYSTEMS AND COMPUTING: OPTIMIZING MAPREDUCE (Page 174-177)