When we deduced closed-form formulas to depict the relationship between system factors and the good- ness of data locality, we assumed replication factor and the number of task slots per slave node are 1. That assumption limits the applicability of our results. We need to theoretically deduce formulas for more general system configurations. It will definitely help MapReduce users to understand how their system factors and data locality interact with each other. System administrators can configure their systems accordingly based on performance requirement and hardware capacity (e.g. storage space).
Our proposed task splitting has been shown to be beneficial for compute-intensive jobs. As we mentioned, task splitting impacts data locality, which was not considered in our research. To figure out how the dynamic adjustment of task granularity impacts data locality is interesting, because it will reveal whether task splitting benefits other types of applications.
In task splitting, firstly tasks of fixed granularity are formed, and then adjusted based on real-time system state. Another approach to tackle the load imbalancing issue is to figure out the optimal granularity and form the “correct” tasks in the first place. In MapReduce, one potential solution is to distribute workload in a “streaming” manner. Initially, an appropriate number of tasks are generated whose granularities are tuned to avoid severe load imbalancing. Unlike traditional MapReduce, those initial tasks together may only process a portion of all input data. During execution, the remaining work “flows” to the resources whenever they become available. In this way, the work done by each task is not fixed any more, but depends on resource availability, the progresses of other tasks, etc. As a result, it is adaptive in that workload is distributed in
reacting to the real-time information of the whole system. However, this approach may result in higher management overhead. The granularity of a task should be sufficiently large so that the overhead of task scheduling, starting and destroying is insignificant compared to the actual task execution. To automatically determine the optimal amount of workload assigned to each task is non-trivial and requires further study.
We predict the execution time of tasks according to the progress rate with the assumption that each task progresses at a constant rate. A more robust estimation is desired if the progress rate is not uniform. There are certain techniques that can potentially be applied. Firstly, weights can be associated with progress rates based on when the processing was carried out; and the progress rate of recent processing has a higher weight than that of “old” processing. The idea is that recent information is more valuable than old information because it reflects the latest state of the node (e.g. health status, workload). Secondly, some sophisticated moving average techniques can be used to smooth out intermittent fluctuation. Thirdly, the progress rates of all running tasks on a slave node can be collected and analyzed in a collaborative manner. For example, if all tasks on a node slow down suddenly, we can conjecture with high confidence that the cause is hardware failure or degradation. To combine the techniques mentioned above will enable us to estimate task execution time more accurately.
Resource stealing in Hadoop does not provide much benefit for IO-intensive applications because of the contention of underlying input reader and output writer shared by multiple threads within individual tasks. How to mitigate the synchronization overhead proves to be another interesting research area. It requires modification to the Hadoop core IO code which has been designed for single-thread iterator accessing only. One possible approach is to use lock-free data structures and algorithms. Or multiple iterators can be exposed to MapReduce so that different portions of an individual block can be accessed in parallel. Task tracker daemon needs to maintain which key/value pairs have been processed to avoid duplicate processing.
As we discussed, the hard partition of hardware resources into task slots is problematic because: i) the optimal setting is hard to find; ii) low resource utilization has been observed when there are not sufficient tasks. Our proposed resource stealing is one solution. Another solution to be yet explored is that we can thoroughly abandon the concept of task slots. Instead, given a slave node, we decide whether a new task should be assigned based on resource usage (e.g. CPU, disk, network bandwidth utilization) and the workload of the task (e.g. compute-, or IO- intensive). Task workload can be obtained from historical data or user- provided hints, while resource usage can be collected by calling system metric APIs. This approach will
alleviate the burden on developers and have the potential to improve efficiency.
Currently, HMR is specifically optimized for compute-intensive applications. For IO-intensive applica- tions, data staging becomes more dominant within overall job execution. How to make HMR suitable for other types of applications as well needs to be studied. If data are pre-uploaded before job submission, data affinity needs to play a more important role in task scheduling to minimize the penalty of data staging. Our current implementation of HyMR is a prototype which lacks the support of monitoring, debugging, and diagnosis tools. They are demanded to facilitate the everyday use of common users.