Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications

(1)

Locality and Network-Aware Reduce Task Scheduling for

Data-Intensive Applications

Engin Arslan

University at Buffalo (SUNY)

Mrigank Shekhar

Intel Corporation

Tevfik Kosar

University at Buffalo (SUNY)

ABSTRACT

MapReduce is one of the leading programming frameworks to implement data-intensive applications by splitting the map and reduce tasks to distributed servers. Although there has been substantial amount of work on map task scheduling and optimization in the literature, the work on reduce task scheduling is very limited. Effective scheduling of the re-duce tasks to the resources becomes especially important for the performance of data-intensive applications where large amounts of data are moved between the map and reduce tasks. In this paper, we propose a new algorithm (LoNARS) for reduce task scheduling, which takes both data locality and network traffic into consideration. Data locality aware-ness aims to schedule the reduce tasks closer to the map tasks to decrease the delay in data access as well as the amount of traffic pushed to the network. Network traffic awareness intends to distribute the traffic over the whole network and minimize the hotspots to reduce the effect of network congestion in data transfers. We have integrated LoNARS into Hadoop-1.2.1. Using our LoNARS algorithm, we achieved up to 15% gain in data shuffling time and up to 3-4% improvement in total job completion time compared to the other reduce task scheduling algorithms. Moreover, we reduced the amount of traffic on network switches by 15% which helps to save energy consumption considerably.

1. INTRODUCTION

The increasing data requirements of commercial and sci-entific applications has lead to new programming paradigms and complex scheduling algorithms for efficient processing of these data sets. MapReduce [1] is one of the programming paradigms proposed to overcome challenges of big data pro-cessing by effectively dividing big jobs into small tasks and distributing them over a set of compute nodes. In such a setting, the compute and data storage nodes can be differ-ent, which brings up the problem of co-locating data and computation for efficient end-to-end computing.

To overcome the data-computation scheduling problem,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

locality-aware data placement is considered a desired fea-ture, and different solutions were proposed to optimize it by arranging data location in a way that compute tasks are always assigned to nodes which store the relevant data [12, 10]. On the other hand, MapReduce and its open source implementation Hadoop [3], offload the data placement task to HDFS which does not take data locality optimization into consideration when making data placement decisions. Thus, low data localization leads to high amounts of data to be transferred to and from the compute nodes.

A typical MapReduce application is composed of three phases: map, shuffle and reduce. Since data placement tasks are managed mostly by the underlying filesystem, such as HDFS in Hadoop, there are two main operations that heav-ily use the network to transfer data between computation units. The first one is the data transfer during the map phase where the task and the relevant data are located at different parts of the datacenter. This would require copy-ing the data from where it is stored to where the task is scheduled. Matei et al. [13] proposed a delay scheduling al-gorithm to improve data locality for map tasks, and their results showed that using simple delay scheduling, map task data locality can be increased to maximum values. The second main operation that causes large network usage is the shuffle phase where the outputs of the map tasks are transferred to the reduce task locations in order to combine all results. Unlike the optimization of map task schedul-ing, shuffle optimization has more than two components to consider thus making the problem more complex.

Hammoud et al. proposed the CoGRS algorithm [4] to in-crease shuffle locality by scheduling reduce tasks to nearby map tasks. CoGRS calculates the optimal task tracker for each reduce task by finding the center of gravity of associated map tasks. During the shuffle phase, reduce tasks transfer data from the map tasks and each reduce task could trans-fer a diftrans-ferent amount of data from each map task. Hence, reduce tasks should be placed nearby to locations of map tasks that they will transfer output data from.

In this paper, we propose a locality and network-aware reduce task scheduling algorithm (LoNARS) in order to op-timize the shuffle phase of data-intensive MapReduce appli-cations. We combine data locality awareness with network traffic awareness in order to decrease the shuffle phase dura-tion, as well as to lower the network traffic caused by these data transfers. Our solution has two distinctive features. First, it takes network bandwidth capacity and congestion into account when comparing two potential paths for reduce input data movement. This is crucial for end-to-end

(2)

perfor-mance of MapReduce applications since the heterogeneous nature of the network connectivity between any two nodes could result in a severe performance penalty. Also, it is pos-sible to observe high load or congestion on a specific part of the cluster or on one specific port of a switch. In this case, it would be crucial to consider the impact of network con-gestion on the end-to-end performance and make scheduling decisions accordingly. As the second distinctive feature of LoNARS, instead of finding one optimal task tracker for a reduce task as used in [4]—which could lead to sub-optimal scheduling, in case the optimal task tracker is not available— we classify all possible reduce task candidates using cost function and then choose from optimal category which will possibly have more than one choice.

The rest of the paper is organized as follows. Section 2 introduces the background on locality based scheduling methods in Hadoop. Our locality and network-aware reduce task scheduling algorithm (LoNARS) is presented in Section 3. In Section 4, we analyze and evaluate the performance gain by our solution. And, Section 5 concludes the paper with a summary of our contributions.

2. BACKGROUND

There has been extensive research in the area of local-ity awareness optimization in MapReduce. Some of them focused on data placement optimization [10, 12], some on map task scheduling [14, 15, 11], and some on reduce task optimizations [6, 10, 4].

Park et al. [11] proposed a locality-aware solution to the map task scheduling in the context of VM resource provi-sioning. When multiple VMs are running on the same phys-ical machines, all VMs obtain a certain portion of physphys-ical resources. This work aimed to dynamically allocate physical CPUs to VMs in such a way that underutilized CPUs can be moved to busy VMs so that better resource utilization can be accomplished. More importantly, in this approach tasks are always assigned to the node with data. When a task is scheduled to the VM where data resides, if all the cores are busy, the task is kept in a queue to wait for an available re-source. If other VMs on the same machine are not busy and have underutilized CPUs, then the CPU resources are dy-namically moved to the VMs with tasks in the queue so that the tasks are always assigned in a locality-aware manner and CPU resources are utilized better.

Palanisamy et al. [10] proposed Purlieus which couples data placement and task scheduling to achieve higher local-ity. They claim that without considering the data place-ment scheme, it is hard to achieve high data locality since random data placement might cause some nodes to be more congested. They also claim that job characteristics, such as how long it takes and how much data is processed in map and reduce phases, can be obtained beforehand. An efficient data placement scheme needs to take these characteristics into consideration so that the data of long jobs are placed on nodes with least possible load.

Hammoud et al. [6] proposed LARTS which focuses on the reduce job data locality problem. Native Hadoop imple-mentations have an option to enable/disable early shuffling (H ESON and H ESOF), which determines whether or not to schedule reduce tasks while map tasks are still running. This improves the overall turnaround time but leads to in-efficient reduce job assignment, since without knowledge of which nodes generated more reduce input, it is hard to

de-cide where to launch the reduce jobs. LARTS tries to employ a solution which enables early shuffling along with efficient reduce job assignment. It is done by enabling early shuffling after a certain number of map tasks are finished. This lets LARTS efficiently choose the nodes to launch the reduce tasks on, by reducing network traffic and data improving locality. Later they proposed CoGRS [4] which calculates the optimal task tracker for each reduce task, similar to LARTS, but, instead of rejecting a task tracker if no reduce task prefers it, it schedules one of the reduce tasks based on the closeness of the task tracker to the reduce task’s op-timal task tracker. CoGRS performs well under light net-work congestion and when servers have free slots most of the time. However, these assumptions might not hold true all the time, especially in production clusters where too many jobs are running simultaneously and network congestion is inevitable. Our proposed algorithm is able to categorize task trackers based on the cost function and will successfully find multiple scheduling choices, which will perform very close to CoGRS in terms of finding optimal task tracker, and out-perform CoGRS in lowering network traffic.

3. SYSTEM DESIGN

In this section, we explain the methodology of our pro-posed reduce task scheduling algorithm. In order to esti-mate the best task tracker for a reduce task, we define a cost function which finds the cost of assigning each reduce task r to each task tracker (T T ). We define δT Tr as the cost

of scheduling reduce task r on task tracker T T ,

δT Tr = n X m=0 Dm BW (T Tm, r) × HT Tm,r (1)

where Dmis the estimated shuffle output size of map task m for a reduce task r, T Tmis the task tracker that executes map task m, and BW (T Tm, r) is an estimation of network bandwidth that can be obtained for data transfer between the task tracker T Tmand the task tracker chosen to reduce task r. HT Tm,ris the number of hops between task tracker T T and T Tm. The cost function measures candidacy by two factors: (i) how much time it would take to complete the shuffling phase for this reduce task, and (ii) how close T T is to map tasks.

In order to find the transfer time of a shuffle, we estimate the data size to be shuffled from the map task to each reduce task. Although the exact shuffle data size can be learned once the map task is completed, it is inefficient to wait for all map tasks to finish execution [6]. Thus we extrapolate it by proportioning the size of the currently available output to the map task’s progress level. To estimate the network bandwidth share for a shuffle operation, we set up an SNMP client which regularly obtains utilization information from the switches.

Our reduce task scheduling algorithm tries to choose the optimal task tracker for each pending reduce task. Thus, once a job is ready to schedule the reduce task, we wait one heartbeat time to receive the request from all task trackers which are available to accept a reduce task. During this time period, we evaluate the cost function of each task tracker with respect to each pending reduce task. As a result, we are able to quantify each task tracker’s candidacy for each reduce task based on cost function values. Then, for each reduce task, we can partition task trackers into groups using

(3)

cost functions via K-Means clustering. This helps us find more than one optimal candidates with slight cost variations. Although the task tracker with the lowest cost function is the best candidate for a reduce task, it is possible multiple reduce tasks prefer the same task tracker. So, by grouping task trackers, we can have more than one task tracker as an optimal candidate. We used three levels of optimality for task trackers with free reduce slots. Task trackers with map tasks generally fall into the first group unless links that are on the path of these task trackers are more congested than the other candidate task tracker. The task trackers which are located close to task trackers with map tasks and have available network bandwidth fall into the second group. The third group contains the rest of the task trackers with available reduce slots.

Algorithm 1 LoNARS Task Scheduling Algorithm 1: function ScheduleReduceTask(TaskTracker TT)

2: for Reduce Task r in job.pendingReduceTasks do

3: δ_{T Tr} = calculateCost(TT,r)

4: r.TTCostList.put(TT,δ_{T Tr})

5: end for

6: if job.timer == −1 then . All job timers are set to -1 at the beginning

7: job.timer = current t

8: return false . Skip this job

9: end if

10: r.costPartitions = partition r.TTCostList list to 3 groups via KMeans

11: sort r.costPartitions in increasing order

12: if currrent t − job.timer > HEART BEAT F REQ and r.T T CostList.get(T T ) ≤ r.costP artitions[0] then

13: scheduleTask R to TT . Optimal Scheduling

14: return true

15: else if currrent t − job.timer > 2xHEART BEAT F REQ and R.T T CostList.get(T T ) ≤ R.costP artitions[1] then

16: scheduleTask R to TT . Sub-optimal Scheduling

17: return true

18: else if currrent t − job.timer > 3xHEART BEAT F REQ then

19: scheduleTask R to TT . Non-optimal Scheduling

20: return true

21: end if return false

22:end function

When a task tracker T T asks for a reduce task from the job tracker, a job in the job list is chosen, and if the job is ready to schedule its reduce task (by default, 5% of map tasks have to complete before a job becomes ready to sched-ule reduce tasks), we call the Schedsched-uleReduceT ask function. This function evaluates the cost of choosing this task tracker for each pending reduce task of the job. Each pending re-duce task keeps a list of costs of task trackers and updates the cost of a task tracker every time the task tracker sends heartbeat. After the cost of each pending reduce task of the job is calculated for requesting task tracker T T , we check if the timer for this job has been initiated yet. If not, we set it to current time and reject the task tracker for this job. If the timer is already set and the difference of current t and timer is between one and two heartbeat times, then we ac-cept only the task tracker from the first optimal cost group. If none of the task trackers from the first group are avail-able, we start accepting requests from the first and second cost group after current t−timer becomes greater than two heartbeat times. In our experiments, for small and medium size jobs, LoNARS is able to assign the task to a task tracker from the first and second group most of the time. For jobs with too many reduce tasks, task trackers from the third optimal level are also chosen.

In terms of algorithm complexity, we used Lloyd’s cluster-ing to generate three groups from 1-dimensional data. With fixed number of iteration, it leads to O(cN) complexity . The complexity of the cost function evaluation is similar to CoGRS except LoNARS measure cost for all available TTs as opposed to considering only feeding nodes. This dif-ference does not impose significant overhead for large jobs since one can expect number of feeding nodes of a reduce task will be very close to total number of TTs as large jobs have too many map tasks. Thus, CoGRS and LoNARS will have similar complexity for large jobs.

4. EVALUATION

We used Apache Hadoop 1.2.1 [2] to implement our cus-tom reduce task scheduling method. We also tuned the heartbeat interval to 0.3 seconds. Although this might lead to high load on the job tracker when thousands of servers are in place, we leave finding the minimal acceptable heartbeat interval based on cluster size to future works. We bench-marked LoNARS at micro and macro scales. Micro-scale benchmarking is conducted on the 12-server cluster topol-ogy shown and macro-scale benchmarking on a 100-server cluster simulation using a MapReduce simulator.

4.1 Micro-benchmarking

We set up a 12-server cluster with the topology shown in Figure 1. Four 1G SNMP-enabled switches (three ToR switches and one Aggregate switch) were used to build a two-layer network topology. We used 12 Ivy Bridge servers with the following specifications: 10-core processors with 2.9 GHz CPU, 16GB RAM, and a single partition 2TB hard disk. CentOS 6.2 is used for operating systems without vir-tualization. Each server runs one task tracker and each task tracker is configured to have eight map and reduce slots. One of the servers is configured to run the job tracker along with its task tracker. The same server is also configured as an SNMP client since the job tracker uses traffic informa-tion from switches to calculate the cost funcinforma-tion. The Fair Scheduler [13] is used in all experiments and simulations in order to eliminate network congestion caused by missed map task locality.

ToR Aggregate

1G

Rack 1 Rack 2 Rack 3

Figure 1: Cluster topology used in micro experiment We benchmarked with three types of jobs to observe the effect of LoNARS on data transfer time and total job time. The jobs are listed in Table 1 along with some character-istics. WordCount is an application to count the number of occurrences of each word in a given text. TeraSort sorts

(4)

Job Type Dataset Size Map Tasks Reduce Tasks Map Reduce Map Duration (sec) Duration (sec) Output Size

WordCount 1 182 MB 3 1 17.6 2.8 7.8 MB WordCount 2 940 MB 15 1 18.1 2.8 7.8 MB WordCount 3 3.7 GB 76 1 20.3 3.2 7.8 MB TeraSort 1 1 GB 16 5 3.5 4.5 63 MB TeraSort 2 4.8 GB 59 15 8.3 6.8 67 MB TeraSort 3 20 GB 298 30 9.2 15.7 68 MB Recommender 20 M 1 1 21.3 5.7 10.8 MB

Table 1: Jobs used in micro-benchmarking a given set of input data generated by TeraGen.

Recom-mender is a famous machine learning application used by many real life applications[8]. It is important to know how much time is spent on different phases of job execution time and the amount of data shuffled during the shuffle phase in order to evaluate gain/loss imposed by a reduce scheduling algorithm. Map and reduce execution times are calculated by taking the average of all map tasks for the given job. Re-duce task time does not cover shuffling time; it only covers sorting time and reduce function execution time. We wanted to see the effect of resource contention between tasks which are running on same task tracker by running the same job with multiple sizes.

10 20 30 40 50 60 70 80 90

WordCount1WordCount2WordCount3TeraSort1TeraSort2TeraSort3Recommender

Time (sec) Job Size Fifo Fifo-Sim LoNARS LoNARS-Sim

Figure 2: Total job completion time of jobs used in micro-benchmarking

Based on the results of micro-benchmarking on the 12-server cluster, we implemented a map reduce simulator which we later used to run our algorithm in larger scale networks. In order to make sure the simulator can generate close-to-real results, we ran the same benchmarking in Table 1 using the simulator (we used the same network topology in the simulator and took resource contention into consideration) and compared the results with the 12-server cluster’s results. Figure 2 and 3 show the results of benchmarking for FIFO and LoNARS algorithms when run on both cluster and sim-ulator. Shuffle times are calculated by dividing the total time a job spent on data transfer by the number of shuf-fle operations. We compared LoNARS with FIFO which is Hadoop’s default reduce scheduling method that schedules reduce tasks with on a first come first serve basis.

Although 12-server cluster size is not an optimal com-parison environment for reduce task scheduling algorithms, we still observed that LoNARS outperforms the FIFO al-gorithm in shuffle time for almost all job types. However,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

WordCount1WordCount2WordCount3TeraSort1TeraSort2TeraSort3Recommender

Time (sec) Job Size Fifo Fifo-Sim LoNARS LoNARS-Sim

Figure 3: Transfer time per shuffle of jobs used in micro-benchmarking

overall job completion time might be affected adversely be-cause the LoNARS algorithm requires at least one heartbeat period waiting time before it chooses an optimal server. If the shuffle data size is very small and the network is not very congested, then the gain from the shuffling phase turns to be less than one heartbeat time, so overall job time in-creases slightly. On the other hand, in production clusters it is rarely possible to have no network congestion, so LoNARS will not perform worse than FIFO in most cases. Moreover, LoNARS also decreases the amount of traffic on the network by considering data locality. The benefit of reduced traffic size is explained in Section 4.2.3

The simulator is capable of generating total job comple-tion time results within 5% error rate for all jobs. For data shuffling times, the error rate is less than 5% for TeraSort. However, the error rate can reach up to 20% for WordCount and Recommender applications. This is because we use flow-based network simulation in the simulator, where it fails to achieve fine-grained results for short-lived flows. On the other hand, correlation of simulator and cluster results is 99.8% for total job times and 98.7% for shuffle times, which is enough to see that even though the error rate is larger for shuffle times, the simulator is able to match the trends of the cluster.

4.2 Macro-benchmarking

We used the simulator to compare the algorithms in larger scale networks. The network topology used in macro bench-marking is shown in Figure 4. There are three network lay-ers: ToR switches where each ToR switch is connected to 10 servers, Aggregate switches where each Aggregate switch is connected to five ToR switches, and a Core switch which connects two aggregate switches. In order to simulate

(5)

over-subscription in datacenters, link capacities are set to 1G, 6G, and 10G from servers to ToR switches, ToR switches to Aggregate switches, and Aggregate switches to Core switch respectively.

ToR

Aggregate

Rack 1 Rack 5 Rack 6 Rack 10

…

Core

Figure 4: Network topology used in macro-benchmarking

We compared LoNARS with FIFO, Rack-Aware, and CoGRS (Center of Gravity)[5]. The Rack-Aware method distributes reduce tasks across racks in order to achieve network load balancing between racks. It starts scheduling jobs from any task tracker in any rack, and waits for the task tracker from different racks so that reduce tasks can be distributed evenly among racks. Finally, CoGRS determines the optimal task tracker for each reduce task and tries to schedule the reduce task on its optimal task tracker. Since it is possible that multiple reduce tasks prefers the same task tracker, one of them will be scheduled on its optimal task tracker. Oth-ers will be scheduled to the task trackOth-ers which are located nearby the optimal task tracker in terms of network distance. However, we realized that instead of choosing a task tracker based on network distance when the optimal task tracker is not available, categorizing task trackers and choosing a task tracker which falls into same category with the optimal task tracker leads to higher data locality as well as less network traffic. Secondly, CoGRS does not take network congestion into consideration which might causes hotspots in the net-work. To minimize hotspots, LoNARS takes congestion into account in the cost function.

In the simulator we assumed that we can measure the number of flows on switches which is possible with Net-Flow [9]. This helps us to estimate network bandwidth for a given path more precisely since we can estimate how much bandwidth a new flow can obtain even if the link is fully uti-lized. SNMP, on the other hand, provides only the amount of traffic passing on switch ports and gives no indication about how many flows are sharing same port. We query NetFlow for traffic information every time we calculate the cost function of a task tracker.

We used a pool of four different jobs for benchmarking as shown in Table 2. In addition to the WordCount, Tera-Sort, and Recommender applications, we used the Permuta-tion applicaPermuta-tion in the simulator in order to see the effect of LoNARS on applications that generates heavy shuffle output in the map phase. Since we propose an algorithm to optimize the shuffling phase of MapReduce jobs, shuffle-heavy jobs will be a good benchmark to be able to see effect LoNARS. The job characteristics for the Permutation application are

Job Name Dataset Map Reduce # of Size Tasks Tasks Jobs WordCount 1 1.8GB 30 2 30 WordCount 2 3.6GB 60 10 10 WordCount 3 12.8GB 200 30 5 TeraSort 1 300MB 5 2 30 TeraSort 2 1.8GB 30 5 10 TeraSort 3 12.8GB 200 30 5 Recommender 1 60MB 1 1 10 Recommender 2 300MB 5 5 5 Recommender 3 640MB 10 10 2 Permutation 1 300MB 5 2 30 Permuation 2 640MB 10 3 5 Permutation 3 1.2GB 20 6 2

Table 2: Jobs used in macro-benchmarking

0 2 4 6 8 10 12 14 Local Rack Percentage (%) Locality Level Fifo RackAware CoGRS LoNARS

Figure 6: Shuffle locality ratios when map outputs shuffle uniformly

as follows: the average map task time is 90 seconds, the average reduce task time (excluding shuffle phase) is 22 sec-onds, and the approximate shuffle data size of a map task is 700MB per map task. We had total of 144 jobs and they are scheduled using exponential distribution with a mean of 14 seconds as in Facebook’s cluster [13].

4.2.1 Uniform Shuffle outputs

Firstly, we considered a case where map task output shuf-fles to reduce tasks uniformly. For example, if a map task generates 30MB of output to be shuffled to reduce tasks and there are six reduce tasks, then each reduce task will receive exactly 5MB.

Figures 5 (a), (b), (c), and (d) show total job time com-parison of each job type listed in Table 2 for four different reduce scheduling algorithms. Figures 5 (e), (f), (g), and (h) show average time for one shuffle operation of a given job. If there are 10 concurrent transfers running and it takes 10 seconds to complete all of them, then per shuffle time will be one second.

The WordCount and Recommender applications shuffle a small amount of data during the shuffle phase. Thus, re-gardless of improvements in reduce task scheduling, it does not considerably impact overall job time. The Terasort ap-plication shuffles more data, however still its average per shuffle time is less than 0.5 seconds. Also, as the number of map tasks increases, the impact of the reduce task schedul-ing algorithm decreases since map tasks will be distributed over the whole cluster almost evenly. Thus, we see slight improvement in TeraSort 1. The Permutation application, on the other hand, shuffles a significant amount of data and

(6)

25 26 27 28 29 30 31 1 2 3 Time (sec) Job Size

(a) WordCount Job Time

Fifo RackAware CoGRS LoNARS 83.5 84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 1 2 3 Time (sec) Job Size

(b) KMeans Job Time

Fifo RackAware CoGRS LoNARS 12 14 16 18 20 22 24 26 28 1 2 3 Time (sec) Job Size

(c) TeraSort Job Time

(d) Permutation Job Time

Fifo RackAware CoGRS LoNARS 0 0.01 0.02 0.03 0.04 0.05 1 2 3 Time (sec) Job Size

(e) WordCount Shuffle Time

Fifo RackAware CoGRS LoNARS 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 1 2 3 Time (sec) Job Size

(f) KMeans Shuffle Time

Fifo RackAware CoGRS LoNARS 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 Time (sec) Job Size

(g) TeraSort Shuffle Time

Fifo RackAware CoGRS LoNARS 1 1.5 2 2.5 3 3.5 4 1 2 3 Time (sec) Job Size

(h) Permutation Shuffle Time

Fifo RackAware CoGRS LoNARS

Figure 5: Benchmarking results with uniform shuffle reduce task scheduling makes a difference more than it does

for other jobs. LoNARS performs 2-3% improvement in to-tal job time in shuffle heavy jobs such as the TeraSort shown in Figure 5 (d). In Figure 5 (h), it can be seen that per shuf-fle time for LoNARS performs with up to a 15% improve-ment compared to FIFO and a 7% improveimprove-ment compared to CoGRS. The basic reason LoNARS performs better than CoGRS is CoGRS’s inability to address choosing an opti-mal alternative task tracker when multiple reduce tasks pre-fer the same task tracker. With the help of categorization, the LoNARS algorithm successfully finds an alternative task tracker whose cost is close to the optimal task tracker’s cost. Figure 6 shows the ratio of local and rack-level shuffle operations among all shuffles. If a map and reduce tasks are on the same server, then the only operation is reading data from disk into memory for the reduce task—we call this case a ”local-level” shuffle. If map and reduce tasks are on the same rack, data transfer stays within rack—we call this a ”rack-level” shuffle. The rest of the shuffles are called ”off-rack” shuffles, and constitute around 90% of all shuffles in most cases. LoNARS approximately doubles the level of local-level shuffling compared to FIFO and RackAware, and improves local-level shuffling by 20% compared to CoGRS. LoNARS also increases the rack level shuffle ratio by 22% compared to FIFO, 19% compared to RackAware, and 17% compared to CoGRS. Increases in the level of local- and rack-level shuffling reduce the amount of traffic pushed to the network significantly.

4.2.2 Non-uniform Shuffle outputs

CoGRS performs best when map outputs are divided among reduce tasks non-uniformly. That is, the partitioning of map task output to each reduce task is different, and some reduce tasks receives more data on one map task. This leads reduce tasks to choose different task trackers as optimal task track-ers. FIFO and RackAware perform the same regardless of the distribution pattern of map outputs. We again used the benchmarking jobs listed in Table 2, and the results are given in Figure 7. CoGRS performs better than the FIFO and RackAware algorithms in this case, as expected. Thus, CoGRS increases locality and shortens shuffling time. LoNARS algorithm outperforms CoGRS slightly in this case. Although CoGRS can find different task trackers for most reduce tasks, it fails to find good alternative task trackers when many jobs are active and optimal task trackers do not have free reduce slots. Furthermore, it is important to be aware of network traffic in order to avoid experiencing network congestion and to utilize network capacity effec-tively. LoNARS successfully finds alternative task trackers and yields more network capacity which improves shuffle time. However, total job time is not significantly improved, since the reduce phase starts only after all shuffles are com-pleted and even one slow shuffle can cause a long total job time.

In Figure 8, we can see CoGRS achieves a better local-level shuffle ratio than it does in the uniform shuffling case. LoNARS again outperforms FIFO, RackAware, and CoGRS in local- and rack-level shuffle ratios. Its local-level shuffle

(7)

25 26 27 28 29 30 31 32 33 1 2 3 Time (sec) Job Size

(a) WordCount Job Time

(b) KMeans Job Time

(c) TeraSort Job Time

Fifo RackAware CoGRS LoNARS 134 136 138 140 142 144 146 148 1 2 3 Time (sec) Job Size

(d) Permutation Job Time

Fifo RackAware CoGRS LoNARS 0 0.01 0.02 0.03 0.04 0.05 0.06 1 2 3 Time (sec) Job Size

(e) WordCount Shuffle Time

Fifo RackAware CoGRS LoNARS 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 1 2 3 Time (sec) Job Size

(f) KMeans Shuffle Time

Fifo RackAware CoGRS LoNARS 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 Time (sec) Job Size

(g) TeraSort Shuffle Time

Fifo RackAware CoGRS LoNARS 1 1.5 2 2.5 3 3.5 4 1 2 3 Time (sec) Job Size

(h) Permutation Job Shuffle Time

Fifo RackAware CoGRS LoNARS

Figure 7: Benchmarking with non-uniform shuffle ratio is again double that of FIFO and RackAware and 10%

more than that of CoGRS. In terms of rack-level shuffle ra-tio, LoNARS outperforms FIFO by 20%, RackAware by 18% , and CoGRS by 13%. 0 2 4 6 8 10 12 14 Local Rack Percentage (%) Locality Level Fifo RackAware CoGRS LoNARS

Figure 8: Shuffle locality ratios when map outputs have non-uniform pattern

4.2.3 Network Traffic Comparison

Besides improvements in shuffle and total job times, LoNARS reduces network traffic considerably. Reduction in network traffic means less traffic on switches and reduced power con-sumption. Mahadevan, et al. [7] analyzed power consump-tion of network switches. 1G network switches consume

around 120-180 watts and 10G switches consume around 300 watts. Utilization of the switch contributes 5-15% of total switch power consumption. According to the author, although the ratio seems low, as green technology becomes popular, vendors will focus more on more energy efficient products. Hence, lowering network traffic has a significant impact on lowering datacenter networking costs.

Figure 9 (a) shows a traffic size comparison of the LoNARS and FIFO algorithms on the 12-server cluster shown in Fig-ure 1. We used the jobs listed in Table 1. The number of jobs of a given size is inversely proportional to job size. Jobs arrive with exponential distribution with a mean of 14 seconds. Traffic information is gathered via SNMP polling. Although traffic size values also include HDFS block copy-ing, since both methods have similar outputs, the differ-ence comes from using different reduce task scheduling algo-rithms. LoNARS reduced network traffic processed by ToR and Aggregate switches by around 15%.

In Figure 9 (b) and (c) we compare the traffic size of four algorithms for benchmarking, explained in Section 4.2. RackAware reduces network traffic slightly more than FIFO, and CoGRS outperforms both FIFO and RackAware by 8-10% in all network levels.

For the uniform map shuffling case, compared to the FIFO, RackAware, and CoGRS algorithms, LoNARS reduces net-work traffic processed by the core switch by 27%, 25%, and 20% respectively. Since higher capacity switches consume more power per megabit [7], increasing locality and reducing

(8)

0 10 20 30 40 50 60 70 80 90 100 Aggregate ToR Traffic Size (GB) Layer

(a) 12-Server Mix Jobs

Fifo LoNARS 100 200 300 400 500 600 700

Core Aggregate ToR

Traffic Size (GB) Layer (b) Uniform Shuffle Fifo RackAware CoGRS LoNARS 100 200 300 400 500 600 700

Core Aggregate ToR

Traffic Size (GB) Layer (c) Non-uniform Shuffle Fifo RackAware CoGRS LoNARS

Figure 9: Size of traffic at different layers of the network network traffic on upper layers is more important than lower

layers. For the aggregate layer, compared to FIFO, Rack-Aware, and CoGRS algorithms, LoNARS cuts the amount of traffic pushed to aggregate switches by 20.9%, 19.3%, and 13% respectively. Finally, LoNARS lowers traffic size pro-cessed by ToR switches by 13%, 11% and 6% respectively compared to FIFO, RackAware, and CoGRS.

In the non-uniform map shuffling case, CoGRS increases its locality ratio, which reduces network traffic. LoNARS again is able to reduce network traffic size in almost the same ratio compared to FIFO and RackAware for all layers. However, compared to CoGRS, LoNARS reduces traffic size through core, aggregate, and ToR switches by 8%, 5% and 2% respectively.

5. CONCLUSIONS AND FUTURE WORK

In this paper, we propose the LoNARS algorithm for re-duce task scheduling in MapRere-duce. We conducted micro-benchmarking to see the performance of LoNARS using a 12-server cluster and showed that LoNARS outperforms the native Hadoop reduce task scheduling algorithm. We also simulated macro-benchmarking using a 100-server cluster and compared LoNARS to FIFO, RackAware, and CoGRS algorithms. The results showed that LoNARS always re-duces network traffic more than these three algorithms by up to 25% which makes significant impact in power con-sumption of network switches. LoNARS also reduces shuffle time in almost all cases and the reduction ratio depends on how much traffic a job shuffle in the shuffle phase. Regard-ing total job completion time, LoNARS makes improvement if shuffle transfer time is more than one heartbeat time yet does not always perform worse in other cases.

As future work, we plan to explore how to tune heartbeat time based on cluster size and processing power of the job tracker to minimize overhead of LoNARS. We also plan to test LoNARS on different datacenter topologies such as Fat-Tree to see its performance in such network topologies.

6. ACKNOWLEDGMENTS

This project is in part sponsored by NSF under award number CNS-1131889 (CAREER).

7. REFERENCES

[1] Dean, J., and Ghemawat, S. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI’04, USENIX Association, pp. 10–10.

[2] Documentation, H. .

https://hadoop.apache.org/docs/r1.2.1/. [3] Hadoop. http://hadoop.apache.org/.

[4] Hammoud, M., Rehman, M. S., and Sakr, M. F. Center-of-gravity reduce task scheduling to lower mapreduce network traffic. In IEEE CLOUD (2012), pp. 49–58.

[5] Hammoud, M., Rehman, M. S., and Sakr, M. F. Center-of-gravity reduce task scheduling to lower mapreduce network traffic. In Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing (Washington, DC, USA, 2012), CLOUD ’12, IEEE Computer Society, pp. 49–58.

[6] Hammoud, M., and Sakr, M. F. Locality-aware reduce task scheduling for mapreduce. In Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science (Washington, DC, USA, 2011), CLOUDCOM ’11, IEEE Computer Society, pp. 570–576. [7] Mahadevan, P., Sharma, P., Banerjee, S., and

Ranganathan, P. A power benchmarking framework for network devices. In Proceedings of the 8th International IFIP-TC 6 Networking Conference (Berlin, Heidelberg, 2009), NETWORKING ’09, Springer-Verlag, pp. 795–808. [8] Mahout, A. http://mahout.apache.org/.

[9] NetFlow. http://en.wikipedia.org/wiki/NetFlow. [10] Palanisamy, B., Singh, A., Liu, L., and Jain, B.

Purlieus: locality-aware resource allocation for mapreduce in a cloud. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2011), SC ’11, ACM, pp. 58:1–58:11.

[11] Park, J., Lee, D., Kim, B., Huh, J., and Maeng, S. Locality-aware dynamic vm reconfiguration on mapreduce clouds. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2012), HPDC ’12, ACM, pp. 27–36. [12] Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors,

J., Manzanares, A., and Qin, X. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on (2010), pp. 1–9. [13] Zaharia, M., Borthakur, D., Sen Sarma, J.,

Elmeleegy, K., Shenker, S., and Stoica, I. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems (New York, NY, USA, 2010), EuroSys ’10, ACM, pp. 265–278. [14] Zaharia, M., Borthakur, D., Sen Sarma, J.,

Elmeleegy, K., Shenker, S., and Stoica, I. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems (New York, NY, USA, 2010), EuroSys ’10, ACM, pp. 265–278.

[15] Zhang, X., Feng, Y., Feng, S., Fan, J., and Ming, Z. An effective data locality aware task scheduling method for mapreduce framework in heterogeneous environments. In Proceedings of the 2011 International Conference on Cloud and Service Computing (Washington, DC, USA, 2011), CSC ’11, IEEE Computer Society, pp. 235–242.

Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications