INFO5011
Cloud Computing Semester 2, 2011
Lecture 11, Cloud Scheduling
The presentation is based on:
Quincy: Fair Scheduling for Distributed Computing Clusters. Michael
Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg, SOSP'09
Improving MapReduce Performance in Heterogeneous Environment.
Matei, Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica. OSDI’2008
Some content/diagrams are from the paper and the author’s presentation
COMMONWEALTH OF Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of the university of Sydney pursuant to Part
VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or
communication of this material by you may be the subject of copyright protection under the Act.
Outline
›
Motivation
- Default FIFO scheduling in Hadoop and its problems
›
Quincy scheduling for distributed computing cluster
- Proposed by Microsoft Research
- Build on Dryad cluster
- Homogeneous assumption
- Focusing on determining which node to run which job
›
Scheduling in Heterogeneous environment
- Proposed by U.C. Berkley RAD lab
- Heterogeneous assumption
- Focusing on optimizing speculative execution
Motivation
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 3
›
Big clusters used for jobs of varying sizes, durations
›
Data from production search cluster used in Microsoft
The Hadoop scheduling
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 4
Diagram from the CACM version of the original MapReduce paper
› The master knows before hand the number of
mappers and number of reducers for each job.
› It also knows the number of available task slots on each worker node
› It is responsible for
assigning tasks to node with vacant slot
› The default and the
simplest scheduling using FIFO queue
The problem of simple FIFO scheduler
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 5
maximum number of tasks
Hadoop running a single job
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 6
hadoop jar wordcount.jar info5011wordcounter.WordCount assn_input/n07.txt countout
= 7 blocks One job:
Hadoop map task list for
job_201108241404_1213
Hadoop running 2 jobs concurrently
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 8
hadoop jar wordcount.jar info5011wordcounter.WordCount assn_input/n00p2.txt countoutn00 hadoop jar wordcount.jar info5011wordcounter.WordCount user/zhouy/assn_input/n07.txt countoutn07
Similar job as the first one except with more reducers
Job smmaries
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 9
first job finishes
Hadoop FIFO queue status
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 10
The second, smaller job waits in queue
Hadoop map task list for
job_201108241404_1214
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 11
gpu1 gpu1 dm0 gpu0 gpu0 dm0 dm1 dm2 dm2
mapper 0 and mapper 9
Outline
›
Motivation
- Default FIFO scheduling in Hadoop and its problems
›
Quincy scheduling for distributed computing cluster
- Proposed by Microsoft Research- Build on Dryad cluster
- Homogeneous assumption
- Focusing on determining which node to run which job
›
Scheduling in Heterogeneous environment
- Proposed by U.C. Berkley RAD lab
- Heterogeneous assumption
- Focusing on optimizing speculative execution
Fair scheduling
›
Job
X
takes
t
seconds when it runs exclusively on a cluster .
›
X
should take no more than
Jt
seconds when cluster has
J
concurrent jobs.
›
Formally, for
N
computers and
J
jobs, each job should get
at-least
N/J
computers.
Data Locality constrains
›
Other scheduling approaches:
- HPC jobs fetch data from a SAN, no need for co-location of data and computation.
›
Data intensive workloads (MapReduce/Hadoop/Dryad)
- have storage attached to computers.- Scheduling tasks near data improves performance.
15 INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou)
Fairness vs. data locality
›
The requirements of fairness and locality often conflict
-
A strategy that achieves optimal data locality will typically delay a job
until its ideal resources are available
-
Fairness benefits from allocating the best available resources to a job as
soon as possible after they are requested
›
An important feature of
Data intensive workloads
(MapReduce/Hadoop/Dryad)
- While running, tasks are independent of each other so killing one task will not impact another
- In contrast, MPI jobs are made of sets of stateful processes tightly coupled by communicating with each other across the network.
Cluster architecture
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 17
Queue based scheduling
›
Baseline algorithm (Greedy)
- Equivalent to FIFO scheduling in Hadoop
›
Simple greedy fairness (GF)
- Based on Hadoop’s Fair Scheduler
- Job j gets a baseline allocation A*j =min(|M/K|, Nj) where M: number of
computer; K: number of concurrent jobs; Nj total number of running and waiting tasks for job j.
- If Σj A*
j<M the remaining slots are divided equally among jobs that have
additional ready workers so that final allocation Aj has Σj Aj=min(M, Σj Nj).
- The scheduler blocks job j whenever it is running Aj tasks or more. It only assigns tasks of an unblocked job to a newly available computer.
›
Fairness with preemption (GFP
- When a job is running more than tasks, the scheduler will kill its over-quota tasks, starting with the most recently scheduled tasks first
Simple Greedy Fairness
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 19
Sticky slots problem
›
Under a steady state in which each job is occupying exactly its allocated
quota of computers.
- Whenever a task from job j completes on computer, another task from j will be assigned to m again
- m sticks to j indefinitely whether or not j has any waiting tasks that have good data locality when run on m.
›
Simple solution
- Do not unblock job j once a task finishes
- wait till j ’s running tasks falls below Aj -MH, where MH is a hysteresis margin or ΔH seconds have passed.
- In many cases, this delay is sufficient to allow another job’s worker, with better
locality to “steal” computer m.
Sticky slots illustrated (i)
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 21
Diagrams and example in slides 21- 27 are from authors’ original presentation
Sticky slots illustrated (ii)
Sticky slots illustrated (iii)
Sticky slots illustrated (iv)
Sticky slots illustrated (v)
Sticky slots illustrated (vi)
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 26
Sticky slots illustrated (vii)
Quincy-- Flow Based Scheduler
›
Main idea
-
Matching = Scheduling
-
Each task is either scheduled on a computer c or un scheduled
-
Can assign a cost to any matching
-
Fairness constrains number of tasks that are scheduled
-
The goal is to minimize matching cost while obeying fairness constraints
-
Min-cost network flow problem
-
Instead of making local decisions [greedy], solve it globally.
Graph construction (i)
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 29
›
Start with a directed graph representation of the cluster
architecture.
Cluster aggregator
Individual computers Rack aggregator
Graph construction (ii)
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 30
job 1 with 6 tasks and a root task Each receive one unit of flow as its supply
Unscheduled node for job 1
•Each task has an edge to Uj.
•There is a single edge from Uj to the sink. •High cost on edges from tasks to Uj.
Graph construction (iii)
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 31
Add edges from tasks (T) to
computers (C), if computer C has some data for task T.
The cost is a function of the amount of data that would be transferred across rack and core switch
Graph construction (iv)
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 32
Add edges from tasks (T) to rack (R), if R has some data for task T.
The cost on the edge is set to the worst case cost that would result if the task were run on the least favorable computer in R
Graph construction (v)
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 33
Add edges from all tasks (T) to cluster (X)
The cost on the edge is set to the worst case cost for running the task on any computer in the cluster
Graph construction (vi)
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 34
0 cost edge from root task to
computer to avoid preempting root task.
Constrains how many tasks can run on each computer
A Feasible Matching
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 35
Unscheduled job
Final graph
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 36
Fairness constrains, setting it to 4 means at lest 2 tasks from job 1 needs to go through computer Fairness constrains, setting it to 2 means at lest 2 tasks from job 2 needs to go through computer
Some experiment results
›
Workload:
›
Typical Dryad jobs (Sort, Join, PageRank, WordCount,
Prime).
›
In total, 30 jobs with a mix of CPU, disk, and network
intensive tasks
›
Prime used as a worst-case job that hogs the cluster if started
first.
›
240 computers in cluster. 8 racks, 29-31 computers per rack.
›
More than one metric used for evaluation.
Results (i)
Results (ii)
Results (iii)
Results (iv)
Discussion point
›
Solver overhead
- The observed average overhead in this 240 machine cluster is 7.64ms with a maximum cost of 57.59ms
- Simulated average overhead in 2500 computers running 100 concurrent jobs is a little over a second per solution
- Seems acceptable, but min-cost flow is recomputed from scratch each time a change occurs
›
Applicable in other scheduling environment?
- The easy mapping of the scheduling problem to a min-cost flow is due to
- Tasks are relatively independent with each other, there is no correlation constraints
- One dimensional capacity setting
Outline
›
Motivation
- Default FIFO scheduling in Hadoop and its problems
›
Quincy scheduling for distributed computing cluster
- Proposed by Microsoft Research
- Build on Dryad cluster
- Homogeneous assumption
- Focusing on determining which node to run which job
›
Scheduling in Heterogeneous environment
- Proposed by U.C. Berkley RAD lab- Heterogeneous assumption
- Focusing on optimizing speculative execution
Hadoop’s straggler handling mechanism
›
Speculative execution
- If a node is available but is performing poorly, this is called a straggler
- MapReduce has a build-in mechanism to run a speculative copy of its task on another machine to finish the computation faster.
›
This paper tries to Improve the performance of speculative executions
by
-
Define a new scheduling metric.
-
Choosing the right machines to run speculative tasks.
-
Capping the amount of speculative executions.
Progress score
› Hadoop monitors task progress using a progress score to select speculative tasks
- Map task’s progress score is the fraction of input data read
- Reduce task’s execution is divided into three phases, each of which account for 1/3 of the
score. In each phases, the score is the fraction of data process
› When a task’s progress score is less than the average for its category minus 0.2 and the task has run for at least one minute, it is marked as a straggler
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 45
Copy phase sort phase reduce phase
For a reduce task, the execution is divided into three phases, each of which accounts for 1/3 of the score progress score is the fraction of input data read
Hadoop’s assumption
›
Nodes can perform work at exactly the same rate
›
Tasks progress at a constant rate throughout time
›
There is no cost to launching a speculative task on an idle node
›
The three phases of execution take approximately same time
›
Tasks with a low progress score are stragglers
›
Maps and Reduces require roughly the same amount of work
Breaking down the assumptions
›
The first 2 assumptions talk about homogeneity. However
-
In a non-virtualized data center, there may be multiple generations
of hardware
-
In a virtualized data center, multiple VMs are co-located on the
same physical host.
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 47
Diagrams from Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang, CloudCmp: comparing public cloud providers. In Proceedings of the 10th annual conference on Internet measurement (IMC '10)
Heterogeneity are observed by other researchers
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 48
Jorg Schad, Jens Dittrich, and Jorge-Arnulfo Quiane-Ruiz. Runtime measurements in the cloud: observing, analyzing, and
reducing variance. In Proceedings of the 36th International Conference on Very Large Data Bases(VLDB’10), September
Other assumptions
›
Assumption 3 that speculating tasks coast nothing, breaks down when
resources are shared
- Network is a bottleneck and speculative tasks may compete for disk I/O
›
Assumption 4 that a task’s progress score is approximately equal to its
percent completion, does not hold especially for reduce tasks
- The copy phase usually counts for more than 1/3 of the task execution time
›
Assumption 5, that progress score is a good proxy for progress rate
because tasks being at roughly the same time, can also be wrong
- Number of mappers depends on number of blocks which might be much large the available slots. The mappers tend to run in waves (see slide 11).
LATE scheduler
›
Longest Approximate Time to End
›
Design principle
- Always speculatively execute the task that “we” think will finish farthest into the
future
›
Different methods can be used to estimate time left
›
Propose a simple heuristic based on progress rate
- ProgressRate = ProgressScore/The amount of time the task has been running
- Estimated time to completion = (1-ProgressScore)/ProgressRate
- It assumes that tasks make progress at a roughly constant rate (there are exceptions to this assumption)
LATE parameters
- SlowNodeThreshold: used to select fast node to launch speculative tasks
- 25th percentile of node progress
- SpeculativeCap: used to control the number of speculative tasks that can be
running at once
- 10% of available task slots
- SlowTaskThreshold: used to select task for speculative copy
- 25th percentile of task progress
- Currently it does not consider data locality
Estimating finishing time
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 52
• the current ProgressRate computation assumes constant progress, which might not be true
•If a task’s execution slows
down in later phase, the
ProgressRate might
suggest the wrong task to speculative
•If it speed up, it won’t affect the final prediction
Mapper tasks progress in constant rate most of the time
Evaluation
›
Environment
- Amazon EC2 (200-250 nodes)
- Small Local Testbed (9 nodes)
›
Measuring Heterogeneity on EC2
Scheduling experiments
›
Heterogeneity setup
- Assigning a varying number of VMs to each physical node
- Create stragglers by running CPU and I/O intensive processes on same VM
EC2 Sort with Heterogeneity
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 55
Each host
sorted 128MB
with a total of
30GB data
Each job has
486 map tasks
and 437 reduce
tasks
EC2 Sort with Stragglers
INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 56