INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling

(1)

INFO5011

Cloud Computing Semester 2, 2011

Lecture 11, Cloud Scheduling

The presentation is based on:

Quincy: Fair Scheduling for Distributed Computing Clusters. Michael

Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg, SOSP'09

Improving MapReduce Performance in Heterogeneous Environment.

Matei, Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica. OSDI’2008

Some content/diagrams are from the paper and the author’s presentation

COMMONWEALTH OF Copyright Regulations 1969

WARNING

This material has been reproduced and communicated to you by or on behalf of the university of Sydney pursuant to Part

VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or

communication of this material by you may be the subject of copyright protection under the Act.

(2)

Outline

›

Motivation

- Default FIFO scheduling in Hadoop and its problems

›

Quincy scheduling for distributed computing cluster

- Proposed by Microsoft Research

- Build on Dryad cluster

- Homogeneous assumption

- Focusing on determining which node to run which job

›

Scheduling in Heterogeneous environment

- Proposed by U.C. Berkley RAD lab

- Heterogeneous assumption

- Focusing on optimizing speculative execution

(3)

Motivation

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 3

›

Big clusters used for jobs of varying sizes, durations

›

Data from production search cluster used in Microsoft

(4)

The Hadoop scheduling

Diagram from the CACM version of the original MapReduce paper

› The master knows before hand the number of

mappers and number of reducers for each job.

› It also knows the number of available task slots on each worker node

› It is responsible for

assigning tasks to node with vacant slot

› The default and the

simplest scheduling using FIFO queue

(5)

The problem of simple FIFO scheduler

maximum number of tasks

(6)

Hadoop running a single job

hadoop jar wordcount.jar info5011wordcounter.WordCount assn_input/n07.txt countout

= 7 blocks One job:

(7)

Hadoop map task list for

job_201108241404_1213

(8)

Hadoop running 2 jobs concurrently

hadoop jar wordcount.jar info5011wordcounter.WordCount assn_input/n00p2.txt countoutn00 hadoop jar wordcount.jar info5011wordcounter.WordCount user/zhouy/assn_input/n07.txt countoutn07

Similar job as the first one except with more reducers

(9)

Job smmaries

first job finishes

(10)

Hadoop FIFO queue status

The second, smaller job waits in queue

(11)

Hadoop map task list for

job_201108241404_1214

gpu1 gpu1 dm0 gpu0 gpu0 dm0 dm1 dm2 dm2

(12)

mapper 0 and mapper 9

(13)

Outline

›

Motivation

›

Quincy scheduling for distributed computing cluster

›

Scheduling in Heterogeneous environment

(14)

Fair scheduling

›

Job

X

takes

t

seconds when it runs exclusively on a cluster .

›

X

should take no more than

Jt

seconds when cluster has

J

concurrent jobs.

›

Formally, for

N

computers and

J

jobs, each job should get

at-least

N/J

computers.

(15)

Data Locality constrains

›

Other scheduling approaches:

- HPC jobs fetch data from a SAN, no need for co-location of data and computation.

›

Data intensive workloads (MapReduce/Hadoop/Dryad)

- have storage attached to computers.

- Scheduling tasks near data improves performance.

15 INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou)

(16)

Fairness vs. data locality

›

The requirements of fairness and locality often conflict

-

A strategy that achieves optimal data locality will typically delay a job

until its ideal resources are available

-

Fairness benefits from allocating the best available resources to a job as

soon as possible after they are requested

›

An important feature of

Data intensive workloads

(MapReduce/Hadoop/Dryad)

- While running, tasks are independent of each other so killing one task will not impact another

- In contrast, MPI jobs are made of sets of stateful processes tightly coupled by communicating with each other across the network.

(17)

Cluster architecture

(18)

Queue based scheduling

›

Baseline algorithm (Greedy)

- Equivalent to FIFO scheduling in Hadoop

›

Simple greedy fairness (GF)

- Based on Hadoop’s Fair Scheduler

- Job j gets a baseline allocation A*_j =min(|M/K|, N_j) where M: number of

computer; K: number of concurrent jobs; N_j total number of running and waiting tasks for job j.

- If Σ_j A*

j<M the remaining slots are divided equally among jobs that have

additional ready workers so that final allocation A_j has Σ_j A_j=min(M, Σ_j N_j).

- The scheduler blocks job j whenever it is running A_j tasks or more. It only assigns tasks of an unblocked job to a newly available computer.

›

Fairness with preemption (GFP

- When a job is running more than tasks, the scheduler will kill its over-quota tasks, starting with the most recently scheduled tasks first

(19)

Simple Greedy Fairness

(20)

Sticky slots problem

›

Under a steady state in which each job is occupying exactly its allocated

quota of computers.

- Whenever a task from job j completes on computer, another task from j will be assigned to m again

- m sticks to j indefinitely whether or not j has any waiting tasks that have good data locality when run on m.

›

Simple solution

- Do not unblock job j once a task finishes

- wait till j ’s running tasks falls below A_j-M_H, where M_H is a hysteresis margin or Δ_H seconds have passed.

- In many cases, this delay is sufficient to allow another job’s worker, with better

locality to “steal” computer m.

(21)

Sticky slots illustrated (i)

Diagrams and example in slides 21- 27 are from authors’ original presentation

(22)

Sticky slots illustrated (ii)

(23)

Sticky slots illustrated (iii)

(24)

Sticky slots illustrated (iv)

(25)

Sticky slots illustrated (v)

(26)

Sticky slots illustrated (vi)

(27)

Sticky slots illustrated (vii)

(28)

Quincy-- Flow Based Scheduler

›

Main idea

-

Matching = Scheduling

-

Each task is either scheduled on a computer c or un scheduled

-

Can assign a cost to any matching

-

Fairness constrains number of tasks that are scheduled

-

The goal is to minimize matching cost while obeying fairness constraints

-

Min-cost network flow problem

-

Instead of making local decisions [greedy], solve it globally.

(29)

Graph construction (i)

›

Start with a directed graph representation of the cluster

architecture.

Cluster aggregator

Individual computers Rack aggregator

(30)

Graph construction (ii)

job 1 with 6 tasks and a root task Each receive one unit of flow as its supply

Unscheduled node for job 1

•Each task has an edge to U_j.

•There is a single edge from U_j to the sink. •High cost on edges from tasks to U_j.

(31)

Graph construction (iii)

Add edges from tasks (T) to

computers (C), if computer C has some data for task T.

The cost is a function of the amount of data that would be transferred across rack and core switch

(32)

Graph construction (iv)

Add edges from tasks (T) to rack (R), if R has some data for task T.

The cost on the edge is set to the worst case cost that would result if the task were run on the least favorable computer in R

(33)

Graph construction (v)

Add edges from all tasks (T) to cluster (X)

The cost on the edge is set to the worst case cost for running the task on any computer in the cluster

(34)

Graph construction (vi)

0 cost edge from root task to

computer to avoid preempting root task.

Constrains how many tasks can run on each computer

(35)

A Feasible Matching

Unscheduled job

(36)

Final graph

Fairness constrains, setting it to 4 means at lest 2 tasks from job 1 needs to go through computer Fairness constrains, setting it to 2 means at lest 2 tasks from job 2 needs to go through computer

(37)

Some experiment results

›

Workload:

›

Typical Dryad jobs (Sort, Join, PageRank, WordCount,

Prime).

›

In total, 30 jobs with a mix of CPU, disk, and network

intensive tasks

›

Prime used as a worst-case job that hogs the cluster if started

first.

›

240 computers in cluster. 8 racks, 29-31 computers per rack.

›

More than one metric used for evaluation.

(38)

Results (i)

(39)

Results (ii)

(40)

Results (iii)

(41)

Results (iv)

(42)

Discussion point

›

Solver overhead

- The observed average overhead in this 240 machine cluster is 7.64ms with a maximum cost of 57.59ms

- Simulated average overhead in 2500 computers running 100 concurrent jobs is a little over a second per solution

- Seems acceptable, but min-cost flow is recomputed from scratch each time a change occurs

›

Applicable in other scheduling environment?

- The easy mapping of the scheduling problem to a min-cost flow is due to

- Tasks are relatively independent with each other, there is no correlation constraints

- One dimensional capacity setting

(43)

Outline

›

Motivation

›

Quincy scheduling for distributed computing cluster

›

Scheduling in Heterogeneous environment

(44)

Hadoop’s straggler handling mechanism

›

Speculative execution

- If a node is available but is performing poorly, this is called a straggler

- MapReduce has a build-in mechanism to run a speculative copy of its task on another machine to finish the computation faster.

›

This paper tries to Improve the performance of speculative executions

by

-

Define a new scheduling metric.

-

Choosing the right machines to run speculative tasks.

-

Capping the amount of speculative executions.

(45)

Progress score

› Hadoop monitors task progress using a progress score to select speculative tasks

- Map task’s progress score is the fraction of input data read

- Reduce task’s execution is divided into three phases, each of which account for 1/3 of the

score. In each phases, the score is the fraction of data process

› When a task’s progress score is less than the average for its category minus 0.2 and the task has run for at least one minute, it is marked as a straggler

Copy phase sort phase reduce phase

For a reduce task, the execution is divided into three phases, each of which accounts for 1/3 of the score progress score is the fraction of input data read

(46)

Hadoop’s assumption

›

Nodes can perform work at exactly the same rate

›

Tasks progress at a constant rate throughout time

›

There is no cost to launching a speculative task on an idle node

›

The three phases of execution take approximately same time

›

Tasks with a low progress score are stragglers

›

Maps and Reduces require roughly the same amount of work

(47)

Breaking down the assumptions

›

The first 2 assumptions talk about homogeneity. However

-

In a non-virtualized data center, there may be multiple generations

of hardware

-

In a virtualized data center, multiple VMs are co-located on the

same physical host.

Diagrams from Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang, CloudCmp: comparing public cloud providers. In Proceedings of the 10th annual conference on Internet measurement (IMC '10)

(48)

Heterogeneity are observed by other researchers

Jorg Schad, Jens Dittrich, and Jorge-Arnulfo Quiane-Ruiz. Runtime measurements in the cloud: observing, analyzing, and

reducing variance. In Proceedings of the 36th_{International Conference on Very Large Data Bases(VLDB’10), September}

(49)

Other assumptions

›

Assumption 3 that speculating tasks coast nothing, breaks down when

resources are shared

- Network is a bottleneck and speculative tasks may compete for disk I/O

›

Assumption 4 that a task’s progress score is approximately equal to its

percent completion, does not hold especially for reduce tasks

- The copy phase usually counts for more than 1/3 of the task execution time

›

Assumption 5, that progress score is a good proxy for progress rate

because tasks being at roughly the same time, can also be wrong

- Number of mappers depends on number of blocks which might be much large the available slots. The mappers tend to run in waves (see slide 11).

(50)

LATE scheduler

›

Longest Approximate Time to End

›

Design principle

- Always speculatively execute the task that “we” think will finish farthest into the

future

›

Different methods can be used to estimate time left

›

Propose a simple heuristic based on progress rate

- ProgressRate = ProgressScore/The amount of time the task has been running

- Estimated time to completion = (1-ProgressScore)/ProgressRate

- It assumes that tasks make progress at a roughly constant rate (there are exceptions to this assumption)

(51)

LATE parameters

- SlowNodeThreshold: used to select fast node to launch speculative tasks

- 25th_{percentile of node progress}

- SpeculativeCap: used to control the number of speculative tasks that can be

running at once

- 10% of available task slots

- SlowTaskThreshold: used to select task for speculative copy

- 25th_{percentile of task progress}

- Currently it does not consider data locality

(52)

Estimating finishing time

• the current ProgressRate computation assumes constant progress, which might not be true

•If a task’s execution slows

down in later phase, the

ProgressRate might

suggest the wrong task to speculative

•If it speed up, it won’t affect the final prediction

Mapper tasks progress in constant rate most of the time

(53)

Evaluation

›

Environment

- Amazon EC2 (200-250 nodes)

- Small Local Testbed (9 nodes)

›

Measuring Heterogeneity on EC2

(54)

Scheduling experiments

›

Heterogeneity setup

- Assigning a varying number of VMs to each physical node

- Create stragglers by running CPU and I/O intensive processes on same VM

(55)

EC2 Sort with Heterogeneity

Each host

sorted 128MB

with a total of

30GB data

Each job has

486 map tasks

and 437 reduce

tasks

(56)

EC2 Sort with Stragglers

Each node sorted

256MB with a total

of 25GB of data

Stragglers created

with 4 CPU (800KB

array sort) and 4

disk (dd tasks)

(57)

Sensitivity Analysis

(58)