• No results found

INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling

N/A
N/A
Protected

Academic year: 2021

Share "INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling"

Copied!
58
0
0

Loading.... (view fulltext now)

Full text

(1)

INFO5011

Cloud Computing Semester 2, 2011

Lecture 11, Cloud Scheduling

The presentation is based on:

Quincy: Fair Scheduling for Distributed Computing Clusters. Michael

Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg, SOSP'09

Improving MapReduce Performance in Heterogeneous Environment.

Matei, Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica. OSDI’2008

Some content/diagrams are from the paper and the author’s presentation

COMMONWEALTH OF Copyright Regulations 1969

WARNING

This material has been reproduced and communicated to you by or on behalf of the university of Sydney pursuant to Part

VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or

communication of this material by you may be the subject of copyright protection under the Act.

(2)

Outline

Motivation

- Default FIFO scheduling in Hadoop and its problems

Quincy scheduling for distributed computing cluster

- Proposed by Microsoft Research

- Build on Dryad cluster

- Homogeneous assumption

- Focusing on determining which node to run which job

Scheduling in Heterogeneous environment

- Proposed by U.C. Berkley RAD lab

- Heterogeneous assumption

- Focusing on optimizing speculative execution

(3)

Motivation

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 3

Big clusters used for jobs of varying sizes, durations

Data from production search cluster used in Microsoft

(4)

The Hadoop scheduling

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 4

Diagram from the CACM version of the original MapReduce paper

The master knows before hand the number of

mappers and number of reducers for each job.

› It also knows the number of available task slots on each worker node

› It is responsible for

assigning tasks to node with vacant slot

› The default and the

simplest scheduling using FIFO queue

(5)

The problem of simple FIFO scheduler

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 5

maximum number of tasks

(6)

Hadoop running a single job

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 6

hadoop jar wordcount.jar info5011wordcounter.WordCount assn_input/n07.txt countout

= 7 blocks One job:

(7)

Hadoop map task list for

job_201108241404_1213

(8)

Hadoop running 2 jobs concurrently

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 8

hadoop jar wordcount.jar info5011wordcounter.WordCount assn_input/n00p2.txt countoutn00 hadoop jar wordcount.jar info5011wordcounter.WordCount user/zhouy/assn_input/n07.txt countoutn07

Similar job as the first one except with more reducers

(9)

Job smmaries

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 9

first job finishes

(10)

Hadoop FIFO queue status

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 10

The second, smaller job waits in queue

(11)

Hadoop map task list for

job_201108241404_1214

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 11

gpu1 gpu1 dm0 gpu0 gpu0 dm0 dm1 dm2 dm2

(12)

mapper 0 and mapper 9

(13)

Outline

Motivation

- Default FIFO scheduling in Hadoop and its problems

Quincy scheduling for distributed computing cluster

- Proposed by Microsoft Research

- Build on Dryad cluster

- Homogeneous assumption

- Focusing on determining which node to run which job

Scheduling in Heterogeneous environment

- Proposed by U.C. Berkley RAD lab

- Heterogeneous assumption

- Focusing on optimizing speculative execution

(14)

Fair scheduling

Job

X

takes

t

seconds when it runs exclusively on a cluster .

X

should take no more than

Jt

seconds when cluster has

J

concurrent jobs.

Formally, for

N

computers and

J

jobs, each job should get

at-least

N/J

computers.

(15)

Data Locality constrains

Other scheduling approaches:

- HPC jobs fetch data from a SAN, no need for co-location of data and computation.

Data intensive workloads (MapReduce/Hadoop/Dryad)

- have storage attached to computers.

- Scheduling tasks near data improves performance.

15 INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou)

(16)

Fairness vs. data locality

The requirements of fairness and locality often conflict

-

A strategy that achieves optimal data locality will typically delay a job

until its ideal resources are available

-

Fairness benefits from allocating the best available resources to a job as

soon as possible after they are requested

An important feature of

Data intensive workloads

(MapReduce/Hadoop/Dryad)

- While running, tasks are independent of each other so killing one task will not impact another

- In contrast, MPI jobs are made of sets of stateful processes tightly coupled by communicating with each other across the network.

(17)

Cluster architecture

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 17

(18)

Queue based scheduling

Baseline algorithm (Greedy)

- Equivalent to FIFO scheduling in Hadoop

Simple greedy fairness (GF)

- Based on Hadoop’s Fair Scheduler

- Job j gets a baseline allocation A*j =min(|M/K|, Nj) where M: number of

computer; K: number of concurrent jobs; Nj total number of running and waiting tasks for job j.

- If Σj A*

j<M the remaining slots are divided equally among jobs that have

additional ready workers so that final allocation Aj has Σj Aj=min(M, Σj Nj).

- The scheduler blocks job j whenever it is running Aj tasks or more. It only assigns tasks of an unblocked job to a newly available computer.

Fairness with preemption (GFP

- When a job is running more than tasks, the scheduler will kill its over-quota tasks, starting with the most recently scheduled tasks first

(19)

Simple Greedy Fairness

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 19

(20)

Sticky slots problem

Under a steady state in which each job is occupying exactly its allocated

quota of computers.

- Whenever a task from job j completes on computer, another task from j will be assigned to m again

- m sticks to j indefinitely whether or not j has any waiting tasks that have good data locality when run on m.

Simple solution

- Do not unblock job j once a task finishes

- wait till j ’s running tasks falls below Aj -MH, where MH is a hysteresis margin or ΔH seconds have passed.

- In many cases, this delay is sufficient to allow another job’s worker, with better

locality to “steal” computer m.

(21)

Sticky slots illustrated (i)

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 21

Diagrams and example in slides 21- 27 are from authors’ original presentation

(22)

Sticky slots illustrated (ii)

(23)

Sticky slots illustrated (iii)

(24)

Sticky slots illustrated (iv)

(25)

Sticky slots illustrated (v)

(26)

Sticky slots illustrated (vi)

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 26

(27)

Sticky slots illustrated (vii)

(28)

Quincy-- Flow Based Scheduler

Main idea

-

Matching = Scheduling

-

Each task is either scheduled on a computer c or un scheduled

-

Can assign a cost to any matching

-

Fairness constrains number of tasks that are scheduled

-

The goal is to minimize matching cost while obeying fairness constraints

-

Min-cost network flow problem

-

Instead of making local decisions [greedy], solve it globally.

(29)

Graph construction (i)

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 29

Start with a directed graph representation of the cluster

architecture.

Cluster aggregator

Individual computers Rack aggregator

(30)

Graph construction (ii)

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 30

job 1 with 6 tasks and a root task Each receive one unit of flow as its supply

Unscheduled node for job 1

•Each task has an edge to Uj.

•There is a single edge from Uj to the sink. •High cost on edges from tasks to Uj.

(31)

Graph construction (iii)

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 31

Add edges from tasks (T) to

computers (C), if computer C has some data for task T.

The cost is a function of the amount of data that would be transferred across rack and core switch

(32)

Graph construction (iv)

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 32

Add edges from tasks (T) to rack (R), if R has some data for task T.

The cost on the edge is set to the worst case cost that would result if the task were run on the least favorable computer in R

(33)

Graph construction (v)

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 33

Add edges from all tasks (T) to cluster (X)

The cost on the edge is set to the worst case cost for running the task on any computer in the cluster

(34)

Graph construction (vi)

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 34

0 cost edge from root task to

computer to avoid preempting root task.

Constrains how many tasks can run on each computer

(35)

A Feasible Matching

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 35

Unscheduled job

(36)

Final graph

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 36

Fairness constrains, setting it to 4 means at lest 2 tasks from job 1 needs to go through computer Fairness constrains, setting it to 2 means at lest 2 tasks from job 2 needs to go through computer

(37)

Some experiment results

Workload:

Typical Dryad jobs (Sort, Join, PageRank, WordCount,

Prime).

In total, 30 jobs with a mix of CPU, disk, and network

intensive tasks

Prime used as a worst-case job that hogs the cluster if started

first.

240 computers in cluster. 8 racks, 29-31 computers per rack.

More than one metric used for evaluation.

(38)

Results (i)

(39)

Results (ii)

(40)

Results (iii)

(41)

Results (iv)

(42)

Discussion point

Solver overhead

- The observed average overhead in this 240 machine cluster is 7.64ms with a maximum cost of 57.59ms

- Simulated average overhead in 2500 computers running 100 concurrent jobs is a little over a second per solution

- Seems acceptable, but min-cost flow is recomputed from scratch each time a change occurs

Applicable in other scheduling environment?

- The easy mapping of the scheduling problem to a min-cost flow is due to

- Tasks are relatively independent with each other, there is no correlation constraints

- One dimensional capacity setting

(43)

Outline

Motivation

- Default FIFO scheduling in Hadoop and its problems

Quincy scheduling for distributed computing cluster

- Proposed by Microsoft Research

- Build on Dryad cluster

- Homogeneous assumption

- Focusing on determining which node to run which job

Scheduling in Heterogeneous environment

- Proposed by U.C. Berkley RAD lab

- Heterogeneous assumption

- Focusing on optimizing speculative execution

(44)

Hadoop’s straggler handling mechanism

Speculative execution

- If a node is available but is performing poorly, this is called a straggler

- MapReduce has a build-in mechanism to run a speculative copy of its task on another machine to finish the computation faster.

This paper tries to Improve the performance of speculative executions

by

-

Define a new scheduling metric.

-

Choosing the right machines to run speculative tasks.

-

Capping the amount of speculative executions.

(45)

Progress score

Hadoop monitors task progress using a progress score to select speculative tasks

- Map task’s progress score is the fraction of input data read

- Reduce task’s execution is divided into three phases, each of which account for 1/3 of the

score. In each phases, the score is the fraction of data process

› When a task’s progress score is less than the average for its category minus 0.2 and the task has run for at least one minute, it is marked as a straggler

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 45

Copy phase sort phase reduce phase

For a reduce task, the execution is divided into three phases, each of which accounts for 1/3 of the score progress score is the fraction of input data read

(46)

Hadoop’s assumption

Nodes can perform work at exactly the same rate

Tasks progress at a constant rate throughout time

There is no cost to launching a speculative task on an idle node

The three phases of execution take approximately same time

Tasks with a low progress score are stragglers

Maps and Reduces require roughly the same amount of work

(47)

Breaking down the assumptions

The first 2 assumptions talk about homogeneity. However

-

In a non-virtualized data center, there may be multiple generations

of hardware

-

In a virtualized data center, multiple VMs are co-located on the

same physical host.

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 47

Diagrams from Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang, CloudCmp: comparing public cloud providers. In Proceedings of the 10th annual conference on Internet measurement (IMC '10)

(48)

Heterogeneity are observed by other researchers

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 48

Jorg Schad, Jens Dittrich, and Jorge-Arnulfo Quiane-Ruiz. Runtime measurements in the cloud: observing, analyzing, and

reducing variance. In Proceedings of the 36th International Conference on Very Large Data Bases(VLDB’10), September

(49)

Other assumptions

Assumption 3 that speculating tasks coast nothing, breaks down when

resources are shared

- Network is a bottleneck and speculative tasks may compete for disk I/O

Assumption 4 that a task’s progress score is approximately equal to its

percent completion, does not hold especially for reduce tasks

- The copy phase usually counts for more than 1/3 of the task execution time

Assumption 5, that progress score is a good proxy for progress rate

because tasks being at roughly the same time, can also be wrong

- Number of mappers depends on number of blocks which might be much large the available slots. The mappers tend to run in waves (see slide 11).

(50)

LATE scheduler

Longest Approximate Time to End

Design principle

- Always speculatively execute the task that “we” think will finish farthest into the

future

Different methods can be used to estimate time left

Propose a simple heuristic based on progress rate

- ProgressRate = ProgressScore/The amount of time the task has been running

- Estimated time to completion = (1-ProgressScore)/ProgressRate

- It assumes that tasks make progress at a roughly constant rate (there are exceptions to this assumption)

(51)

LATE parameters

- SlowNodeThreshold: used to select fast node to launch speculative tasks

- 25th percentile of node progress

- SpeculativeCap: used to control the number of speculative tasks that can be

running at once

- 10% of available task slots

- SlowTaskThreshold: used to select task for speculative copy

- 25th percentile of task progress

- Currently it does not consider data locality

(52)

Estimating finishing time

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 52

• the current ProgressRate computation assumes constant progress, which might not be true

•If a task’s execution slows

down in later phase, the

ProgressRate might

suggest the wrong task to speculative

•If it speed up, it won’t affect the final prediction

Mapper tasks progress in constant rate most of the time

(53)

Evaluation

Environment

- Amazon EC2 (200-250 nodes)

- Small Local Testbed (9 nodes)

Measuring Heterogeneity on EC2

(54)

Scheduling experiments

Heterogeneity setup

- Assigning a varying number of VMs to each physical node

- Create stragglers by running CPU and I/O intensive processes on same VM

(55)

EC2 Sort with Heterogeneity

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 55

Each host

sorted 128MB

with a total of

30GB data

Each job has

486 map tasks

and 437 reduce

tasks

(56)

EC2 Sort with Stragglers

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 56

Each node sorted

256MB with a total

of 25GB of data

Stragglers created

with 4 CPU (800KB

array sort) and 4

disk (dd tasks)

(57)

Sensitivity Analysis

(58)

Conclusion

Advantages

-

Considers heterogeneity that appears in real life systems.

-

LATE speculatively executes the tasks that hurt the

response time the most on fast nodes.

-

LATE caps speculative tasks to avoid overloading resources

Limitations

-

Does not consider data locality

-

Finishing time estimation may predict wrong when tasks

slows down

References

Related documents

New Cooperative Medical Scheme (NCMS) aims to protect households from catastrophic health expenditure (CHE) and impoverishment in rural China.. This article assesses the effect of

Numerous studies have shown that apigenin (API) could treat various fibrotic diseases by regulating various signaling pathways, whereas no study has discussed whether API can

Numerous pharmaceutical and biotechnology companies world-wide have scaled-up reliably and predictably to high area 12 inch and 16 inch diameter cartridge systems using small

[14] In the document for 2020, PTD recommends that in people with diabetes, who despite the implementation of lifestyle modifications (weight reduction, increased physical activity

Large orders can move price in the adverse direction, and a general way of reducing trading loss is splitting large orders into smaller child orders and spanning them over a given

The estimation results reported in the table point out that the disaggregation of the policy effect by quarter yields positive and statistically significant city of being hired

In the figures the predictions for IHS2, using the IHS2 model, are in fact not the numbers predicted by a model, but the actual poverty level calculated directly from the survey..

The Elite 2 and Inter 1 gymnasts try to accelerate the giant swing during the preparation phase by starting the Tkatchev before the Handstand position (Table 1, Figure 1