Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

(1)

Matei Zaharia, Dhruba Borthakur*_{,
Joydeep
Sen
Sarma}*_,

Khaled Elmeleegy+_{,
Scott
Shenker,
Ion
Stoica}

UC Berkeley, *_{Facebook
Inc,}+_{Yahoo!
Research}

Delay Scheduling

A Simple Technique for Achieving Locality

and Fairness in Cluster Scheduling

(2)

Motivation

•  MapReduce / Hadoop originally designed for high

throughput batch processing

•  Today’s workload is far more diverse:

–  Many users want to share a cluster

•  Engineering, marketing, business intelligence, etc

–  Vast majority of jobs are short

•  Ad-‐hoc queries, sampling, periodic reports

–  Response time is critical

•  Interactive queries, deadline-‐driven reports

 How can we efTiciently share MapReduce clusters

between users?

(3)

0% 20% 40% 60% 80% 100% 1 10 100 1000 10000 100000 CD F

Job Running Time (s)

Example: Hadoop at Facebook

•  600-‐node, 2 PB data warehouse, growing at 15 TB/day

•  Applications: data mining, spam detection, ads

•  200 users (half non-‐engineers)

•  7500 MapReduce jobs / day

(4)

Approaches to Sharing

•  Hadoop default scheduler (FIFO)

– Problem: short jobs get stuck behind long ones

•  Separate clusters

– Problem 1: poor utilization

– Problem 2: costly data replication

•  Full replication across clusters nearly infeasible at Facebook/Yahoo! scale

(5)

Our Work

•  Hadoop Fair Scheduler

–  Fine-‐grained sharing at level of map & reduce tasks –  Predictable response times and user isolation

•  Main challenge:

data locality

–  For efTiciency, must run tasks near their input data

–  Strictly following any job queuing policy hurts locality: job picked by policy may not have data on free nodes

•  Solution:

delay scheduling

(6)

The Problem

Job 2

Master Job 1

Scheduling order

Slave Slave Slave Slave Slave Slave

4 2 1 2 1 3 2 3 9 5 3 3 6 2 8 5 8 1 9 1 7 4 6 7

Task 2 Task 5 Task 3 Task 1 Task 7 Task 4

File 1:

(7)

The Problem

Job 1

Master Job 2

Scheduling order

4 2 1 2 2 3 9 5 3 3 6 2 8 5 8 1 9 1 7 4 6 7

Task 2 Task 5 Task 3 Task 1

File 1:

File 2:

Task 1 Task 7 Task 2 Task 4 Task 3

Problem: Fair decision hurts locality

Especially bad for jobs with small input Tiles

(8)

0% 20% 40% 60% 80% 100% 10 100 1000 10000 100000 P er ce n t L oc al M ap T ask s

Job Size (Number of Input Blocks)

Data Locality in Production at Facebook

Node Locality Rack Locality

Locality vs. Job Size at Facebook

(9)

Special Instance: Sticky Slots

•  Under fair sharing, locality can be poor even when all jobs have large input Tiles

•  Problem: jobs get “stuck” in the same set of task slots

–  When one a task in job j Tinishes, the slot it was running in is given back to j, because j is below its share

–  Bad because data Tiles are spread out across all nodes

Job _ShareFair Running _Tasks

Job 1 2 2

Job 2 2 2

Job _ShareFair Running _Tasks

Job 1 2 1

Job 2 2 2

Slave Slave Slave Slave Master

(10)

Special Instance: Sticky Slots

0% 20% 40% 60% 80% 100%

5 Jobs 10 Jobs 20 Jobs 50 Jobs

% L oc al M ap T ask s

Data Locality vs. Number of Concurrent Jobs

•  Under fair sharing, locality can be poor even when all jobs have large input Tiles

•  Problem: jobs get “stuck” in the same set of task slots

–  When one a task in job j Tinishes, the slot it was running in is given back to j, because j is below its share

(11)

Solution: Delay Scheduling

•  Relax queuing policy to make jobs wait for a

limited time if they cannot launch local tasks

•  Result: Very short wait time (1-‐5s) is

(12)

Delay Scheduling Example

Job 1

Master Job 2

Scheduling order

4 2 1 2 1 3 2 3 9 5 3 3 6 2 8 5 8 1 9 1 7 4 6 7 Task 2 Task 3 File 1: File 2:

Task 8 Task 7 Task 2 Task 4 Task 6

Idea: Wait a short time to get data-‐local

scheduling opportunities

Task 5 Task 1 Task 1 Task 3

(13)

Delay Scheduling Details

•  Scan jobs in order given by queuing policy,

picking Tirst that is permitted to launch a task

•  Jobs must wait before being permitted to

launch non-‐local tasks

–  If wait < T

₁

, only allow node-‐local tasks

–  If T

₁

< wait < T

₂

, also allow rack-‐local

–  If wait > T

₂

, also allow off-‐rack

(14)

Analysis

•  When is it worth it to wait, and for how long?

  Waiting worth it if E(wait time) < cost to run non-locally

•  Simple model: Data-‐local scheduling opportunities arrive following Poisson process

•  Expected response time gain from delay scheduling:

E(gain) = (1 – e-‐w/t₎₍_d_–_t₎

  Optimal wait time is inAinity

Delay from running non-‐locally

Expected time to get local scheduling opportunity

(15)

Analysis Continued

E(gain) = (1 – e-‐w/t₎₍_d_–_t₎

•  Typical values of

t

and

d

:

t = (avg task length) / (Tile replication × tasks per node) d ≥ (avg task length)

Delay from running non-‐locally

Expected time to get local scheduling opportunity

Wait amount

(16)

More Analysis

•  What if some nodes are occupied by long tasks?

•  Suppose we have:

–  Fraction φ of active tasks are long

–  R replicas of each block –  L task slots per node

•  For a given data block,

P(all replicas blocked by long tasks) =

_φ

RL

(17)

Evaluation

•  Macrobenchmark

–  IO-‐heavy workload

–  CPU-‐heavy workload

–  Mixed workload

•  Microbenchmarks

–  Sticky slots

–  Small jobs

–  Hierarchical fair scheduling

•  Sensitivity analysis

•  Scheduler overhead

(18)

Macrobenchmark

•  100-‐node EC2 cluster, 4 cores/node

•  Job submission schedule based on job sizes and

inter-‐arrival times at Facebook

–  100 jobs grouped into 9 “bins” of sizes

•  Three workloads:

–  IO-‐heavy, CPU-‐heavy, mixed

•  Three schedulers:

–  FIFO

–  Fair sharing

(19)

0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 CDF Time (s) FIFO Fair Fair + DS 0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 CDF Time (s) FIFO Fair Fair + DS 0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 CDF Time (s) FIFO Fair Fair + DS

Results for IO-Heavy Workload

Small Jobs (1-‐10 input blocks) Medium Jobs (50-‐800 input blocks) Large Jobs (4800 input blocks)

Job Response Times Up to 5x

(20)

Results for IO-Heavy Workload

0% 20% 40% 60% 80% 100% 1 2 3 4 5 6 7 8 9 P er ce n t L oc al M ap s Bin

FIFO Fair Fair + Delay Sched.

0% 20% 40% 60% 80% 1 2 3 4 5 6 7 8 9 Speedup fr om D el ay S ch ed u li n g Bin

(21)

Sticky Slots Microbenchmark

•  5-‐50 jobs on EC2

•  100-‐node cluster

•  4 cores / node

•  5s delay scheduling

0% 20% 40% 60% 80% 100%

P er ce n t L oc al M ap s

Without Delay Scheduling With Delay Scheduling

0 10 20 30 40 50

B en ch m ar k R u n n in g T im e ( m in u te s)

Without Delay Scheduling With Delay Scheduling

(22)

Conclusions

•  Delay scheduling works well under two conditions:

–  SufTicient fraction of tasks are short relative to jobs

–  There are many locations where a task can run efTiciently

•  Blocks replicated across nodes, multiple tasks/node

•  Generalizes beyond MapReduce and fair sharing

–  Applies to any queuing policy that gives ordered job list

•  Used in Hadoop Fair Scheduler to implement hierarchical policy

(23)

Thanks!

•  This work is open source, packaged with

Hadoop as the Hadoop Fair Scheduler

– Old version without delay scheduling in use at

Facebook since fall 2008

– Delay scheduling will appear in Hadoop 0.21

(24)

Related Work

•  Quincy (SOSP ‘09)

–  Locality aware fair scheduler for Dryad

–  Expresses scheduling problem as min-‐cost Tlow –  Kills tasks to reassign nodes when jobs change –  One task slot per machine

•  HPC scheduling

–  Inelastic jobs (cannot scale up/down over time) –  Data locality generally not considered

•  Grid computing

–  Focused on common interfaces for accessing compute resources in multiple administrative domains

(25)

Delay Scheduling Details

when there is a free task slot on node n:

sort jobs according to queuing policy for j in jobs:

if j has node-‐local task t on n:

j.level := 0; j.wait := 0; return t

else if j has rack-‐local task t on n and (j.level ≥ 1 or j.wait ≥ T₁): j.level := 1; j.wait := 0; return t

else if j.level = 2 or (j.level = 1 and j.wait ≥ T₂) or (j.level = 0 and j.wait ≥ T₁ + T₂):

j.level := 2; j.wait := 0; return t else:

(26)

Sensitivity Analysis

5% 68% 98% 100% 11% 80% 99% 100% 0% 20% 40% 60% 80% 100%

No Delay 1s Delay 5s Delay 10s Delay

P er ce n t L oc al M ap s

(27)

Hadoop Terminology

•  Job = entire MapReduce computation; consists of

map & reduce tasks

•  Task = unit of work that is part of a job

•  Slot = location on a slave where a task can run

Output File Map Map Map Reduce Reduce Input F ile Block Block Block

(28)

Analysis

•  Required waiting time is a fraction of the

average task length as long as there are

many locations where a task can run

– Ex: with 3 replicas / block and 8 tasks / node,

wait 1/24

th

_{of
average
task
length}

•  Non-‐locality decreases exponentially with

waiting time

(29)

  Fraction of mean task length Decreases with # slots/node

Analysis

•  Suppose we have:

–  M machines

–  L slots/machine

–  T second average task length

–  Job with data on fraction f of nodes

•  Let D = max # of scheduling opportunities skipped

•  P(non-‐local) = (1-‐f)

D

•  waiting time = D * T/(ML)

(30)

Small Jobs Experiment

•  10-‐20 min benchmark

consisting of several

hundred IO-‐heavy jobs

•  Job sizes: 3, 10 and 100

map tasks

•  Environment: 100-‐node

cluster on 4 racks; 5 map

slots per node

0

0.2 0.4 0.6 0.8 1

3 Maps 10 Maps 100 Maps

Normalized Running T

ime

Job Size (Number of Maps)

No Delay Scheduling With Delay Scheduling Job Size Default Scheduler

Node Loc. Rack Loc.

3 maps 2% 50%

10 maps 37% 98%

100 maps 84% 99%

With Delay Sched. Node Loc. Rack Loc.

75% 94%

99% 99%

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Delay Scheduling

A Simple Technique for Achieving Locality

and Fairness in Cluster Scheduling

Motivation

• MapReduce / Hadoop originally designed for high

throughput batch processing

• Today’s workload is far more diverse:

 How can we efTiciently share MapReduce clusters

between users?

Example: Hadoop at Facebook

• 600-­‐node, 2 PB data warehouse, growing at 15 TB/day

• Applications: data mining, spam detection, ads

• 200 users (half non-­‐engineers)

• 7500 MapReduce jobs / day

Approaches to Sharing

• Hadoop default scheduler (FIFO)

– Problem: short jobs get stuck behind long ones

• Separate clusters

– Problem 1: poor utilization

– Problem 2: costly data replication

Our Work

• Hadoop Fair Scheduler

• Main challenge:

data locality

• Solution:

delay scheduling

The Problem

The Problem

Problem: Fair decision hurts locality

Especially bad for jobs with small input Tiles

Locality vs. Job Size at Facebook

Special Instance: Sticky Slots

Special Instance: Sticky Slots

Solution: Delay Scheduling

• Relax queuing policy to make jobs wait for a

limited time if they cannot launch local tasks

• Result: Very short wait time (1-­‐5s) is

Delay Scheduling Example

Idea: Wait a short time to get data-­‐local

scheduling opportunities

Delay Scheduling Details

• Scan jobs in order given by queuing policy,

picking Tirst that is permitted to launch a task

• Jobs must wait before being permitted to

launch non-­‐local tasks

– If wait < T

, only allow node-­‐local tasks

– If T

< wait < T

, also allow rack-­‐local

– If wait > T

, also allow off-­‐rack

Analysis

Analysis Continued

• Typical values of

t

and

d

:

More Analysis

• What if some nodes are occupied by long tasks?

• Suppose we have:

• For a given data block,

P(all replicas blocked by long tasks) =

φ

Evaluation

• Macrobenchmark

– IO-­‐heavy workload

– CPU-­‐heavy workload

– Mixed workload

• Microbenchmarks

– Sticky slots

– Small jobs

– Hierarchical fair scheduling

• Sensitivity analysis

• Scheduler overhead

Macrobenchmark

• 100-­‐node EC2 cluster, 4 cores/node

• Job submission schedule based on job sizes and

•  MapReduce / Hadoop originally designed for high

•  Today’s workload is far more diverse:

 How can we efTiciently share MapReduce clusters

•  600-‐node, 2 PB data warehouse, growing at 15 TB/day

•  Applications: data mining, spam detection, ads

•  200 users (half non-‐engineers)

•  7500 MapReduce jobs / day

•  Hadoop default scheduler (FIFO)

– Problem: short jobs get stuck behind long ones

•  Separate clusters

– Problem 1: poor utilization

– Problem 2: costly data replication

•  Hadoop Fair Scheduler

•  Main challenge:

•  Solution:

•  Relax queuing policy to make jobs wait for a

•  Result: Very short wait time (1-‐5s) is

Idea: Wait a short time to get data-‐local

•  Scan jobs in order given by queuing policy,

•  Jobs must wait before being permitted to

launch non-‐local tasks

–  If wait < T

, only allow node-‐local tasks

–  If T

, also allow rack-‐local

–  If wait > T

, also allow off-‐rack

•  Typical values of

•  What if some nodes are occupied by long tasks?

•  Suppose we have:

•  For a given data block,

_φ

•  Macrobenchmark

–  IO-‐heavy workload

–  CPU-‐heavy workload

–  Mixed workload

•  Microbenchmarks

–  Sticky slots

–  Small jobs

–  Hierarchical fair scheduling

•  Sensitivity analysis

•  Scheduler overhead

•  100-‐node EC2 cluster, 4 cores/node

•  Job submission schedule based on job sizes and

inter-‐arrival times at Facebook

•  Three workloads:

•  Three schedulers:

Results for IO-Heavy Workload

Results for IO-Heavy Workload

•  5-‐50 jobs on EC2

•  100-‐node cluster

•  4 cores / node

•  5s delay scheduling

•  Delay scheduling works well under two conditions:

•  Generalizes beyond MapReduce and fair sharing

•  This work is open source, packaged with

– Old version without delay scheduling in use at

– Delay scheduling will appear in Hadoop 0.21

•  Quincy (SOSP ‘09)

•  HPC scheduling

•  Grid computing

•  Job = entire MapReduce computation; consists of

•  Task = unit of work that is part of a job

•  Slot = location on a slave where a task can run

•  Required waiting time is a fraction of the

– Ex: with 3 replicas / block and 8 tasks / node,

_{of
average
task
length}

•  Non-‐locality decreases exponentially with

•  Suppose we have:

•  Let D = max # of scheduling opportunities skipped

•  P(non-‐local) = (1-‐f)

•  waiting time = D * T/(ML)

•  10-‐20 min benchmark

hundred IO-‐heavy jobs

•  Job sizes: 3, 10 and 100

•  Environment: 100-‐node