• No results found

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

N/A
N/A
Protected

Academic year: 2021

Share "Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Matei  Zaharia,  Dhruba  Borthakur*,  Joydeep  Sen  Sarma*,  

Khaled  Elmeleegy+,  Scott  Shenker,  Ion  Stoica  

UC  Berkeley,  *Facebook  Inc,  +Yahoo!  Research  

Delay  Scheduling  

A  Simple  Technique  for  Achieving  Locality  

and  Fairness  in  Cluster  Scheduling  

(2)

Motivation  

•  MapReduce  /  Hadoop  originally  designed  for  high  

throughput  batch  processing  

•  Today’s  workload  is  far  more  diverse:  

–  Many  users  want  to  share  a  cluster  

•  Engineering,  marketing,  business  intelligence,  etc  

–  Vast  majority  of  jobs  are  short  

•  Ad-­‐hoc  queries,  sampling,  periodic  reports  

–  Response  time  is  critical  

•  Interactive  queries,  deadline-­‐driven  reports  

 How  can  we  efTiciently  share  MapReduce  clusters  

between  users?  

(3)

0%   20%   40%   60%   80%   100%   1   10   100   1000   10000   100000   CD F  

Job  Running  Time  (s)  

Example:  Hadoop  at  Facebook  

•  600-­‐node,  2  PB  data  warehouse,  growing  at  15  TB/day  

•  Applications:  data  mining,  spam  detection,  ads  

•  200  users  (half  non-­‐engineers)  

•  7500  MapReduce  jobs  /  day  

(4)

Approaches  to  Sharing  

•  Hadoop  default  scheduler  (FIFO)  

– Problem:  short  jobs  get  stuck  behind  long  ones  

•  Separate  clusters  

– Problem  1:  poor  utilization  

– Problem  2:  costly  data  replication  

•  Full  replication  across  clusters  nearly  infeasible  at   Facebook/Yahoo!  scale  

(5)

Our  Work  

•  Hadoop  Fair  Scheduler  

–  Fine-­‐grained  sharing  at  level  of  map  &  reduce  tasks   –  Predictable  response  times  and  user  isolation  

•  Main  challenge:

 data  locality  

–  For  efTiciency,  must  run  tasks  near  their  input  data  

–  Strictly  following  any  job  queuing  policy  hurts  locality:   job  picked  by  policy  may  not  have  data  on  free  nodes  

•  Solution:

 

delay  scheduling  

(6)

The  Problem  

Job  2  

Master   Job  1  

Scheduling  order  

Slave   Slave   Slave   Slave   Slave   Slave  

4   2   1   2   1   3   2   3   9   5   3   3   6   2   8   5   8   1   9   1   7   4   6   7  

Task  2   Task  5   Task  3   Task  1   Task  7   Task  4  

File  1:  

(7)

The  Problem  

Job  1  

Master   Job  2  

Scheduling  order  

Slave   Slave   Slave   Slave   Slave   Slave  

4   2   1   2   2   3   9   5   3   3   6   2   8   5   8   1   9   1   7   4   6   7  

Task  2   Task  5   Task  3   Task  1  

File  1:  

File  2:  

Task  1   Task  7  Task  2   Task  4  Task  3  

Problem:  Fair  decision  hurts  locality  

Especially  bad  for  jobs  with  small  input  Tiles  

(8)

0%   20%   40%   60%   80%   100%   10   100   1000   10000   100000   P er ce n t  L oc al  M ap  T ask s  

Job  Size  (Number  of  Input  Blocks)  

Data  Locality  in  Production  at  Facebook  

Node  Locality   Rack  Locality  

Locality  vs.  Job  Size  at  Facebook  

(9)

Special  Instance:  Sticky  Slots  

•  Under  fair  sharing,  locality  can  be  poor  even  when  all   jobs  have  large  input  Tiles  

•  Problem:  jobs  get  “stuck”  in  the  same  set  of  task  slots  

–  When  one  a  task  in  job  j  Tinishes,  the  slot  it  was  running  in  is   given  back  to  j,  because  j  is  below  its  share  

–  Bad  because  data  Tiles  are  spread  out  across  all  nodes  

Job   Share  Fair   Running  Tasks  

Job  1   2    2  

Job  2   2   2  

Job   Share  Fair   Running  Tasks  

Job  1   2   1  

Job  2   2   2  

Slave   Slave   Slave   Slave   Master  

(10)

Special  Instance:  Sticky  Slots  

0%   20%   40%   60%   80%   100%  

5  Jobs   10  Jobs   20  Jobs   50  Jobs  

%  L oc al  M ap  T ask s  

Data  Locality  vs.  Number  of  Concurrent  Jobs  

•  Under  fair  sharing,  locality  can  be  poor  even  when  all   jobs  have  large  input  Tiles  

•  Problem:  jobs  get  “stuck”  in  the  same  set  of  task  slots  

–  When  one  a  task  in  job  j  Tinishes,  the  slot  it  was  running  in  is   given  back  to  j,  because  j  is  below  its  share  

(11)

Solution:  Delay  Scheduling  

•  Relax  queuing  policy  to  make  jobs  wait  for  a  

limited  time  if  they  cannot  launch  local  tasks  

•  Result:  Very  short  wait  time  (1-­‐5s)  is  

(12)

Delay  Scheduling  Example  

Job  1  

Master   Job  2  

Scheduling  order  

Slave   Slave   Slave   Slave   Slave   Slave  

4   2   1   2   1   3   2   3   9   5   3   3   6   2   8   5   8   1   9   1   7   4   6   7   Task  2   Task  3   File  1:   File  2:  

Task  8   Task  7  Task  2   Task  4  Task  6  

Idea:  Wait  a  short  time  to  get  data-­‐local  

scheduling  opportunities  

Task  5  Task  1   Task  1  Task  3  

(13)

Delay  Scheduling  Details  

•  Scan  jobs  in  order  given  by  queuing  policy,  

picking  Tirst  that  is  permitted  to  launch  a  task  

•  Jobs  must  wait  before  being  permitted  to  

launch  non-­‐local  tasks  

–  If  wait  <  T

1

,  only  allow  node-­‐local  tasks  

–  If  T

1

 <  wait  <  T

2

,  also  allow  rack-­‐local  

–  If  wait  >  T

2

,  also  allow  off-­‐rack  

(14)

Analysis  

•  When  is  it  worth  it  to  wait,  and  for  how  long?  

  Waiting  worth  it  if  E(wait  time)  <  cost  to  run  non-­locally  

•  Simple  model:  Data-­‐local  scheduling  opportunities  arrive   following  Poisson  process  

•  Expected  response  time  gain  from  delay  scheduling:  

                     E(gain)  =  (1  –  e-­‐w/t)(d  –  t)    

  Optimal  wait  time  is  inAinity  

Delay  from  running  non-­‐locally  

Expected  time  to  get  local   scheduling  opportunity  

(15)

Analysis  Continued  

                     E(gain)  =  (1  –  e-­‐w/t)(d  –  t)    

•  Typical  values  of  

t

 and  

d

:  

           t  =  (avg  task  length)  /  (Tile  replication  ×  tasks  per  node)            d  ≥  (avg  task  length)  

Delay  from  running  non-­‐locally  

Expected  time  to  get  local   scheduling  opportunity  

Wait  amount  

(16)

More  Analysis  

•  What  if  some  nodes  are  occupied  by  long  tasks?  

•  Suppose  we  have:  

–   Fraction  φ  of  active  tasks  are  long  

–   R  replicas  of  each  block   –   L  task  slots  per  node  

•  For  a  given  data  block,  

               P(all  replicas  blocked  by  long  tasks)  =  

φ

RL  

(17)

Evaluation  

•  Macrobenchmark  

–  IO-­‐heavy  workload  

–  CPU-­‐heavy  workload  

–  Mixed  workload  

•  Microbenchmarks  

–  Sticky  slots  

–  Small  jobs  

–  Hierarchical  fair  scheduling

 

•  Sensitivity  analysis  

•  Scheduler  overhead  

(18)

Macrobenchmark  

•  100-­‐node  EC2  cluster,  4  cores/node  

•  Job  submission  schedule  based  on  job  sizes  and  

inter-­‐arrival  times  at  Facebook  

–  100  jobs  grouped  into  9  “bins”  of  sizes  

•  Three  workloads:  

–  IO-­‐heavy,  CPU-­‐heavy,  mixed  

•  Three  schedulers:  

–  FIFO  

–  Fair  sharing  

(19)

0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 CDF Time (s) FIFO Fair Fair + DS 0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 CDF Time (s) FIFO Fair Fair + DS 0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 CDF Time (s) FIFO Fair Fair + DS

Results  for  IO-­Heavy  Workload  

Small  Jobs   (1-­‐10  input  blocks)   Medium  Jobs   (50-­‐800  input  blocks)   Large  Jobs   (4800  input  blocks)  

Job  Response  Times   Up to 5x

(20)

Results  for  IO-­Heavy  Workload  

0%   20%   40%   60%   80%   100%   1   2   3   4   5   6   7   8   9   P er ce n t  L oc al  M ap s   Bin  

FIFO   Fair   Fair  +  Delay  Sched.  

0%   20%   40%   60%   80%   1   2   3   4   5   6   7   8   9   Speedup  fr om   D el ay  S ch ed u li n g   Bin  

(21)

Sticky  Slots  Microbenchmark  

•  5-­‐50  jobs  on  EC2  

•  100-­‐node  cluster  

•  4  cores  /  node  

•  5s  delay  scheduling  

0%   20%   40%   60%   80%   100%  

5  Jobs   10  Jobs   20  Jobs   50  Jobs  

P er ce n t  L oc al  M ap s  

Without  Delay  Scheduling   With  Delay  Scheduling  

0   10   20   30   40   50  

5  Jobs   10  Jobs   20  Jobs   50  Jobs  

B en ch m ar k  R u n n in g   T im e    ( m in u te s)  

Without  Delay  Scheduling   With  Delay  Scheduling  

(22)

Conclusions  

•  Delay  scheduling  works  well  under  two  conditions:  

–  SufTicient  fraction  of  tasks  are  short  relative  to  jobs  

–  There  are  many  locations  where  a  task  can  run  efTiciently  

•  Blocks  replicated  across  nodes,  multiple  tasks/node  

•  Generalizes  beyond  MapReduce  and  fair  sharing  

–  Applies  to  any  queuing  policy  that  gives  ordered  job  list  

•  Used  in  Hadoop  Fair  Scheduler  to  implement  hierarchical  policy  

(23)

Thanks!  

•  This  work  is  open  source,  packaged  with  

Hadoop  as  the  Hadoop  Fair  Scheduler  

– Old  version  without  delay  scheduling  in  use  at  

Facebook  since  fall  2008  

– Delay  scheduling  will  appear  in  Hadoop  0.21  

(24)

Related  Work  

•  Quincy  (SOSP  ‘09)  

–  Locality  aware  fair  scheduler  for  Dryad  

–  Expresses  scheduling  problem  as  min-­‐cost  Tlow   –  Kills  tasks  to  reassign  nodes  when  jobs  change   –  One  task  slot  per  machine  

•  HPC  scheduling  

–  Inelastic  jobs  (cannot  scale  up/down  over  time)   –  Data  locality  generally  not  considered  

•  Grid  computing  

–  Focused  on  common  interfaces  for  accessing  compute   resources  in  multiple  administrative  domains  

(25)

Delay  Scheduling  Details  

when  there  is  a  free  task  slot  on  node  n:  

 sort  jobs  according  to  queuing  policy    for  j  in  jobs:  

   if  j  has  node-­‐local  task  t  on  n:  

     j.level  :=  0;  j.wait  :=  0;  return  t  

   else  if  j  has  rack-­‐local  task  t  on  n  and  (j.level  ≥  1  or  j.wait  ≥  T1):        j.level  :=  1;  j.wait  :=  0;  return  t  

   else  if  j.level  =  2  or  (j.level  =  1  and  j.wait  ≥  T2)                  or  (j.level  =  0  and  j.wait  ≥  T1  +  T2):  

     j.level  :=  2;  j.wait  :=  0;  return  t      else:  

(26)

Sensitivity  Analysis  

5%   68%   98%   100%   11%   80%   99%   100%   0%   20%   40%   60%   80%   100%  

No  Delay   1s  Delay   5s  Delay   10s  Delay  

P er ce n t  L oc al  M ap s  

(27)

Hadoop  Terminology  

•  Job  =  entire  MapReduce  computation;  consists  of  

map  &  reduce  tasks  

•  Task  =  unit  of  work  that  is  part  of  a  job  

•  Slot  =  location  on  a  slave  where  a  task  can  run  

Output   File   Map   Map   Map   Reduce   Reduce   Input  F ile   Block   Block   Block  

(28)

Analysis  

•  Required  waiting  time  is  a  fraction  of  the  

average  task  length  as  long  as  there  are  

many  locations  where  a  task  can  run  

– Ex:  with  3  replicas  /  block  and  8  tasks  /  node,  

wait  1/24

th

 of  average  task  length  

•  Non-­‐locality  decreases  exponentially  with  

waiting  time  

(29)

   Fraction  of  mean  task  length              Decreases  with  #  slots/node      

Analysis  

•  Suppose  we  have:  

–  M  machines  

–  L  slots/machine  

–  T  second  average  task  length  

–  Job  with  data  on  fraction  f  of  nodes  

•  Let  D  =  max  #  of  scheduling  opportunities  skipped  

•  P(non-­‐local)  =  (1-­‐f)

D  

•  waiting  time  =  D  *  T/(ML)  

(30)

Small  Jobs  Experiment  

•  10-­‐20  min  benchmark  

consisting  of  several  

hundred  IO-­‐heavy  jobs  

•  Job  sizes:  3,  10  and  100  

map  tasks  

•  Environment:  100-­‐node  

cluster  on  4  racks;  5  map  

slots  per  node  

0

0.2 0.4 0.6 0.8 1

3 Maps 10 Maps 100 Maps

Normalized Running T

ime

Job Size (Number of Maps)

No Delay Scheduling With Delay Scheduling Job  Size   Default  Scheduler  

Node  Loc.   Rack  Loc.  

3  maps   2%   50%  

10  maps   37%   98%  

100  maps   84%   99%  

With  Delay  Sched.   Node  Loc.   Rack  Loc.  

75%   94%  

99%   99%  

References

Related documents

This article discusses three of the newer aspects of the vitamin D system which can differ among individuals who do not have overt dis- ease and which should influence

These systems use data such as search query logs, ratings from similar users, social connections and much more to predict unknown relevance, as we shall see.. Recommender systems

 As a result of the ongoing investigation and in order to further maximize recovery on behalf of the Empire Plan from Kings Pharmacy fraudulent activity, UHC instructed Medco

Marlow is here beginning to draw the antievolutionary implications internal to his evolutionary account of mimetic frenzy: namely, that with respect to strong, affective

This study used secondary data from 2006 to 2013 of 33 provinces in Indonesia; they are a number of people working (employment), Economic Growth Rate, Foreign Direct

Abdelhak, &#34;Hybrid Approaches For Automatic Vowelization Of Arabic Texts,&#34; International Journal on Natural Language Computing (IJNLC),

As illustrated by the literature review and the findings of my research, the main practical implication is the support to the creation of clear strategies for small and

Replacing Dietary Corn with Bakery By-products Supplemented with Enzyme and Evaluating Performance of Laying