Managing large clusters resources

(1)

ID2210

Gautier Berthou (SICS)

Managing large clusters resources

(2)

Big Data Processing with No Data Locality

Job(“/crawler/bot/jd.io/1”)

Workflow Manager

Compute Grid Node Job

submi

t

This doesn’t scale.

Bandwidth is the bottleneck

(3)

MapReduce – Data Locality

Job(“/crawler/bot/jd.io/1”)

Job Tracker

2 1 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5

Task

Tracker Task

Tracker

submi

t

Job Job Job Job Job Job

DN DN DN DN DN DN

(4)

First step

Hadoop 1.x

HDFS

(distributed storage)

MapReduce

(resource mgmt, job scheduler,

data processing)

Single Processing Framework

Batch Apps

(5)

MapReduce – Data Locality

Job(“/crawler/bot/jd.io/1”)

Job Tracker

2 1 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5

Task

Tracker Task

Tracker

submi

t

Job Job Job Job Job Job

DN DN DN DN DN DN

(6)

The Job Tracker Job

•Distribute the map and reduce tasks on the nodes

of the cluster

•Ensure fairness of the cluster resource attributions

•Track the progress of these tasks

•Authenticate job tenants and make sure that each

job is isolated from the others

•Etc.

(7)

Limitations of MapReduce [Zaharia’11]

Map

Reduce

Input Output

• MapReduce is based on an acyclic data flow from

stable storage to stable storage.

- Slow writes data to HDFS at every stage in the pipeline

• Acyclic data flow is inefficient for applications that

repeatedly reuse a working set of data:

- Iterative algorithms (machine learning, graphs)

- Interactive data mining tools (R, Excel, Python)

(8)

Limitations

•Only one programming model.

•The map reduce framework is not using the

cluster at its maximum.

•The job tracker is a bottle neck.

•The job tracker is a single point of failure.

(9)

Goals for a new Scheduler

•Being able to run different frameworks

•Scale

•Provide advance scheduling policies

•Run efficiently with different kind of workloads

(10)

Second step

Hadoop 2.x

MapReduce

(data processing)

HDFS

(distributed storage)

YARN

(resource mgmt, job scheduler)

Others

(spark, mpi, giraph, etc)

Multiple Processing Frameworks

Batch, Interactive, Streaming …

(11)

Examples of scheduling Policies

•Capacity scheduler:

- Applications have different levels of priorities.

•Fair scheduler:

- Applications have different levels of priorities.

- Used resources can be preempted.

•Reservation-based scheduler ⁽¹⁾ :

- Applications can indicate how long they will run and when

they have to be finished.

(1) Reservation-based Scheduling: If you’re late don’t blame us!, C.

Curino & al., Microsoft tech-report

(12)

Scenario

•3 kinds of jobs:

- Emergency jobs: need to be run as soon as possible.

- Production jobs: have a deadline, a known running time

and are very exigent on the nodes they can be scheduled

on.

- Best effort jobs: interactive jobs that have lower priority,

but on which users expect low latency.

(13)

Capacity scheduler

Best effort

Best

effort

Best

effort

production

Best

effort

emergency _effort ^Best _effort ^Best

Now Production work deadline

production

(14)

Fair scheduler

Best effort

Best

effort emergency

Now Production work deadline

Best

effort Best effort

Best

effort

production

(15)

Reservation-based scheduler

Best effort

Best

effort emergency

Now Production work deadline

Best

effort

production

Best

effort production

(16)

Scheduler Architectures

Omega: flexible, scalable schedulers for large compute clusters, Malte Schwarzkopf & al., EuroSys’13

(17)

The monolithic Scheduler

•Yarn:

- Apache Hadoop YARN: Yet Another Resource Negotiator, V. K.

Vavilapalli & al., SoCC’13.

•Borg:

- Large-scale cluster management at Google with Borg, A. Verma &

al., EuroSys’15.

(18)

Architecture 1/3

(19)

Architecture 2/3

Data

node Data

node

Resources Manager

(20)

Architecture 3/3

Data Data Data Data Data

Resources Manager

Standby

Resources

Manager

Standby

Resources

Manager

Master

Resources

Manager

zookeeper

Master

Resources

Manager

(21)

Pros and Cons

•Pros:

- Fine knowledge of the state of the cluster state -> optimal

use of the cluster resources.

- Easy to implement new scheduling policies.

•Cons:

- Bottle neck.

- The failure of the master scheduler has a big impact on

the cluster usage.

(22)

Two level Scheduler

• Mesos: A Platform for Fine-Grained Resource Sharing in the

Data Center, B. Hindman & al., NSDI’11

(23)

Architecture 1/2

(24)

Architecture 2/2

Data

node Data

node

Mesos Master

MapReduce

Scheduler Spark

Scheduler Flink

Scheduler

Partial State

(25)

Pros and Cons

•Pros:

- Scale out by adding schedulers.

- Concurrent scheduling of tasks.

•Cons:

- Suboptimal use of the cluster. Especially when there exist

long running tasks.

(26)

Shared State Scheduler

•Omega: flexible, scalable schedulers for large

compute clusters, M. Schwarzkopf & al. EuroSys’13

(27)

Architecture 1/2

(28)

Architecture 2/2

Data

node Data

node

State Manager

MapReduce

Scheduler Spark

Scheduler Flink

Scheduler

Global state`

(29)

Architecture 2/2

Data

node Data

node

State Manager

MapReduce

Scheduler Spark

Scheduler Flink

Scheduler

Global state Conflict

(30)

Pros and Cons

•Pros

- Scalable.

- Good use of the cluster resources.

•Cons

- Unpredictable interaction between the different

schedulers’ policies.

(31)

Comparison

(32)

Performance comparison 1/

Two-level

Simple monolithic Advence monolithic

Shared state

(33)

Performance Comparison 2/

What the previous evaluation does not show about the Two-level

scheduling:

(34)

Performance Comparison 3/

How to write a

good paper.

How to write an

accepted paper.

Trying to handle more batch jobs in Omega by running several batch

schedulers in parallel.

(35)

Sum up

•Two-Level and Shared state Schedulers scale

better.

•Shared state Schedulers use the cluster resources

more optimally than Two-level Schedulers.

•Monolithic Scheduler are a potential Bottleneck.

•But as Monolithic schedulers are easier to design,

allow finer allocation of resources and more

advance scheduling policies, they are the ones

used in practice.

(36)

Making Yarn more scalable

•HOPS YARN: a one and a half level scheduler

(37)

Hadoop Yarn HA Implementation

Data

node Data

node

Resources Manager

Standby

Resources

Manager

Standby

Resources

Manager

Master

Resources

Manager

zookeeper

Master

Resources

Manager

(38)

Hops Yarn HA Implementation3/3

Data Data Data Data Data

Resources Manager

Standby

Resources

Manager

Standby

Resources

Manager

Master

Resources

Manager

Master

Resources

Manager

NDB

(39)

• Distributed, In-memory

• 2-Phase Commit

- Replicate DB, not the Log!

• Real-time

- Low TransactionInactive timeouts

• Commodity Hardware

• Scales out

- Millions of transactions/sec

- TB-sized datasets (48 nodes)

• Split-Brain solved with

Arbitrator Pattern

• SQL and Native Blocking/Non-

Blocking APIs

MySQL Cluster (NDB) – Shared Nothing DB

SQL API NDB API

30+ million update transactions/second

on a 30-node cluster

(40)

Standby is boring

Data Data Data Data Data

Standby

Resources

Manager

Standby

Resources

Manager

Master

Resources

Manager

(41)

Dificulties 1/2

(42)

Difficulties 2/2

•Pulling from the database when the state is

needed is inefficient.

•Having an independent thread that regularly pull

from the database is difficult to tune and cause

lock problems.

(43)

Solution

•Luckily NDB has an event API.

(44)

With streaming

Data Data Data Data Data

Standby

Resources

Manager

Standby

Resources

Manager

Master

Resources

Manager

(45)

Conclusion

• There exists three architectures for large cluster

resource scheduling:

- Monolithic

- Two-levels

- Shared State

• Each of these architectures has pros and cons.

• The monolithic architectur is the one presently used

because it is easyer to use and develop.

• At KTH and SICS we are exploring the possibilities

for a new architecture ensuring more scalability

while keeping the advantages of the monolithic

architecture.

(46)

References

• Reservation-based Scheduling: If you’re late don’t blame us!,

C. Curino & al., Microsoft tech-report

• Omega: flexible, scalable schedulers for large compute

clusters, Malte Schwarzkopf & al., EuroSys’13

• Apache Hadoop YARN: Yet Another Resource Negotiator, V. K.

Vavilapalli & al., SoCC’13.

• Large-scale cluster management at Google with Borg, A.

Verma & al., EuroSys’15.

• Mesos: A Platform for Fine-Grained Resource Sharing in the

Data Center, B. Hindman & al., NSDI’11