MANAGING RESOURCES IN A BIG DATA CLUSTER.

40  Download (0)

Full text

(1)

MANAGING

RESOURCES IN A BIG

DATA CLUSTER.

Gautier Berthou (SICS)

(2)
(3)

Where does they Come From?

On-line services :

PBs per day

Scientific instruments :

PBs per minute

Whole genome sequencing :

250 GB per person

Internet-of-Things :

Will be lots!

(4)
(5)

HDFS

Big

(6)

What do we do with this data?

Batch

Large quantity of data (Tera bytes).

Stored in a large number of machines (100 to 5000).

The user want to run computation on this data in a way that is as

efficient as possible.

Streaming

Large quantity of data.

The data arrive as a stream generated by different sources.

The user want to run computation on the fly on this data. With the

guaranty that the stream of data and the computation will run

without any interruption.

(7)

Example of batch processing MapReduce

Job(“/crawler/bot/jd.io/1”)

Job Tracker

Task

Tracker

Task

Tracker

Task

Tracker

Task

Tracker

Tracker

Task

Tracker

Task

submi

t

Job

Job

Job

Job

Job

Job

(8)

Scenario

3 kinds of jobs:

Emergency jobs: need to be run as soon as possible.

Production jobs: have a deadline, a known running time and are

very exigent on the nodes they can be scheduled on.

Best effort jobs: interactive jobs that have lower priority, but on

(9)

Capacity scheduler

Best effort

Best

effort

Best

effort

production

Best

effort

emergency

effort

Best

effort

Best

Now

Production work deadline

(10)

Fair scheduler

Best effort

Best

effort

emergency

Now

Production work deadline

Best

effort

Best effort

Best

effort

(11)

Reservation-based scheduler

Best effort

Best

effort

emergency

Now

Production work deadline

Best

effort

production

Best

(12)

Scheduler Architectures

(13)

The monolithic Scheduler

Yarn:

Apache Hadoop YARN: Yet Another Resource Negotiator, V. K. Vavilapalli &

al., SoCC’13.

Borg:

Large-scale cluster management at Google with Borg, A. Verma & al.,

(14)
(15)

Architecture 2/3

Data

Data

Data

Data

Data

(16)

Architecture 3/3

Data

Data

Data

Data

Data

Resources Manager

Standby Resources Manager Standby Resources Manager Master Resources Manager

zookeeper

Master Resources Manager

(17)

Pros and Cons

Pros:

Fine knowledge of the state of the cluster state -> optimal use of

the cluster resources.

Easy to implement new scheduling policies.

Cons:

Bottle neck.

The failure of the master scheduler has a big impact on the cluster

(18)

Two level Scheduler

Mesos: A Platform for Fine-Grained Resource Sharing in the Data

(19)
(20)

Architecture 2/2

Data

node

Data

node

Data

node

Data

node

Data

node

Mesos Master

MapReduce

Scheduler

Spark

Scheduler

Flink

Scheduler

Partial State

(21)

Pros and Cons

Pros:

Scale out by adding schedulers.

Concurrent scheduling of tasks.

Cons:

Suboptimal use of the cluster. Especially when there exist long

(22)

Shared State Scheduler

Omega: flexible, scalable schedulers for large compute

(23)
(24)

Architecture 2/2

Data

node

Data

node

Data

node

Data

node

Data

node

State Manager

MapReduce

Scheduler

Spark

Scheduler

Flink

Scheduler

Global state`

(25)

Architecture 2/2

State Manager

MapReduce

Scheduler

Spark

Scheduler

Flink

Scheduler

(26)

Pros and Cons

Pros

Scalable.

Good use of the cluster resources.

Cons

Unpredictable interaction between the different schedulers’

(27)

Sum up

Two-Level and Shared state Schedulers scale better.

Shared state Schedulers use the cluster resources more

optimally than Two-level Schedulers.

Monolithic Scheduler are a potential Bottleneck.

In practice the monolithic bottleneck is not/rarely reached.

And, as the monolithic scheduler is easier to implement

and allows more advance scheduling policies, it is the

scheduler architecture used in Hadoop and by Google.

(28)

Hadoop

Hadoop 2.x

MapReduce

(data processing)

HDFS

(distributed storage)

YARN

(resource mgmt, job scheduler)

Others

(spark, mpi, giraph, etc)

Multiple Processing Frameworks

(29)

Making Yarn more scalable

HOPS YARN: a one and a half level scheduler

www.hops.io

(30)

Hadoop Yarn HA Implementation

Data

Data

Data

Data

Data

Standby Resources Manager Standby Resources Manager Master Resources Manager

zookeeper

(31)

Hops Yarn HA Implementation3/3

Standby Resources Manager Standby Resources Manager Master Resources Manager Master Resources Manager

NDB

(32)

Distributed, In-memory

2-Phase Commit

Replicate DB, not the Log!

Real-time

Low TransactionInactive timeouts

Commodity Hardware

Scales out

Millions of transactions/sec

TB-sized datasets (48 nodes)

Split-Brain solved with Arbitrator

Pattern

SQL and Native

Blocking/Non-Blocking APIs

MySQL Cluster (NDB) – Shared Nothing DB

SQL API

NDB API

(33)

Standby is boring

Standby Resources Manager Standby Resources Manager Master Resources Manager

(34)
(35)

Difficulties 2/2

Pulling from the database when the state is needed is

inefficient.

Having an independent thread that regularly pull from the

database is difficult to tune and cause lock problems.

(36)

Solution

(37)

With streaming

Standby Resources Manager Standby Resources Manager Master Resources Manager

(38)

Conclusion

There exists three architectures for large cluster resource

scheduling:

Monolithic

Two-levels

Shared State

Each of these architectures has pros and cons.

The monolithic architectur is the one presently used

because it is easyer to use and develop.

At KTH and SICS we are exploring the possibilities for a

new architecture ensuring more scalability while keeping the

advantages of the monolithic architecture.

(39)

Project proposition: Quota base scheduling

What we saw so far allows cluster resources to be

optimally used.

Scheduling policies implemented in Yarn such as

Capacity scheduler or Fair scheduler allows the scheduler

to give priority to some users over some other users.

But: none of this allows to follow, manage and limit a user

consumption of resources over time.

Google Borg provide this feature, we need to add it to

(40)

References

Reservation-based Scheduling: If you’re late don’t blame us!

, C. Curino

& al., Microsoft tech-report

Omega: flexible, scalable schedulers for large compute clusters

, Malte

Schwarzkopf & al., EuroSys’13

Apache Hadoop YARN: Yet Another Resource Negotiator

, V. K.

Vavilapalli & al., SoCC’13.

Large-scale cluster management at Google with Borg

, A. Verma & al.,

EuroSys’15.

Mesos: A Platform for Fine-Grained Resource Sharing in the Data

Figure

Updating...

References

Related subjects :