• No results found

Fuxi: a Fault Tolerant Resource Management and Job Scheduling System at Internet Scale

N/A
N/A
Protected

Academic year: 2021

Share "Fuxi: a Fault Tolerant Resource Management and Job Scheduling System at Internet Scale"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Fuxi: a Fault Tolerant Resource Management

and Job Scheduling System at Internet Scale

Zhuo Zhang

*

, Chao Li

*

, Yangyu Tao

*

, Renyu Yang

†*

, Hong Tang

*

, Jie Xu

†§

(2)

Agenda

Motivation

System Introduction

Contribution

Incremental resource management protocol

User transparent failure recovery

Faulty node detection and multilevel blacklist mechanism

Evaluation

(3)

Motivation

Big data era

2.5 exabytes of data are generated everyday

The speed of data generation doubles every 40 months

Billions of transactions on Taobao everyday, must be processed in 6 hours

Scalability challenges

10

3 ~

10

4

of machines

10

3 ~

10

4

of concurrent jobs

10

5 ~

10

6

of concurrent workers

10

4 ~

10

5

OPS for resource request/assign

Other factors: Multi-dimensional resource, quota, priority, locality, etc.

Fault tolerance challenges

Failures become normal at large scale

Partial failures worse than machine down

Master failures

(4)

Apsara: A Brief History

09/10/2009

Aliyun.com Inc established

08/27/2010

Apsara became the platform of four applications:

Search, Mail, Image Storage, AliFinance 07/28/2011 Aliyun.com went online, releasing 1st cloud service: ECS 08/15/2013 5000-node Aspara cluster went into production 10/24/2013 3rd Aliyun dev conference held in Hangzhou. ~5000 developers attended the conference

5K

(5)

Apsara Cloud Platform

Resource Management

Application Scheduling

Linux Clusters

Security

Management

Remote

Procedure

Call

Distributed Coordination

Deployme

nt

Monit

o

ring

Distributed File System

ACE

ECS/SLB

OSS

OTS

RDS

ODPS

Map, Mail, Search

Cloud Market

Other Cloud

Services

Resource scheduling and allocation Quota control

Preemption

Process life cycle control Process isolation

DAG Job

Long running service

(6)

FuxiMaster

Cluster

Node Cluster Node Cluster Node Cluster Node

. . .

App Master APP Worker App Master APP Worker

. . .

Client

. . .

. . .

. . .

. . .

FuxiAgent FuxiAgent Job scheduling Resource request and response Node management and status collection Job submission APP Worker APP Worker APP Worker FuxiAgent FuxiAgent

Fuxi Workflow

Process Execution Plan

(7)

Support Different Application Models

AppMaster lib

Resource request/return

Process execution

Heartbeat

Build-in models

DAG job (similar to Dryad and Tez)

Long running services

Mail service

Search service

Can support more models

spark, storm, etc.

InputFile

OutputFile

T1

T2

T3

(8)

Contributions

Scalability

Incremental resource management protocol

Fault tolerance

User transparent failure recovery

(9)

Incremental Resource Management Protocol

FuxiMaster

FuxiAgent1

FuxiAgent2

FuxiAgent3

AppMaster1

ScheduleUnit: {1cpu, 2gb}

AppMaster2

ScheduleUnit: {2cpu, 5gb} ScheduleUnit: {1cpu, 2gb} Apply: {M1*2, C*10} Max: 10 Assign: M1: +3, M2: +3, M3: +2

A1: +3 A1: +3 A1: +2

Return: M3: -1 Assign: M3: +2 A2: -1 A1: +2 Return: M1: -3 M2: -1 M3: -4 A1: -3 A1: -1 A1: -4 Return: M2: -2 A1: -2

(10)

Resource Scheduling

Resource

Allocation

New

Submission

Resource

Release

Resource

Preemption

Resource

Change

Quota

Change

Available Resource Pool Waiting Resource Requests Queue Fulfilled Resource Request Queue

Event driven real-time resource allocation

Locality tree based scheduling

Multi-layer schedule order

Cluster Rack1 Rack2 M1 M2 M3 M4 App1: P1,14 APP2: P1,9 APP3: P2,5 APP4: P3, 16 App1: P1,9 App2: P1,5 App3: P2,1 App4:P3,8 App1: P1,4 App2: P1,3 App3: P2,1 App1: P1,4 App3: P2,6 App1: P1,3 App3: P2,1 App4:P3,1 App1: P1,4 App2: P1,4 App3: P2,3 App4:P3,8 App3: P2,1 App4:P3,3

(11)

User Transparent Failure Recovery

Full stack failover support

FuxiMaster

JobMaster

FuxiAgent

Scalability consideration

Hard state

Soft state

(12)

FuxiMaster Failover

Hard state

Application configuration

Soft state

Machine list

Resource request

Resource schedule results

FuxiMaster

App Master1

FuxiAgent FuxiAgent FuxiAgent

App Master2 Checkpoint AppConfig App1:[…] App2:[…] ResourceAssign App1: 5 APP2: 3 ResourceAssign App1: 3 APP2: 2 ResourceAssign App1: 2 APP2: 3 ScheduleUnit:{1cpu,2GB Mem} Request: M1*2,M2*3, R2*3,C*8 MaxCount: 15 ScheduleUnit:{2cpu,3GB Mem} Request: M3*2,C*8 MaxCount: 10

Soft state

Soft state

Hard

state

(13)

Faulty Node Detection and Multilevel Blacklist

Cluster level: rule based

Hardware & OS: Disk statistics, machine load, network I/O, etc.

Apsara platform: OPS, state, heartbeat, etc.

Scoring system

Plugin based

Job level

(14)

Faulty Node Detection and Multilevel Blacklist

Cluster level

Job level: history based

One instance failed on one machine

Multiple instances failed on one same machine

Running slow on one machine

(15)

Fuxi in Production

In production since 2009

Manage hundreds of thousands of nodes

5000 nodes in one cluster

Job type at a glance

avg

max

total

Instance Number

228/task

99,937/task

42,266,899

Worker Number

89.92/task

4,636/task

16,295,167

(16)

Evaluation: 5000 nodes with 1000 concurrent jobs

FuxiMaster scheduling time

(17)

Evaluation: GraySort

Provenance

Configuration

GraySort Indi Result

Fuxi

(2013)

5000 nodes (2 2.20GHz 6cores Xeon E5-2430, 96 GB memory,12x2TB disks)

100TB in 2538 seconds (2.364TB/min)

Yahoo! Inc.

(2012)

2100 nodes (2 2.3Ghz hexcore Xeon E5-2630, 64 GB memory,12x3TB disks)

102.5 TB in 4,328 seconds (1.42TB/min)

UCSD

(2011)

52 nodes (2 Quadcore processors,24 GBmemory, 16x500GB disks) Cisco Nexus 5096 switch

100 TB in 6,395 seconds (0.938TB/min)

UCSD&VUT

(2010)

47 nodes (2 Quadcore processors,24 GB memory, 16x500GB disks) Cisco Nexus 5020 switch

100 TB in 10,318 seconds (0.582TB/min)

KIT

(2009)

195 nodes x (2 Quadcore processors,16 GB

memory,4x250GB disks)288-port InniBand 4xDDR switch

100 TB in 10,628 seconds (0.564TB/min)

(18)

Evaluation: Performance of Fault Handling

5% failure: 15.7% performance degradation

10% failure: 19.6% performance degradation

0 job fail

Injected Fault Type

Node Number

Node Number

Node Down

2

2

Partial Worker Failure

2

4

Slow Machine

11

23

(19)

Conclusion and Future Work

Conclusion

Scalability

Incremental communication/schedule instead of full run

Fault tolerance

Full stack failover support

Do checkpoint only for hard state

Detect faulty node by rules and history

Future work

Improve real resource utilization

More resource isolation

(20)
(21)

Alibaba Night

Recently Alibaba’s Chinese online commerce sites drove over US $5.7 billion dollars worth of sales in a single day. Learn what it takes to run these sites and how the Alibaba Group has impacted the Chinese economy. This event is specifically geared towards researchers, engineers and IT professionals who are interested in learning about the growth of ecommerce and technology.

Join us for this FREE event as we share learning’s and surge ahead in 2014!

Date: September 4th, 2014

Location: Alibaba Xixi Campus

Agenda:

- Tour of Xixi Campus

- Presentations from Alibaba - Q&A Session

Notes:

- Please register at the Alibaba booth in the Exhibition area.

- Due to limited space, we may not be able to accommodate everyone. Sorry! - This event will be primarily conducted in Mandarin.

References

Related documents

In addition to the instances described above, the PHA will make a determination of rent reasonableness at any time after the initial occupancy period if: (1) the PHA

On page 38, under (q), the RFP and resulting contract requires the contractor to submit a field staffing plan with “names of all personnel and subcontractors (if applicable); position

The above may be said to be linked to the code of Brown and Levinson (De Kadt, 1994) and according to which both the status and age difference between two people may contribute to

Also, an informed trade by a producer of one final good amounts to a noise trade from the perspective of a producer of another final good, so (a) as the price mechanism becomes

A Case of Co-Secreting TSH and Growth Hormone Pituitary Adenoma Presenting with a Thyroid Nodule Notes/Citation Information.. Published in Endocrinology, Diabetes & Metabolism

Section C : This will consist of essay type questions with answer to each question upto 5 pages in length.. Four questions will be set by the examiner and the candidate will

Selv om det ikke blir et alternativ ved selve folkeavstemningen, er økt selvstyre fremdeles et mulig utfall av prosessen – enten fordi den britiske regjeringen kan komme til å

• the new employer takes over the contracts of employment of all employees who were employed in the ‘organised grouping of resources or employees’ immediately after the transfer,