Hadoop Optimizations for BigData Analytics

(1)

Hadoop Optimizations for BigData Analytics

Weikuan Yu

(2)

Outline

• Background

• Network Levitated Merge

• JVM-Bypass Shuffling

(3)

Emerging Demand for BigData Analytics

•  Big demand from many organizations in various domains –  Scalable computing power without worrying about system maintenance.

–  Ubiquitously accessible computing and storage resources.

–  Low cost, highly reliable, trusted computing infrastructure.

(4)

MapReduce

l 

A simple data processing model to process big data

l 

Designed for commodity off-the-shelf hardware components.

l 

Strong merits for big data analytics

l  Scalability: increase throughput by increasing # of nodes

l  Fault-tolerance (quick and low cost recovery of the failures of tasks)

l 

Hadoop, An open-source implementation of MapReduce:

l  Widely deployed by many big data companies: AOL, Baidu, EBay,

(5)

High-Level Overview of Hadoop

MapTasks ReduceTasks HDFS Shuffle Intermediate Data Task Tracker/Runner Task Tracker/Runner JobTracker Applications Job Submission

l  HDFS and the MapReduce Framework

l  Data processing with MapTasks and ReduceTasks l  Three main steps of data movement.

l  Intermediate data shuffling in the MapReduce is time-consuming

1 3

(6)

Outline

• Background

• Network Levitated Merge

• JVM-Bypass Shuffling

(7)

Motivation for Network-Levitated Merge

Time

shuffle merge reduce map

Start First MOF Serialization

(8)

Repetitive Merges and Disk Access

more 1: merge 3: merge 4: to merge soon 2: insert

• Hadoop data spilling controlled through parameters

–  To limit the number of outstanding files

(9)

Hadoop Acceleration (Hadoop-A)

• Pipelined shuffle, merge and reduce

• Network-levitated data merge

TaskTracker JobTracker ReduceTask TaskTracker MapTask MOFSupplier Data

Engine RDMA Server

Fetch Manager Merging Thread Merged Data Merge Manager RDMA Client NetMerger Java Hadoop C++ RDMA Interconnects Acceleration

(10)

Pipelined Data Shuffle, Merge and Reduce

Time

shuffle merge reduce map

start First MOF

header _{PQ setup}

(11)

Network-Levitated Merge Algorithm

<k1,v1><k2,v2><k3,v3>,… <k1,v1>,… <k1’,v1’> <k2’,v2’>,… <k2,v2>,… <k3’,v3’>,… <k3,v3>,… Merged Data: S1 S2 S3 <k1,v1>,… <k1’,v1’>,… <k2’,v2’>,… <k2,v2>,… <k3’,v3’>,… <k3,v3>,… <k1,v1><k2,v2><k3,v3>,…,<k2’,v2’><k1’,v1’><k3’,v3’>,… Merged Data: S2 S1 S3 S1 S2 S3 <k1,v1>,… <k2,v2>,… <k3,v3>,… S2 S1 S3 Merge Point

(a) Fetching Header (b) Priority Queue Setup

(12)

Job Progression with Network-levitated Merge

a) Map Progress of TeraSort b) Reduce Progress of TeraSort

•  a) Hadoop-A speeds up the execution time by more than 47%

•  b) Both MapTasks and ReduceTasks are improved

0 20 40 60 80 100 120 140 0 500 1000 1500 2000 Progress (%) Time (sec) Hadoop-A (Map)

Hadoop on IPoIB (Map) Hadoop on GigE (Map)

0 20 40 60 80 100 120 140 0 500 1000 1500 2000 Progress (%) Time (sec) Hadoop-A (Reduce)

Hadoop on IPoIB (Reduce) Hadoop on GigE (Reduce)

(13)

Breakdown of ReduceTask Execution Time (sec)

• Significantly reduced the execution time of ReduceTasks

• Most came from reduced shuffle/merge time

–  An improvement of 2.5 times

• Also improved the time to reduce data

–  An improvement of 15%

Category PQ-Setup Shuffle/Merge Reduce or

Merge/Reduce

Hadoop-GigE 1150.01 (65.0%) 599.97 (35.0%)

Hadoop-IPoIB 1148.26 (65.9%) 597.09 (34.1%)

(14)

Outline

• Background

• Network Levitated Merge

• JVM-Bypass Shuffling

(15)

WBDB, Oct 2012, S-15

HDFS

JVM-Dependent Intermediate Data Shuffling

MOF1 MOF2 MOF1 MOF2 MapTask Map MOF1 MOF2 HttpServlet Staging ReduceTask Staging Staging Reduce MOFCopiers HttpServlet HttpServlet Sort/Merge

JobTracker

TaskTracker TaskTracker

Heavily relies on Java

(16)

JVM-Bypass Shuffling (JBS)

Data Analytics Applications

Sockets TCP/IP HTTP Servlet HTTP GET MOFSupplier

MapTask TaskTracker ReduceTask

RDMA Verbs, TCP/IP

NetMerger Java C JVM-Bypass JVM-Bypass Shuffling (JBS) C

•  JBS removes JVM from the critical path of intermediate data shuffling

(17)

WBDB, Oct 2012, S-17

Benefits of JBS: 1/10 Gigabit Ethernets

•  JBS is effective for intermediate data of different sizes

Ø  Using Terasort benchmark, size of intermediate data = size of input data

•  JBS reduces the execution time by 20.9% on average in 1GigE,

19.3% on average in 10GigE 0 500 1000 1500 2000 16 32 64 128 256

Terasort Job Execution Time (sec)

Input Data Size (GB)

Hadoop on 1GigE JBS on 1GigE 0 500 1000 1500 2000 16 32 64 128 256

Hadoop on 10GigE JBS on 10GigE

(18)

Benefits of JBS: InfiniBand Cluster

•  JBS on IPoIB outperforms Hadoop on IPoIB and SDP by 14.1%,

14.8%, respectively.

•  Hadoop performs similarly when using IPoIB or SDP.

0 500 1000 1500 2000 16 32 64 128 256

Hadoop on IPoIB Hadoop on SDP JBS on IPoIB

(19)

Outline

• Background

• Network Levitated Merge

• JVM-Bypass Shuffling

(20)

Hadoop Fair Scheduler

•  Scheduler assigns tasks to the TaskTrackers

•  Tasks occupy slots until completion or failure

Job Arrival shuffle reduce J-3 J-2 J-3 J-3 J-1 J-1 J-1 Slot-R1 Slot-R2 Slot-R3 Slot-M1 Slot-M2 Slot-M3 Slot-M4 Slot-M5 J-3 J-3 J-2 J-2 J-2 J-2 J-2 J-2 1 2 3 Time

(21)

Fair Completion Scheduler

• Prioritize ReduceTasks based on the shortest remaining

map phase

• When remaining map phases are equal, prioritize

ReduceTasks of jobs with least remaining reduce data

• Track the slowdown of preempted ReduceTasks

(22)

1 10 100 1000 10000 1 2 3 4 5 6 7 8 9 10

Av

er

ag

e

ReduceT

ask

W

ai

ti

n

g

T

im

e

(s

ec

)

Groups

FCS

HFS

Average ReduceTask Waiting Time

• ReduceTasks in small jobs are significantly speedup

19,5 12,4 22 21 32.2 27.3 1.22 4.51 0.52 0.8

(23)

Conclusions

•  Examined the design and architecture of Hadoop MapReduce

framework and reveal critical issues faced by the existing implementation

•  Designed and implemented Hadoop-A as an extensible

acceleration framework which addresses all these issues

•  Provided JVM-Bypass Shuffling to avoid JVM overhead,

meanwhile we enable it to be a portable library that can run on both TCP/IP and RDMA protocols.

•  Designed and Implemented Fast Completion Scheduler for

(24)