Hadoop Optimizations for BigData Analytics
Weikuan Yu
Outline
•
Background
•
Network Levitated Merge
•
JVM-Bypass Shuffling
Emerging Demand for BigData Analytics
• Big demand from many organizations in various domains – Scalable computing power without worrying about system maintenance.
– Ubiquitously accessible computing and storage resources.
– Low cost, highly reliable, trusted computing infrastructure.
MapReduce
l
A simple data processing model to process big data
l
Designed for commodity off-the-shelf hardware components.
lStrong merits for big data analytics
l Scalability: increase throughput by increasing # of nodes
l Fault-tolerance (quick and low cost recovery of the failures of tasks)
l
Hadoop, An open-source implementation of MapReduce:
l Widely deployed by many big data companies: AOL, Baidu, EBay,
High-Level Overview of Hadoop
MapTasks ReduceTasks HDFS Shuffle Intermediate Data Task Tracker/Runner Task Tracker/Runner JobTracker Applications Job Submissionl HDFS and the MapReduce Framework
l Data processing with MapTasks and ReduceTasks l Three main steps of data movement.
l Intermediate data shuffling in the MapReduce is time-consuming
1 3
Outline
•
Background
•
Network Levitated Merge
•
JVM-Bypass Shuffling
Motivation for Network-Levitated Merge
Time
shuffle merge reduce map
Start First MOF Serialization
Repetitive Merges and Disk Access
more 1: merge 3: merge 4: to merge soon 2: insert•
Hadoop data spilling controlled through parameters
– To limit the number of outstanding files
Hadoop Acceleration (Hadoop-A)
•
Pipelined shuffle, merge and reduce
•
Network-levitated data merge
TaskTracker JobTracker ReduceTask TaskTracker MapTask MOFSupplier Data
Engine RDMA Server
Fetch Manager Merging Thread Merged Data Merge Manager RDMA Client NetMerger Java Hadoop C++ RDMA Interconnects Acceleration
Pipelined Data Shuffle, Merge and Reduce
Time
shuffle merge reduce map
start First MOF
header PQ setup
Network-Levitated Merge Algorithm
<k1,v1><k2,v2><k3,v3>,… <k1,v1>,… <k1’,v1’> <k2’,v2’>,… <k2,v2>,… <k3’,v3’>,… <k3,v3>,… Merged Data: S1 S2 S3 <k1,v1>,… <k1’,v1’>,… <k2’,v2’>,… <k2,v2>,… <k3’,v3’>,… <k3,v3>,… <k1,v1><k2,v2><k3,v3>,…,<k2’,v2’><k1’,v1’><k3’,v3’>,… Merged Data: S2 S1 S3 S1 S2 S3 <k1,v1>,… <k2,v2>,… <k3,v3>,… S2 S1 S3 Merge Point(a) Fetching Header (b) Priority Queue Setup
Job Progression with Network-levitated Merge
a) Map Progress of TeraSort b) Reduce Progress of TeraSort
• a) Hadoop-A speeds up the execution time by more than 47%
• b) Both MapTasks and ReduceTasks are improved
0 20 40 60 80 100 120 140 0 500 1000 1500 2000 Progress (%) Time (sec) Hadoop-A (Map)
Hadoop on IPoIB (Map) Hadoop on GigE (Map)
0 20 40 60 80 100 120 140 0 500 1000 1500 2000 Progress (%) Time (sec) Hadoop-A (Reduce)
Hadoop on IPoIB (Reduce) Hadoop on GigE (Reduce)
Breakdown of ReduceTask Execution Time (sec)
•
Significantly reduced the execution time of ReduceTasks
•
Most came from reduced shuffle/merge time
– An improvement of 2.5 times
•
Also improved the time to reduce data
– An improvement of 15%
Category PQ-Setup Shuffle/Merge Reduce or
Merge/Reduce
Hadoop-GigE 1150.01 (65.0%) 599.97 (35.0%)
Hadoop-IPoIB 1148.26 (65.9%) 597.09 (34.1%)
Outline
•
Background
•
Network Levitated Merge
•
JVM-Bypass Shuffling
WBDB, Oct 2012, S-15
HDFS
JVM-Dependent Intermediate Data Shuffling
MOF1 MOF2 MOF1 MOF2 MapTask Map MOF1 MOF2 HttpServlet Staging ReduceTask Staging Staging Reduce MOFCopiers HttpServlet HttpServlet Sort/Merge
JobTracker
TaskTracker TaskTrackerHeavily relies on Java
JVM-Bypass Shuffling (JBS)
Data Analytics Applications
Sockets TCP/IP HTTP Servlet HTTP GET MOFSupplier
MapTask TaskTracker ReduceTask
RDMA Verbs, TCP/IP
NetMerger Java C JVM-Bypass JVM-Bypass Shuffling (JBS) C
• JBS removes JVM from the critical path of intermediate data shuffling
WBDB, Oct 2012, S-17
Benefits of JBS: 1/10 Gigabit Ethernets
• JBS is effective for intermediate data of different sizes
Ø Using Terasort benchmark, size of intermediate data = size of input data
• JBS reduces the execution time by 20.9% on average in 1GigE,
19.3% on average in 10GigE 0 500 1000 1500 2000 16 32 64 128 256
Terasort Job Execution Time (sec)
Input Data Size (GB)
Hadoop on 1GigE JBS on 1GigE 0 500 1000 1500 2000 16 32 64 128 256
Terasort Job Execution Time (sec)
Input Data Size (GB)
Hadoop on 10GigE JBS on 10GigE
Benefits of JBS: InfiniBand Cluster
• JBS on IPoIB outperforms Hadoop on IPoIB and SDP by 14.1%,
14.8%, respectively.
• Hadoop performs similarly when using IPoIB or SDP.
0 500 1000 1500 2000 16 32 64 128 256
Terasort Job Execution Time (sec)
Input Data Size (GB)
Hadoop on IPoIB Hadoop on SDP JBS on IPoIB
Outline
•
Background
•
Network Levitated Merge
•
JVM-Bypass Shuffling
Hadoop Fair Scheduler
• Scheduler assigns tasks to the TaskTrackers
• Tasks occupy slots until completion or failure
Job Arrival shuffle reduce J-3 J-2 J-3 J-3 J-1 J-1 J-1 Slot-R1 Slot-R2 Slot-R3 Slot-M1 Slot-M2 Slot-M3 Slot-M4 Slot-M5 J-3 J-3 J-2 J-2 J-2 J-2 J-2 J-2 1 2 3 Time
Fair Completion Scheduler
•
Prioritize ReduceTasks based on the shortest remaining
map phase
•
When remaining map phases are equal, prioritize
ReduceTasks of jobs with least remaining reduce data
•
Track the slowdown of preempted ReduceTasks
1 10 100 1000 10000 1 2 3 4 5 6 7 8 9 10
Av
er
ag
e
ReduceT
ask
W
ai
ti
n
g
T
im
e
(s
ec
)
Groups
FCS
HFS
Average ReduceTask Waiting Time
•
ReduceTasks in small jobs are significantly speedup
19,5 12,4 22 21 32.2 27.3 1.22 4.51 0.52 0.8
Conclusions
• Examined the design and architecture of Hadoop MapReduce
framework and reveal critical issues faced by the existing implementation
• Designed and implemented Hadoop-A as an extensible
acceleration framework which addresses all these issues
• Provided JVM-Bypass Shuffling to avoid JVM overhead,
meanwhile we enable it to be a portable library that can run on both TCP/IP and RDMA protocols.
• Designed and Implemented Fast Completion Scheduler for