Big Data: Hadoop and Memcached
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPC Advisory Council Stanford Conference and Exascale Workshop (Feb 2014)
Introduction to Big Data Applications and Analytics
• Big Data has become the one of the most important elements of business analytics • Provides groundbreaking opportunities
for enterprise information management and decision making
• The amount of data is exploding;
companies are capturing and digitizing more information than ever
• The rate of information growth appears to be exceeding Moore’s Law
2
HPC Advisory Council Stanford Conference, Feb '14
• Commonly accepted 3V’s of Big Data
• Volume, Velocity, Variety
Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf
• Scientific Data Management, Analysis, and Visualization • Applications examples – Climate modeling – Combustion – Fusion – Astrophysics – Bioinformatics
• Data Intensive Tasks
– Runs large-scale simulations on supercomputers – Dump data on parallel storage systems
– Collect experimental / observational data
– Move experimental / observational data to analysis sites – Visual analytics – help understand data visually
HPC Advisory Council Stanford Conference, Feb '14 3
Not Only in Internet Services - Big Data in Scientific
Domains
• Hadoop: http://hadoop.apache.org
– The most popular framework for Big Data Analytics
– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.
• Spark: http://spark-project.org
– Provides primitives for in-memory cluster computing; Jobs can load data into memory and query it repeatedly
• Shark: http://shark.cs.berkeley.edu
– A fully Apache Hive-compatible data warehousing system
• Storm: http://storm-project.net
– A distributed real-time computation system for real-time analytics, online machine learning, continuous computation, etc.
• S4: http://incubator.apache.org/s4
– A distributed system for processing continuous unbounded streams of data
• Web 2.0: RDBMS + Memcached (http://memcached.org)
– Memcached: A high-performance, distributed memory object caching systems
HPC Advisory Council Stanford Conference, Feb '14 4
Typical Solutions or Architectures for Big Data
Applications and Analytics
• F
ocuses on large data and data analysis
• Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment
is gaining a lot of momentum
•
http://wiki.apache.org/hadoop/PoweredBy
HPC Advisory Council Stanford Conference, Feb '14 5
• Scale: 42,000 nodes, 365PB HDFS, 10M daily slot hours
• http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-171710422.html • A new Jim Gray’ Sort Record: 1.42 TB/min over 2,100
nodes
• Hadoop 0.23.7, an early branch of the Hadoop 2.X
• http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html
HPC Advisory Council Stanford Conference, Feb '14 6
Hadoop at Yahoo!
• Overview of Hadoop and Memcached
• Trends in Networking Technologies and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
7
Big Data Processing with Hadoop
• The open-source implementation
of MapReduce programming
model for Big Data Analytics
• Major components
– HDFS
– MapReduce
• Underlying
Hadoop Distributed
File System (HDFS)
used by both
MapReduce and HBase
• Model scales but high amount of
communication during
intermediate phases can be
further optimized
8 HDFS MapReduce Hadoop Framework User Applications HBase Hadoop Common (RPC)Hadoop MapReduce Architecture & Operations
Disk Operations
• Map and Reduce Tasks carry out the total job execution
• Communication in shuffle phase uses HTTP over Java Sockets
9
Hadoop Distributed File System (HDFS)
• Primary storage of Hadoop;
highly reliable and fault-tolerant
• Adopted by many reputed
organizations
– eg: Facebook, Yahoo!
• NameNode: stores the file system
namespace
• DataNode: stores data blocks
• Developed in Java for
platform-independence and portability
• Uses sockets for communication!
(HDFS Architecture)10
RPC RPC RPC RPC
HPC Advisory Council Stanford Conference, Feb '14
HPC Advisory Council Stanford Conference, Feb '14 11
System-Level Interaction Between Clients and Data
Nodes in HDFS
High Performance Networks (HDD/SSD) (HDD/SSD) (HDD/SSD) ... ... (HDFS Data Nodes) (HDFS Clients) ... ...HPC Advisory Council Stanford Conference, Feb '14 12
HBase
• Apache Hadoop Database
(
http://hbase.apache.org/
)
• Semi-structured database,
which is highly scalable
• Integral part of many
datacenter applications
– eg:- Facebook Social Inbox
• Developed in Java for
platform-independence and portability
• Uses sockets for
communication!
ZooKeeper HBase Client HMaster HRegion Servers HDFS (HBase Architecture)HPC Advisory Council Stanford Conference, Feb '14 13
System-Level Interaction Among
HBase Clients, Region Servers, and Data Nodes
(HBase Clients) (HRegion Servers) (Data Nodes)
(HDD/SSD) (HDD/SSD) (HDD/SSD) ... ... ... ... ... ... High Performance Networks High Performance Networks
HPC Advisory Council Stanford Conference, Feb '14 14
Memcached Architecture
• Integral part of Web 2.0 architecture • Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes – General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
!"#$%"$#&
Web Frontend Servers
(Memcached Clients) (Memcached Servers)
(Database Servers) High Performance Networks High Performance Networks Main memory CPUs SSD HDD High Performance Networks ... ... ... Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD
• Overview of Hadoop and Memcached
• Trends in Networking Technologies and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
15
HPC Advisory Council Stanford Conference, Feb '14 16
Trends for Commodity Computing Clusters in the Top 500
List (http://www.top500.org)
0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500 P e rce n tag e of Clus te rs Nu mb e r of Clus te rs Timeline Percentage of Clusters Number of ClustersHPC and Advanced Interconnects
• High-Performance Computing (HPC) has adopted advanced
interconnects and protocols
– InfiniBand
– 10 Gigabit Ethernet/iWARP
– RDMA over Converged Enhanced Ethernet (RoCE)
• Very Good Performance
– Low latency (few micro seconds)
– High Bandwidth (100 Gb/s with dual FDR InfiniBand) – Low CPU overhead (5-10%)
• OpenFabrics software stack with IB, iWARP and RoCE
interfaces are driving HPC systems
• Many such systems in Top500 list
17 HPC Advisory Council Stanford Conference, Feb '14
18
Trends of Networking Technologies in TOP500 Systems
Percentage share of InfiniBand is steadily increasing Interconnect Family – Systems Share
• Message Passing Interface (MPI)
– Most commonly used programming model for HPC applications
• Parallel File Systems
• Storage Systems
• Almost 13 years of Research and Development since
InfiniBand was introduced in October 2000
• Other Programming Models are emerging to take
advantage of High-Performance Networks
– UPC
– OpenSHMEM
– Hybrid MPI+PGAS (OpenSHMEM and UPC)
19
Use of High-Performance Networks for Scientific
Computing
HPC Advisory Council Stanford Conference, Feb '14 20
One-way Latency: MPI over IB with MVAPICH2
0.00 1.00 2.00 3.00 4.00 5.00
6.00 Small Message Latency
Message Size (bytes)
Lat enc y (us ) 1.66 1.56 1.64 1.82 0.99 1.09 1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0.00 50.00 100.00 150.00 200.00 250.00 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Lat
enc
y
(us
HPC Advisory Council Stanford Conference, Feb '14 21
Bandwidth: MPI over IB with MVAPICH2
0 2000 4000 6000 8000 10000 12000 14000 Unidirectional Bandwidth B an dwid th ( MBy tes /se c)
Message Size (bytes)
3280 3385 1917 1706 6343 12485 12810 0 5000 10000 15000 20000 25000 30000 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Bidirectional Bandwidth B an dwid th ( MBy tes /sec )
Message Size (bytes)
3341 3704 4407 11643 6521 21025 24727
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
• Overview of Hadoop and Memcached
• Trends in Networking Technologies and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
22
Can High-Performance Interconnects Benefit Hadoop?
• Most of the current Hadoop environments use Ethernet
Infrastructure with Sockets
• Concerns for performance and scalability
• Usage of High-Performance Networks is beginning to draw
interest from many companies
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar
to the designs adopted for MPI)?
• Can HPC Clusters with High-Performance networks be used
for Big Data applications using Hadoop?
23 HPC Advisory Council Stanford Conference, Feb '14
Designing Communication and I/O Libraries for Big
Data Systems: Challenges
HPC Advisory Council Stanford Conference, Feb '14 24
Big Data Middleware
(HDFS, MapReduce, HBase and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies (HDD and SSD) Programming Models
(Sockets) Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols? Communication and I/O Library
Point-to-Point Communication QoS Threaded Models and Synchronization Fault-Tolerance I/O and File Systems
Virtualization Benchmarks
HPC Advisory Council Stanford Conference, Feb '14 25
Common Protocols using Open Fabrics
Application/ Middleware
Verbs Sockets
Application /Middleware Interface
SDP RDMA SDP InfiniBand Adapter InfiniBand Switch RDMA IB Verbs InfiniBand Adapter InfiniBand Switch User space RDMA RoCE RoCE Adapter User space Ethernet Switch TCP/IP Ethernet Driver Kernel Space Protocol InfiniBand Adapter InfiniBand Switch IPoIB Ethernet Adapter Ethernet Switch Adapter Switch 1/10/40 GigE iWARP Ethernet Switch iWARP iWARP Adapter User space IPoIB TCP/IP Ethernet Adapter Ethernet Switch 10/40 GigE-TOE Hardware Offload RSockets InfiniBand Adapter InfiniBand Switch User space RSockets
Can Hadoop be Accelerated with High-Performance Networks
and Protocols?
26 Enhanced Designs Application Accelerated Sockets 10 GigE or InfiniBand Verbs / Hardware Offload Current Design Application Sockets 1/10 GigE Network• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Hadoop and HBase) – Zero-copy not available for non-blocking sockets
Our Approach Application
OSU Design
10 GigE or InfiniBand
Verbs Interface
• Overview of Hadoop and Memcached
• Trends in Networking Technologies and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached• Open Challenges
• Conclusion and Q&A
Presentation Outline
27
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.8
– Based on Apache Hadoop 1.2.1
– Compliant with Apache Hadoop 1.2.1 APIs and applications – Tested with
• Mellanox InfiniBand adapters (DDR, QDR and FDR) • RoCE
• Various multi-core platforms
• Different file systems with disks and SSDs – http://hadoop-rdma.cse.ohio-state.edu
RDMA for Apache Hadoop Project
28
Design Overview of HDFS with RDMA
• JNI Layer bridges Java based HDFS with communication library written in native code
• Only the communication part of HDFS Write is modified; No change in HDFS architecture
29
HDFS
Verbs
RDMA Capable Networks
(IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket
Interface Java Native Interface (JNI)
Write Others
OSU Design
Enables high performance RDMA communication, while supporting traditional socket interface
• Design Features
– RDMA-based HDFS write – RDMA-based HDFS replication – InfiniBand/RoCE supportCommunication Time in HDFS
• Cluster with HDD DataNodes
– 30% improvement in communication time over IPoIB (QDR)
– 56% improvement in communication time over 10GigE • Similar improvements are obtained for SSD DataNodes
Reduced by 30%
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
30 0 5 10 15 20 25 2GB 4GB 6GB 8GB 10GB Co m m u n ic ati on T im e (s) File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
Evaluations using TestDFSIO
• Cluster with 8 HDD DataNodes, single disk per node
– 24% improvement over IPoIB (QDR) for 20GB file size • Cluster with 4 SSD DataNodes, single SSD per node
– 61% improvement over IPoIB (QDR) for 20GB file size
Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps Increased by 24% Increased by 61% 31 0 200 400 600 800 1000 1200 5 10 15 20 To tal Th ro u gh p u t (M B p s) File Size (GB) IPoIB (QDR) OSU-IB (QDR) 0 500 1000 1500 2000 2500 5 10 15 20 To tal Th ro u gh p u t (M B p s) File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
Evaluations using TestDFSIO
• Cluster with 4 DataNodes, 1 HDD per node
– 24% improvement over IPoIB (QDR) for 20GB file size
• Cluster with 4 DataNodes, 2 HDD per node
– 31% improvement over IPoIB (QDR) for 20GB file size
• 2 HDD vs 1 HDD
– 76% improvement for OSU-IB (QDR)
– 66% improvement for IPoIB (QDR)
Increased by 31% 32 0 500 1000 1500 2000 2500 5 10 15 20 Tot al T h rou gh p u t (M Bps ) File Size (GB) 10GigE-1HDD 10GigE-2HDD IPoIB -1HDD (QDR) IPoIB -2HDD (QDR) OSU-IB -1HDD (QDR) OSU-IB -2HDD (QDR)
33
Evaluations using Enhanced DFSIO of Intel HiBench on
SDSC-Gordon
• Cluster with 32 DataNodes, single SSD per node
– 47% improvement in throughput over IPoIB (QDR) for 128GB file size
– 16% improvement in latency over IPoIB (QDR) for 128GB file size
HPC Advisory Council Stanford Conference, Feb '14
Increased by 47% Reduced by 16% 0 500 1000 1500 2000 2500 32 64 128 A gg re ga te d T h rou gh p u t (M Bps ) File Size (GB) IPoIB (QDR) OSU-IB (QDR) 0 20 40 60 80 100 120 140 160 32 64 128 Ex e cu ti on T im e (s) File Size (GB) IPoIB (QDR) OSU-IB (QDR)
HPC Advisory Council Stanford Conference, Feb '14 34
Evaluations using Enhanced DFSIO of Intel HiBench on
TACC-Stampede
• Cluster with 64 DataNodes, single HDD per node
– 64% improvement in throughput over IPoIB (FDR) for 256GB file size
– 37% improvement in latency over IPoIB (FDR) for 256GB file size 0 200 400 600 800 1000 1200 1400 64 128 256 A gg re ga te d T h rou gh p u t (M Bps ) File Size (GB) IPoIB (FDR) OSU-IB (FDR) Increased by 64% Reduced by 37% 0 100 200 300 400 500 600 64 128 256 Ex e cu ti on T im e (s) File Size (GB) IPoIB (FDR) OSU-IB (FDR)
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached• Open Challenges
• Conclusion and Q&A
Presentation Outline
35
Design Overview of MapReduce with RDMA
• Enables high
performance RDMA
communication, while
supporting traditional
socket interface
• JNI Layer bridges Java
based MapReduce
with communication
library written in
native code
• InfiniBand/RoCE
support
MapReduce VerbsRDMA Capable Networks
(IB, 10GE/ iWARP, RoCE ..)
OSU Design
Applications
1/10 GigE, IPoIB Network
Java Socket
Interface Java Native Interface (JNI)
Job Tracker Task Tracker Map Reduce
Design features for RDMA
Prefetch/Caching of MOF
In-Memory Merge
Overlap of Merge & Reduce
36
Evaluations using Sort
8 HDD DataNodes
• With 8 HDD DataNodes for 40GB sort
– 41% improvement over IPoIB (QDR)
– 39% improvement over Hadoop-A-IB
(QDR)
37
M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May 2013.
HPC Advisory Council Stanford Conference, Feb '14
8 SSD DataNodes
• With 8 SSD DataNodes for 40GB sort
– 52% improvement over IPoIB (QDR)
– 44% improvement over Hadoop-A-IB
(QDR) 0 50 100 150 200 250 300 350 400 450 500 25 30 35 40 Job Ex e cu ti on T ime (se c) Data Size (GB) 10GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) 0 50 100 150 200 250 300 350 400 450 500 25 30 35 40 Job Ex e cu ti on T im e (se c) Data Size (GB) 10GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR)
Evaluations using TeraSort
• 100 GB TeraSort with 8 DataNodes with 2 HDD per node
– 51%
improvement over IPoIB (QDR)
– 41%
improvement over Hadoop-A-IB (QDR)
38
HPC Advisory Council Stanford Conference, Feb '14
0 200 400 600 800 1000 1200 1400 1600 1800 60 80 100 Job Ex ecu tion Time ( se c) Data Size (GB) 10GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR)
• For 240GB Sort in 64 nodes
– 40% improvement over IPoIB (QDR) with HDD used for HDFS
39
Performance Evaluation on Larger Clusters
HPC Advisory Council Stanford Conference, Feb '14
0 200 400 600 800 1000 1200
Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job Ex e cu ti on T im e (se c) IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR)
Sort in OSU Cluster
0 100 200 300 400 500 600 700
Data Size: 80 GB Data Size: 160 GBData Size: 320 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job Ex e cu ti on T im e (se c)
IPoIB (FDR) Hadoop-A-IB (FDR) OSU-IB (FDR)
TeraSort in TACC Stampede
• For 320GB TeraSort in 64 nodes
– 38% improvement over IPoIB (FDR) with HDD used for HDFS
• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size
• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size
Evaluations using PUMA Workload
40
HPC Advisory Council Stanford Conference, Feb '14
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)
No rma liz e d Ex e cu ti on T ime Benchmarks
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached• Open Challenges
• Conclusion and Q&A
Presentation Outline
41
HPC Advisory Council Stanford Conference, Feb '14 42
Motivation – Detailed Analysis on HBase Put/Get
• HBase 1KB Put
– Communication Time – 8.9 us
– A factor of 6X improvement over 10GE for communication time • HBase 1KB Get
– Communication Time – 8.9 us
– A factor of 6X improvement over 10GE for communication time
M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?,
ISPASS’12 0 20 40 60 80 100 120 140 160 180 200
IPoIB (DDR) 10GigE OSU-IB (DDR)
Ti m e ( u s) HBase Put 1KB 0 20 40 60 80 100 120 140 160
IPoIB (DDR) 10GigE OSU-IB (DDR)
Ti m e ( u s) Hbase Get 1KB Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization
HPC Advisory Council Stanford Conference, Feb '14 43
HBase Single-Server-Multi-Client Results
• HBase Get latency
– 4 clients: 104.5 us; 16 clients: 296.1 us
• HBase Get throughput
– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec
• 27% improvement in throughput for 16 clients over 10GE
0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 Ti m e ( u s) No. of Clients 0 10000 20000 30000 40000 50000 60000 1 2 4 8 16 Op s/se c No. of Clients IPoIB OSU-IB 10GigE Latency Throughput
J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12
OSU-IB (DDR) IPoIB (DDR)
HPC Advisory Council Stanford Conference, Feb '14 44
HBase – YCSB Read-Write Workload
• HBase Get latency (Yahoo! Cloud Service Benchmark) – 64 clients: 2.0 ms; 128 Clients: 3.5 ms
– 42% improvement over IPoIB for 128 clients • HBase Get latency
– 64 clients: 1.9 ms; 128 Clients: 3.5 ms
– 40% improvement over IPoIB for 128 clients 0 1000 2000 3000 4000 5000 6000 7000 8 16 32 64 96 128 Ti m e ( u s) No. of Clients 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 8 16 32 64 96 128 Ti m e ( u s) No. of Clients IPoIB OSU-IB 10GigE
Read Latency Write Latency
OSU-IB (QDR) IPoIB (QDR)
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached• Open Challenges
• Conclusion and Q&A
Presentation Outline
45
Design Overview of Hadoop RPC with RDMA
Hadoop RPC
Verbs
RDMA Capable Networks
(IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket
Interface Java Native Interface (JNI)
Our Design Default
OSU Design
Enables high performance RDMA communication, while supporting traditional socket interface
46
• Design Features
– JVM-bypassed buffer management – On-demand connection setup – RDMA or send/recv based adaptive communication – InfiniBand/RoCE supportHadoop RPC over IB - Gain in Latency and Throughput
• Hadoop RPC over IB PingPong Latency
– 1 byte: 39 us; 4 KB: 52 us
– 42%-49% and 46%-50% improvements compared with the performance on 10 GigE
and IPoIB (32Gbps), respectively
• Hadoop RPC over IB Throughput
– 512 bytes & 48 clients: 135.22 Kops/sec
– 82% and 64% improvements compared with the peak performance on 10 GigE and
IPoIB (32Gbps), respectively 0 20 40 60 80 100 120 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Late n cy ( u s)
Payload Size (Byte)
10GigE IPoIB(QDR) OSU-IB(QDR) 0 20 40 60 80 100 120 140 160 8 16 24 32 40 48 56 64 Th ro u gh p u t (K o p s/Sec ) Number of Clients 47
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached• Open Challenges
• Conclusion and Q&A
Presentation Outline
48
• For 5GB,
– 53.4% improvement over IPoIB, 126.9% improvement over 10GigE with HDD used for HDFS
– 57.8% improvement over IPoIB, 138.1% improvement over 10GigE with SSD used for HDFS
• For 20GB,
– 10.7% improvement over IPoIB, 46.8% improvement over 10GigE with HDD used for HDFS
– 73.3% improvement over IPoIB, 141.3% improvement over 10GigE with SSD used for HDFS
49
Performance Benefits - TestDFSIO
Cluster with 4 HDD Nodes, TestDFSIO with total 4 maps Cluster with 4 SSD Nodes, TestDFSIO with total 4 maps
0 100 200 300 400 500 600 700 800 900 1000 5 10 15 20 Tot al T h rou gh p u t (M Bps ) Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
0 100 200 300 400 500 600 700 800 900 5 10 15 20 Tot al T h rou gh p u t (M Bps ) Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
• For 80GB Sort in a cluster size of 8,
– 40.7% improvement over IPoIB, 32.2% improvement over 10GigE with HDD used for HDFS
– 43.6% improvement over IPoIB, 42.9% improvement over 10GigE with SSD used for HDFS
50
Performance Benefits - Sort
0 200 400 600 800 1000 1200 1400 1600 1800 48 64 80 Job Ex e cu ti on T im e (se c) Sort Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
Cluster with 8 HDD Nodes, Sort with <8 maps, 8 reduces>/host Cluster with 8 SSD Nodes, Sort with <8 maps, 8 reduces>/host
0 200 400 600 800 1000 1200 1400 1600 48 64 80 Job Ex e cu ti on T im e (se c) Sort Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
• For 100GB TeraSort in a cluster size of 8,
– 45% improvement over IPoIB, 39% improvement over 10GigE with HDD used for HDFS
– 46% improvement over IPoIB, 40% improvement over 10GigE with SSD used for HDFS
51
Performance Benefits - TeraSort
Cluster with 8 HDD Nodes, TeraSort with <8 maps, 8reduces>/host Cluster with 8 SSD Nodes, TeraSort with <8 maps, 8reduces>/host
0 500 1000 1500 2000 2500 80 90 100 Job Ex ecu tion Time ( se c) Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
0 500 1000 1500 2000 2500 80 90 100 Job Ex ecu tion Time ( se c) Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
0 50 100 150 200 250 300 350 400 50 GB 100 GB 200 GB 300 GB Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64
Job Ex ecu tion Time ( se c) IPoIB (QDR) OSU-IB (QDR) 0 20 40 60 80 100 120 140 160 50 GB 100 GB 200 GB 300 GB Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64
Job Ex ecu tion Time ( se c) IPoIB (QDR) OSU-IB (QDR)
Performance Benefits – RandomWriter & Sort in SDSC-Gordon
• RandomWriter in a cluster of single SSD per node
– 16% improvement over IPoIB (QDR) for
50GB in a cluster of 16
– 20% improvement over IPoIB (QDR) for
300GB in a cluster of 64
Reduced by 20% Reduced by 36%
52
HPC Advisory Council Stanford Conference, Feb '14
• Sort in a cluster of single SSD per node
– 20% improvement over IPoIB (QDR)
for 50GB in a cluster of 16
– 36% improvement over IPoIB (QDR)
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached• Open Challenges
• Conclusion and Q&A
Presentation Outline
53
Memcached-RDMA Design
• Server and client perform a negotiation protocol
– Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can
communicate directly and is “bound” to that thread
• All other Memcached data structures are shared among RDMA
and Sockets worker threads
Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items … 1 1 2 2 54
55
• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us – 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Ti m e ( u s) Message Size 0 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Ti m e ( u s) Message Size IPoIB (QDR) 10GigE (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR)
Memcached Get Latency (Small Message)
HPC Advisory Council Stanford Conference, Feb '14
T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% Ti m e% (u s) % Message%Size% T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% Ti m e% (u s) % Message%Size% IPoIB (DDR) 10GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR)
HPC Advisory Council Stanford Conference, Feb '14 56
Memcached Get TPS (4byte)
• Memcached Get transactions per second for 4 bytes – On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
• Significant improvement with native IB QDR compared to SDP and IPoIB 0 200 400 600 800 1000 1200 1400 1600 1 2 4 8 16 32 64 128 256 512 800 1K Th o u san d s o f Tr an sact io n s p e r sec o n d (TPS ) No. of Clients 0 200 400 600 800 1000 1200 1400 1600 4 8 Th o u san d s o f Tr an sact io n s p e r sec o n d (TPS ) No. of Clients T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% Ti m e% (u s) % Message%Size% IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR)
1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us ) Message Size OSU-IB (FDR) IPoIB (FDR) 0 100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 4080 Thou sa nds of Tr an sa ct ion s per Sec ond ( TPS ) No. of Clients 57
• Memcached Get latency
– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us • Memcached Throughput (4bytes)
– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) Latency Reduced
by nearly 20X
2X
HPC Advisory Council Stanford Conference, Feb '14 58
Application Level Evaluation – Olio Benchmark
• Olio Benchmark
– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients
• 4X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design 0 500 1000 1500 2000 2500 64 128 256 512 1024 Ti m e ( m s) No. of Clients 0 10 20 30 40 50 60 1 4 8 Ti m e ( m s) No. of Clients IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) Ti m e %( m s) % No.%of%Clients%
HPC Advisory Council Stanford Conference, Feb '14 59
Application Level Evaluation –
Real Application Workloads
• Real Application Workload
– RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design 0 10 20 30 40 50 60 70 80 90 1 4 8 Ti m e ( m s) No. of Clients 0 50 100 150 200 250 300 350 64 128 256 512 1024 Ti m e ( m s) No. of Clients
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12
IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) Ti m e %( m s) % No.%of%Clients%
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies
– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
60
Big Data Middleware
(HDFS, MapReduce, HBase and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies (HDD and SSD) Programming Models
(Sockets) Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols? Communication and I/O Library
Point-to-Point Communication QoS Threaded Models and Synchronization Fault-Tolerance I/O and File Systems
Virtualization Benchmarks
RDMA Protocol
Designing Communication and I/O Libraries for Big
Data Systems: Solved a Few Initial Challenges
HPC Advisory Council Stanford Conference, Feb '14
Upper level Changes?
• Multi-threading and Synchronization
– Multi-threaded model exploration
– Fine-grained synchronization and lock-free design – Unified helper threads for different components
– Multi-endpoint design to support multi-threading communications
• QoS and Virtualization
– Network virtualization and locality-aware communication for Big Data middleware
– Hardware-level virtualization support for End-to-End QoS – I/O scheduling and storage virtualization
– Live migration
HPC Advisory Council Stanford Conference, Feb '14 62
• Support of Accelerators
– Efficient designs for Big Data middleware to take advantage of NVIDA GPGPUs and Intel MICs
– Offload computation-intensive workload to accelerators
– Explore maximum overlapping between communication and offloaded computation
• Fault Tolerance Enhancements
– Novel data replication schemes for high-efficiency fault tolerance – Exploration of light-weight fault tolerance mechanisms for Big Data
processing
• Support of Parallel File Systems
– Optimize Big Data middleware over parallel file systems (e.g. Lustre) on modern HPC clusters
HPC Advisory Council Stanford Conference, Feb '14 63
• For 500GB Sort in 64 nodes
– 44% improvement over IPoIB (FDR)
64
Performance Improvement Potential of MapReduce over
Lustre on HPC Clusters – An Initial Study
HPC Advisory Council Stanford Conference, Feb '14
Sort in TACC Stampede Sort in SDSC Gordon
• For 100GB Sort in 16 nodes
– 33% improvement over IPoIB (QDR) 0 200 400 600 800 1000 1200 300 400 500 Job Ex e cu ti on T im e (se c) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) 0 50 100 150 200 250 60 80 100 Job Ex e cu ti on T im e (se c) Data Size (GB) IPoIB (QDR) OSU-IB (QDR)
• The current benchmarks provide some performance
behavior
• However, do not provide any information to the
designer/developer on:
– What is happening at the lower-layer? – Where the benefits are coming from?
– Which design is leading to benefits or bottlenecks?
– Which component in the design needs to be changed and what will be its impact?
– Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?
HPC Advisory Council Stanford Conference, Feb '14 65
Big Data Middleware
(HDFS, MapReduce, HBase and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies (HDD and SSD) Programming Models
(Sockets) Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols? Communication and I/O Library
Point-to-Point Communication QoS Threaded Models and Synchronization Fault-Tolerance I/O and File Systems
Virtualization Benchmarks
RDMA Protocols
Challenges in Benchmarking of RDMA-based Designs
66
Current Benchmarks
No Benchmarks
Correlation?
OSU MPI Micro-Benchmarks (OMB) Suite
• A comprehensive suite of benchmarks to
– Compare performance of different MPI libraries on various networks and systems
– Validate low-level functionalities
– Provide insights to the underlying MPI-level designs
• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth
• Extended later to
– MPI-2 one-sided
– Collectives
– GPU-aware data movement
– OpenSHMEM (point-to-point and collectives)
– UPC
• Has become an industry standard
• Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems
• Available from http://mvapich.cse.ohio-state.edu/benchmarks
• Available in an integrated manner with MVAPICH2 stack
67
Big Data Middleware
(HDFS, MapReduce, HBase and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies (HDD and SSD) Programming Models
(Sockets) Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols? Communication and I/O Library
Point-to-Point Communication QoS Threaded Models and Synchronization Fault-Tolerance I/O and File Systems
Virtualization Benchmarks
RDMA Protocols
Iterative Process – Requires Deeper Investigation and
Design for Benchmarking Next Generation Big Data
Systems and Applications
HPC Advisory Council Stanford Conference, Feb '14 68
Applications-Level Benchmarks
Micro-Benchmarks
• Upcoming RDMA for Apache Hadoop Releases will support
– HDFS Parallel Replication – Advanced MapReduce – HBase – Hadoop 2.0.x• Memcached-RDMA
• Exploration of other components (Threading models, QoS,
Virtualization, Accelerators, etc.)
• Advanced designs with upper-level changes and optimizations
• Comprehensive set of Benchmarks
• More details at
http://hadoop-rdma.cse.ohio-state.edu
HPC Advisory Council Stanford Conference, Feb '14
Future Plans of OSU Big Data Project
• Presented an overview of Big Data, Hadoop and Memcached
• Provided an overview of Networking Technologies
• Discussed challenges in accelerating Hadoop
• Presented initial designs to take advantage of InfiniBand/RDMA
for HDFS, MapReduce, RPC, HBase and Memcached
• Results are promising
• Many other open issues need to be solved
• Will enable Big Data community to take advantage of modern
HPC technologies to carry out their analytics in a fast and scalable
manner
HPC Advisory Council Stanford Conference, Feb '14
Concluding Remarks
HPC Advisory Council Stanford Conference, Feb '14
Personnel Acknowledgments
Current Students – S. Chakraborthy (Ph.D.) – N. Islam (Ph.D.) – J. Jose (Ph.D.) – M. Li (Ph.D.) – S. Potluri (Ph.D.) – R. Rajachandrasekhar (Ph.D.) Past Students – P. Balaji (Ph.D.) – D. Buntinas (Ph.D.) – S. Bhagvat (M.S.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) 71Past Research Scientist
– S. Sur Current Post-Docs – X. Lu – K. Hamidouche Current Programmers – M. Arnold – J. Perkins Past Post-Docs – H. Wang – X. Besseron – H.-W. Jin – M. Luo – W. Huang (Ph.D.) – W. Jiang (M.S.) – S. Kini (M.S.) – M. Koop (Ph.D.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – M. Rahman (Ph.D.) – D. Shankar (Ph.D.) – R. Shir (Ph.D.) – A. Venkatesh (Ph.D.) – J. Zhang (Ph.D.) – E. Mancini – S. Marcarelli – J. Vienne
Current Senior Research Associate
– H. Subramoni
Past Programmers
HPC Advisory Council Stanford Conference, Feb '14