Big Data: Hadoop and Memcached

(1)

Big Data: Hadoop and Memcached

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Stanford Conference and Exascale Workshop (Feb 2014)

(2)

Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most important elements of business analytics • Provides groundbreaking opportunities

for enterprise information management and decision making

• The amount of data is exploding;

companies are capturing and digitizing more information than ever

• The rate of information growth appears to be exceeding Moore’s Law

2

HPC Advisory Council Stanford Conference, Feb '14

• Commonly accepted 3V’s of Big Data

• Volume, Velocity, Variety

Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

(3)

• Scientific Data Management, Analysis, and Visualization • Applications examples – Climate modeling – Combustion – Fusion – Astrophysics – Bioinformatics

• Data Intensive Tasks

– Runs large-scale simulations on supercomputers – Dump data on parallel storage systems

– Collect experimental / observational data

– Move experimental / observational data to analysis sites – Visual analytics – help understand data visually

HPC Advisory Council Stanford Conference, Feb '14 3

Not Only in Internet Services - Big Data in Scientific

Domains

(4)

• Hadoop: http://hadoop.apache.org

– The most popular framework for Big Data Analytics

– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.

• Spark: http://spark-project.org

– Provides primitives for in-memory cluster computing; Jobs can load data into memory and query it repeatedly

• Shark: http://shark.cs.berkeley.edu

– A fully Apache Hive-compatible data warehousing system

• Storm: http://storm-project.net

– A distributed real-time computation system for real-time analytics, online machine learning, continuous computation, etc.

• S4: http://incubator.apache.org/s4

– A distributed system for processing continuous unbounded streams of data

• Web 2.0: RDBMS + Memcached (http://memcached.org)

– Memcached: A high-performance, distributed memory object caching systems

Typical Solutions or Architectures for Big Data

Applications and Analytics

(5)

• F

ocuses on large data and data analysis

• Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment

is gaining a lot of momentum

• http://wiki.apache.org/hadoop/PoweredBy

(6)

• Scale: 42,000 nodes, 365PB HDFS, 10M daily slot hours

• http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-171710422.html • A new Jim Gray’ Sort Record: 1.42 TB/min over 2,100

nodes

• Hadoop 0.23.7, an early branch of the Hadoop 2.X

• http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html

Hadoop at Yahoo!

(7)

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

7

(8)

Big Data Processing with Hadoop

• The open-source implementation

of MapReduce programming

model for Big Data Analytics

• Major components

– HDFS

– MapReduce

• Underlying

Hadoop Distributed

File System (HDFS)

used by both

MapReduce and HBase

• Model scales but high amount of

communication during

intermediate phases can be

further optimized

8 HDFS MapReduce Hadoop Framework User Applications HBase Hadoop Common (RPC)

(9)

Hadoop MapReduce Architecture & Operations

Disk Operations

• Map and Reduce Tasks carry out the total job execution

• Communication in shuffle phase uses HTTP over Java Sockets

9

(10)

Hadoop Distributed File System (HDFS)

• Primary storage of Hadoop;

highly reliable and fault-tolerant

• Adopted by many reputed

organizations

– eg: Facebook, Yahoo!

• NameNode: stores the file system

namespace

• DataNode: stores data blocks

• Developed in Java for

platform-independence and portability

• Uses sockets for communication!

(HDFS Architecture)

10

RPC RPC RPC RPC

(11)

System-Level Interaction Between Clients and Data

Nodes in HDFS

High Performance Networks (HDD/SSD) (HDD/SSD) (HDD/SSD) ... ... (HDFS Data Nodes) (HDFS Clients) ... ...

(12)

HBase

• Apache Hadoop Database

(

http://hbase.apache.org/

)

• Semi-structured database,

which is highly scalable

• Integral part of many

datacenter applications

– eg:- Facebook Social Inbox

• Developed in Java for

platform-independence and portability

• Uses sockets for

communication!

ZooKeeper HBase Client HMaster HRegion Servers HDFS (HBase Architecture)

(13)

System-Level Interaction Among

HBase Clients, Region Servers, and Data Nodes

(HBase Clients) (HRegion Servers) (Data Nodes)

(HDD/SSD) (HDD/SSD) (HDD/SSD) ... ... ... ... ... ... High Performance Networks High Performance Networks

(14)

Memcached Architecture

• Integral part of Web 2.0 architecture • Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes – General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

!"#$%"$#&

Web Frontend Servers

(Memcached Clients) (Memcached Servers)

(Database Servers) High Performance Networks High Performance Networks Main memory CPUs SSD HDD High Performance Networks ... ... ... Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD

(15)

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

• Open Challenges

• Conclusion and Q&A

Presentation Outline

15

(16)

Trends for Commodity Computing Clusters in the Top 500

List (http://www.top500.org)

0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500 P e rce n tag e of Clus te rs Nu mb e r of Clus te rs Timeline Percentage of Clusters Number of Clusters

(17)

HPC and Advanced Interconnects

• High-Performance Computing (HPC) has adopted advanced

interconnects and protocols

– InfiniBand

– 10 Gigabit Ethernet/iWARP

– RDMA over Converged Enhanced Ethernet (RoCE)

• Very Good Performance

– Low latency (few micro seconds)

– High Bandwidth (100 Gb/s with dual FDR InfiniBand) – Low CPU overhead (5-10%)

• OpenFabrics software stack with IB, iWARP and RoCE

interfaces are driving HPC systems

• Many such systems in Top500 list

17 HPC Advisory Council Stanford Conference, Feb '14

(18)

18

Trends of Networking Technologies in TOP500 Systems

Percentage share of InfiniBand is steadily increasing Interconnect Family – Systems Share

(19)

• Message Passing Interface (MPI)

– Most commonly used programming model for HPC applications

• Parallel File Systems

• Storage Systems

• Almost 13 years of Research and Development since

InfiniBand was introduced in October 2000

• Other Programming Models are emerging to take

advantage of High-Performance Networks

– UPC

– OpenSHMEM

– Hybrid MPI+PGAS (OpenSHMEM and UPC)

19

Use of High-Performance Networks for Scientific

Computing

(20)

One-way Latency: MPI over IB with MVAPICH2

0.00 1.00 2.00 3.00 4.00 5.00

6.00 Small Message Latency

Message Size (bytes)

Lat enc y (us ) 1.66 1.56 1.64 1.82 0.99 1.09 1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00 50.00 100.00 150.00 200.00 250.00 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR

Large Message Latency

Lat

enc

y

(us

(21)

Bandwidth: MPI over IB with MVAPICH2

0 2000 4000 6000 8000 10000 12000 14000 Unidirectional Bandwidth B an dwid th ( MBy tes /se c)

3280 3385 1917 1706 6343 12485 12810 0 5000 10000 15000 20000 25000 30000 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Bidirectional Bandwidth B an dwid th ( MBy tes /sec )

3341 3704 4407 11643 6521 21025 24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

(22)

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

• Open Challenges

• Conclusion and Q&A

Presentation Outline

22

(23)

Can High-Performance Interconnects Benefit Hadoop?

• Most of the current Hadoop environments use Ethernet

Infrastructure with Sockets

• Concerns for performance and scalability

• Usage of High-Performance Networks is beginning to draw

interest from many companies

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar

to the designs adopted for MPI)?

• Can HPC Clusters with High-Performance networks be used

for Big Data applications using Hadoop?

23 HPC Advisory Council Stanford Conference, Feb '14

(24)

Designing Communication and I/O Libraries for Big

Data Systems: Challenges

Big Data Middleware

(HDFS, MapReduce, HBase and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies (HDD and SSD) Programming Models

(Sockets) Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication QoS Threaded Models and Synchronization Fault-Tolerance I/O and File Systems

Virtualization Benchmarks

(25)

Common Protocols using Open Fabrics

Application/ Middleware

Verbs Sockets

Application /Middleware Interface

SDP RDMA SDP InfiniBand Adapter InfiniBand Switch RDMA IB Verbs InfiniBand Adapter InfiniBand Switch User space RDMA RoCE RoCE Adapter User space Ethernet Switch TCP/IP Ethernet Driver Kernel Space Protocol InfiniBand Adapter InfiniBand Switch IPoIB Ethernet Adapter Ethernet Switch Adapter Switch 1/10/40 GigE iWARP Ethernet Switch iWARP iWARP Adapter User space IPoIB TCP/IP Ethernet Adapter Ethernet Switch 10/40 GigE-TOE Hardware Offload RSockets InfiniBand Adapter InfiniBand Switch User space RSockets

(26)

Can Hadoop be Accelerated with High-Performance Networks

and Protocols?

26 Enhanced Designs Application Accelerated Sockets 10 GigE or InfiniBand Verbs / Hardware Offload Current Design Application Sockets 1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Hadoop and HBase) – Zero-copy not available for non-blocking sockets

Our Approach Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

(27)

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

• Open Challenges

• Conclusion and Q&A

Presentation Outline

27

(28)

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.8

– Based on Apache Hadoop 1.2.1

– Compliant with Apache Hadoop 1.2.1 APIs and applications – Tested with

• Mellanox InfiniBand adapters (DDR, QDR and FDR) • RoCE

• Various multi-core platforms

• Different file systems with disks and SSDs – http://hadoop-rdma.cse.ohio-state.edu

RDMA for Apache Hadoop Project

28

(29)

Design Overview of HDFS with RDMA

• JNI Layer bridges Java based HDFS with communication library written in native code

• Only the communication part of HDFS Write is modified; No change in HDFS architecture

29

HDFS

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket

Interface Java Native Interface (JNI)

Write Others

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

• Design Features

– RDMA-based HDFS write – RDMA-based HDFS replication – InfiniBand/RoCE support

(30)

Communication Time in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE • Similar improvements are obtained for SSD DataNodes

Reduced by 30%

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

30 0 5 10 15 20 25 2GB 4GB 6GB 8GB 10GB Co m m u n ic ati on T im e (s) File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

(31)

Evaluations using TestDFSIO

• Cluster with 8 HDD DataNodes, single disk per node

– 24% improvement over IPoIB (QDR) for 20GB file size • Cluster with 4 SSD DataNodes, single SSD per node

– 61% improvement over IPoIB (QDR) for 20GB file size

Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps Increased by 24% Increased by 61% 31 0 200 400 600 800 1000 1200 5 10 15 20 To tal Th ro u gh p u t (M B p s) File Size (GB) IPoIB (QDR) OSU-IB (QDR) 0 500 1000 1500 2000 2500 5 10 15 20 To tal Th ro u gh p u t (M B p s) File Size (GB)

(32)

Evaluations using TestDFSIO

• Cluster with 4 DataNodes, 1 HDD per node

• Cluster with 4 DataNodes, 2 HDD per node

• 2 HDD vs 1 HDD

– 76% improvement for OSU-IB (QDR)

– 66% improvement for IPoIB (QDR)

Increased by 31% 32 0 500 1000 1500 2000 2500 5 10 15 20 Tot al T h rou gh p u t (M Bps ) File Size (GB) 10GigE-1HDD 10GigE-2HDD IPoIB -1HDD (QDR) IPoIB -2HDD (QDR) OSU-IB -1HDD (QDR) OSU-IB -2HDD (QDR)

(33)

33

Evaluations using Enhanced DFSIO of Intel HiBench on

SDSC-Gordon

• Cluster with 32 DataNodes, single SSD per node

– 47% improvement in throughput over IPoIB (QDR) for 128GB file size

– 16% improvement in latency over IPoIB (QDR) for 128GB file size

Increased by 47% _{Reduced by 16%} 0 500 1000 1500 2000 2500 32 64 128 A gg re ga te d T h rou gh p u t (M Bps ) File Size (GB) IPoIB (QDR) OSU-IB (QDR) 0 20 40 60 80 100 120 140 160 32 64 128 Ex e cu ti on T im e (s) File Size (GB) IPoIB (QDR) OSU-IB (QDR)

(34)

Evaluations using Enhanced DFSIO of Intel HiBench on

TACC-Stampede

• Cluster with 64 DataNodes, single HDD per node

– 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size 0 200 400 600 800 1000 1200 1400 64 128 256 A gg re ga te d T h rou gh p u t (M Bps ) File Size (GB) IPoIB (FDR) OSU-IB (FDR) Increased by 64% Reduced by 37% 0 100 200 300 400 500 600 64 128 256 Ex e cu ti on T im e (s) File Size (GB) IPoIB (FDR) OSU-IB (FDR)

(35)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

• Open Challenges

• Conclusion and Q&A

Presentation Outline

35

(36)

Design Overview of MapReduce with RDMA

• Enables high

performance RDMA

communication, while

supporting traditional

socket interface

• JNI Layer bridges Java

based MapReduce

with communication

library written in

native code

• InfiniBand/RoCE

support

MapReduce Verbs

OSU Design

Applications

Java Socket

Job Tracker Task Tracker Map Reduce

Design features for RDMA

 Prefetch/Caching of MOF

 In-Memory Merge

 Overlap of Merge & Reduce

36

(37)

Evaluations using Sort

8 HDD DataNodes

• With 8 HDD DataNodes for 40GB sort

– 41% improvement over IPoIB (QDR)

– 39% improvement over Hadoop-A-IB

(QDR)

37

M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May 2013.

8 SSD DataNodes

• With 8 SSD DataNodes for 40GB sort

– 44% improvement over Hadoop-A-IB

(QDR) 0 50 100 150 200 250 300 350 400 450 500 25 30 35 40 Job Ex e cu ti on T ime (se c) Data Size (GB) 10GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) 0 50 100 150 200 250 300 350 400 450 500 25 30 35 40 Job Ex e cu ti on T im e (se c) Data Size (GB) 10GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR)

(38)

Evaluations using TeraSort

• 100 GB TeraSort with 8 DataNodes with 2 HDD per node

– 51%

improvement over IPoIB (QDR)

– 41%

improvement over Hadoop-A-IB (QDR)

38

0 200 400 600 800 1000 1200 1400 1600 1800 60 80 100 Job Ex ecu tion Time ( se c) Data Size (GB) 10GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR)

(39)

• For 240GB Sort in 64 nodes

– 40% improvement over IPoIB (QDR) with HDD used for HDFS

39

Performance Evaluation on Larger Clusters

0 200 400 600 800 1000 1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job Ex e cu ti on T im e (se c) IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0 100 200 300 400 500 600 700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job Ex e cu ti on T im e (se c)

IPoIB (FDR) Hadoop-A-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes

– 38% improvement over IPoIB (FDR) with HDD used for HDFS

(40)

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

40

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

No rma liz e d Ex e cu ti on T ime Benchmarks

(41)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

• Open Challenges

• Conclusion and Q&A

Presentation Outline

41

(42)

Motivation – Detailed Analysis on HBase Put/Get

• HBase 1KB Put

– Communication Time – 8.9 us

– A factor of 6X improvement over 10GE for communication time • HBase 1KB Get

– Communication Time – 8.9 us

– A factor of 6X improvement over 10GE for communication time

M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?,

ISPASS’12 0 20 40 60 80 100 120 140 160 180 200

IPoIB (DDR) 10GigE OSU-IB (DDR)

Ti m e ( u s) HBase Put 1KB 0 20 40 60 80 100 120 140 160

IPoIB (DDR) 10GigE OSU-IB (DDR)

Ti m e ( u s) Hbase Get 1KB Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization

(43)

HBase Single-Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 Ti m e ( u s) No. of Clients 0 10000 20000 30000 40000 50000 60000 1 2 4 8 16 Op s/se c No. of Clients IPoIB OSU-IB 10GigE Latency _Throughput

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

OSU-IB (DDR) IPoIB (DDR)

(44)

HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark) – 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients • HBase Get latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients 0 1000 2000 3000 4000 5000 6000 7000 8 16 32 64 96 128 Ti m e ( u s) No. of Clients 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 8 16 32 64 96 128 Ti m e ( u s) No. of Clients IPoIB OSU-IB 10GigE

Read Latency _{Write Latency}

OSU-IB (QDR) IPoIB (QDR)

(45)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

• Open Challenges

• Conclusion and Q&A

Presentation Outline

45

(46)

Design Overview of Hadoop RPC with RDMA

Hadoop RPC

Verbs

Applications

Java Socket

Our Design Default

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

46

• Design Features

– JVM-bypassed buffer management – On-demand connection setup – RDMA or send/recv based adaptive communication – InfiniBand/RoCE support

(47)

Hadoop RPC over IB - Gain in Latency and Throughput

• Hadoop RPC over IB PingPong Latency

– 1 byte: 39 us; 4 KB: 52 us

– 42%-49% and 46%-50% improvements compared with the performance on 10 GigE

and IPoIB (32Gbps), respectively

• Hadoop RPC over IB Throughput

– 512 bytes & 48 clients: 135.22 Kops/sec

– 82% and 64% improvements compared with the peak performance on 10 GigE and

IPoIB (32Gbps), respectively 0 20 40 60 80 100 120 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Late n cy ( u s)

Payload Size (Byte)

10GigE IPoIB(QDR) OSU-IB(QDR) 0 20 40 60 80 100 120 140 160 8 16 24 32 40 48 56 64 Th ro u gh p u t (K o p s/Sec ) Number of Clients 47

(48)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

• Open Challenges

• Conclusion and Q&A

Presentation Outline

48

(49)

• For 5GB,

– 53.4% improvement over IPoIB, 126.9% improvement over 10GigE with HDD used for HDFS

– 57.8% improvement over IPoIB, 138.1% improvement over 10GigE with SSD used for HDFS

• For 20GB,

49

Performance Benefits - TestDFSIO

Cluster with 4 HDD Nodes, TestDFSIO with total 4 maps Cluster with 4 SSD Nodes, TestDFSIO with total 4 maps

0 100 200 300 400 500 600 700 800 900 1000 5 10 15 20 Tot al T h rou gh p u t (M Bps ) Size (GB)

0 100 200 300 400 500 600 700 800 900 5 10 15 20 Tot al T h rou gh p u t (M Bps ) Size (GB)

(50)

• For 80GB Sort in a cluster size of 8,

50

Performance Benefits - Sort

0 200 400 600 800 1000 1200 1400 1600 1800 48 64 80 Job Ex e cu ti on T im e (se c) Sort Size (GB)

Cluster with 8 HDD Nodes, Sort with <8 maps, 8 reduces>/host Cluster with 8 SSD Nodes, Sort with <8 maps, 8 reduces>/host

0 200 400 600 800 1000 1200 1400 1600 48 64 80 Job Ex e cu ti on T im e (se c) Sort Size (GB)

(51)

• For 100GB TeraSort in a cluster size of 8,

– 45% improvement over IPoIB, 39% improvement over 10GigE with HDD used for HDFS

– 46% improvement over IPoIB, 40% improvement over 10GigE with SSD used for HDFS

51

Performance Benefits - TeraSort

Cluster with 8 HDD Nodes, TeraSort with <8 maps, 8reduces>/host Cluster with 8 SSD Nodes, TeraSort with <8 maps, 8reduces>/host

0 500 1000 1500 2000 2500 80 90 100 Job Ex ecu tion Time ( se c) Size (GB)

(52)

0 50 100 150 200 250 300 350 400 50 GB 100 GB 200 GB 300 GB Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job Ex ecu tion Time ( se c) IPoIB (QDR) OSU-IB (QDR) 0 20 40 60 80 100 120 140 160 50 GB 100 GB 200 GB 300 GB Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job Ex ecu tion Time ( se c) IPoIB (QDR) OSU-IB (QDR)

Performance Benefits – RandomWriter & Sort in SDSC-Gordon

• RandomWriter in a cluster of single SSD per node

– 16% improvement over IPoIB (QDR) for

50GB in a cluster of 16

– 20% improvement over IPoIB (QDR) for

300GB in a cluster of 64

Reduced by 20% Reduced by 36%

52

• Sort in a cluster of single SSD per node

for 50GB in a cluster of 16

(53)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

• Open Challenges

• Conclusion and Q&A

Presentation Outline

53

(54)

Memcached-RDMA Design

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can

communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA

and Sockets worker threads

Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items … 1 1 2 2 54

(55)

55

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us – 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Ti m e ( u s) Message Size 0 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Ti m e ( u s) Message Size IPoIB (QDR) 10GigE (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR)

Memcached Get Latency (Small Message)

T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% Ti m e% (u s) % Message%Size% T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% Ti m e% (u s) % Message%Size% IPoIB (DDR) 10GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR)

(56)

Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes – On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB 0 200 400 600 800 1000 1200 1400 1600 1 2 4 8 16 32 64 128 256 512 800 1K Th o u san d s o f Tr an sact io n s p e r sec o n d (TPS ) No. of Clients 0 200 400 600 800 1000 1200 1400 1600 4 8 Th o u san d s o f Tr an sact io n s p e r sec o n d (TPS ) No. of Clients T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% Ti m e% (u s) % Message%Size% IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR)

(57)

1 10 100 1000 1 2 4 8 ₁₆ ₃₂ ₆₄ 128 256 512 1K 2K 4K Time (us ) Message Size OSU-IB (FDR) IPoIB (FDR) 0 100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 4080 Thou sa nds of Tr an sa ct ion s per Sec ond ( TPS ) No. of Clients 57

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us • Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) Latency Reduced

by nearly 20X

2X

(58)

Application Level Evaluation – Olio Benchmark

• Olio Benchmark

– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design 0 500 1000 1500 2000 2500 64 128 256 512 1024 Ti m e ( m s) No. of Clients 0 10 20 30 40 50 60 1 4 8 Ti m e ( m s) No. of Clients IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) Ti m e %( m s) % No.%of%Clients%

(59)

Application Level Evaluation –

Real Application Workloads

• Real Application Workload

– RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design 0 10 20 30 40 50 60 70 80 90 1 4 8 Ti m e ( m s) No. of Clients 0 50 100 150 200 250 300 350 64 128 256 512 1024 Ti m e ( m s) No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) Ti m e %( m s) % No.%of%Clients%

(60)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

• Open Challenges

• Conclusion and Q&A

Presentation Outline

60

(61)

RDMA Protocol

Designing Communication and I/O Libraries for Big

Data Systems: Solved a Few Initial Challenges

Upper level Changes?

(62)

• Multi-threading and Synchronization

– Multi-threaded model exploration

– Fine-grained synchronization and lock-free design – Unified helper threads for different components

– Multi-endpoint design to support multi-threading communications

• QoS and Virtualization

– Network virtualization and locality-aware communication for Big Data middleware

– Hardware-level virtualization support for End-to-End QoS – I/O scheduling and storage virtualization

– Live migration

(63)

• Support of Accelerators

– Efficient designs for Big Data middleware to take advantage of NVIDA GPGPUs and Intel MICs

– Offload computation-intensive workload to accelerators

– Explore maximum overlapping between communication and offloaded computation

• Fault Tolerance Enhancements

– Novel data replication schemes for high-efficiency fault tolerance – Exploration of light-weight fault tolerance mechanisms for Big Data

processing

• Support of Parallel File Systems

– Optimize Big Data middleware over parallel file systems (e.g. Lustre) on modern HPC clusters

(64)

• For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

64

Performance Improvement Potential of MapReduce over

Lustre on HPC Clusters – An Initial Study

Sort in TACC Stampede Sort in SDSC Gordon

• For 100GB Sort in 16 nodes

– 33% improvement over IPoIB (QDR) 0 200 400 600 800 1000 1200 300 400 500 Job Ex e cu ti on T im e (se c) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) 0 50 100 150 200 250 60 80 100 Job Ex e cu ti on T im e (se c) Data Size (GB) IPoIB (QDR) OSU-IB (QDR)

(65)

• The current benchmarks provide some performance

behavior

• However, do not provide any information to the

designer/developer on:

– What is happening at the lower-layer? – Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will be its impact?

– Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?

(66)

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

66

Current Benchmarks

No Benchmarks

Correlation?

(67)

OSU MPI Micro-Benchmarks (OMB) Suite

• A comprehensive suite of benchmarks to

– Compare performance of different MPI libraries on various networks and systems

– Validate low-level functionalities

– Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to

– MPI-2 one-sided

– Collectives

– GPU-aware data movement

– OpenSHMEM (point-to-point and collectives)

– UPC

• Has become an industry standard

• Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack

67

(68)

RDMA Protocols

Iterative Process – Requires Deeper Investigation and

Design for Benchmarking Next Generation Big Data

Systems and Applications

Applications-Level Benchmarks

Micro-Benchmarks

(69)

• Upcoming RDMA for Apache Hadoop Releases will support

– HDFS Parallel Replication – Advanced MapReduce – HBase – Hadoop 2.0.x

• Memcached-RDMA

• Exploration of other components (Threading models, QoS,

Virtualization, Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations

• Comprehensive set of Benchmarks

• More details at

http://hadoop-rdma.cse.ohio-state.edu

Future Plans of OSU Big Data Project

(70)

• Presented an overview of Big Data, Hadoop and Memcached

• Provided an overview of Networking Technologies

• Discussed challenges in accelerating Hadoop

• Presented initial designs to take advantage of InfiniBand/RDMA

for HDFS, MapReduce, RPC, HBase and Memcached

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data community to take advantage of modern

HPC technologies to carry out their analytics in a fast and scalable

manner

Concluding Remarks

(71)

Personnel Acknowledgments

Current Students – S. Chakraborthy (Ph.D.) – N. Islam (Ph.D.) – J. Jose (Ph.D.) – M. Li (Ph.D.) – S. Potluri (Ph.D.) – R. Rajachandrasekhar (Ph.D.) Past Students – P. Balaji (Ph.D.) – D. Buntinas (Ph.D.) – S. Bhagvat (M.S.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) 71

Past Research Scientist

– S. Sur Current Post-Docs – X. Lu – K. Hamidouche Current Programmers – M. Arnold – J. Perkins Past Post-Docs – H. Wang – X. Besseron – H.-W. Jin – M. Luo – W. Huang (Ph.D.) – W. Jiang (M.S.) – S. Kini (M.S.) – M. Koop (Ph.D.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – M. Rahman (Ph.D.) – D. Shankar (Ph.D.) – R. Shir (Ph.D.) – A. Venkatesh (Ph.D.) – J. Zhang (Ph.D.) – E. Mancini – S. Marcarelli – J. Vienne

Current Senior Research Associate

– H. Subramoni

Past Programmers

(72)

Pointers

http://nowlab.cse.ohio-state.edu [email protected] http://www.cse.ohio-state.edu/~panda 72 http://mvapich.cse.ohio-state.edu http://hadoop-rdma.cse.ohio-state.edu