• No results found

Big Data: Hadoop and Memcached

N/A
N/A
Protected

Academic year: 2021

Share "Big Data: Hadoop and Memcached"

Copied!
72
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data: Hadoop and Memcached

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Stanford Conference and Exascale Workshop (Feb 2014)

(2)

Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most important elements of business analytics • Provides groundbreaking opportunities

for enterprise information management and decision making

• The amount of data is exploding;

companies are capturing and digitizing more information than ever

• The rate of information growth appears to be exceeding Moore’s Law

2

HPC Advisory Council Stanford Conference, Feb '14

• Commonly accepted 3V’s of Big Data

• Volume, Velocity, Variety

Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

(3)

• Scientific Data Management, Analysis, and Visualization • Applications examples – Climate modeling – Combustion – Fusion – Astrophysics – Bioinformatics

• Data Intensive Tasks

– Runs large-scale simulations on supercomputers – Dump data on parallel storage systems

– Collect experimental / observational data

– Move experimental / observational data to analysis sites – Visual analytics – help understand data visually

HPC Advisory Council Stanford Conference, Feb '14 3

Not Only in Internet Services - Big Data in Scientific

Domains

(4)

• Hadoop: http://hadoop.apache.org

– The most popular framework for Big Data Analytics

– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.

• Spark: http://spark-project.org

– Provides primitives for in-memory cluster computing; Jobs can load data into memory and query it repeatedly

• Shark: http://shark.cs.berkeley.edu

– A fully Apache Hive-compatible data warehousing system

• Storm: http://storm-project.net

– A distributed real-time computation system for real-time analytics, online machine learning, continuous computation, etc.

• S4: http://incubator.apache.org/s4

– A distributed system for processing continuous unbounded streams of data

• Web 2.0: RDBMS + Memcached (http://memcached.org)

– Memcached: A high-performance, distributed memory object caching systems

HPC Advisory Council Stanford Conference, Feb '14 4

Typical Solutions or Architectures for Big Data

Applications and Analytics

(5)

• F

ocuses on large data and data analysis

• Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment

is gaining a lot of momentum

http://wiki.apache.org/hadoop/PoweredBy

HPC Advisory Council Stanford Conference, Feb '14 5

(6)

• Scale: 42,000 nodes, 365PB HDFS, 10M daily slot hours

• http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-171710422.html • A new Jim Gray’ Sort Record: 1.42 TB/min over 2,100

nodes

• Hadoop 0.23.7, an early branch of the Hadoop 2.X

• http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html

HPC Advisory Council Stanford Conference, Feb '14 6

Hadoop at Yahoo!

(7)

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

7

(8)

Big Data Processing with Hadoop

• The open-source implementation

of MapReduce programming

model for Big Data Analytics

• Major components

– HDFS

– MapReduce

• Underlying

Hadoop Distributed

File System (HDFS)

used by both

MapReduce and HBase

• Model scales but high amount of

communication during

intermediate phases can be

further optimized

8 HDFS MapReduce Hadoop Framework User Applications HBase Hadoop Common (RPC)

(9)

Hadoop MapReduce Architecture & Operations

Disk Operations

• Map and Reduce Tasks carry out the total job execution

• Communication in shuffle phase uses HTTP over Java Sockets

9

(10)

Hadoop Distributed File System (HDFS)

• Primary storage of Hadoop;

highly reliable and fault-tolerant

• Adopted by many reputed

organizations

– eg: Facebook, Yahoo!

• NameNode: stores the file system

namespace

• DataNode: stores data blocks

• Developed in Java for

platform-independence and portability

• Uses sockets for communication!

(HDFS Architecture)

10

RPC RPC RPC RPC

HPC Advisory Council Stanford Conference, Feb '14

(11)

HPC Advisory Council Stanford Conference, Feb '14 11

System-Level Interaction Between Clients and Data

Nodes in HDFS

High Performance Networks (HDD/SSD) (HDD/SSD) (HDD/SSD) ... ... (HDFS Data Nodes) (HDFS Clients) ... ...

(12)

HPC Advisory Council Stanford Conference, Feb '14 12

HBase

• Apache Hadoop Database

(

http://hbase.apache.org/

)

• Semi-structured database,

which is highly scalable

• Integral part of many

datacenter applications

– eg:- Facebook Social Inbox

• Developed in Java for

platform-independence and portability

• Uses sockets for

communication!

ZooKeeper HBase Client HMaster HRegion Servers HDFS (HBase Architecture)

(13)

HPC Advisory Council Stanford Conference, Feb '14 13

System-Level Interaction Among

HBase Clients, Region Servers, and Data Nodes

(HBase Clients) (HRegion Servers) (Data Nodes)

(HDD/SSD) (HDD/SSD) (HDD/SSD) ... ... ... ... ... ... High Performance Networks High Performance Networks

(14)

HPC Advisory Council Stanford Conference, Feb '14 14

Memcached Architecture

• Integral part of Web 2.0 architecture • Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes – General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

!"#$%"$#&

Web Frontend Servers

(Memcached Clients) (Memcached Servers)

(Database Servers) High Performance Networks High Performance Networks Main memory CPUs SSD HDD High Performance Networks ... ... ... Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD

(15)

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

15

(16)

HPC Advisory Council Stanford Conference, Feb '14 16

Trends for Commodity Computing Clusters in the Top 500

List (http://www.top500.org)

0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500 P e rce n tag e of Clus te rs Nu mb e r of Clus te rs Timeline Percentage of Clusters Number of Clusters

(17)

HPC and Advanced Interconnects

• High-Performance Computing (HPC) has adopted advanced

interconnects and protocols

– InfiniBand

– 10 Gigabit Ethernet/iWARP

– RDMA over Converged Enhanced Ethernet (RoCE)

• Very Good Performance

– Low latency (few micro seconds)

– High Bandwidth (100 Gb/s with dual FDR InfiniBand) – Low CPU overhead (5-10%)

• OpenFabrics software stack with IB, iWARP and RoCE

interfaces are driving HPC systems

• Many such systems in Top500 list

17 HPC Advisory Council Stanford Conference, Feb '14

(18)

18

Trends of Networking Technologies in TOP500 Systems

Percentage share of InfiniBand is steadily increasing Interconnect Family – Systems Share

(19)

• Message Passing Interface (MPI)

– Most commonly used programming model for HPC applications

• Parallel File Systems

• Storage Systems

• Almost 13 years of Research and Development since

InfiniBand was introduced in October 2000

• Other Programming Models are emerging to take

advantage of High-Performance Networks

– UPC

– OpenSHMEM

– Hybrid MPI+PGAS (OpenSHMEM and UPC)

19

Use of High-Performance Networks for Scientific

Computing

(20)

HPC Advisory Council Stanford Conference, Feb '14 20

One-way Latency: MPI over IB with MVAPICH2

0.00 1.00 2.00 3.00 4.00 5.00

6.00 Small Message Latency

Message Size (bytes)

Lat enc y (us ) 1.66 1.56 1.64 1.82 0.99 1.09 1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00 50.00 100.00 150.00 200.00 250.00 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR

Large Message Latency

Message Size (bytes)

Lat

enc

y

(us

(21)

HPC Advisory Council Stanford Conference, Feb '14 21

Bandwidth: MPI over IB with MVAPICH2

0 2000 4000 6000 8000 10000 12000 14000 Unidirectional Bandwidth B an dwid th ( MBy tes /se c)

Message Size (bytes)

3280 3385 1917 1706 6343 12485 12810 0 5000 10000 15000 20000 25000 30000 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Bidirectional Bandwidth B an dwid th ( MBy tes /sec )

Message Size (bytes)

3341 3704 4407 11643 6521 21025 24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

(22)

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

22

(23)

Can High-Performance Interconnects Benefit Hadoop?

• Most of the current Hadoop environments use Ethernet

Infrastructure with Sockets

• Concerns for performance and scalability

• Usage of High-Performance Networks is beginning to draw

interest from many companies

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar

to the designs adopted for MPI)?

• Can HPC Clusters with High-Performance networks be used

for Big Data applications using Hadoop?

23 HPC Advisory Council Stanford Conference, Feb '14

(24)

Designing Communication and I/O Libraries for Big

Data Systems: Challenges

HPC Advisory Council Stanford Conference, Feb '14 24

Big Data Middleware

(HDFS, MapReduce, HBase and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies (HDD and SSD) Programming Models

(Sockets) Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication QoS Threaded Models and Synchronization Fault-Tolerance I/O and File Systems

Virtualization Benchmarks

(25)

HPC Advisory Council Stanford Conference, Feb '14 25

Common Protocols using Open Fabrics

Application/ Middleware

Verbs Sockets

Application /Middleware Interface

SDP RDMA SDP InfiniBand Adapter InfiniBand Switch RDMA IB Verbs InfiniBand Adapter InfiniBand Switch User space RDMA RoCE RoCE Adapter User space Ethernet Switch TCP/IP Ethernet Driver Kernel Space Protocol InfiniBand Adapter InfiniBand Switch IPoIB Ethernet Adapter Ethernet Switch Adapter Switch 1/10/40 GigE iWARP Ethernet Switch iWARP iWARP Adapter User space IPoIB TCP/IP Ethernet Adapter Ethernet Switch 10/40 GigE-TOE Hardware Offload RSockets InfiniBand Adapter InfiniBand Switch User space RSockets

(26)

Can Hadoop be Accelerated with High-Performance Networks

and Protocols?

26 Enhanced Designs Application Accelerated Sockets 10 GigE or InfiniBand Verbs / Hardware Offload Current Design Application Sockets 1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Hadoop and HBase) – Zero-copy not available for non-blocking sockets

Our Approach Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

(27)

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

27

(28)

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.8

– Based on Apache Hadoop 1.2.1

– Compliant with Apache Hadoop 1.2.1 APIs and applications – Tested with

• Mellanox InfiniBand adapters (DDR, QDR and FDR) • RoCE

• Various multi-core platforms

• Different file systems with disks and SSDs – http://hadoop-rdma.cse.ohio-state.edu

RDMA for Apache Hadoop Project

28

(29)

Design Overview of HDFS with RDMA

• JNI Layer bridges Java based HDFS with communication library written in native code

Only the communication part of HDFS Write is modified; No change in HDFS architecture

29

HDFS

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket

Interface Java Native Interface (JNI)

Write Others

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

• Design Features

– RDMA-based HDFS write – RDMA-based HDFS replication – InfiniBand/RoCE support

(30)

Communication Time in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE • Similar improvements are obtained for SSD DataNodes

Reduced by 30%

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

30 0 5 10 15 20 25 2GB 4GB 6GB 8GB 10GB Co m m u n ic ati on T im e (s) File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

(31)

Evaluations using TestDFSIO

• Cluster with 8 HDD DataNodes, single disk per node

– 24% improvement over IPoIB (QDR) for 20GB file size • Cluster with 4 SSD DataNodes, single SSD per node

– 61% improvement over IPoIB (QDR) for 20GB file size

Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps Increased by 24% Increased by 61% 31 0 200 400 600 800 1000 1200 5 10 15 20 To tal Th ro u gh p u t (M B p s) File Size (GB) IPoIB (QDR) OSU-IB (QDR) 0 500 1000 1500 2000 2500 5 10 15 20 To tal Th ro u gh p u t (M B p s) File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

(32)

Evaluations using TestDFSIO

Cluster with 4 DataNodes, 1 HDD per node

– 24% improvement over IPoIB (QDR) for 20GB file size

• Cluster with 4 DataNodes, 2 HDD per node

– 31% improvement over IPoIB (QDR) for 20GB file size

• 2 HDD vs 1 HDD

– 76% improvement for OSU-IB (QDR)

– 66% improvement for IPoIB (QDR)

Increased by 31% 32 0 500 1000 1500 2000 2500 5 10 15 20 Tot al T h rou gh p u t (M Bps ) File Size (GB) 10GigE-1HDD 10GigE-2HDD IPoIB -1HDD (QDR) IPoIB -2HDD (QDR) OSU-IB -1HDD (QDR) OSU-IB -2HDD (QDR)

(33)

33

Evaluations using Enhanced DFSIO of Intel HiBench on

SDSC-Gordon

• Cluster with 32 DataNodes, single SSD per node

– 47% improvement in throughput over IPoIB (QDR) for 128GB file size

– 16% improvement in latency over IPoIB (QDR) for 128GB file size

HPC Advisory Council Stanford Conference, Feb '14

Increased by 47% Reduced by 16% 0 500 1000 1500 2000 2500 32 64 128 A gg re ga te d T h rou gh p u t (M Bps ) File Size (GB) IPoIB (QDR) OSU-IB (QDR) 0 20 40 60 80 100 120 140 160 32 64 128 Ex e cu ti on T im e (s) File Size (GB) IPoIB (QDR) OSU-IB (QDR)

(34)

HPC Advisory Council Stanford Conference, Feb '14 34

Evaluations using Enhanced DFSIO of Intel HiBench on

TACC-Stampede

• Cluster with 64 DataNodes, single HDD per node

– 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size 0 200 400 600 800 1000 1200 1400 64 128 256 A gg re ga te d T h rou gh p u t (M Bps ) File Size (GB) IPoIB (FDR) OSU-IB (FDR) Increased by 64% Reduced by 37% 0 100 200 300 400 500 600 64 128 256 Ex e cu ti on T im e (s) File Size (GB) IPoIB (FDR) OSU-IB (FDR)

(35)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

35

(36)

Design Overview of MapReduce with RDMA

• Enables high

performance RDMA

communication, while

supporting traditional

socket interface

• JNI Layer bridges Java

based MapReduce

with communication

library written in

native code

• InfiniBand/RoCE

support

MapReduce Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications

1/10 GigE, IPoIB Network

Java Socket

Interface Java Native Interface (JNI)

Job Tracker Task Tracker Map Reduce

Design features for RDMA

Prefetch/Caching of MOF

In-Memory Merge

Overlap of Merge & Reduce

36

(37)

Evaluations using Sort

8 HDD DataNodes

• With 8 HDD DataNodes for 40GB sort

– 41% improvement over IPoIB (QDR)

– 39% improvement over Hadoop-A-IB

(QDR)

37

M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May 2013.

HPC Advisory Council Stanford Conference, Feb '14

8 SSD DataNodes

• With 8 SSD DataNodes for 40GB sort

– 52% improvement over IPoIB (QDR)

– 44% improvement over Hadoop-A-IB

(QDR) 0 50 100 150 200 250 300 350 400 450 500 25 30 35 40 Job Ex e cu ti on T ime (se c) Data Size (GB) 10GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) 0 50 100 150 200 250 300 350 400 450 500 25 30 35 40 Job Ex e cu ti on T im e (se c) Data Size (GB) 10GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR)

(38)

Evaluations using TeraSort

• 100 GB TeraSort with 8 DataNodes with 2 HDD per node

– 51%

improvement over IPoIB (QDR)

– 41%

improvement over Hadoop-A-IB (QDR)

38

HPC Advisory Council Stanford Conference, Feb '14

0 200 400 600 800 1000 1200 1400 1600 1800 60 80 100 Job Ex ecu tion Time ( se c) Data Size (GB) 10GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR)

(39)

• For 240GB Sort in 64 nodes

– 40% improvement over IPoIB (QDR) with HDD used for HDFS

39

Performance Evaluation on Larger Clusters

HPC Advisory Council Stanford Conference, Feb '14

0 200 400 600 800 1000 1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job Ex e cu ti on T im e (se c) IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0 100 200 300 400 500 600 700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job Ex e cu ti on T im e (se c)

IPoIB (FDR) Hadoop-A-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes

– 38% improvement over IPoIB (FDR) with HDD used for HDFS

(40)

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

40

HPC Advisory Council Stanford Conference, Feb '14

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

No rma liz e d Ex e cu ti on T ime Benchmarks

(41)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

41

(42)

HPC Advisory Council Stanford Conference, Feb '14 42

Motivation – Detailed Analysis on HBase Put/Get

• HBase 1KB Put

– Communication Time – 8.9 us

– A factor of 6X improvement over 10GE for communication time • HBase 1KB Get

– Communication Time – 8.9 us

– A factor of 6X improvement over 10GE for communication time

M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?,

ISPASS’12 0 20 40 60 80 100 120 140 160 180 200

IPoIB (DDR) 10GigE OSU-IB (DDR)

Ti m e ( u s) HBase Put 1KB 0 20 40 60 80 100 120 140 160

IPoIB (DDR) 10GigE OSU-IB (DDR)

Ti m e ( u s) Hbase Get 1KB Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization

(43)

HPC Advisory Council Stanford Conference, Feb '14 43

HBase Single-Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 Ti m e ( u s) No. of Clients 0 10000 20000 30000 40000 50000 60000 1 2 4 8 16 Op s/se c No. of Clients IPoIB OSU-IB 10GigE Latency Throughput

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

OSU-IB (DDR) IPoIB (DDR)

(44)

HPC Advisory Council Stanford Conference, Feb '14 44

HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark) – 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients • HBase Get latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients 0 1000 2000 3000 4000 5000 6000 7000 8 16 32 64 96 128 Ti m e ( u s) No. of Clients 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 8 16 32 64 96 128 Ti m e ( u s) No. of Clients IPoIB OSU-IB 10GigE

Read Latency Write Latency

OSU-IB (QDR) IPoIB (QDR)

(45)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

45

(46)

Design Overview of Hadoop RPC with RDMA

Hadoop RPC

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket

Interface Java Native Interface (JNI)

Our Design Default

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

46

• Design Features

– JVM-bypassed buffer management – On-demand connection setup – RDMA or send/recv based adaptive communication – InfiniBand/RoCE support

(47)

Hadoop RPC over IB - Gain in Latency and Throughput

• Hadoop RPC over IB PingPong Latency

– 1 byte: 39 us; 4 KB: 52 us

– 42%-49% and 46%-50% improvements compared with the performance on 10 GigE

and IPoIB (32Gbps), respectively

• Hadoop RPC over IB Throughput

– 512 bytes & 48 clients: 135.22 Kops/sec

– 82% and 64% improvements compared with the peak performance on 10 GigE and

IPoIB (32Gbps), respectively 0 20 40 60 80 100 120 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Late n cy ( u s)

Payload Size (Byte)

10GigE IPoIB(QDR) OSU-IB(QDR) 0 20 40 60 80 100 120 140 160 8 16 24 32 40 48 56 64 Th ro u gh p u t (K o p s/Sec ) Number of Clients 47

(48)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

48

(49)

• For 5GB,

– 53.4% improvement over IPoIB, 126.9% improvement over 10GigE with HDD used for HDFS

– 57.8% improvement over IPoIB, 138.1% improvement over 10GigE with SSD used for HDFS

• For 20GB,

– 10.7% improvement over IPoIB, 46.8% improvement over 10GigE with HDD used for HDFS

– 73.3% improvement over IPoIB, 141.3% improvement over 10GigE with SSD used for HDFS

49

Performance Benefits - TestDFSIO

Cluster with 4 HDD Nodes, TestDFSIO with total 4 maps Cluster with 4 SSD Nodes, TestDFSIO with total 4 maps

0 100 200 300 400 500 600 700 800 900 1000 5 10 15 20 Tot al T h rou gh p u t (M Bps ) Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

0 100 200 300 400 500 600 700 800 900 5 10 15 20 Tot al T h rou gh p u t (M Bps ) Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

(50)

• For 80GB Sort in a cluster size of 8,

– 40.7% improvement over IPoIB, 32.2% improvement over 10GigE with HDD used for HDFS

– 43.6% improvement over IPoIB, 42.9% improvement over 10GigE with SSD used for HDFS

50

Performance Benefits - Sort

0 200 400 600 800 1000 1200 1400 1600 1800 48 64 80 Job Ex e cu ti on T im e (se c) Sort Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

Cluster with 8 HDD Nodes, Sort with <8 maps, 8 reduces>/host Cluster with 8 SSD Nodes, Sort with <8 maps, 8 reduces>/host

0 200 400 600 800 1000 1200 1400 1600 48 64 80 Job Ex e cu ti on T im e (se c) Sort Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

(51)

• For 100GB TeraSort in a cluster size of 8,

– 45% improvement over IPoIB, 39% improvement over 10GigE with HDD used for HDFS

– 46% improvement over IPoIB, 40% improvement over 10GigE with SSD used for HDFS

51

Performance Benefits - TeraSort

Cluster with 8 HDD Nodes, TeraSort with <8 maps, 8reduces>/host Cluster with 8 SSD Nodes, TeraSort with <8 maps, 8reduces>/host

0 500 1000 1500 2000 2500 80 90 100 Job Ex ecu tion Time ( se c) Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

0 500 1000 1500 2000 2500 80 90 100 Job Ex ecu tion Time ( se c) Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

(52)

0 50 100 150 200 250 300 350 400 50 GB 100 GB 200 GB 300 GB Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job Ex ecu tion Time ( se c) IPoIB (QDR) OSU-IB (QDR) 0 20 40 60 80 100 120 140 160 50 GB 100 GB 200 GB 300 GB Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job Ex ecu tion Time ( se c) IPoIB (QDR) OSU-IB (QDR)

Performance Benefits – RandomWriter & Sort in SDSC-Gordon

• RandomWriter in a cluster of single SSD per node

– 16% improvement over IPoIB (QDR) for

50GB in a cluster of 16

– 20% improvement over IPoIB (QDR) for

300GB in a cluster of 64

Reduced by 20% Reduced by 36%

52

HPC Advisory Council Stanford Conference, Feb '14

• Sort in a cluster of single SSD per node

– 20% improvement over IPoIB (QDR)

for 50GB in a cluster of 16

– 36% improvement over IPoIB (QDR)

(53)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

53

(54)

Memcached-RDMA Design

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can

communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA

and Sockets worker threads

Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items 1 1 2 2 54

(55)

55

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us – 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Ti m e ( u s) Message Size 0 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Ti m e ( u s) Message Size IPoIB (QDR) 10GigE (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR)

Memcached Get Latency (Small Message)

HPC Advisory Council Stanford Conference, Feb '14

T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% Ti m e% (u s) % Message%Size% T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% Ti m e% (u s) % Message%Size% IPoIB (DDR) 10GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR)

(56)

HPC Advisory Council Stanford Conference, Feb '14 56

Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes – On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB 0 200 400 600 800 1000 1200 1400 1600 1 2 4 8 16 32 64 128 256 512 800 1K Th o u san d s o f Tr an sact io n s p e r sec o n d (TPS ) No. of Clients 0 200 400 600 800 1000 1200 1400 1600 4 8 Th o u san d s o f Tr an sact io n s p e r sec o n d (TPS ) No. of Clients T im e %( u s) % Message%Size% T im e %( u s) % Message%Size% Ti m e% (u s) % Message%Size% IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR)

(57)

1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us ) Message Size OSU-IB (FDR) IPoIB (FDR) 0 100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 4080 Thou sa nds of Tr an sa ct ion s per Sec ond ( TPS ) No. of Clients 57

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us • Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) Latency Reduced

by nearly 20X

2X

(58)

HPC Advisory Council Stanford Conference, Feb '14 58

Application Level Evaluation – Olio Benchmark

• Olio Benchmark

– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design 0 500 1000 1500 2000 2500 64 128 256 512 1024 Ti m e ( m s) No. of Clients 0 10 20 30 40 50 60 1 4 8 Ti m e ( m s) No. of Clients IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) Ti m e %( m s) % No.%of%Clients%

(59)

HPC Advisory Council Stanford Conference, Feb '14 59

Application Level Evaluation –

Real Application Workloads

• Real Application Workload

– RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design 0 10 20 30 40 50 60 70 80 90 1 4 8 Ti m e ( m s) No. of Clients 0 50 100 150 200 250 300 350 64 128 256 512 1024 Ti m e ( m s) No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) Ti m e %( m s) % No.%of%Clients%

(60)

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • RPC • Combination of components – Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

60

(61)

Big Data Middleware

(HDFS, MapReduce, HBase and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies (HDD and SSD) Programming Models

(Sockets) Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication QoS Threaded Models and Synchronization Fault-Tolerance I/O and File Systems

Virtualization Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big

Data Systems: Solved a Few Initial Challenges

HPC Advisory Council Stanford Conference, Feb '14

Upper level Changes?

(62)

• Multi-threading and Synchronization

– Multi-threaded model exploration

– Fine-grained synchronization and lock-free design – Unified helper threads for different components

– Multi-endpoint design to support multi-threading communications

• QoS and Virtualization

– Network virtualization and locality-aware communication for Big Data middleware

– Hardware-level virtualization support for End-to-End QoS – I/O scheduling and storage virtualization

– Live migration

HPC Advisory Council Stanford Conference, Feb '14 62

(63)

• Support of Accelerators

– Efficient designs for Big Data middleware to take advantage of NVIDA GPGPUs and Intel MICs

– Offload computation-intensive workload to accelerators

– Explore maximum overlapping between communication and offloaded computation

• Fault Tolerance Enhancements

– Novel data replication schemes for high-efficiency fault tolerance – Exploration of light-weight fault tolerance mechanisms for Big Data

processing

• Support of Parallel File Systems

– Optimize Big Data middleware over parallel file systems (e.g. Lustre) on modern HPC clusters

HPC Advisory Council Stanford Conference, Feb '14 63

(64)

• For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

64

Performance Improvement Potential of MapReduce over

Lustre on HPC Clusters – An Initial Study

HPC Advisory Council Stanford Conference, Feb '14

Sort in TACC Stampede Sort in SDSC Gordon

• For 100GB Sort in 16 nodes

– 33% improvement over IPoIB (QDR) 0 200 400 600 800 1000 1200 300 400 500 Job Ex e cu ti on T im e (se c) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) 0 50 100 150 200 250 60 80 100 Job Ex e cu ti on T im e (se c) Data Size (GB) IPoIB (QDR) OSU-IB (QDR)

(65)

• The current benchmarks provide some performance

behavior

• However, do not provide any information to the

designer/developer on:

– What is happening at the lower-layer? – Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will be its impact?

– Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?

HPC Advisory Council Stanford Conference, Feb '14 65

(66)

Big Data Middleware

(HDFS, MapReduce, HBase and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies (HDD and SSD) Programming Models

(Sockets) Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication QoS Threaded Models and Synchronization Fault-Tolerance I/O and File Systems

Virtualization Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

66

Current Benchmarks

No Benchmarks

Correlation?

(67)

OSU MPI Micro-Benchmarks (OMB) Suite

• A comprehensive suite of benchmarks to

– Compare performance of different MPI libraries on various networks and systems

– Validate low-level functionalities

– Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to

– MPI-2 one-sided

– Collectives

– GPU-aware data movement

– OpenSHMEM (point-to-point and collectives)

– UPC

• Has become an industry standard

• Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack

67

(68)

Big Data Middleware

(HDFS, MapReduce, HBase and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies (HDD and SSD) Programming Models

(Sockets) Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication QoS Threaded Models and Synchronization Fault-Tolerance I/O and File Systems

Virtualization Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and

Design for Benchmarking Next Generation Big Data

Systems and Applications

HPC Advisory Council Stanford Conference, Feb '14 68

Applications-Level Benchmarks

Micro-Benchmarks

(69)

• Upcoming RDMA for Apache Hadoop Releases will support

– HDFS Parallel Replication – Advanced MapReduce – HBase – Hadoop 2.0.x

• Memcached-RDMA

• Exploration of other components (Threading models, QoS,

Virtualization, Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations

• Comprehensive set of Benchmarks

• More details at

http://hadoop-rdma.cse.ohio-state.edu

HPC Advisory Council Stanford Conference, Feb '14

Future Plans of OSU Big Data Project

(70)

• Presented an overview of Big Data, Hadoop and Memcached

• Provided an overview of Networking Technologies

• Discussed challenges in accelerating Hadoop

• Presented initial designs to take advantage of InfiniBand/RDMA

for HDFS, MapReduce, RPC, HBase and Memcached

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data community to take advantage of modern

HPC technologies to carry out their analytics in a fast and scalable

manner

HPC Advisory Council Stanford Conference, Feb '14

Concluding Remarks

(71)

HPC Advisory Council Stanford Conference, Feb '14

Personnel Acknowledgments

Current Students – S. Chakraborthy (Ph.D.) – N. Islam (Ph.D.) – J. Jose (Ph.D.) – M. Li (Ph.D.) – S. Potluri (Ph.D.) – R. Rajachandrasekhar (Ph.D.) Past Students – P. Balaji (Ph.D.) – D. Buntinas (Ph.D.) – S. Bhagvat (M.S.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) 71

Past Research Scientist

– S. Sur Current Post-Docs – X. Lu – K. Hamidouche Current Programmers – M. Arnold – J. Perkins Past Post-Docs – H. Wang – X. Besseron – H.-W. Jin – M. Luo – W. Huang (Ph.D.) – W. Jiang (M.S.) – S. Kini (M.S.) – M. Koop (Ph.D.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – M. Rahman (Ph.D.) – D. Shankar (Ph.D.) – R. Shir (Ph.D.) – A. Venkatesh (Ph.D.) – J. Zhang (Ph.D.) – E. Mancini – S. Marcarelli – J. Vienne

Current Senior Research Associate

– H. Subramoni

Past Programmers

(72)

HPC Advisory Council Stanford Conference, Feb '14

Pointers

http://nowlab.cse.ohio-state.edu [email protected] http://www.cse.ohio-state.edu/~panda 72 http://mvapich.cse.ohio-state.edu http://hadoop-rdma.cse.ohio-state.edu

References

Related documents

Thus, the increased investment in information processing and computerized equipment, which have shorter lives than traditional industrial equipment, and the increase in the share

TM4 is a Canadian company of more than 140 people, of whom 14 are directly em- ployed as software engineers, the meeting the criteria of being a VSE. The company designs and

where Underpricing t /SEO Discount t is measured from the offer price to the first-day closing price; BHAR t is the buy-and-hold abnormal return based on size, book-to-market

Table 6 shows the activities in counts per minute of photoheterotrophic protein fractions eluted from a calf thymus (CT) DNA- cellulose column when incubated

The aim of this study was to examine the extent to which LINE-1 methylation in maternal peripheral blood during pregnancy, and in umbilical cord blood, is associated with length

El tamaño del hospedador, las zonas litorales someras con vegetación acuática y el impacto antrópico fueron los factores que determinaron las variaciones en las infracomunidades y

patients who sustained FTI and underwent FTR, to establish the range of movement (ROM), grip strength and hand function at six months post FTR as well as to

Artinya bahwa pada wilayah yang dipengaruhi ENSO, akurasi model ramalan produksi padi (dengan variabel prediktor SST Nino 3.4, DMI, rasio LT/LB) akan tinggi, dan sebaliknya