• No results found

RDMA over Ethernet - A Preliminary Study

N/A
N/A
Protected

Academic year: 2021

Share "RDMA over Ethernet - A Preliminary Study"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

RDMA over Ethernet - A Preliminary

Study

Hari Subramoni, Miao Luo, Ping Lai and

Dhabaleswar. K. Panda

Computer Science & Engineering Department

The Ohio State University

(2)

Outline

• Introduction

• Problem Statement

• Approach

• Performance Evaluation and Results

• Conclusions and Future Work

(3)

Introduction

• Ethernet and InfiniBand accounts for majority of

interconnects in high performance distributed computing

• End users want InfiniBand like latencies with existing

Ethernet infrastructure

• Can be achieved if networks converge

• Existing options have overhead or tradeoffs in terms of

performance

• No solution exists that efficiently combines the

ubiquitous nature of Ethernet and the high performance

offered by InfiniBand

• RDMA over Ethernet (RDMAoE) seems to provide a

good option as of date

(4)

RDMAoE

• Allows running the IB transport protocol using Ethernet

frames

• RDMAoE packets are standard Ethernet frames with an

IEEE assigned Ethertype, a GRH, unmodified IB

transport headers and payload

• InfiniBand HCA takes care of translating InfiniBand

addresses to Ethernet addresses and back

• Encodes IP addresses into its GIDs and resolves MAC

addresses using the host IP stack

• Use GID’s for establishing connections instead of LID’s

• No SM/SA, Ethernet management practices are used

(5)

InfiniBand Architecture & Adapters

• An industry standard for low latency, high bandwidth, System Area

Networks

• Multiple features

– Two communication types

• Channel Semantics

• Memory Semantics (RDMA mechanism)

– Multiple virtual lanes

– Quality of Service (QoS) support

• Double Data Rate (DDR) with 20 Gbps bandwidth has been there

• Quad Data Rate (QDR) with 40 Gbps bandwidth is available

recently

• Multiple generations of InfiniBand adapters are available now

• The latest ConnectX DDR adapters provide support for both IB as

well as RDMAoE modes

(6)

Modes of Communication using

ConnectX DDR Adapter

Applications

Open Fabrics TCP / IP

ConnectX ConnectX ConnectX ConnectX

IPoIB Verbs (with RDMA) Verbs (with RDMA) Ethernet Switch IB Switch Ethernet Switch IB Switch Application Protocol Adapter Switch Network Protocol

(7)

Outline

• Introduction

• Problem Statement

• Approach

• Performance Evaluation and Results

• Conclusions and Future Work

(8)

Problem Statement

• How do the different communication

protocols stack up against each other as

far

– Raw sockets / verbs level performance

– Performance for MPI applications

– Performance for Data center applications

• Does RDMAoE bring us a step closer to

the goal of network convergence

(9)

Outline

• Introduction

• Problem Statement

• Approach

• Performance Evaluation and Results

• Conclusions and Future Work

(10)

Approach

• Protocol level benchmarks to evaluate

very basic performance

• MPI level benchmarks to evaluate basic

MPI performance at both point to point and

collective levels

• Application level benchmarks to evaluate

performance of real world applications

• Evaluation using common data center

applications

(11)

Outline

• Introduction

• Problem Statement

• Approach

• Performance Evaluation and Results

• Conclusions and Future Work

(12)

Experimental Testbed

• Compute Platform

– Intel Nehalem

• Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz • 12GB RAM, 8MB cache

• PCIe 2.0 interface

• Host Channel Adapter

– Dual port ConnectX DDR adapter

• Configured in either RDMAoE mode or IB mode

• Network Switches

– 24 port Mellanox IB DDR switch

– 24 port Fulcrum Focalpoint 10GigE switch

• OFED version

– OFED-1.4.1 for IB and IPoIB

– Pre-release version of OFED-1.5 for RDMAoE and TCP / IP

(13)

MVAPICH / MVAPICH2 Software

• High Performance MPI Library for IB and 10GE

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Used by more than 960 organizations in 51 countries – More than 32,000 downloads from OSU site directly – Empowering many TOP500 clusters

• 8th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)

– Also supports uDAPL device to work with any network supporting uDAPL

– http://mvapich.cse.ohio-state.edu/

(14)

List of Benchmarks

• OSU Microbenchmarks (OMB)

– Version 3.1.1

http://mvapich.cse.ohio-state.edu/benchmarks/

• Intel Collective Microbenchmarks (IMB)

– Version 3.2

http://software.intel.com/en-us/articles/intel-mpi-benchmarks/

• NAS Parallel Benchmarks (NPB)

– Version 3.3

(15)

Verbs Level Evaluation

Inter-Node Latency

0 5 10 15 20 25 30 35 40 La te nc y ( us )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

HPIDC '09 25.5 3.03 1.66 29.9 0 2000 4000 6000 8000 10000 12000 L a te n c y ( u s )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

• For small messages

• Native IB verbs offers best latency of 1.66 us

(16)

MPI Level Evaluation

Inter-Node Latency

0 10 20 30 40 50 60 70 La te nc y ( us )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

45.4 3.6 1.8 47.9 0 5000 10000 15000 20000 25000 30000 35000 L a te n c y ( u s )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

• For small messages

• Native IB verbs offers best latency of 1.8 us

(17)

Inter-Node Multipair Latency

0 10 20 30 40 50 60 70 80 L a te n c y ( u s )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

HPIDC '09 39.4 3.51 1.66 64.7 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 La te nc y ( us )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

• 4 pairs of processes communicating simultaneously • For small messages

• Native IB verbs offers best latency of 1.66 us

(18)

0 100 200 300 400 500 600 700 800 900 1000 La te nc y ( us )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

Collective Performance

Allgather Latency (32-cores)

0 500000 1000000 1500000 2000000 2500000 L a te n c y ( u s )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

280.7

22.71 12.97

330.6

• For small messages

• Native IB verbs offers best latency of 12.97 us

(19)

0 50 100 150 200 250 300 350 400 450 500 La te nc y ( us )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

Collective Performance

Allreduce Latency (32-cores)

HPIDC '09 0 100000 200000 300000 400000 500000 600000 L a te n c y ( u s )

Message Size (Bytes)

Native IB RDMAoE TCP/IP IPoIB

285.4

12.78 10.05

329.1

• For small messages

• Native IB verbs offers best latency of 10.05 us

(20)

Performance of NAS Benchmarks

0 1 2 3 4 5 6 7 8 LU CG EP FT MG No rm a li ze d T im e NAS Benchmarks

Native-IB RDMAoE TCP/IP IP o IB

• 32 process, Class C • Numbers normalized to

Native-IB

• Performance of Native

IB and RDMAoE are very close with Native IB giving the best

(21)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Fil e T ra ns fe r T im e ( s ) Message Size (MB)

Native IB RDMAoE TCP/IP IPoIB

Evaluation of Data Center

Applications

HPIDC '09

• We evaluate FTP, a common

data center application

• We use our own version of FTP

over native IB verbs

(FTP-ADTS [2]) to evaluate

RDMAoE and Native IB

• GridFTP [1] is used to evaluate

performance of TCP/IP and IPoIB

• RDMAoE shows performance

comparable to Native IB

[1] http://www.globus.org/grid_software/data/gridftp.php

[2] FTP Mechanisms for High Performance Data-Transfer over InfiniBand. Ping Lai, Hari Subramoni, Sundeep Narravula, Amith Mamidala, D K. Panda. ICPP ‘09.

(22)

Outline

• Introduction

• Problem Statement

• Approach

• Performance Evaluation and Results

(23)

Conclusions & Future Work

• Perform comprehensive evaluation of all possible modes of

communication (Native IB, RDMAoE, TCP/IP, IPoIB) using

– Verbs – MPI

– Application and,

– Data center level experiments

• Native IB gives the best performance followed by RDMAoE

• RDMAoE provides a high performance solution to the problem

of network convergence

• As part of future work, we plan to

– Perform large scale evaluations including studies into the effect of network contention on the performance of these protocols

– Study these protocols in a comprehensive manner for file systems

(24)

Thank you !

{subramon, laipi, luom, panda}@cse.ohio-state.edu

Network-Based Computing Laboratory

References

Related documents

The aim of this study was to examine the extent to which LINE-1 methylation in maternal peripheral blood during pregnancy, and in umbilical cord blood, is associated with length

The Directive allows for some restrictions if they can be justified within the categories of the exceptions provided un- der chapter VI (art. 27-33), which provides for restrictions

Other alternative on resources that can be used when sharing materials in MOOC is by using work that has been offered to public via unrestricted license such as the

The purpose of this study was to identify reasons why Pakistani women chose to teach in computer-related technology areas and how they thought they were

The goal of this experiment is to determine whether a GP can successfully predict whether a student will succeed in a data structures class using the results from a

Concentrations of some biogenic and anthropogenic VOCs were measured throughout the atmospheric boundary layer and used to estimate the landscape scale fluxes

Lajunen and Lipman (2016) evaluate life cycle costs, energy consumption, and emissions of diesel, natural gas, hybrid electric, fuel cell hybrid and battery electric city

Reported values are displayed as found in the reference papers and demonstrate a difficulty of inter- comparison due to the multiple statistical ways of express- ing the results