• No results found

10GbE vs Infiniband 4x Performance tests

N/A
N/A
Protected

Academic year: 2021

Share "10GbE vs Infiniband 4x Performance tests"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

10GbE vs Infiniband 4x Performance

tests

Deep Computing & Network

Transformation Center

ATS-PSSC

Benchmark Report

Last update:

24-Jul-07

Authors Role Comments

Erwan Auffret IBM IT Specialist – Network Transformation Center

François Corradino IBM IT Specialist – Deep Computing Benchmark center

Ludovic Enault IBM IT Specialist – Deep Computing Benchmark center

Charles Ferland BladeNetwork Director of Sales, EMEA

[email protected]

(2)

Table of Contents

1. Introduction ... 1

2. Objectives ... 2

3. Benchmark Infrastructure ... 3

3.1 Hardware List ...3

4. TCP/IP NetPerf testing ... 6

4.1 NetPerf 2.4.3 ...6

4.2 Results ...7

5. HPC testing ... 9

5.1 Intel MPI Benchmark...9

5.1.1 Results...11

5.2 High Performance Computing Challenge...13

5.2.1 HPL...14

5.2.2 PTRANS...15

5.2.3 Latency comparison ...16

5.2.4 Memory bandwidth comparison ...17

5.3 VASP (Vienna Ab-initio Simulation Package)...18

5.3.1 TEST 1 ...18

5.3.2 TEST 2 ...19

6. Conclusions... 20

(3)

1. Introduction

With the announcement of the new BladeNetwork 10GbE switch for IBM BladeCenter H, it’s been decided to test the 10GbE adapters in High Performance Computing /Next Generation Networks environments.

The compared elements were NetXen Ethernet adapter, Topspin IB adapter and a Low Latency Ethernet adapter for blades, manufactured by Myricom.

A set of standard HPC benchmarks and TCP benchmarks were performed on the adapters. A real HPC application has been tested as well.

(4)

2. Objectives

The objective of the tests is to check the behavior of several network adapters with the new BladeCenter H and HT high performance Nortel switch.

A first set of tests were performed to COMPARE Infiniband 4x and 10Gb Ethernet.

A second standalone test has been performed to get base performance metrics on TCP/IP

(5)

3. Benchmark Infrastructure

3.1 Hardware

List

2* IBM BladeCenter® HS21 XM (7995-E2X)

• 2* INTEL Xeon E5345 (2.33GHz) QC (8MB L2 cache)

• 16GB (8*2GB) 667MHz FBD RAM

• 1* SFF SAS HDD

• Integrated Dual gigabit Broadcom ® 5708S Ethernet controller

HDD

DIMMs

CPUs

Daughter Card

Figure 1: IBM BladeCenter HS21 XM internal layout

Please check the following web site for information on System x and Blades updates. http://www-03.ibm.com/servers/eserver/xseries/

Several adapters were then added and tested on the PCI-express connector for IO

daughter cards. There are different form factors available. Here are the ones that were used.

TopSpin Infiniband Expansion Card for IBM BladeCenter (PN: xxxxxx)

(6)

NetXen 10Gb Ethernet Expansion Card for IBM BladeCenter (PN: 39Y9271)

Figure 3: CFF-H (Combination Form Factor for Horizontal switches) daughter card for IBM BladeCenter H

Myricom Low Latency 10Gb Ethernet Expansion Card for IBM BladeCenter (NO IBM

PN)

Daughter card (HSDC) for IBM BladeCenter H (10G-PCIE-8AI-C+MX1)

http://www.myri.com/Myri-10G/product_list.html

Figure 4: CFF-H (Combination Form Factor for Horizontal switches) daughter card for IBM BladeCenter H

The IBM BladeCenter IO adapters need a switch to connect to. The Ethernet adapters (NetXen and Myricom) connect to a BladeNetwork 10GbE switch module, while the IB adapters connect to a Cisco Systems 4x Infiniband switch module.

Nortel 10Gb Ethernet Switch Module for IBM BladeCenter (PN: 39Y9267)

(7)

BIOS, firmware & OS configuration

Part name BIOS/OS/Firmware Version

HS21 XM BIOS 1.02

HS21 XM OS (1) RedHat Enterprise Linux 5 AS for x86_64 kernel: 2.6.18-8.el5

HS21 XM OS (2) RedHat Enterprise Linux 4 u4 AS for x86_64 kernel: 2.6.9-42 Topspin IB adapter driver/firmware

Myricom driver/firmware Myri-10G 1.3.0 (for Netperf tests) MXoE: 1.2.1 rc17 (for HPC tests)

(8)

4. TCP/IP NetPerf testing

4.1 NetPerf

2.4.3

NetPerf is a benchmark that can be used to measure various aspects of networking performance. Its primary focus is on bulk data transfer and request/response performance using either TCP or UDP. It is also referred to as “stream” or “unidirectional stream” performance. Basically, these tests will measure how fast one system can send data to another and/or how fast that other system can receive it. The TCP_STEAM test is the default test and is the one giving the stream performance that is the closest to IPTV workload.

NGN applications and particularly IPTV solutions are typical streaming applications. However, a

NetPerf test will not simulate an IPTV workload. Indeed, this type of applications requires the data transfer to VERY regular (with absolutely no packet loss), at a regular bit rate (which depends on the quality of the content that is streamed) and with various content read access (different files). Moreover, other functionalities like time shifting (fast forward/rewind, pause) need to be managed by the IPTV application and can generate network workloads that are not comparable with NetPerf results.

A series of tests has been performed on the two 10Gb Ethernet adapters (NetXen and Myricom) between the two blades. The tests on the NetXen adapters were performed with the default MTU at 1500 and then with a MTU set at 8000 (on both servers). On the Myricom adapter, two different drivers were used: one which is tuned for HPC applications that require more better response time; the other is tuned to deliver better bandwidth, which is the main need for IPTV applications. All NetPerf tests were performed on RedHat Enterprise Linux 5.

(9)

4.2 Results

The TCP_STREAM tests could deliver bandwidth results as well as CPU usage information. The maximum bandwidth could be reached by using the “NGN” driver on the Myricom adapter. Almost 70% of the theoretical bandwidth could be reached. As a comparison, the same test was performed on the integrated Broadcom Gigabit Ethernet adapter and 94% of the 1Gb bandwidth was reached.

0 1000 2000 3000 4000 5000 6000 7000 Bandw idth (Mbps)

MTU = 1500 MTU = 8000 "HPC" driver (better latency)

"NGN" driver (better bandw idth)

Netxen Myricom

NetPerf tets (TCP_STREAM)- Bandwidth

Throughput (Mbps)

Table 1: NetPerf TCP_STREAM Bandwidth results

CPU usage is very important for servers tuning and application providers. The fact that the network is stressed can generate a lot of CPU utilization which, therefore, cannot be allocated to application treatment. TOE (Totally Offload Ethernet) can be used on some adapters which support this functionality. Some basic typical network treatment can be handled by the adapter itself, instead of the CPUs. This speeds up some applications. We decided not to use TOE since it is not yet supported on the Myricom adapter. Otherwise the CPU utilization would be less. The average utilization is around 15%. With TOE enabled, we can expect 5% to 10¨% utilization.

(10)

0 5 10 15 20 25 CPU utilization (%)

MTU = 1500 MTU = 8000 "HPC" driver (better latency)

"NGN" driver (better bandw idth)

Netxen Myricom

NetPerf tets (TCP_STREAM)- CPU usage

Local (send) CPU utilization Remote (receive) CPU utilization

(11)

5. HPC testing

For the following section all the tests have been performed with Myricom low latency 10G network adapter and Infiniband 4x SDR network card.

5.1 Intel MPI Benchmark

This test is a kind of reference test, since it gives us the performance of the network. The idea of IMB is to provide a concise set of elementary MPI benchmark kernels. The Intel (formerly Pallas) MPI Benchmark Suite was used to study communications. Points to point communications were studied with the use of the PingPong and SendRecv benchmarks.

• PingPong

The Ping Pong is the classical pattern used for measuring startup and throughput of a single message sent between two processors. The plot below shows the PingPong pattern.

Figure 6: PIngPong pattern

The latency reported by the PingPong test is the type to send a message of size 0 so it is “time” is the plot above.

The network bandwidth defined in Mbytes/sec is the time to send 2x bytes in ∆t (µsec).

(12)

This test is based on MPI_Sendrecv, the processes form a periodic communication chain. Each process sends to the right and receives from the left neighbor in the chain. See below the Sendrecv pattern

Figure 7: SendRecv pattern

The throughput performance is 2x divided by ∆t (µsec). As here only 2 processes are used, it will report the bi-directional bandwidth of the system.

(13)

5.1.1 Results

5.1.1.1 Latency

results

Intel MPI Benchmark latency

13 % 11 % 0 1 2 3 4 5 6

PingPong Latency SendRecv latency

us e c 10 10,5 11 11,5 12 12,5 13 13,5 10G IB 4x Difference

Lower is better

Table 3: Intel MPI benchmark latency

There is a 13 % difference for the PingPong latency between the network adapter 10 G and the Infiniband 4x which is very interesting for HPC purpose. The latency is an important factor since it represents the time to open a communication between two processes. 13 % difference on the latency may show a huge difference for an overall performance point of view. There is about the same difference for the SendRecv latency. In the following part it would be interesting to see the overall performance impact that the latency implies.

5.1.1.2 Network

bandwidth

performance

(14)

Ping Pong benchmark

0 100 200 300 400 500 600 700 800 900 1000

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1E+06 2E+06 4E+06

Bytes MB /s IB 4x 10 G

Higher is better

Table 4: PingPong bandwidth with respect to the message size

It appears clearly that the network adapter 10 G offers a better bandwidth (around 10 %) with a peak at 950 MB/s whereas the IB 4x offers 854.29 MB/sec. For the 10G, the performance really increases when the message size is between 4KB and 32 KB where the difference is about 28 %.

(15)

5.2 High Performance Computing Challenge

The HPC Challenge benchmark consists of basically 7 tests, not all of them are relevant for our work. Among the seven tests we consider only:

HPL - the Linpack TPP benchmark which measures the floating point rate of execution for solving

a linear system of equations.

PTRANS (parallel matrix transpose) - exercises the communications where pairs of processors communicate with each other simultaneously. It is a useful test of the total communications capacity of the network.

Communication bandwidth and latency - a set of tests to measure latency and bandwidth of a

number of simultaneous communication patterns. Latency/Bandwidth measures latency (time required to send an 8-byte message from one node to another) and bandwidth (message size divided by the time it takes to transmit a 2,000,000 byte message) of network communication using basic MPI routines. The measurement is done during non-simultaneous (ping-pong benchmark) and simultaneous communication (random and natural ring pattern) and therefore it covers two extreme levels of contention (no contention and contention caused by the fact that each process communicates with a randomly chosen neighbour in parallel) that might occur in real application. For measuring latency and bandwidth of parallel communication, all processes are arranged in a ring topology and each process sends and receives a message from its left and its right neighbour in parallel. Two types of rings are reported: a naturally ordered ring (i.e., ordered by the process ranks in

MPI_COMM_WORLD), and the geometric mean of ten different randomly chosen process orderings

in the ring. The communication is implemented:

(a) with MPI standard non-blocking receive and send, and (b) with two calls to MPI_Sendrecv for both directions in the ring.

With this type of parallel communication, the bandwidth per process is defined as total amount of message data divided by the number of processes and the maximal time needed in all processes.

(16)

5.2.1 HPL

As it is mentioned above the HPL benchmark reveals the sustainable peak performance that your system can achieve. The algorithm used is the LU decomposition in parallel using mainly a BLAS 3 routine (DGEMM) and a block-cyclic decomposition exchanging the data between processor.

From experience point of view the network has some performance impacts in the sense that using a low performance network implies that the sustainable peak will not be high (less than 60 % of the theoretical peak), loosing efficiency during exchanging data between the processors.

For coherency two matrix sizes are considered N=32000 and N=58000 Below is represented the sustainable peak.

HPL performance

109,31 99,0957 109,139 101,168 92 94 96 98 100 102 104 106 108 110 112 58000 matrix size 32000 G F lops /s e c Network adapter 10G IB 4x

Higher is better

Table 5: HPL performance with 2 matrix sizes

The performance are very close, meaning that since both networks are fast and so the performance achieved for HPL won’t differ a lot. For the matrix size 58000 the percentage of the peak performance is 73 % which is a rather good number. Of course the influence of the network increases with the number of node used. Two nodes are not really relevant to show the network influence. The good information is that both networks are able to give good performance for the HPL benchmark.

(17)

5.2.2 PTRANS

As above the PTRANS benchmark is useful to test the total communication capacity of the network. It performs a matrix inversion in parallel. The graph below show for 2 matrices sizes the capacity of the network.

PTRANS performance

1,17491 1,5821 1,20881 1,26888 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 58000 32000 matrix size G B /sec Network adapter 10G IB 4x

Higher is better

Table 6: PTRANS performance

The performances of PTRANS for a matrix size equal to 58000 are very close whereas for a smaller matrix size (N=32000) the difference is much bigger, 3 % in the first case and 20% in the second case. It is important to say that the performance of PTRANS depends on the matrix size but the matrix used for the benchmark is the same as the HPL one. So in one hand the matrix size should be the biggest as possible for the HPL and in the other hand should not be too big otherwise the PTRANS performance will decrease.

(18)

5.2.3 Latency

comparison

Latency comparison 31,51% 24,69% 9,13% 0 1 2 3 4 5 6 7 8 9 10

MaxPingPongLatency RandomlyOrderedRingLatency NaturallyOrderedRingLatency

usec 0 5 10 15 20 25 30 35 Netw ork adapter 10G

IB 4x Difference

Lower is better

Table 7: PingPong, randomly ordered and natural ordered latencies

The max PingPong latency is around 4usecs for both networks meaning that the two networks are good, nevertheless the network adapter 10G is 9% better in terms of latency as the IMB benchmark showed (see above).

For the randomly ordered ring latency the difference is bigger with 24.7% and it is 31.5 % for the naturally ordered ring latency. It means that when simultaneous communication occurred the network adapter 10G shows better performance. It is interesting since in “real” HPC applications it is very often that simultaneous communications take place.

(19)

Network bandwidth comparison

24,7 43,9 2,2 0 0,2 0,4 0,6 0,8 1 1,2

MinPingPongBandwidth NaturallyOrderedRingBandwidth RandomlyOrderedRingBandwidth

G B ytes/ s ec 0 5 10 15 20 25 30 35 40 45 50 Netw ork adapter 10G IB 4x

Difference

5.2.4 Network bandwidth comparison

Table 8: PingPong, naturally ordered and randomly ordered ring bandwidths

Higher is better

The difference for the min PingPong bandwidth is around 25%. For the naturally ordered ring bandwidth case the performance differs by 44%, whereas the performance is about the same for the randomly ordered ring bandwidth.

Now that comparisons between the two network adapters have been performed on “simple” kernel benchmarks, it is interesting to see how it impacts performance on a “real” application.

(20)

5.3 VASP (Vienna Ab-initio Simulation Package)

VASP is a package for performing ab-initio quantum-mechanical molecular dynamics (MD) using pseudo potentials and a plane wave basis set.

This application has been chosen because it represents a large HPC segment namely Life Science. At the opposite of the two previous tests it is a “real application” and not a kernel benchmark. The communication is an important factor for the performance of VASP and so it is a good candidate for our work.

Two input test cases have been selected. The first is the following:

5.3.1 TEST

1

SYSTEM = Co rods 1x1x1

Startparameter for this run:

PREC = High medium, high low

ISTART = 0 job : 0-new 1-cont 2-samecut ISPIN = 2 spin polarized calculation? Electronic Relaxation 1

ENCUT = 400.0 eV

NELM = 120; NELMIN= 2; NELMDL= -5 # of ELM steps EDIFF = 0.1E-03 stopping-criterion for ELM

ISMEAR = 1; SIGMA = 0.1 Ionic relaxation

EDIFFG = -0.02 stopping-criterion for IOM NSW = 45 number of steps for IOM

IBRION = 2 ionic relax: 0-MD 1-quasi-New 2-CG ISIF = 2 stress and relaxation

NBANDS = 104

MAGMOM = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Electronic relaxation 2 (details)

(21)

5.3.2 TEST

2

SYSTEM = SWCNT ISTART= 0

ISMEAR = -5 ! Small K-Points Only NELM = 15

EDIFF = 0

#Paralellisation switches - NPAR = no proc LPLANE=.TRUE.

NPAR=4 NSIM=8

LCHARG=.FALSE. LWAVE=.FALSE.

The following plot shows the performance on both systems:

VASP

17 % 33 % 0 1000 2000 3000 4000 5000 6000 test 1 test 2 T im e ( sec) 0 5 10 15 20 25 30 35 Network adapter 10G IB 4x Difference

Lower is better

Table 9: Elapsed Time for VASP execution

It is very clear that there is a better performance when using a network adapter 10G, for the test case 2, 33 % difference is observed whereas for the test case 1 the difference is 17%. The difference of gain can be due to the different test case, the test case 2 stresses more the network than the test case 1.

(22)

6. Conclusions

In certain conditions (RedHat 5, right drivers…), the NetXen 10GbE adapter can be a good alternative to ten 1GbE adapters for simplicity. For performance issues, it is not an interesting solution yet. Indeed, a single 10GbE performs like five or six 1GbE adapters and not like ten. The Myricom adapter seems to perform much better but is still less efficient (when talking about bandwidth) than ten 1GbE adapters. Blade servers would definitely benefit from that solution.

From the HPC point of view the study performed has been really interesting since it shows that the Myricom low latency 10G network adapter gives better performance than an IB 4x card.

On the kernel benchmark IMB a difference of 10% in term of network latency and network bandwidth has been observed whereas with HPCC the latency difference is increasing (around 25 %).

The most interesting result is the result we obtained on VASP. VASP is a real application and represent a large numbers of life science codes in term of communication requirement. For the two cases the network adapter 10G shows better performance than the IB adapter by 17 % and 33 % respectively. Of course some further testing has to be done, but it is really promising. The follow-on will be to test the 10 G adapter with others real applications belonging to other HPC sectors.

(23)

7. Contacts:

IBM Products and Solutions Support Center (Montpellier)

Erwan Auffret

IBM Sales & Distribution

IT Specialist - Network Transformation Center. Phone: +33 4 6734 6077

E-mail: [email protected]

François-Romain Corradino

IBM Sales & Distribution IT Specialist - Deep Computing Phone: +33 4 6734 4836

E-mail: [email protected]

Ludovic Enault

IBM Sales & Distribution IT Specialist - Deep Computing Phone: +33 4 6734 4706

E-mail: [email protected]

http://www-03.ibm.com/servers/eserver/xseries/ http://www.myri.com/Myri-10G/product_list.html

References

Related documents