• No results found

HPC Hardware Overview

N/A
N/A
Protected

Academic year: 2021

Share "HPC Hardware Overview"

Copied!
59
0
0

Loading.... (view fulltext now)

Full text

(1)

HPC Hardware Overview

John Lockman III

February 7, 2012

Texas Advanced Computing Center

The University of Texas at Austin

(2)

Outline

Some general comments

Lonestar

System

– Dell blade-based system

– InfiniBand ( QDR)

– Intel Processors

Ranger

System

– Sun blade-based system

– InfiniBand (SDR)

– AMD Processors

Longhorn

System

– Dell blade-based system

– InfiniBand (QDR)

(3)

About this Talk

We will focus on TACC systems, but much of the

information applies to HPC systems in general

As an applications programmer you may not care

about hardware details, but…

We need to think about it because of performance

issues

I will try to give you pointers to the most relevant

architecture characteristics

(4)

High Performance Computing

In our context, it refers to hardware and software tools

dedicated to computationally intensive tasks

Distinction between HPC center (throughput focused)

and Data center (data focused) is becoming fuzzy

High bandwidth, low latency

Memory

(5)
(6)

Lonestar: Introduction

Lonestar Cluster

Configuration & Diagram

Server Blades

Dell PowerEdge M610 Blade (Intel Hexa-Core) Server

Nodes

Microprocessor Architecture Features

Instruction Pipeline

Speeds and Feeds

Block Diagram

Node Interconnect

Hierarchy

InfiniBand Switch and Adapters

(7)

Lonestar Cluster Overview

(8)

Lonestar Cluster Overview

Hardware Components Characteristics

Peak Performance 302 TFLOPS

Nodes 2 Hexa-Core Xeon 5680 1888 nodes / 22656 cores Memory 1333 MHz DDR3 DIMMS 24 GB/node, 45 TB total Shared Disk Lustre parallel file system 1 PB

Local Disk SATA 146GB/node, 276 TB total

(9)

Blade : Rack : System

1 node : 2 x 6 cores = 12 cores

1 chassis : 16 nodes = 192 cores

1 rack

: 3 x 16 nodes = 576 cores

39⅓ racks : 22656 cores

16 blades

(10)

Lonestar login nodes

Dell PowerEdge M610

Intel dual socket Xeon hexa-core 3.33GHz

24 GB DDR3 1333 MHz DIMMS

Intel QPI 5520 Chipset

Dell 1200 PowerVault

15 TB HOME disk

(11)

Lonestar compute nodes

16 Blades / 10U chassis

Dell PowerEdge M619

Dual socket Intel hexa-core Xeon

3.33 GHz(1.25x)

13.3 GFLOPS/core(1.25x)

64 KB L1 cache (independent)

12 MB L2 cache (unified)

24 GB DDR3 1333 MHz DIMMS

Intel QPI 5520 Chipset

2x QPI 6.4 GT/s

(12)

Motherboard

IOH

I/O Hub PCIe

ICH I/O Controller Hexa-core Xeon Hexa-core Xeon DDR3 1333 12 CPU Cores QPI DMI DDR3 1333

(13)

Intel Xeon 5600(Westmere)

32 KB L1 cache/core

256 KB L2 cache/core

Shared 12 MB L3 cache

(14)

Cluster Interconnect

(15)

Lonestar Parallel File Systems: Lustre &NFS

$HOME $WORK $SCRATCH 97 TB 1GB/ user 226 TB 200GB/ user 806 TB 1 2 3 Uses IP over IB InfiniBand

Ethernet

(16)
(17)

Ranger: Introduction

• Unique instrument for

computational scientific research

• Housed at TACC’s new machine room

• Over 2 ½ years of initial planning and deployment efforts

• Funded by the National Science Foundation as part of a unique program to reinvigorate High Performance Computing in the United States (Office of

(18)

How Much Did it Cost and Who’s Involved?

• TACC selected for very first NSF ‘Track2’ HPC system

– $30M system acquisition

– Sun Microsystems (now Oracle) was the vendor

– Very Large InfiniBand Installation

• ~4100 endpoint hosts

• >1350 MT47396 switches

• TACC, ICES, Cornell Theory Center, Arizona State HPCI are teamed to operate/support the system four 4 years ($29M)

(19)

Ranger: Performance

Ranger debuted at #4 on

the Top 500 list (ranked #25

as of November 2011)

(20)

Ranger Cluster Overview

Hardware Components Characteristics

Peak Performance 579 TFLOPS

Nodes 4 Quad-Core Opteron 3,936 nodes / 62,976 cores Memory 667MHz DDR2 DIMMS

1GHz HyperTransport

32 GB/node, 123 TB total

Shared Disk Lustre parallel file system 1.7 PB Local Disk NONE (flash-based) 8 Gb

Interconnect Infiniband (generation 2) 1 GB/sec P-2-P 2.3 μs latency

(21)

Ranger Hardware Summary

Compute power - 579 Teraflops

– 3,936 Sun four-socket blades

– 15,744 AMD “Barcelona” processors

• Quad-core, four flops/cycle (dual pipelines)

Memory - 123 Terabytes

– 2 GB/core, 32 GB/node

– ~20 GB/sec memory B/W per node (667 MHz DDR2)

Disk subsystem - 1.7 Petabytes

– 72 Sun x4500 “Thumper” I/O servers, 24TB each

– 40 GB/sec total aggregate I/O bandwidth

– 1 PB raw capacity in largest filesystem

Interconnect - 10 Gbps /1.6 – 2.9 sec latency

– Sun InfiniBand-based switches (2), up to 3456 4x ports each

– Full non-blocking 7-stage Clos fabric

(22)

Ranger Hardware Summary (cont.)

25 Management servers - Sun 4-socket x4600s

– 4 Login servers, quad-core processors

– 1 Rocks master, contains software stack for nodes

– 2 SGE servers, primary batch server and backup

– 2 Sun Connection Management servers, monitors hardware

– 2 InfiniBand Subnet Managers, primary and backup

– 6 Lustre Meta-Data Servers, enabled with failover

– 4 Archive data-movers, move data to tape library

– 4 GridFTP servers, external multi-stream transfer

Ethernet Networking - 10Gbps Connectivity

– Two external 10GigE networks: TeraGrid, NLR

– 10GigE fabric for login, data-mover and GridFTP nodes, integrated into existing TACC network infrastructure

(23)

Infiniband Cabling in Ranger

Sun switch design with reduced cable count,

manageable, but still a challenge to cable

1312 InfiniBand

12x to 12x cables

78 InfiniBand 12x to three 4x splitter cables

Cable lengths range from 7-16m, average 11m

(24)

Space, Power and Cooling

System Power:

3.0 MW total

System:

2.4 MW

– ~90 racks, in 6 row arrangement

– ~100 in-row cooling units

– ~4000 ft2 total footprint

Cooling:

~0.6 MW

– In-row units fed by three 400-ton chillers

– Enclosed hot-aisles

– Supplemental 280-tons of cooling from CRAC units

Observations:

– Space less an issue than power

– Cooling > 25kW per rack difficult

(25)
(26)
(27)
(28)

Ranger Features

AMD Processors:

– HPC Features  4 FLOPS/CP

– 4 Sockets on a board

– 4 Cores per socket

– HyperTransport (Direct Connect between sockets)

– 2.3 GHz core

– Any idea what the peak floating-point performance of a node is?

• 2.3 GHz * 4 Flops/CP * 16 cores = 147.2 GFlops Peak Performance

– Any idea how much an application can sustain?

• Can sustain over 80% of peak with DGEMM (matrix-matrix multiply)

NUMA Node Architecture (16 cores per node,

think hybrid

)

2-tier InfiniBand (NEM – “Magnum”) Switch System

(29)

GigE InfiniBand

Ranger Architecture

Magnum InfiniBand Switches

I/O Nodes WORK File System

internet Login Nodes 1 72 1 82 X4600 X4600 3,456 IB ports, each 12x Line splits into 3 4x lines. Bisection BW = 110Tbps. “C48” Blades Compute Nodes 4 sockets X 4 cores 24 TB each Thumper X4500 Metadata Server

X4600 1 per File Sys.

(30)

Ranger Infiniband Topology

…78… “Magnum” Switch 12x InfiniBand 3 cables combined NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM

(31)

MPI Tests: P2P Bandwidth

0 200 400 600 800 1000 1200 1 10 100 1000 10000 100000 1000000 1000000 0 1E+08

Message Size (Bytes)

B a n d w id th ( M B /s e c )

Ranger - OFED 1.2 - MVAPICH 0.9.9 Lonestar - OFED 1.1 MVAPICH 0.9.8

Effictive Bandwith is improved at smaller message size

Point-to-Point MPI

Measured Performance

Shelf Latencies: ~1.6 µs

Rack Latencies: ~2.0 µs

(32)

Able to sustain ~73% bisection bandwidth efficiency with all 3936 nodes communicating simultaneously (82 racks)

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 20 40 60 80 100

# of Ranger Compute Racks

B is e c ti o n B W (G B /s e c ) . Ideal Measured 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% 1 2 4 8 16 32 64 82

# of Ranger Compute Racks

F u ll B is e c ti o n B W Effi c ie n c y .

(33)

Sun Motherboard for AMD Barcelona Chips

Compute Blade

4 Sockets

4 Cores

8 Memory

Slots/Socket

HyperTransport

1 GHz

(34)

Sun Motherboard for AMD Barcelona Chips

8.3 GB/ s 8.3 GB/ s 8.3 GB /s 8.3 GB/ s Two PCIe x8 – 32Gbps One PCIe x4 – 16Gbps

Pa

ssive

Mi

dplane

A maximum neighbor NUMA Configuration for 3-port HyperTransport.

HyperTransport Bidirectional is 6.4GB/s, Unidirectional is 3.2GB/s. Dual Channel, 533MHz Registered, ECC Memory

(35)

Intel/AMD Dual- to Quad-core Evolution

MCH MCH

AMD Barcelona/Phenom Quad-Core AMD Opteron Dual-Core

Intel Clovertown Intel Woodcrest

(36)

Caches in Quad Core CPUs

Memory Controller Core 1 L1 Core 2 L1 L2 Core 4 L1 Core 3 L1 L2 Memory Controller L2 Core 1 L1 L2 Core 2 L1 Core 3 L1 L2 L2 Core 4 L1 L3 Intel Quad-Core

L2 caches are not independent

AMD Quad-Core

(37)

HT link bandwidth is 24 GB/s (8 GB/s per link)

Memory controller bandwidth up to 10.7 GB/s (667 MHz)

Cache sizes in AMD Barcelona

L2 512 KB Core 1 L1 64 KB Core 2 L1 64 KB Core 3 L1 64 KB Core 4 L1 64 KB L3 2 MB L2 512 KB L2 512 KB L2 512 KB System Request Interface

Crossbar Switch

(38)

Other Important Features

AMD Quad-core (K10, code name Barcelona)

Instruction fetch bandwidth now 32 bytes/cycle

2MB L3 cache on-die; 4 x 512KB L2 caches; 64KB

L1 Instruction & Data caches.

SSE units are now 128-bit wide -> single-cycle

throughput; improved ALU and FPU throughput

Larger branch prediction tables, higher accuracies

Dedicated stack engine to pull stack-related ESP

(39)
(40)

Speeds and Feeds

On Die Registers L1 Data L2 External Memory 3 CP ~15 CP ~300 CP Latency Load Speed Store Speed 4 W/CP 2 W/CP 2 W/CP 1 W/CP 0.5 W/CP 0.5 W/CP

Cache line size (L1/L2) is 8W 4 FLOPS / CP 64 KB 512 KB L3 2 MB ~25 CP 2x @ 533MHz DDR2 DIMMS W : FP Word (64 bit) CP : Clock Period

Cache States: MOESI (Modified, Owner, Exclusive, Shared, Invalid) MOESI is beneficial when latency/bandwidth between cpus is significantly better than main memory

(41)

Ranger Disk Subsystem - Lustre

Disk system (OSS) is based on Sun x4500 “Thumper”

– Each server has 48 SATA II 500 GB drives (24TB total) - running internal software RAID

– Dual Socket/Dual-Core Opterons @ 2.6 GHz

– 72 Servers Total: 1.7 PB raw storage (that’s 288 cores just to drive the file systems)

Metadata Servers (MDS) based on SunFire x4600s

MDS is Fibre-channel connected to

9TB Flexline Storage

Target Performance

(42)

Ranger Parallel File Systems: Lustre

$HOME $WORK $SCRATCH 96TB 6GB/ user 193TB 200GB/ user 773TB 36 OSTs 72 OSTs 300 OSTs 6 Thumpers 12 Thumpers 50 Thumpers …81 12 blades/Chassis blade 0 1 2 3 Uses IP over IB InfiniBand Switch

(43)

I/O with Lustre over Native InfiniBand

• Max. total aggregate performance of 38 or 46 GB/sec depending on stripecount (Design

Target = 32GB/sec)

• External users have reported performance of ~35 GB/sec with a 4K application run

$SCRATCH File System Throughput

0.00 10.00 20.00 30.00 40.00 50.00 60.00 1 10 100 1000 10000 # of Writing Clients W ri te S p e e d ( G B /s e c ) . Stripecount=1 Stripecount=4

$SCRATCH Single Application Performance

0.00 10.00 20.00 30.00 40.00 50.00 60.00 1 10 100 1000 10000 # of Writing Clients W ri te Sp e e d (G B /s e c ) . Stripecount=1 Stripecount=4

(44)
(45)

Longhorn: Introduction

First NSF eXtreme Digital Visualization grant

(XD Vis)

Designed for scientific visualization and data

analysis

Very large memory per computational core

Two NVIDIA Graphics cards per node

Rendering performance 154 billion

(46)

Longhorn Cluster Overview

Hardware Components Characteristics

Peak Performance (CPUs) 512 Intel Xeon E5400 20.7 TFLOPS Peak Performance (GPUs, SP) 128 NVIDIA

Quadroplex 2200 S4 500 TFLOPS

System Memory DDR3-DIMMS

48 GB/node (240 nodes) 144 GB/node (16 nodes) 13.5 TB total

Graphics Memory 512 FX5800 x 4GB 2 TB

Disk Lustre parallel file

system 210 TB

(47)

Longhorn “fat” nodes

Dell R710

Intel dual socket quad-core Xeon E5400 @ 2.53GHz

144 GB DDR3 ( 18 GB/core )

Intel 5520 chipset

NVIDIA Quadroplex 2200 S4

4 NVIDIA Quadro FX5800

240 CUDA cores

4 GB Memory

102 GB/s Memory bandwidth

(48)

Longhorn standard nodes

Dell R610

Dual socket Intel quad-core Xeon E5400 @ 2.53 GHz

48 GB DDR3 ( 6 GB/core )

Intel 5520 chipset

NVIDIA Quadroplex 2200 S4

4 NVIDIA Quadro FX5800

240 CUDA cores

4 GB Memory

102 GB/s Memory bandwidth

(49)

Motherboard (R610/R710)

Intel 5520 Quad Core Intel Xeon 5500 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 Quad Core Intel Xeon 5500 DD R 3 DD R 3 DD R 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3

42 lanes PCI Express or 36 lines PCI Express 2.0 12 DIMM slots (R610)

(50)

Cache Sizes in Intel Nehalem

L2 256 KB Core 1 L1 32 KB Core 2 L1 32 KB Core 3 L1 32 KB Core 4 L1 32 KB L3 8 MB L2 256 KB L2 256 KB L2 256 KB Memory Controller QPI 0 QPI 1 Shared L3 Cache Core 1 Core 2 Core 3 Core 4 Memory Controller QPI 0 QPI 1

• Total QPI bandwidth up to 25.6 GB/s (@ 3.2 GHz) • 2 20-lane QPI links

(51)
(52)

Storage

Ranch

– Long term tape storage

(53)

Ranch

Archival System

•Sun StorageTek Silo

•10,000 - T1000B tapes

•6,000 - T1000C tapes

•~40 PB total capacity

•Used for long-term

storage

(54)

Corral

• 6PB DataDirect Networks online disk storage

• 8 Dell 1950 servers

• 8 Dell 2950 servers

• 12 Dell R710 servers

• High Performance Parallel File System

– Multiple databases

– iRODS data management

– Replication to tape archive

• Multiple levels of access control

• Web and other data access available globally

(55)

Coming Soon (2013)

Stampede

10 PFLOP peak performance

272 TB total memory

14 PB of disk storage

Intel® Xeon® Processor E5 Family (Sandy Bridge)

Also includes a special pre-release shipment of the

Intel® Many Integrated Core (MIC) co-processor

(56)
(57)

Lonestar Related References

User Guide

services.tacc.utexas.edu/index.php/lonestar-user-guide

Developers

www.tomshardware.com www.intel.com/p/en_US/products/server/processor/xeon5000/technical-documents

(58)

Ranger Related References

User

Guide

services.tacc.utexas.edu/index.php/ranger-user-guide

Forums

forums.amd.com/devforum

www.pgroup.com/userforum/index.php

Developers

developer.amd.com/home.jsp

developer.amd.com/rec_reading.jsp

(59)

Longhorn Related References

User

Guide

services.tacc.utexas.edu/index.php/longhorn-user-guide

General Information

www.intel.com/technology/architecture-silicon/next-gen/

www.nvidia.com/object/cuda_what_is.html

Developers

developer.intel.com/design/pentium4/manuals/index2.htm

developer.nvidia.com/forums/index.php

References

Related documents