HPC Hardware Overview
John Lockman III
February 7, 2012
Texas Advanced Computing Center
The University of Texas at Austin
Outline
•
Some general comments
•
Lonestar
System
– Dell blade-based system
– InfiniBand ( QDR)
– Intel Processors
•
Ranger
System
– Sun blade-based system
– InfiniBand (SDR)
– AMD Processors
•
Longhorn
System
– Dell blade-based system
– InfiniBand (QDR)
About this Talk
•
We will focus on TACC systems, but much of the
information applies to HPC systems in general
•
As an applications programmer you may not care
about hardware details, but…
–
We need to think about it because of performance
issues
–
I will try to give you pointers to the most relevant
architecture characteristics
High Performance Computing
•
In our context, it refers to hardware and software tools
dedicated to computationally intensive tasks
•
Distinction between HPC center (throughput focused)
and Data center (data focused) is becoming fuzzy
•
High bandwidth, low latency
–
Memory
Lonestar: Introduction
•
Lonestar Cluster
–
Configuration & Diagram
–
Server Blades
•
Dell PowerEdge M610 Blade (Intel Hexa-Core) Server
Nodes
•
Microprocessor Architecture Features
–
Instruction Pipeline
–
Speeds and Feeds
–
Block Diagram
•
Node Interconnect
–
Hierarchy
–
InfiniBand Switch and Adapters
Lonestar Cluster Overview
Lonestar Cluster Overview
Hardware Components Characteristics
Peak Performance 302 TFLOPS
Nodes 2 Hexa-Core Xeon 5680 1888 nodes / 22656 cores Memory 1333 MHz DDR3 DIMMS 24 GB/node, 45 TB total Shared Disk Lustre parallel file system 1 PB
Local Disk SATA 146GB/node, 276 TB total
Blade : Rack : System
•
1 node : 2 x 6 cores = 12 cores
•
1 chassis : 16 nodes = 192 cores
•
1 rack
: 3 x 16 nodes = 576 cores
•
39⅓ racks : 22656 cores
16 blades
Lonestar login nodes
•
Dell PowerEdge M610
–
Intel dual socket Xeon hexa-core 3.33GHz
–
24 GB DDR3 1333 MHz DIMMS
–
Intel QPI 5520 Chipset
•
Dell 1200 PowerVault
–
15 TB HOME disk
Lonestar compute nodes
•
16 Blades / 10U chassis
•
Dell PowerEdge M619
–
Dual socket Intel hexa-core Xeon
•
3.33 GHz(1.25x)
•
13.3 GFLOPS/core(1.25x)
•
64 KB L1 cache (independent)
•
12 MB L2 cache (unified)
–
24 GB DDR3 1333 MHz DIMMS
–
Intel QPI 5520 Chipset
–
2x QPI 6.4 GT/s
Motherboard
IOH
I/O Hub PCIe
ICH I/O Controller Hexa-core Xeon Hexa-core Xeon DDR3 1333 12 CPU Cores QPI DMI DDR3 1333
Intel Xeon 5600(Westmere)
•
32 KB L1 cache/core
•
256 KB L2 cache/core
•
Shared 12 MB L3 cache
Cluster Interconnect
Lonestar Parallel File Systems: Lustre &NFS
$HOME $WORK $SCRATCH 97 TB 1GB/ user 226 TB 200GB/ user 806 TB 1 2 3 Uses IP over IB InfiniBandEthernet
Ranger: Introduction
• Unique instrument for
computational scientific research
• Housed at TACC’s new machine room
• Over 2 ½ years of initial planning and deployment efforts
• Funded by the National Science Foundation as part of a unique program to reinvigorate High Performance Computing in the United States (Office of
How Much Did it Cost and Who’s Involved?
• TACC selected for very first NSF ‘Track2’ HPC system
– $30M system acquisition
– Sun Microsystems (now Oracle) was the vendor
– Very Large InfiniBand Installation
• ~4100 endpoint hosts
• >1350 MT47396 switches
• TACC, ICES, Cornell Theory Center, Arizona State HPCI are teamed to operate/support the system four 4 years ($29M)
Ranger: Performance
•
Ranger debuted at #4 on
the Top 500 list (ranked #25
as of November 2011)
Ranger Cluster Overview
Hardware Components Characteristics
Peak Performance 579 TFLOPS
Nodes 4 Quad-Core Opteron 3,936 nodes / 62,976 cores Memory 667MHz DDR2 DIMMS
1GHz HyperTransport
32 GB/node, 123 TB total
Shared Disk Lustre parallel file system 1.7 PB Local Disk NONE (flash-based) 8 Gb
Interconnect Infiniband (generation 2) 1 GB/sec P-2-P 2.3 μs latency
Ranger Hardware Summary
• Compute power - 579 Teraflops
– 3,936 Sun four-socket blades
– 15,744 AMD “Barcelona” processors
• Quad-core, four flops/cycle (dual pipelines)
• Memory - 123 Terabytes
– 2 GB/core, 32 GB/node
– ~20 GB/sec memory B/W per node (667 MHz DDR2)
• Disk subsystem - 1.7 Petabytes
– 72 Sun x4500 “Thumper” I/O servers, 24TB each
– 40 GB/sec total aggregate I/O bandwidth
– 1 PB raw capacity in largest filesystem
• Interconnect - 10 Gbps /1.6 – 2.9 sec latency
– Sun InfiniBand-based switches (2), up to 3456 4x ports each
– Full non-blocking 7-stage Clos fabric
Ranger Hardware Summary (cont.)
•
25 Management servers - Sun 4-socket x4600s
– 4 Login servers, quad-core processors
– 1 Rocks master, contains software stack for nodes
– 2 SGE servers, primary batch server and backup
– 2 Sun Connection Management servers, monitors hardware
– 2 InfiniBand Subnet Managers, primary and backup
– 6 Lustre Meta-Data Servers, enabled with failover
– 4 Archive data-movers, move data to tape library
– 4 GridFTP servers, external multi-stream transfer
•
Ethernet Networking - 10Gbps Connectivity
– Two external 10GigE networks: TeraGrid, NLR
– 10GigE fabric for login, data-mover and GridFTP nodes, integrated into existing TACC network infrastructure
Infiniband Cabling in Ranger
•
Sun switch design with reduced cable count,
manageable, but still a challenge to cable
–
1312 InfiniBand
12x to 12x cables
–
78 InfiniBand 12x to three 4x splitter cables
–
Cable lengths range from 7-16m, average 11m
Space, Power and Cooling
•
System Power:
3.0 MW total
•
System:
2.4 MW
– ~90 racks, in 6 row arrangement
– ~100 in-row cooling units
– ~4000 ft2 total footprint
•
Cooling:
~0.6 MW
– In-row units fed by three 400-ton chillers
– Enclosed hot-aisles
– Supplemental 280-tons of cooling from CRAC units
•
Observations:
– Space less an issue than power
– Cooling > 25kW per rack difficult
Ranger Features
•
AMD Processors:
– HPC Features 4 FLOPS/CP
– 4 Sockets on a board
– 4 Cores per socket
– HyperTransport (Direct Connect between sockets)
– 2.3 GHz core
– Any idea what the peak floating-point performance of a node is?
• 2.3 GHz * 4 Flops/CP * 16 cores = 147.2 GFlops Peak Performance
– Any idea how much an application can sustain?
• Can sustain over 80% of peak with DGEMM (matrix-matrix multiply)
•
NUMA Node Architecture (16 cores per node,
think hybrid
)
•
2-tier InfiniBand (NEM – “Magnum”) Switch System
GigE InfiniBand
Ranger Architecture
Magnum InfiniBand SwitchesI/O Nodes WORK File System
…
…
internet Login Nodes 1 72 1 82 X4600 X4600 3,456 IB ports, each 12x Line splits into 3 4x lines. Bisection BW = 110Tbps. “C48” Blades Compute Nodes 4 sockets X 4 cores 24 TB each Thumper X4500 Metadata Server…
X4600 1 per File Sys.Ranger Infiniband Topology
…78… “Magnum” Switch 12x InfiniBand 3 cables combined NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEM NEMMPI Tests: P2P Bandwidth
0 200 400 600 800 1000 1200 1 10 100 1000 10000 100000 1000000 1000000 0 1E+08Message Size (Bytes)
B a n d w id th ( M B /s e c )
Ranger - OFED 1.2 - MVAPICH 0.9.9 Lonestar - OFED 1.1 MVAPICH 0.9.8
Effictive Bandwith is improved at smaller message size
Point-to-Point MPI
Measured Performance
• Shelf Latencies: ~1.6 µs
• Rack Latencies: ~2.0 µs
• Able to sustain ~73% bisection bandwidth efficiency with all 3936 nodes communicating simultaneously (82 racks)
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 20 40 60 80 100
# of Ranger Compute Racks
B is e c ti o n B W (G B /s e c ) . Ideal Measured 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% 1 2 4 8 16 32 64 82
# of Ranger Compute Racks
F u ll B is e c ti o n B W Effi c ie n c y .
Sun Motherboard for AMD Barcelona Chips
Compute Blade4 Sockets
4 Cores
8 Memory
Slots/Socket
HyperTransport
1 GHz
Sun Motherboard for AMD Barcelona Chips
8.3 GB/ s 8.3 GB/ s 8.3 GB /s 8.3 GB/ s Two PCIe x8 – 32Gbps One PCIe x4 – 16GbpsPa
ssive
Mi
dplane
A maximum neighbor NUMA Configuration for 3-port HyperTransport.
HyperTransport Bidirectional is 6.4GB/s, Unidirectional is 3.2GB/s. Dual Channel, 533MHz Registered, ECC Memory
Intel/AMD Dual- to Quad-core Evolution
MCH MCH
AMD Barcelona/Phenom Quad-Core AMD Opteron Dual-Core
Intel Clovertown Intel Woodcrest
Caches in Quad Core CPUs
Memory Controller Core 1 L1 Core 2 L1 L2 Core 4 L1 Core 3 L1 L2 Memory Controller L2 Core 1 L1 L2 Core 2 L1 Core 3 L1 L2 L2 Core 4 L1 L3 Intel Quad-CoreL2 caches are not independent
AMD Quad-Core
• HT link bandwidth is 24 GB/s (8 GB/s per link)
•Memory controller bandwidth up to 10.7 GB/s (667 MHz)
Cache sizes in AMD Barcelona
L2 512 KB Core 1 L1 64 KB Core 2 L1 64 KB Core 3 L1 64 KB Core 4 L1 64 KB L3 2 MB L2 512 KB L2 512 KB L2 512 KB System Request Interface
Crossbar Switch
Other Important Features
•
AMD Quad-core (K10, code name Barcelona)
•
Instruction fetch bandwidth now 32 bytes/cycle
•
2MB L3 cache on-die; 4 x 512KB L2 caches; 64KB
L1 Instruction & Data caches.
•
SSE units are now 128-bit wide -> single-cycle
throughput; improved ALU and FPU throughput
•
Larger branch prediction tables, higher accuracies
•
Dedicated stack engine to pull stack-related ESP
Speeds and Feeds
On Die Registers L1 Data L2 External Memory 3 CP ~15 CP ~300 CP Latency Load Speed Store Speed 4 W/CP 2 W/CP 2 W/CP 1 W/CP 0.5 W/CP 0.5 W/CPCache line size (L1/L2) is 8W 4 FLOPS / CP 64 KB 512 KB L3 2 MB ~25 CP 2x @ 533MHz DDR2 DIMMS W : FP Word (64 bit) CP : Clock Period
Cache States: MOESI (Modified, Owner, Exclusive, Shared, Invalid) MOESI is beneficial when latency/bandwidth between cpus is significantly better than main memory
Ranger Disk Subsystem - Lustre
•
Disk system (OSS) is based on Sun x4500 “Thumper”
– Each server has 48 SATA II 500 GB drives (24TB total) - running internal software RAID
– Dual Socket/Dual-Core Opterons @ 2.6 GHz
– 72 Servers Total: 1.7 PB raw storage (that’s 288 cores just to drive the file systems)
•
Metadata Servers (MDS) based on SunFire x4600s
•
MDS is Fibre-channel connected to
9TB Flexline Storage
•
Target Performance
Ranger Parallel File Systems: Lustre
$HOME $WORK $SCRATCH 96TB 6GB/ user 193TB 200GB/ user 773TB 36 OSTs 72 OSTs 300 OSTs 6 Thumpers 12 Thumpers 50 Thumpers …81 12 blades/Chassis blade 0 1 2 3 Uses IP over IB InfiniBand SwitchI/O with Lustre over Native InfiniBand
• Max. total aggregate performance of 38 or 46 GB/sec depending on stripecount (Design
Target = 32GB/sec)
• External users have reported performance of ~35 GB/sec with a 4K application run
$SCRATCH File System Throughput
0.00 10.00 20.00 30.00 40.00 50.00 60.00 1 10 100 1000 10000 # of Writing Clients W ri te S p e e d ( G B /s e c ) . Stripecount=1 Stripecount=4
$SCRATCH Single Application Performance
0.00 10.00 20.00 30.00 40.00 50.00 60.00 1 10 100 1000 10000 # of Writing Clients W ri te Sp e e d (G B /s e c ) . Stripecount=1 Stripecount=4
Longhorn: Introduction
•
First NSF eXtreme Digital Visualization grant
(XD Vis)
•
Designed for scientific visualization and data
analysis
–
Very large memory per computational core
–
Two NVIDIA Graphics cards per node
–
Rendering performance 154 billion
Longhorn Cluster Overview
Hardware Components Characteristics
Peak Performance (CPUs) 512 Intel Xeon E5400 20.7 TFLOPS Peak Performance (GPUs, SP) 128 NVIDIA
Quadroplex 2200 S4 500 TFLOPS
System Memory DDR3-DIMMS
48 GB/node (240 nodes) 144 GB/node (16 nodes) 13.5 TB total
Graphics Memory 512 FX5800 x 4GB 2 TB
Disk Lustre parallel file
system 210 TB
Longhorn “fat” nodes
•
Dell R710
–
Intel dual socket quad-core Xeon E5400 @ 2.53GHz
–
144 GB DDR3 ( 18 GB/core )
–
Intel 5520 chipset
•
NVIDIA Quadroplex 2200 S4
–
4 NVIDIA Quadro FX5800
•
240 CUDA cores
•
4 GB Memory
•
102 GB/s Memory bandwidth
Longhorn standard nodes
•
Dell R610
–
Dual socket Intel quad-core Xeon E5400 @ 2.53 GHz
–
48 GB DDR3 ( 6 GB/core )
–
Intel 5520 chipset
•
NVIDIA Quadroplex 2200 S4
–
4 NVIDIA Quadro FX5800
•
240 CUDA cores
•
4 GB Memory
•
102 GB/s Memory bandwidth
Motherboard (R610/R710)
Intel 5520 Quad Core Intel Xeon 5500 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 Quad Core Intel Xeon 5500 DD R 3 DD R 3 DD R 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 3 DDR 342 lanes PCI Express or 36 lines PCI Express 2.0 12 DIMM slots (R610)
Cache Sizes in Intel Nehalem
L2 256 KB Core 1 L1 32 KB Core 2 L1 32 KB Core 3 L1 32 KB Core 4 L1 32 KB L3 8 MB L2 256 KB L2 256 KB L2 256 KB Memory Controller QPI 0 QPI 1 Shared L3 Cache Core 1 Core 2 Core 3 Core 4 Memory Controller QPI 0 QPI 1• Total QPI bandwidth up to 25.6 GB/s (@ 3.2 GHz) • 2 20-lane QPI links
Storage
•
Ranch
– Long term tape storage
Ranch
Archival System
•Sun StorageTek Silo
•10,000 - T1000B tapes
•6,000 - T1000C tapes
•~40 PB total capacity
•Used for long-term
storage
Corral
• 6PB DataDirect Networks online disk storage
• 8 Dell 1950 servers
• 8 Dell 2950 servers
• 12 Dell R710 servers
• High Performance Parallel File System
– Multiple databases
– iRODS data management
– Replication to tape archive
• Multiple levels of access control
• Web and other data access available globally