dca 8/2003
Application of General Purpose
HPC Systems in HPEC
David Alexander
Director of Engineering
SGI, Inc.
dca 8/2003
SGI Scalable ccNUMA Architecture
Basic Node Structure and Interconnect
Physical Memory
Physical Memory
(2-16GB)
C A C H ECPU
Interface
Chip
CPU
C A C H ENUMAlink Interconnect
(>6GB/sec)
Physical Memory
(2-16GB)
C A C H ECPU
Interface
Chip
CPU
C A C H EMemory Interface
(>8GB/sec)
Memory Interface
(>8GB/sec)
Processor Interface
(>6GB/sec)
Processor Interface
(>6GB/sec)
SGI Scalable ccNUMA Architecture
Scaling to Large Node Counts
Memory CPU Interface Chip CPU Memory CPU Interface Chip CPU Memory CPU Interface Chip CPU Memory CPU Interface Chip CPU
System Interconnect Fabric
•
Scaling to 100’s of processors
• RT Memory Latency < 600ns worst case (64p config)
• Bi-section bandwidth >25GB/sec (64p config)
SGI Scalable ccNUMA Architecture
SGI
®
NUMAflex™ Modular Design
System
Interconnect
CPU &
Memory
Storage Expansion
System I/O
High BW I/O
Expansion
Standard I/O
Expansion
Graphics Expansion
Scalability: No central bus or switch; just
modules and NUMAlink™ cables
Scalability:
No central bus or switch; just
modules and NUMAlink™ cables
Investment protection: Add new
technologies as they evolve
Investment protection:
Add new
technologies as they evolve
Flexibility: Tailored configurations for
different dimensions of scalability
Flexibility:
Tailored configurations for
different dimensions of scalability
Performance: High-bandwidth
interconnect with very low latency
Performance:
High-bandwidth
interconnect with very low latency
Traditional Clusters
SGI
®
Altix
™
3000
node
+
OS
...
Fast NUMAflex™ interconnect
Global shared memory
node
+
OS
...
Commodity interconnect
mem
mem
mem
mem
mem
mem
node
+
OS
node
+
OS
node
+
OS
node
+
OS
node
+
OS
node
+
OS
node
+
OS
node
+
OS
What is shared memory?
• All nodes operate on one large shared memory space, instead of each node having its own small
memory space
Shared memory is high-performance
• All nodes can access one large memory space efficiently, so complex communication and data
passing between nodes aren’t needed
• Big data sets fit entirely in memory; less disk I/O is needed
Shared memory is cost-effective and easy to deploy
• The SGI Altix 3000 family supports all major parallel programming models
• It requires less memory per node, because large problems can be solved in big shared memory
• Simpler programming means lower tuning and maintenance costs
SGI Scalable ccNUMA Architecture
dca 8/2003
SGI
®
Altix™ 3000
Overview
First Linux
®
node with 64 CPUs in
single-OS image
First clusters with global shared
memory across multiple nodes
First Linux solution with HPC
system-and data-management tools
World-record performance for
floating-point calculations, memory
performance, I/O bandwidth, and real
technical applications
Industry-standard Linux
®
Intel
®
Itanium
®
2 processor family
SGI
®
NUMAflex™
(Third-generation modular supercomputing architecture)
SGI
®
NUMAlink™
(Built-in high-bandwidth, low-latency interconnect)
SGI
®
Altix™ 3000
Fusion of Powerhouse Technologies
SGI
®
supercomputing technology, Intel’s most advanced
processors, and open-source computing
SGI ProPack™ and supercomputing enhancements
Global shared memory
SGI
®
Altix
™
3000 Family
Extreme Power, Extreme Potential
Single-node entry offering
4–12 1.30 GHz Itanium
®2
processors, 3MB L3 cache
Up to 96GB memory
Scalable performance offering
4–64 Itanium 2 processors in a single node
1.30 GHz/3MB L3 cache
1.50 GHz/6MB L3 cache
Shared memory across nodes
Scalable to 2,048 processors, 16TB memory
Nodes up to 64P, 4TB memory
Model 3300 Servers
dca 8/2003
Multi-Paradigm Architecture
Overview
C
C
C
C
C
C
C
C
C
C
C
V
R
R
R
R
IO
IO
IO
IO
MSE
MSE
MSE
MSE
• NUMAlink system interconnect
• General-purpose compute nodes
• Peer-attached general purpose I/O
• Mission-specific accel. and/or I/O (MSE)
• Integrated graphics/visualization
R
C
IO
MSE
V
Multi-Paradigm Architecture
Mission Specific Element (MSE)
MSE
NL4 Ports
(>5GB/s x 2)
FPGA,
DSP, or
ASIC
TIO ASIC
System
Interconnect
Fabric
>5GB/s
Memory
I/O
Implementation variants:
•
Customer supplied/loaded FPGA algorithms and/or ASIC/DSP
• Subroutine or standard library acceleration
• Specific use “appliance”
dca 8/2003
Embedded High Performance
Computing (EHPC)
EHPC Today
Different Development and Deployment Environments
EHPC Vision
Unified Development and Deployment
Enhanced software development productivity
Superior performance and HW utilization
Demonstration to deployment in days
Benefits from mainstream HPC advances
Development Environment
HPC methods, tools, hardware
Architecture provides freedom to develop new approaches
Platform Port
Months of work Millions of dollars Significant schedule risk
Significant architectural/performance risk
Deployment
Proprietary hardware and software
“Islands of memory” narrows algorithm options
Embedded High Performance Computing
Unified Development/Deployment Environment
SGI
EHPC
System
Scalable ccNUMA
System Architecture
Embedded Systems
Packaging
Seamless HPC Application
Development Model
Mission-Specific
Processing
Embedded High Performance Computing
EHPC Platform
• Unified software environment for development,
prototype and deployment
• Full multi-paradigm computing support
• Field-upgradeable mission specific processing
acceleration
• Highly optimized performance per watt and
performance per slot
• Design to fit into established environmentals, form
factors, and interfaces
• 6U Eurocard form factor
• Passive, slot configurable backplane
• PMC module connectivity to standard interfaces
• Able to address oceanic, ground and airborne
environmentals
SGI
EHPC
System
Scalable ccNUMA System Architecture Embedded SystemsPackaging
Seamless HPC Application
Development Model
Mission-Specific Processing
Embedded High Performance Computing
Family of 6U Form Factor Modules
6U Form Factor NUMAlink4 Ports Diagnostic & L1 Ports System Control PMC Module Site NUMAlink Controller PMC Module Site PIC 6U Form Factor NUMAlink4 Ports Diagnostic & L1 Ports uP Cache NUMAlink Controller Memory
DDR SDRAMDDR SDRAMMemory Memory DDR SDRAMDDR SDRAMMemory
Memory DDR SDRAMDDR SDRAMMemory
Memory DDR SDRAMDDR SDRAMMemory
uP Cache
System Control
Scalable General Purpose
6U Form Factor NUMAlink4 Port Diagnostic & L1 Ports TIO System Control Memory FPGA 6.4GB/s TIO Memory FPGA 6.4GB/s NUMAlink4 Port NL4 Port External I/O (Optional) 6U Form Factor NUMAlink4 Port Diagnostic & L1 Ports TIO System Control Memory TIO Memory NUMAlink4 Port NL4 Port External I/O (Optional) FPGA FPGA ASIC ASIC
Embedded High Performance Computing
Family of 6U Form Factor Modules
www.sgi.com
©2002 Silicon Graphics, Inc. All rights reserved. Silicon Graphics, SGI, IRIX, Challenge, Origin, Onyx, and the SGI logo are registered trademarks and OpenVault, FailSafe, Altix, XFS, and CXFS are trademarks of Silicon Graphics, Inc., in the U.S. and/or other countries worldwide. MIPS is a registered trademark of MIPS Technologies, Inc., used under license by Silicon Graphics, Inc. UNIX is a registered trademark of The Open Group in the U.S. and other countries. Intel and Itanium are registered trademarks of Intel Corporation. Linux is a registered trademark of Linus Torvalds. Mac is a registered trademark of Apple Computer, Inc. Windows is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.Red Hat is a registered trademark of Red Hat, Inc. All other trademarks are the property of their respective owners. (1/03)