High Performance Computing High Performance Computing
Course Notes 2007
Course Notes 2007 - - 2008 2008
HPC Fundamentals
HPC Fundamentals
Introduction Introduction
What is High Performance Computing (HPC)?
Difficult to define - it’s a moving target.
•
Later 1980s, a supercomputer performs 100m FLOPS
•
Today, a 2G Hz desktop/laptop performs a few giga FLOPS
•
Today, a supercomputer performs tens of Tera FLOPS (Top500)
•
High performance: O(1000) more powerful than the latest desktops
Most supercomputers are obsolete in terms of
performance before the end of their physical life.
Applications of HPC Applications of HPC
HPC is Driven by demand of computation-intensive applications from various areas
•
Medical, Biology, neuroscience (e.g.
simulation of brains)
•
Finance (e.g. modelling the world economy)
•
Military and Defence (e.g.
modelling explosion of nuclear weapons)
•
Engineering (e.g. simulations of a car crash or a new
airplane design)
An Example of Demands in Computing An Example of Demands in Computing
Capability Capability
Project: Blue Brain
aim: construct a simulated brain
Building blocks of a brain are neurocortical columns A column consists of about 60,000 neurons
Human brain contains millions of such columns
First stage: simulate a single column (each processor acting as one or two neurons)
Then: simulate a small network of columns
Ultimate goal: simulate the whole human brain
IBM contributes Blue Gene supercomputer
Related Technologies Related Technologies
HPC covers a wide range of technologies:
computer architecture
•
CPU, memory, VLSI
Compilers
•
Identify inefficient implementations
•
Make use of the characteristics of the computer architecture
•
Choose suitable compiler for a certain architecture
Algorithms (for parallel and distributed systems)
•
How to program on parallel and distributed systems
Middleware
• From Grid computing technology
• Application->middleware->operating system
• Resource discovery and sharing
History of High Performance Computing History of High Performance Computing
1960s: Scalar processor
Process one data item at a time
1970s: Vector processor
Can process an array of data items at one go
Architecture
Overhead
Difference between vector processor and scalar processor
Later 1980s: Massively Parallel Processing (MPP)
Up to thousands of processors, each with its own memory and OS
Break down a problem
Difference between MPP and vector processor
Later 1990s: Cluster
Not a new term itself, but renewed interests
Connecting stand-alone computers with high-speed network
Difference between cluster and MPP
Later 1990s: Grid
Tackle collaboration among geographically distributed organisations
Draw an analogue from Power grid
Difference between Grid and cluster
Parallel computing vs. distributed Parallel computing vs. distributed
computing computing
Parallel Computing
Breaking the problem to be computed into parts that can be run simultaneously in different processors
Example: an MPI program to perform matrix multiplication
Solve tightly coupled problems
Distributed Computing
Parts of the work to be computed are computed in different places (Note: does not necessarily imply simultaneous processing)
An example: C/S model
Solve loosely-coupled problems (no much
communication)
Architecture Types Architecture Types
SMP (Symmetric Multi-Processing)
Multiple CPUs, single memory, shared I/O
All resources in a SMP machine are equally available to each CPU
Does not scale well to a large number of processors (less than 8) - (Scalability is the measure of how well the system performance improves linearly to the number of processing elements)
NUMA (Non-Uniform Memory Access)
Multiple CPUs
Each CPU has fast access to its local area of the memory, but slower access to other areas
Scale well to a large number of processors
Complicated memory access pattern and system bus
MPP (Massively Parallel Processing)
Cluster
Illustration for Architecture Types Illustration for Architecture Types
Shared memory (uniform memory access - SMP)
Processors share access to a common memory space.
•
Implemented over a shared memory bus or communication network.
Support for critical sections are required
Local cache is critical:
•
If not, bus contention (or network traffic) reduces the systems efficiency.
•
For this reason, pure shared memory systems do not scale naturally. Cache introduces
problems of coherency (ensuring that stale cache lines are invalidated when other
processors alter shared memory).
Shared Memory
Interconnect
PE 0
PE n
Illustration for Architecture Types Illustration for Architecture Types
Shared memory (Non- uniform memory access:
NUMA)
PE may be fetching from local or remote memory - hence non-
uniform access times.
• NUMA
• cc-NUMA (cache-coherent Non- Uniform Memory Access)
Groups of processors are connected together by a fast interconnect (SMP)
These are then connected together by a high-speed interconnect.
Global address space.
PE (m-1)n+1
PE m.n Shared Memory
m Interconnect
PE 1
PE n Shared Memory
1
Illustration for Architecture Types Illustration for Architecture Types
Distributed Memory (MPP, cluster)
Each processor has it’s own local memory.
When processors need to exchange (or share data), they must do this through an explicit communication
•
Message passing (MPI language)
Typically larger latencies between PEs (especially if they communicate via over- network interconnections).
Scalability is good if the problems can be sufficiently contained within PEs.
Interconnect
M 0
M n PE
0
PE n