High Performance Computing
Trey Breckenridge
Computing Systems Manager Engineering Research Center Mississippi State University
What is High Performance Computing?
• HPC is ill defined and context dependent.
• In the late 1980’s, the US Government defined
supercomputers as processors capable of more than
100MFlops. This definition is clearly obsolete, as modern desktop PC’s are capable of ~ 5GFlops.
• Another approach is to describe HPC as the fastest
computers at any point in time, however, that is more a budgetary dependent definition.
• For the intent of this presentation, we will define HPC as:
Computing resources which provide at least an order of magnitude more computing power than is normally
What does the definition really mean?
• That definition sounds like HPC is hardware only. Isn’t the software important too?
• The full range of supercomputing activities including existing supercomputer systems, special purpose and experimental systems, and the new generation of large scale parallel architectures.
• HPC exists on a broad range of computer systems, from departmental clusters of desktop workstations to large parallel processing systems.
Why High Performance Computing?
• To achieve the maximum amount of computations in a minimum amount of time – SPEED!
• To solve problems that couldn’t otherwise be solved without large computer systems.
• Traditionally, HPC used in scientific and engineering fields for work with massively complex simulations. • Computations are typically “floating point” intensive.
Areas of HPC Use
• Traditional:
– Computational Fluid Dynamics (CFD)
– Climate, Weather, and Ocean Modeling and Simulation (CWO) – Nuclear Modeling and Simulation
– Geophysical/Petroleum Modeling
• Emerging:
– Computer Graphics/Scientific Visualization – Financial Modeling
– Database Applications – Bioinformatics
Parallel Computing
• A collection of processing elements that can
communicate and cooperate to solve large
problems more quickly than a single processing
element.
• Simultaneous use of multiple processors to
execute different parts of a program.
• Goal: To reduce wall-clock time of run
• No single processor ever again is likely to match
performance of existing parallel HPC systems:
Type of Parallelism
• Overt
– Parallelism is visible to the programmer – May be difficult to program (correctly) – Large improvements in performance
• Covert
– Parallelism is not visible to the programmer – Compiler responsible for parallelism
– Easy to do
Speed Up
• Speed Up is one quantitative measure of the
benefit of parallelism
• Speed Up is defined as S / T(N) where,
– S = best serial time
– T(N) = time required for N processors
• Since S/N is the best possible parallel time,
speedup typically should not exceed N
• S is sometimes difficult to measure causing many
people to substitute T(1) for S
Efficiency
• Speed up does not measure how efficiently the
processors are being used
– Is it worth using 100 processors to get a speed up of 2?
• Efficiency is defined as the ratio of the speed up
and the number of processors required to achieve
it
– The best efficiency is 1
Processors
• Vector
– Large rows of data are operated on simultaneously
• Scalar
– Data is operated on in a sequential fashion – Instruction sets
• Complex Instruction Set Computer (CISC) • Reduced Instruction Set Computer (RISC) • Post-RISC or CISC/RISC
– UltraSPARC – IBM Power4 – IA64
Scalar vs. Vector Arithmetic
DO 10 i = 1.n
a(i) = b(i) + c(i) 10 CONTINUE Scalar: Vector: a(1) = b(1) + c(1) a = b + c a(2) = b(2) + c(2) … a(n) = b(n) + c(n)
Where is Scalar better?
• If the vector length is small
• If the loop contains IF statements
• If partial vectorization involves large overhead
• If recursion is used
Architectural Classifications
Flynn’s Taxonomy
• Published by Flynn in 1972
– Outdated, but still widely used
• Categorizes machines by instruction streams and data streams
– A stream of instructions (the algorithm) tells the computer what to do.
– A stream of data (the input) is affected by these instructions.
• Four Categories
– SISD – Single Instruction, Single Data – MISD – Multiple Instruction, Single Data – SIMD – Single Instruction, Multiple Data – MIMD – Multiple Instruction, Multiple Data
SISD
Single Instruction, Single Data
• Conventional single processor computers
• Each arithmetic instruction initiates an operation
on a data item taken from a single stream of data
elements.
• Historical supercomputers and most contemporary
microprocessors are SISD
SIMD
Single Instruction, Multiple Data
• Many, simple processing elements – 1000s
• Each processor has its own local memory
• Each processor runs the same program
• Each processor processes different data streams
• All processors work in lock-step (synchronously)
• Very efficient for array/matrix operations
• Most older vector/array computers are SIMD
• Example machines:
– Cray YMP
MISD
Multiple Instruction, Single Data
• Very few machines fit this category
• None have been commercially successful or have
had any impact on computational science
MIMD
Multiple Instruction, Multiple Data
• Most diverse of the four classifications
• Multiple processors
• Each processor either has own, or accesses shared,
memory
• Each processor can run the same or different
programs
• Each processor processes different data streams
• Processors can work synchronously or
MIMD
cont.
• Processors can be either tightly or loosely coupled
• Examples include:
– Processors and memory units specifically designed to be components of a parallel architecture (e.g., Intel Paragon)
– Large scale parallel machines built from “off the shelf” workstations (e.g., Beowulf Cluster)
– Small scale multiprocessors made by connecting multiple vector processors together (e.g., Cray T90) – Wide variety of other designs as well
SPMD Computing
• Not a Flynn category, per se, but instead a combination of categories.
• SPMD stands for single program, multiple data
– The same program is run on the processors of an MIMD machine. – Occasionally the processors may synchronize.
– Because an entire program is executed on separate data, it is
possible that different branches are taken, leading to asynchronous parallelism
• SPMD came about as a desire to do SIMD like calculations on MIMD machine
– SPMD is not a hardware paradigm, but instead, the software equivalent of SIMD
Memory Classifications
Organization
• Shared Memory (SM-MIMD)
– Bus based
– Interconnection network
• Distributed Memory (DM-MIMD)
– Local
– Message passing
• Virtual shared memory (VSM-MIMD)
– Physically distributed, but appears as one image
Access
• Uniform Memory Access (UMA)
Memory Organization
Shared Memory
• One common memory block between all processors
Bus Based
• Since bus has limited bandwidth, number of processors which can be used is limited to a few tens of processors • Examples include typical multi-processors PC’s, SGI
Memory Organization
Switch based
• Utilizes (complex) inter-connected network to connect processors to shared memory modules
• May use multi-stage networks - NUMA
• Increases bandwidth to memory over bus based systems • Every processor still has access to global memory
Memory Organization
Distributed Memory
• Message Passing. Memory physically distributed through the machine. Each processor has private memory.
• Contents of private memory can only be accessed by that processor. If required by another processor, then it must be sent explicitly.
• In general, machines can be scaled to thousands of processors. • Requires special programming techniques.
Memory Organization
Virtual Shared Memory
• Objective is to have the scalability of distributed memory with the programmability of shared memory
• Global address space mapped onto physically distributed memory
• Data moves between processors “on demand” or as it is accessed
Compute Clusters
• Connecting multiple standalone machines via a network interconnect, utilizing software to access the combined systems as one computer
• The standalone machines could be inexpensive single processor workstations or multi-million dollar
multiprocessor servers
• Individual machines can be connected via numerous networking technologies using a variety of topologies.
– 100BaseT Ethernet – inexpensive, low performance, high latency – Myrinet (2 Gb/s) – expensive, high performance, low latency – Proprietary high speed network
• Nearly 20% of fastest 500 supercomputers in the world are clusters.
Beowulf Clusters
• First developed in 1994 at NASA Goddard
• Goal is to build a supercomputer utilizing a large number of inexpensive, commodity off-the-shelf (COTS) parts.
• Increasingly used for HPC applications due to high cost of MPPs and the wide availability of networked workstations. • Not a panacea for HPC. Many applications require shared
memory or vector solutions.
• Existing Beowulf clusters range from 2 to 4000 processors, are likely to reach 10000 processors in the near future.
Metacomputing
• Metacomputing is a dynamic environment that has some informal pool of nodes that can join or leave the
environment whenever they desire.
– SETI@HOME
• Why do we need metacomputing?
– Our computational needs are infinite – Our financial needs are finite
• Someday we will utilize computing cycles just like we utilize electricity from the power company.
– Enables us to “buy” cycles on an as needed basis.
• Commonly referred to “The Grid” or “Computational Grids”
Job Execution
• Most HPC systems do not allow interactive access.
• Batch-style jobs are submitted to the system via a queuing mechanism.
• Schedulers determine the order in which jobs should be run. Factors include
– User priority
– Resource availability
• The goal of the Scheduler is to maximize system utilization.
• Scheduler optimization is an important component and is a field of study of its own.
Programming Languages
• It has been said, “I don’t know what language they will be using to program high performance computers 10 years from now, but we do know it will be called FORTRAN.”
• C and C++ are making strides in the HPC community due to their ability to create complex data structures and better I/O routines.
• FORTRAN 90 incorporated many of the features of C (e.g., pointers). • High Performance Fortran (HPF) is FORTRAN 90 with
directive-based extensions allowing for shared and distributed memory
machines – clusters, traditional supercomputers, and massively parallel processors
• Today, many programmers prefer to do their data structure, communications, etc. in C, while doing the computations in FORTRAN.
Compilers
• Compilers are an often overlooked area of HPC,
but are of critical importance.
• Application run times are directly related to the
ability of the compiler to produce highly
optimized code.
– Poor compiler optimization could result in run times increasing by an order of magnitude.
• Optimization Levels
– None, Basic, Interprocedural analysis, Runtime profile analysis, Floating-point, Data flow analysis, Advanced
Distributed Memory Parallel
Programming
• Message passing is a programming paradigm
where one effectively writes multiple programs for
parallel execution.
• The problem must be decomposed, typically by
domain or function
• Each process knows only about its own local data.
If data is required from a different process, it must
send a message to that process asking for the data
• Access to remote data is much slower than to local
data, so a major objective is to minimize remote
communications.
Message Passing Environments
• PVM – Parallel Virtual Machine
– Portable and operable across heterogeneous computers – Performance sacrificed for flexibility
– Well defined protocol allows for interoperability between different implementations
• MPI – Message Passing Interface
– Today’s standard for message passing – Widely adopted by most vendors
– Portable and operable across heterogeneous computers – Good performance with reasonable efficiency
Shared Memory Parallel
Programming
• Every processor has direct access to the memory
of every other processor in the system
• Not widely used at programmer level, but widely
used at the system level (even on single processors
systems via Multithreading)
• Allows low-latency, high-bandwidth
communications
• Portability is poor
• Easy to program (compared to message passing)
• Directive controlled parallelism
Shared Memory Environments
• POSIX Threads (Pthreads)
• SHMEM
• OpenMP
– Quickly becoming the standard API for shared memory programming
– Emphasis on performance and scalability
– Allows for fine-grain or coarse-grain parallelism
– Some implementations are interoperable with MPI and PVM
Benchmarking
• Benchmarking is an important aspect of HPC and is used for purchase decisions, system configuration, and
application tuning.
• Rule 1: All vendors lie about their benchmarks!!
• Purchase decisions should not be based on published
benchmark results. If at all possible, run your code on the exact machine you are considering for purchase.
• LINPACK
– Mother of all benchmarks
– Not originally designed to be a benchmark, but instead a set of high performance library routines for linear algebra.
– Reports average megaflop rates by dividing the total number of floating-point operations by time
Summary
• HPC is parallel computing.
• HPC involves a broad spectrum of components,
and is only as fast as the weakest component,
whether that be processor, memory, network
interconnect, compiler, or software.
• HPC exists on a broad range of computer systems,
from departmental clusters of desktop
Additional Information
• Dowd, Kevin and Severance, Charles. High Performance
Computing, Second Edition. O’Reilly & Associates, Inc.,
1998.
• Dongarra, Jack. High Performance Computing:
Technology, Methods and Applications. Elsevier, 1995.
• Buyya, Rajkumar. High Performance Cluster Computing,
Volume 1. Prentice Hall PTR, 1999.
• Foster, Ian and Kesselman, Carl. The Grid: Blueprint for
a new Computing Infrastructure. Morgan Kaufmann