• No results found

High Performance Computing

N/A
N/A
Protected

Academic year: 2021

Share "High Performance Computing"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

High Performance Computing

Trey Breckenridge

Computing Systems Manager Engineering Research Center Mississippi State University

(2)

What is High Performance Computing?

• HPC is ill defined and context dependent.

• In the late 1980’s, the US Government defined

supercomputers as processors capable of more than

100MFlops. This definition is clearly obsolete, as modern desktop PC’s are capable of ~ 5GFlops.

• Another approach is to describe HPC as the fastest

computers at any point in time, however, that is more a budgetary dependent definition.

• For the intent of this presentation, we will define HPC as:

Computing resources which provide at least an order of magnitude more computing power than is normally

(3)

What does the definition really mean?

• That definition sounds like HPC is hardware only. Isn’t the software important too?

• The full range of supercomputing activities including existing supercomputer systems, special purpose and experimental systems, and the new generation of large scale parallel architectures.

• HPC exists on a broad range of computer systems, from departmental clusters of desktop workstations to large parallel processing systems.

(4)

Why High Performance Computing?

• To achieve the maximum amount of computations in a minimum amount of time – SPEED!

• To solve problems that couldn’t otherwise be solved without large computer systems.

• Traditionally, HPC used in scientific and engineering fields for work with massively complex simulations. • Computations are typically “floating point” intensive.

(5)

Areas of HPC Use

• Traditional:

– Computational Fluid Dynamics (CFD)

– Climate, Weather, and Ocean Modeling and Simulation (CWO) – Nuclear Modeling and Simulation

– Geophysical/Petroleum Modeling

• Emerging:

– Computer Graphics/Scientific Visualization – Financial Modeling

– Database Applications – Bioinformatics

(6)

Parallel Computing

• A collection of processing elements that can

communicate and cooperate to solve large

problems more quickly than a single processing

element.

• Simultaneous use of multiple processors to

execute different parts of a program.

• Goal: To reduce wall-clock time of run

• No single processor ever again is likely to match

performance of existing parallel HPC systems:

(7)

Type of Parallelism

• Overt

– Parallelism is visible to the programmer – May be difficult to program (correctly) – Large improvements in performance

• Covert

– Parallelism is not visible to the programmer – Compiler responsible for parallelism

– Easy to do

(8)

Speed Up

• Speed Up is one quantitative measure of the

benefit of parallelism

• Speed Up is defined as S / T(N) where,

– S = best serial time

– T(N) = time required for N processors

• Since S/N is the best possible parallel time,

speedup typically should not exceed N

• S is sometimes difficult to measure causing many

people to substitute T(1) for S

(9)
(10)

Efficiency

• Speed up does not measure how efficiently the

processors are being used

– Is it worth using 100 processors to get a speed up of 2?

• Efficiency is defined as the ratio of the speed up

and the number of processors required to achieve

it

– The best efficiency is 1

(11)
(12)

Processors

• Vector

– Large rows of data are operated on simultaneously

• Scalar

– Data is operated on in a sequential fashion – Instruction sets

• Complex Instruction Set Computer (CISC) • Reduced Instruction Set Computer (RISC) • Post-RISC or CISC/RISC

– UltraSPARC – IBM Power4 – IA64

(13)

Scalar vs. Vector Arithmetic

DO 10 i = 1.n

a(i) = b(i) + c(i) 10 CONTINUE Scalar: Vector: a(1) = b(1) + c(1) a = b + c a(2) = b(2) + c(2) … a(n) = b(n) + c(n)

(14)

Where is Scalar better?

• If the vector length is small

• If the loop contains IF statements

• If partial vectorization involves large overhead

• If recursion is used

(15)

Architectural Classifications

Flynn’s Taxonomy

• Published by Flynn in 1972

– Outdated, but still widely used

• Categorizes machines by instruction streams and data streams

– A stream of instructions (the algorithm) tells the computer what to do.

– A stream of data (the input) is affected by these instructions.

• Four Categories

– SISD – Single Instruction, Single Data – MISD – Multiple Instruction, Single Data – SIMD – Single Instruction, Multiple Data – MIMD – Multiple Instruction, Multiple Data

(16)

SISD

Single Instruction, Single Data

• Conventional single processor computers

• Each arithmetic instruction initiates an operation

on a data item taken from a single stream of data

elements.

• Historical supercomputers and most contemporary

microprocessors are SISD

(17)

SIMD

Single Instruction, Multiple Data

• Many, simple processing elements – 1000s

• Each processor has its own local memory

• Each processor runs the same program

• Each processor processes different data streams

• All processors work in lock-step (synchronously)

• Very efficient for array/matrix operations

• Most older vector/array computers are SIMD

• Example machines:

– Cray YMP

(18)

MISD

Multiple Instruction, Single Data

• Very few machines fit this category

• None have been commercially successful or have

had any impact on computational science

(19)

MIMD

Multiple Instruction, Multiple Data

• Most diverse of the four classifications

• Multiple processors

• Each processor either has own, or accesses shared,

memory

• Each processor can run the same or different

programs

• Each processor processes different data streams

• Processors can work synchronously or

(20)

MIMD

cont.

• Processors can be either tightly or loosely coupled

• Examples include:

– Processors and memory units specifically designed to be components of a parallel architecture (e.g., Intel Paragon)

– Large scale parallel machines built from “off the shelf” workstations (e.g., Beowulf Cluster)

– Small scale multiprocessors made by connecting multiple vector processors together (e.g., Cray T90) – Wide variety of other designs as well

(21)

SPMD Computing

• Not a Flynn category, per se, but instead a combination of categories.

• SPMD stands for single program, multiple data

– The same program is run on the processors of an MIMD machine. – Occasionally the processors may synchronize.

– Because an entire program is executed on separate data, it is

possible that different branches are taken, leading to asynchronous parallelism

• SPMD came about as a desire to do SIMD like calculations on MIMD machine

– SPMD is not a hardware paradigm, but instead, the software equivalent of SIMD

(22)

Memory Classifications

Organization

• Shared Memory (SM-MIMD)

– Bus based

– Interconnection network

• Distributed Memory (DM-MIMD)

– Local

– Message passing

• Virtual shared memory (VSM-MIMD)

– Physically distributed, but appears as one image

Access

• Uniform Memory Access (UMA)

(23)

Memory Organization

Shared Memory

• One common memory block between all processors

Bus Based

• Since bus has limited bandwidth, number of processors which can be used is limited to a few tens of processors • Examples include typical multi-processors PC’s, SGI

(24)

Memory Organization

Switch based

• Utilizes (complex) inter-connected network to connect processors to shared memory modules

• May use multi-stage networks - NUMA

• Increases bandwidth to memory over bus based systems • Every processor still has access to global memory

(25)

Memory Organization

Distributed Memory

• Message Passing. Memory physically distributed through the machine. Each processor has private memory.

• Contents of private memory can only be accessed by that processor. If required by another processor, then it must be sent explicitly.

• In general, machines can be scaled to thousands of processors. • Requires special programming techniques.

(26)

Memory Organization

Virtual Shared Memory

• Objective is to have the scalability of distributed memory with the programmability of shared memory

• Global address space mapped onto physically distributed memory

• Data moves between processors “on demand” or as it is accessed

(27)

Compute Clusters

• Connecting multiple standalone machines via a network interconnect, utilizing software to access the combined systems as one computer

• The standalone machines could be inexpensive single processor workstations or multi-million dollar

multiprocessor servers

• Individual machines can be connected via numerous networking technologies using a variety of topologies.

– 100BaseT Ethernet – inexpensive, low performance, high latency – Myrinet (2 Gb/s) – expensive, high performance, low latency – Proprietary high speed network

• Nearly 20% of fastest 500 supercomputers in the world are clusters.

(28)

Beowulf Clusters

• First developed in 1994 at NASA Goddard

• Goal is to build a supercomputer utilizing a large number of inexpensive, commodity off-the-shelf (COTS) parts.

• Increasingly used for HPC applications due to high cost of MPPs and the wide availability of networked workstations. • Not a panacea for HPC. Many applications require shared

memory or vector solutions.

• Existing Beowulf clusters range from 2 to 4000 processors, are likely to reach 10000 processors in the near future.

(29)

Metacomputing

• Metacomputing is a dynamic environment that has some informal pool of nodes that can join or leave the

environment whenever they desire.

– SETI@HOME

• Why do we need metacomputing?

– Our computational needs are infinite – Our financial needs are finite

• Someday we will utilize computing cycles just like we utilize electricity from the power company.

– Enables us to “buy” cycles on an as needed basis.

• Commonly referred to “The Grid” or “Computational Grids”

(30)

Job Execution

• Most HPC systems do not allow interactive access.

• Batch-style jobs are submitted to the system via a queuing mechanism.

• Schedulers determine the order in which jobs should be run. Factors include

– User priority

– Resource availability

• The goal of the Scheduler is to maximize system utilization.

• Scheduler optimization is an important component and is a field of study of its own.

(31)
(32)

Programming Languages

• It has been said, “I don’t know what language they will be using to program high performance computers 10 years from now, but we do know it will be called FORTRAN.”

• C and C++ are making strides in the HPC community due to their ability to create complex data structures and better I/O routines.

• FORTRAN 90 incorporated many of the features of C (e.g., pointers). • High Performance Fortran (HPF) is FORTRAN 90 with

directive-based extensions allowing for shared and distributed memory

machines – clusters, traditional supercomputers, and massively parallel processors

• Today, many programmers prefer to do their data structure, communications, etc. in C, while doing the computations in FORTRAN.

(33)

Compilers

• Compilers are an often overlooked area of HPC,

but are of critical importance.

• Application run times are directly related to the

ability of the compiler to produce highly

optimized code.

– Poor compiler optimization could result in run times increasing by an order of magnitude.

• Optimization Levels

– None, Basic, Interprocedural analysis, Runtime profile analysis, Floating-point, Data flow analysis, Advanced

(34)

Distributed Memory Parallel

Programming

• Message passing is a programming paradigm

where one effectively writes multiple programs for

parallel execution.

• The problem must be decomposed, typically by

domain or function

• Each process knows only about its own local data.

If data is required from a different process, it must

send a message to that process asking for the data

• Access to remote data is much slower than to local

data, so a major objective is to minimize remote

communications.

(35)

Message Passing Environments

• PVM – Parallel Virtual Machine

– Portable and operable across heterogeneous computers – Performance sacrificed for flexibility

– Well defined protocol allows for interoperability between different implementations

• MPI – Message Passing Interface

– Today’s standard for message passing – Widely adopted by most vendors

– Portable and operable across heterogeneous computers – Good performance with reasonable efficiency

(36)

Shared Memory Parallel

Programming

• Every processor has direct access to the memory

of every other processor in the system

• Not widely used at programmer level, but widely

used at the system level (even on single processors

systems via Multithreading)

• Allows low-latency, high-bandwidth

communications

• Portability is poor

• Easy to program (compared to message passing)

• Directive controlled parallelism

(37)

Shared Memory Environments

• POSIX Threads (Pthreads)

• SHMEM

• OpenMP

– Quickly becoming the standard API for shared memory programming

– Emphasis on performance and scalability

– Allows for fine-grain or coarse-grain parallelism

– Some implementations are interoperable with MPI and PVM

(38)

Benchmarking

• Benchmarking is an important aspect of HPC and is used for purchase decisions, system configuration, and

application tuning.

• Rule 1: All vendors lie about their benchmarks!!

• Purchase decisions should not be based on published

benchmark results. If at all possible, run your code on the exact machine you are considering for purchase.

• LINPACK

– Mother of all benchmarks

– Not originally designed to be a benchmark, but instead a set of high performance library routines for linear algebra.

– Reports average megaflop rates by dividing the total number of floating-point operations by time

(39)
(40)

Summary

• HPC is parallel computing.

• HPC involves a broad spectrum of components,

and is only as fast as the weakest component,

whether that be processor, memory, network

interconnect, compiler, or software.

• HPC exists on a broad range of computer systems,

from departmental clusters of desktop

(41)

Additional Information

• Dowd, Kevin and Severance, Charles. High Performance

Computing, Second Edition. O’Reilly & Associates, Inc.,

1998.

• Dongarra, Jack. High Performance Computing:

Technology, Methods and Applications. Elsevier, 1995.

• Buyya, Rajkumar. High Performance Cluster Computing,

Volume 1. Prentice Hall PTR, 1999.

• Foster, Ian and Kesselman, Carl. The Grid: Blueprint for

a new Computing Infrastructure. Morgan Kaufmann

References

Related documents

Vendors will try to strengthen their positions by aligning the claim with a specific use case, like VDI, but the mainstream data center needs a single storage system that can

Once the component servers have been launched, the Globus Toolkit need not create any more processes: the components can be created by thread creation within the component

experiences. However, I recognize that practitioners, especially those in student affairs, want to better serve their students and peers and sometimes need tangible and

The study involved 10 individual interviews with the parents of these high achieving middle school students to gain data to determine parent practices and involvement of lower

VEGETATION DESCRIPTION: Total cover in this community is relatively high and is dominated by a mixture of tall warm-season grasses 1–2 m high (most commonly switchgrass, with

• If this is successful, airlines should experience lower training costs for lower-time pilots because the baseline of knowledge and skills will be

Understanding the times includes being concerned enough about others to study why they believe what they believe (study in a detached way, not an involved way) in order