High Performance Computing. Course Notes HPC Fundamentals

(1)

High Performance Computing High Performance Computing

Course Notes 2007

Course Notes 2007 - - 2008 2008

HPC Fundamentals

(2)

Introduction Introduction

What is High Performance Computing (HPC)?

Difficult to define - it’s a moving target.

•

Later 1980s, a supercomputer performs 100m FLOPS

•

Today, a 2G Hz desktop/laptop performs a few giga FLOPS

•

Today, a supercomputer performs tens of Tera FLOPS (Top500)

•

High performance: O(1000) more powerful than the latest desktops

Most supercomputers are obsolete in terms of

performance before the end of their physical life.

(3)

Applications of HPC Applications of HPC

HPC is Driven by demand of computation-intensive applications from various areas

•

Medical, Biology, neuroscience (e.g.

simulation of brains)

•

Finance (e.g. modelling the world economy)

•

Military and Defence (e.g.

modelling explosion of nuclear weapons)

•

Engineering (e.g. simulations of a car crash or a new

airplane design)

(4)

An Example of Demands in Computing An Example of Demands in Computing

Capability Capability

Project: Blue Brain

aim: construct a simulated brain

Building blocks of a brain are neurocortical columns A column consists of about 60,000 neurons

Human brain contains millions of such columns

First stage: simulate a single column (each processor acting as one or two neurons)

Then: simulate a small network of columns

Ultimate goal: simulate the whole human brain

IBM contributes Blue Gene supercomputer

(5)

Related Technologies Related Technologies

HPC covers a wide range of technologies:

computer architecture

•

CPU, memory, VLSI

Compilers

•

Identify inefficient implementations

•

Make use of the characteristics of the computer architecture

•

Choose suitable compiler for a certain architecture

Algorithms (for parallel and distributed systems)

•

How to program on parallel and distributed systems

Middleware

• From Grid computing technology

• Application->middleware->operating system

• Resource discovery and sharing

(6)

History of High Performance Computing History of High Performance Computing

1960s: Scalar processor

Process one data item at a time

1970s: Vector processor

Can process an array of data items at one go

Architecture

Overhead

Difference between vector processor and scalar processor

Later 1980s: Massively Parallel Processing (MPP)

Up to thousands of processors, each with its own memory and OS

Break down a problem

Difference between MPP and vector processor

Later 1990s: Cluster

Not a new term itself, but renewed interests

Connecting stand-alone computers with high-speed network

Difference between cluster and MPP

Later 1990s: Grid

Tackle collaboration among geographically distributed organisations

Draw an analogue from Power grid

Difference between Grid and cluster

(7)

Parallel computing vs. distributed Parallel computing vs. distributed

computing computing

Parallel Computing

Breaking the problem to be computed into parts that can be run simultaneously in different processors

Example: an MPI program to perform matrix multiplication

Solve tightly coupled problems

Distributed Computing

Parts of the work to be computed are computed in different places (Note: does not necessarily imply simultaneous processing)

An example: C/S model

Solve loosely-coupled problems (no much

communication)

(8)

Architecture Types Architecture Types

SMP (Symmetric Multi-Processing)

Multiple CPUs, single memory, shared I/O

All resources in a SMP machine are equally available to each CPU

Does not scale well to a large number of processors (less than 8) - (Scalability is the measure of how well the system performance improves linearly to the number of processing elements)

NUMA (Non-Uniform Memory Access)

Multiple CPUs

Each CPU has fast access to its local area of the memory, but slower access to other areas

Scale well to a large number of processors

Complicated memory access pattern and system bus

MPP (Massively Parallel Processing)

Cluster

(9)

Illustration for Architecture Types Illustration for Architecture Types

Shared memory (uniform memory access - SMP)

Processors share access to a common memory space.

•

Implemented over a shared memory bus or communication network.

Support for critical sections are required

Local cache is critical:

•

If not, bus contention (or network traffic) reduces the systems efficiency.

•

For this reason, pure shared memory systems do not scale naturally. Cache introduces

problems of coherency (ensuring that stale cache lines are invalidated when other

processors alter shared memory).

Shared Memory

Interconnect

PE 0

PE n

(10)

Illustration for Architecture Types Illustration for Architecture Types

Shared memory (Non- uniform memory access:

NUMA)

PE may be fetching from local or remote memory - hence non-

uniform access times.

• NUMA

• cc-NUMA (cache-coherent Non- Uniform Memory Access)

Groups of processors are connected together by a fast interconnect (SMP)

These are then connected together by a high-speed interconnect.

Global address space.

PE (m-1)n+1

PE m.n Shared Memory

m Interconnect

PE 1

PE n Shared Memory

1

(11)

Illustration for Architecture Types Illustration for Architecture Types

Distributed Memory (MPP, cluster)

Each processor has it’s own local memory.

When processors need to exchange (or share data), they must do this through an explicit communication

•

Message passing (MPI language)

Typically larger latencies between PEs (especially if they communicate via over- network interconnections).

Scalability is good if the problems can be sufficiently contained within PEs.

Interconnect

M 0

M n PE

0

PE n

(12)

High Performance Computing. Course Notes HPC Fundamentals

High Performance Computing High Performance Computing

Course Notes 2007

Course Notes 2007 - - 2008 2008

HPC Fundamentals

HPC Fundamentals

Introduction Introduction

What is High Performance Computing (HPC)?

Difficult to define - it’s a moving target.

Later 1980s, a supercomputer performs 100m FLOPS

Today, a 2G Hz desktop/laptop performs a few giga FLOPS

Today, a supercomputer performs tens of Tera FLOPS (Top500)

High performance: O(1000) more powerful than the latest desktops

Most supercomputers are obsolete in terms of

performance before the end of their physical life.

Applications of HPC Applications of HPC

HPC is Driven by demand of computation-intensive applications from various areas

Medical, Biology, neuroscience (e.g.

simulation of brains)

Finance (e.g. modelling the world economy)

Military and Defence (e.g.

modelling explosion of nuclear weapons)

Engineering (e.g. simulations of a car crash or a new

airplane design)

An Example of Demands in Computing An Example of Demands in Computing

Capability Capability

Project: Blue Brain

aim: construct a simulated brain

Building blocks of a brain are neurocortical columns A column consists of about 60,000 neurons

Human brain contains millions of such columns

First stage: simulate a single column (each processor acting as one or two neurons)

Then: simulate a small network of columns

Ultimate goal: simulate the whole human brain

IBM contributes Blue Gene supercomputer

Related Technologies Related Technologies

HPC covers a wide range of technologies:

computer architecture

CPU, memory, VLSI

Compilers

Identify inefficient implementations

Make use of the characteristics of the computer architecture

Choose suitable compiler for a certain architecture

Algorithms (for parallel and distributed systems)

How to program on parallel and distributed systems

Middleware

History of High Performance Computing History of High Performance Computing

Parallel computing vs. distributed Parallel computing vs. distributed

computing computing

 Parallel Computing

Breaking the problem to be computed into parts that can be run simultaneously in different processors

Example: an MPI program to perform matrix multiplication

Solve tightly coupled problems

 Distributed Computing

Parts of the work to be computed are computed in different places (Note: does not necessarily imply simultaneous processing)

An example: C/S model

Solve loosely-coupled problems (no much

communication)

Architecture Types Architecture Types

SMP (Symmetric Multi-Processing)

NUMA (Non-Uniform Memory Access)

MPP (Massively Parallel Processing)

Cluster

Illustration for Architecture Types Illustration for Architecture Types

Shared memory (uniform memory access - SMP)

Processors share access to a common memory space.

Implemented over a shared memory bus or communication network.

Support for critical sections are required

Local cache is critical:

If not, bus contention (or network traffic) reduces the systems efficiency.

For this reason, pure shared memory systems do not scale naturally. Cache introduces

problems of coherency (ensuring that stale cache lines are invalidated when other

processors alter shared memory).

Illustration for Architecture Types Illustration for Architecture Types

Shared memory (Non- uniform memory access:

NUMA)

PE may be fetching from local or remote memory - hence non-

uniform access times.

Groups of processors are connected together by a fast interconnect (SMP)

These are then connected together by a high-speed interconnect.

Global address space.

Parallel Computing

Distributed Computing