Modellierung, Simulation, Optimierung Computer Architektur

(1)

Modellierung, Simulation, Optimierung Computer Architektur

Prof. Michael Resch

Dr. Martin Bernreuther, Dr. Natalia Currle-Linde, Dr. Martin Hecht, Uwe K ¨uster, Dr. Oliver Mangold, Melanie Mochmann, Christoph Niethammer, Ralf Schneider HLRS, IHR

28. Juni 2012

(2)

Outline

A simple computer

Computer and programs

Modern computers in detail

principle parallel architectures

examples of large machines

parameters describing hardware

Types of Parallelism

performance modeling

2/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::

(3)

A simple computer

(4)

von Neumann Machine (!)

I 4 units

Arithmetic Logical Unit

Control Unit

Memory for instructions and data

Input and Output Unit

I the instructions are executed in order by incrementing an instruction pointer

I branches to make jumps in the instruction sequence by changing the instruction pointer

I no parallelism

I memory is linearly addressed

I typical for computer architecture

I http:

//de.wikipedia.org/wiki/John_von_Neumann

http://de.wikipedia.org/wiki/

Von-Neumann-Architektur

(5)

design of a simple computer (!)

I processor does all operations

I memory is a fast but volatile storage

I bus connecting processor, memory and IO

I Input and output devices:

disk storing data permanently

keyboard

display

network interface for communication

processor interface memory

devices

disk

network display

keyboard

mouse

(6)

Computer and progams

(7)

executing a program on a computer (!)

I all the work acomputer can do is specified by executable programs

I a program consists on alarge sequence of instructions combined with data

I the processor or core loads aninstruction from memory

I theinstruction is analysed and executed by the hardware

I the instruction may contain

data,

memory addresses,

and the specification of anoperation what to do with the data

I theinstruction pointer

points to the current instruction to be loaded and executed

after issueing the instruction it is incremented and points to the next instruction;

in this way the machine operates on asequence of instructions

it can be changed by specialbranch and jump instructions to point to a different location in the instruction stream;

in this way loops and alternatives are formulated

I the sameinstruction sequence may executed on different data

I typical instruction classes are

load, store

integer add, multiply

(8)

purpose of the operating system (!)

I the operating system is a collection of executable programs for

the administration of all devices as disk, keyboard, display, mouse, hardware interfaces, ...

scheduling and starting executable programs, assigning resources to them, finishing them

handling data, files and the file system

connecting to the outer world, e.g. other computing nodes, the internet

hides the complexity of the machine to the user

I examples: Microsoft Windows, Unix/Linux, NEC SuperUX, Android, ...

(9)

from the source code to the start of the executable (!)

I the source code of a program is written in a high level programming language because

machine code is too complicated to read, understand and handle

machine code is architecture dependent

I the source code is translated to the executable program by the following steps

compilation of files by the compiler generates objects in machine code; relations between objects still lacking

the linker binds the objects the executable program which is written to disk

to run the executable the loader load the executable from disk to memory

and starts the executable by setting the instruction pointer to the first instruction

I high level languages are for example

C, C++

Fortran

C#, Visual Basic, Delphi

Java; works differently: source code will be translated to an intermediate byte code which is independent on the machine; the intermediate code will be executed on a run time system

(10)

Pseudo Assembler

I Assembler is near to the machine code but better readable; only for special purposes, e.g. drivers

I here we are using some kind of pseudo assembler

Fortran:

do i=1,imax a(i)=b(i)+c(i) enddo

Java:

for (i = 1; i <= imax; i++) a[i] = b[i] + c[i];

1. i=0 2. start:

3. i=i+1

4. if i > imax goto end 5. load b(i) to register 0 6. load c(i) to register 1

7. add register 0 to register 1 and write to register 2 8. store register 2 to memory at location a(i) 9. goto start

10. end:

(11)

in X86 Assembler

1. Block 7:

2. movsdq (%rsi,%rcx,8), %xmm0 ; c(i ) -> xmm0[63..0]

3. movsdq 0x10(%rsi,%rcx,8), %xmm1 ; c(i+2) -> xmm1[63..0]

6. movhpdq 0x8(%rsi,%rcx,8), %xmm0 ; c(i+1) -> xmm0[127..64]

10. addpdx (%rdx,%rcx,8), %xmm0 ; b(i ),b(i+1) + xmm0 ->xmm0 11. addpdx 0x10(%rdx,%rcx,8), %xmm1 ; b(i+2),b(i+3) + xmm1 ->xmm1 12. addpdx 0x20(%rdx,%rcx,8), %xmm2 ; b(i+4),b(i+5) + xmm2 ->xmm2 13. addpdx 0x30(%rdx,%rcx,8), %xmm3 ; b(i+6),b(i+7) + xmm3 ->xmm3 14. movapsx %xmm0, (%rdi,%rcx,8) ; xmm0->a(i ),a(i+1) 15. movapsx %xmm1, 0x10(%rdi,%rcx,8) ; xmm1->a(i+2),a(i+3) 16. movapsx %xmm2, 0x20(%rdi,%rcx,8) ; xmm2->a(i+4),a(i+5)

(12)

some instruction of the X86 Assembler

1. Block 7: begin of a block

2. movsdq (%rsi,%rcx,8), %xmm0 load from address rsi+ 8*rcx in register xmm0 [63:0]

3. movhpdq 0x8(%rsi,%rcx,8), %xmm0 load from address 0x8+rsi+ 8*rcx in xmm0 [127:64]

4. addpdx (%rdx,%rcx,8), %xmm0 add from rdx+8*rcx to register xmm0 writing to xmm0 5. addpdx 0x10(%rdx,%rcx,8), %xmm1 add from 0x10+rdx+8*rcx to register xmm1 6. movapsx %xmm0, (%rdi,%rcx,8) store xmm0 to address rdi+8*rcx 7. movapsx %xmm1, 0x10(%rdi,%rcx,8) store xmm1 to address 0x10+rdi+8*rcx 8. add $0x8, %rcx add 0x8 to rcx with target rcx

9. cmp %rax, %rcx compare rax and rcx

10. jb 0x303f <Block 7> jump to 0x303f if rax > rcx

(13)

Test program in Fortran; part 1

1. module test_module 2. contains

3. subroutine func(a,b,c,imax) 4. integer :: imax, i

5. real(kind=8), dimension(imax) :: a,b,c 6. do i=1,imax

7. a(i)=b(i)+c(i) 8. enddo

9. end subroutine func 10. end module test_module

(14)

Test program in Fortran; part 2

1. program test_prog 2. use test_module 3. implicit none 4. integer :: i,j,k=200

5. integer,parameter :: imax=100000

6. real(kind=8),allocatable, dimension(:) :: a,b,c ! are dynamic arrays 7. allocate(a(imax)) ! reserving space for array a

8. allocate(b(imax)) 9. allocate(c(imax))

10. do i=1,imax ! Definition of array elements

11. b(i)=i+j

12. c(i)=i*5

13. enddo

14. call func(a,b,c,imax) ! call the procedure 15. write(*,*)a(1)

16. end program test_prog

(15)

Test program in Java; part 1

1. package test.module;

2. public class TestModule 3. {

4.

5. public static void func(double a[],double b[],double c[],int imax)

6. {

7. for (int i = 0; i < imax; i++)

8. a[i] = b[i] + c[i];

9. }

10. }

(16)

Test program in Java; part 2

1. package test.prog;

2. import test.module.TestModule;

3. public class TestProg 4. {

5. public static void main(String[] args)

6. {

7. int i, j = 0, k = 200;

8. final int imax = 100000;

9. double a[], b[], c[];

10. a = new double[imax]; // reserving space for array a 11. b = new double[imax];

12. c = new double[imax];

13. for (i = 0; i < imax; i++)

14. {

15. b[i] = i + j;

16. c[i] = i * 5;

17. }

18. TestModule.func(a, b, c, imax); // call the procedure 19. System.out.println(a[0]);

20. }

21. }

(17)

Modern computers in detail

(18)

Core with Caches and Memory (!)

I data are stored in a connected memory hierarchy

I L1 cache ( Level 1 cache)

I L2 cache

I L3 cache ( not all architectures )

I memory

I the cache

I is an intermediate memory with the same addressing scheme replicating the data in memory

I has a smaller access latency and provides higher bandwidth

I a load/store first looks in the (low level) cache; if data is not there, the next level will tested

I all cached data are organized in cache lines (of 64 B for X86, X86 64)

I load/store moves cachelines, not single data

I old data will be overwritten, but saved to higher level caches or to memory if changed

I writing to a cacheline assumes that the cacheline is present in cache

I lower level caches are smaller than higher level caches

I nearby caches have higher bandwidth and smaller access latency

I load/store instructions can reach all memory locations

M C

L1 L2 L3

(19)

Processor with Cores and Memory (!)

I here: Intel Sandy Bridge (2012)

I multi core processor with 8 cores

I L3 cache is shared but partitioned

I communication ring with high bandwidth

I 4 rings for data, requests, acknowledgments, snooping

I memory controller in ring

I QPI (Quick Path Interconnect) interface in ring

M C

L1 L2 L3

C^L1

L2 L3

C

L1 L2 L3 C

L1 L2 L3

C^L1

L2 L3

C^L1

L2 L3

C^L1

L2 L3

QPI MC

4 rings:

data ackn requ snoop

snooping enables all cores to ”hear” which shared cachelines are changed by other cores ( cache coherence protocol)

(20)

Intel Sandy Bridge architectural data 1

I 8 cores

I AVX vector unit per core with 4 mult-plus operations/CP

I core with L1, L2 cache

I shared L3-cache

I 4 memory channels with 1600 MHz DDR3

I 2 QPI links up two 8 GT/s

I 51.2 GB/s nominal bandwidth

I >2.0 GHz frequency

I daughter card on PCIe 3.0 up to 16 GB/s (was ist PCI)

PCIe ( Peripheral Component Interconnect Express) is an interface standard for

connecting external devices as graphics cards, network interface controllers, sound cards, ...

(21)

Intel Sandy Bridge architectural data 2

(22)

AMD Interlagos Processor

I 16 cores in 2 dies with 4 core pairs

I AVX vector unit per core pair(!) with 4 mult-plus operations/CP

I core with L1 cache

I 2 MB L2-cache per core pair

I 8 MB shared L3-cache per die, in total 16 MB

I 2 memory channels with 1600 MHz DDR3 per die

I one Hyper-Transport link between dies

I one Hyper-Transport link per die going outwards

I 51.2 GB/s nominal bandwidth for 4 channels

I >2.0 GHz frequency

picture from AMD

(23)

Intel Knights Ferry

I 30 cores

I 4 vector units per core

I Pentium II based

I 100 GB/s nominal bandwidth

I 1.05 GHz frequency

I daughter card on PCIe 2.0

I prototype; will be followed by Knights Corner

(24)

Nvidia Graphics Card

nicht mehr aktuell

I 448 - 512 cores

I 1 8Byte FP-unit per core

I 14 - 16 multiprocessors

I 144 -177 GB/s nominal bandwidth

I 1.15 - 1.3 GHz frequency

I 515 - 665 DP peak performance

I daughter card on PCIe 2.0

Oliver Mangold

(25)

principle parallel architectures

(26)

Shared memory system (!)

I all processors/cores are using the same memory

I the address space is identical for all cores

I memory load/store instructions can reach all locations

I each thread may share his memory part with others

I shared memory may consist on different banks on different channels

I can be local to a processsor but accessible by others

Intel: Quick Path Interconnect (QPI)

AMD: Hyper-Transport (HT)

I bus/interconnect is a bottleneck

I memory bandwidth is insufficient

M M M M

P

M M M M

P

QPI / HT

IO

(27)

Distributed memory system (!)

I nodes are using different memories

I the address space is different for all cores

I memory load/store instructions cannot reach other locations

I memory bus/interconnect is independent

I memory bandwidth scales with the number of nodes

I network access via a Network Interface Controller (NIC)

I network links connected via Routers (R)

M

P

NIC

R

M

P

NIC

R

M

P

NIC

R

M

P

NIC

R

(28)

Hierarchical Parallel Computer System (!)

I the different levels of a parallel system: cores, processors, nodes, supernodes, system

hierarchische Maschine.png

(29)

Hierarchical Parallel Computer System (!)

vector floating point units (FPU) ( 2 - 8 per core )

cores (C) ( 4 - 64 per CPU, 500 per GPU)

processors (CPU) ( 2 - 8 per node )

nodes (N) ( 1-8 per supernode )

supernodes, blades, cages, ... (2 - 8 per rack )

rack (R) (10 - 1000 per system)

distributed system

I by multiplying small numbers we get a very large number

(30)

addressing of the total memory (!)

I shared memory

Uniform Memory Access (UMA)

Non-Uniform Memory Access (NUMA)

I distributed memory

locally addressable

nodes may have shared memory

I Partitioned Global Address Space (PGAS)

I globally addressable, every processor can address the memory in all other processors or nodes

Remote Direct Memory Access (RDMA) done by the Network Interface Controller (NIC)

enables simpler handling of global data structures

(31)

examples of large machines

(32)

HLRS machine Cray XE6

I Cray XE6

I 3500 nodes with two AMD processors Interlagos

I Gemini interconnect

I working since November 2011

I 16 C x 2 CPU x 96 N x 38 R = 116736 Cores (HLRS 2011)

I 32 GB x 96 N x 38 R = 116 TB memory

I 1.8 PB disk

Cray XE6 at HLRS

(33)

K-computer at Riken

I Fujitsu

I 10-13 MW

I finished in 2012

I 8 C x 4 N x 24 SN x 864 R = 82944 CPUs = 663552 Cores

I 10.6 PetaFlop/s peak performance

(34)

Blue Waters System

I >235 Cray XE6 cabinets

I >30 Cray XK6 cabinets

I >25000 compute nodes

I >1 TB/s usable storage bandwidth

I >1.5 PB aggregate system memory

I 4 GB Memory per core

I >9000 Gemini network cables

I 3D torus interconnect topology

I >17000 disks

I >190000 memory DIMMS

I >25 PB usable storage

I >11.5 PF peak performance

I >49000 AMD processors

I >380000 cores

I >3000 GPUs

I 100 GB/s bandwidth to near-line storage

Blue Waters machine under construction http://timelapse.ncsa.illinois.edu/

pcf/inside2/index.php http://www.ncsa.illinois.edu/

BlueWaters/system.html

(35)

parameters describing hardware

(36)

Moore’s Law 1 (!)

(37)

Moore’s Law 2 (!)

I Moore’s Law from 1965 states that number of components on an integrated circuit doubles every 18 (24) months

I this is the reason that computers are inexpensive despite their complexity

I for some time it was understood as doubling the frequency every 18 month

I http://de.wikipedia.org/wiki/Mooresches_Gesetz

(38)

stagnating frequency (!)

I the power consumption depends heavily on the frequency of the processor

power ∼ frequency³

I the number of cores can be increased, if the frequency is reduced

I List of Intel processors:

http://www.intel.com/

pressroom/kits/

quickreffam.htm

I data from Gerhard Wellein / Erlangen

(39)

Floating Point Performance Parameters (!)

I clock rate of the core/processor

inverse of frequency for scheduling the instructions; typical 1 – 4 GHz

I number of floating point operations per clock tic

instruction parallelism in the core/processor [FLOPs/FP − unit]; typical 2 – 8

I peak performance of processor

maximum number of Floating Point Operations per second [MFLOPs, GFLOPs, TFLOPs]; typical 4 GFLOPs – 100 GFLOPs

I total peak performance of machine

number of nodes defines the size of the cluster; typical 1 - 25000

I total performance=frequency x # FP-units x #cores x #processors x #nodes

(40)

Communication Parameters: Bandwidth (!)

I bandwidth: amount of data transferred per time through a transportation path

I typical parameter for all transportation systems

I for computers

L1 cache bandwidth ( 4x8 B/CP )

memory bandwidth

o (1,2,3,4) channels x 8 B x Bus frequency (at 1066, 1333, 1600 MHz) → up to 51,2 GB/s

interconnect bandwidth

o InfiniBand (SDR 1 GB/s, DDR 2 GB/s , QDR 4 GB/s, EDR 12.5 GB/s) o Cray Gemini ( 5 GB/s x 2 directional )

o Ethernet ( 1Gbit, 10 GBit, 40 GBit)

IO bandwidth o SATA disk ( 60 MB/s)

o Solid State Device SSD ( 120 –700 MB/s) o RAID Storage Subsystem ( 1.5 – 10 GB/s) o parallel Lustre Filesystem ( > 100 GB/s)

(41)

Communication Parameters: Latency (!)

I Latency: time to get the first data after sending the request

I connected to the length of the pipeline

I all system parts show a typical latency

L1 cache (2 - 4 CP)

L2 cache ( 10 -20 CP)

L3 cache (25 - 70 CP)

memory (100 - 300 CP)

interconnect ( 1 -50 µ s)

disc ( 1 - 5 ms)

I the number of open transfers is limited by number of open requests

I the achievable bandwidth is limited by the

bandwidth ≤ number of open requests × transferred buffer size/latency

I the latency is the start up of the transfer pipeline

(42)

Effective Memory Bandwidth (!)

I The nominal memory bandwidth of a modern PC-system can be calculated by BWnom=path width × number of channels × bus frequency path width = 8B, number of channels = 1, 2, 3, 4,

bus frequency = 1066, 1333, 1600, 1866, 2133MHz

I The effective memory bandwidth is smaller than the nominal bandwidth BW_eff= size of data transferred

latency + transfer time The transfer time is related to the nominal bandwidth by

transfer time =size of data transferred BWnom

I The effective bandwidth is influenced by the memory or cache latency

BWeff= size of data transferred latency +size of data transferred

BWnom

(43)

Example 1

I Intels Sandy Bridge processor has 4 channels running at a bus frequency of 1600MHz. The channels are 8B wide.

BWnom=8B × 4 × 1600MHz = 51.2GB/s The memory latency is 70nsec.

I The resulting effective bandwidth for a single cache line of 64B is

BW_eff = 64B

70nsec +_51.2GB/s^64B = 64B

70nsec + 1.25nsec =0, 898GB/s(!)

I We are far from the nominal bandwidth!

I The estimations are not completely correct:

the cacheline uses a single channel; the cacheline is transported with bus clock

(44)

Example 2

I The resulting effective bandwidth for 100 cache lines of 64B is

BW_eff = 100 × 64B

70nsec +_51.2GB/s^100×64B = 100 × 64B

70nsec + 125nsec =32, 82GB/s(!)

I How to overcome this problem?

(45)

Prefetching (!)

I The data access has to be started before consumption

I Prefetching initiates data access independent on using the data

I difficult to handle

I may be contraproductive for data already in caches

(46)

Types of Parallelism(!)

I single instruction single data (SISD)

simple or traditional processor

I single instruction multiple data (SIMD)

vector model

in a multi processor of a graphic card

I multiple instructions multiple data (MIMD)

multicore processors as shared memory processors

different multiprocessors of graphic cards

I multiple programs multiple data (MPMD)

multicore processors for different processes

distributed memory processors

I multiple instructions single data (MISD) ??

I SISD, SIMD, MIMD, MISD: Flynns notation

I pipeline parallelism Bild!

(47)

pipeline parallelism (!)

I phase shifted parallelism

I very general principle in production, transport, flow of goods

I assembly line in production systems

I important model in bureaucracy ( first come first serve)

I different parts are in operation at different stages at the same time

I dominant in modern computer architectures

I in computer:

accessing caches or memory within a loop

accessing data from a remote processor

input and output system

I description model of the pipeline

the pipeline is filled after startup steps. This is the length of the pipeline and at the same the degree of its parallelism. Could be understood also as latency.

(48)

performance modeling

(49)

performance modeling

(50)

performance behaviour of the pipeline (!)

I The pipeline mechanism is characterized by the

fixed startup or latency costs or time for filling the pipeline

time per unit ∆T

operations per unit op

performance perf

I The resulting performance is perf (n) = n ∗ op

startup + n ∗ ∆T

I peak performance is the limit for increasing n perf∞= lim

n→∞perf (n) = op

∆T

I getting half of the peak performance for n1

2 =^startup_∆T

perf (n1 2) =

startup

∆T ∗ op

startup +^startup_∆T ∗ ∆T = 1 2

op

∆T =1 2perf∞ 50/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::

(51)

Amdahls Law: limits of parallel speedup (!)

I Assume a parallelizable program running a fixed size test case having a

parallel part taking a total time tpar

serial (non parallelizable) part taking a total time tser

total time is ttot=tpar+tser

I We will run the program with p parallel processes

I The speedup is

speedup(p) = ttot

tpar/p + tser

6 ttot

tser I and the efficiency

efficiency (p) = speedup(p) p

(52)

Amdahls Law: limits of parallel speedup (!)

I These assumptions are correct of any type of parallel work!

I the pictures show speedup in dependence on tser/(tpar+tser)in percent.

I The behaviour is not encouraging!

I Why large scale parallel computing can be successful?

I the speedup behaviour of parallel program for a fixed sized test case is namedstrong scaling

(53)

Amdahls Law with communication

I even worse: assume that communication times increase with the number of processes

parallel part taking a total time tpar

assume communication time com ∗ p is proportional to the number of processes; other dependencies to be discussed

serial part taking a total time tser

total time is ttot=tpar+tser

I We will run the program with p parallel processes

I The speedup is

speedup(p) = ttot

tpar/p + com ∗ p + tser 6 ttot

com ∗ p + tser

(54)

Gustafson: increase the work 1 (!)

I Amdahls assumes afixed amount of work.

I But what can we do in afixed amount of time?

parallel part taking a total time tpar=tprocp proportional to the number of processors

serial part taking a total time tser I The speedup is

speedup(p) = tprocp + tser

tprocp/p + tser

= tproc

tproc+tser

p + tser

tproc+tser I and the efficiency

efficiency (p) = tproc

tproc+tser

+ tser

tproc+tser

1 p

I The efficiency decreases to a lower limit.

(55)

Gustafson: increase the work 2 (!)

I There are a lot of simulation problems of this type!

I But we are neglecting any necessary communication.

I the speedup behaviour of parallel program for a test case which is increasing with the number of active processors/cores is namedweak scaling

(56)

Danke f ¨ur die Aufmerksamkeit!

Kuester[at]hlrs.de