Modellierung, Simulation, Optimierung Computer Architektur
Prof. Michael Resch
Dr. Martin Bernreuther, Dr. Natalia Currle-Linde, Dr. Martin Hecht, Uwe K ¨uster, Dr. Oliver Mangold, Melanie Mochmann, Christoph Niethammer, Ralf Schneider HLRS, IHR
28. Juni 2012
Outline
A simple computer
Computer and programs
Modern computers in detail
principle parallel architectures
examples of large machines
parameters describing hardware
Types of Parallelism
performance modeling
2/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
A simple computer
von Neumann Machine (!)
I 4 units
Arithmetic Logical Unit
Control Unit
Memory for instructions and data
Input and Output Unit
I the instructions are executed in order by incrementing an instruction pointer
I branches to make jumps in the instruction sequence by changing the instruction pointer
I no parallelism
I memory is linearly addressed
I typical for computer architecture
I http:
//de.wikipedia.org/wiki/John_von_Neumann
http://de.wikipedia.org/wiki/
Von-Neumann-Architektur
4/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
design of a simple computer (!)
I processor does all operations
I memory is a fast but volatile storage
I bus connecting processor, memory and IO
I Input and output devices:
disk storing data permanently
keyboard
display
network interface for communication
processor interface memory
devices
disk
network display
keyboard
mouse
Computer and progams
6/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
executing a program on a computer (!)
I all the work acomputer can do is specified by executable programs
I a program consists on alarge sequence of instructions combined with data
I the processor or core loads aninstruction from memory
I theinstruction is analysed and executed by the hardware
I the instruction may contain
data,
memory addresses,
and the specification of anoperation what to do with the data
I theinstruction pointer
points to the current instruction to be loaded and executed
after issueing the instruction it is incremented and points to the next instruction;
in this way the machine operates on asequence of instructions
it can be changed by specialbranch and jump instructions to point to a different location in the instruction stream;
in this way loops and alternatives are formulated
I the sameinstruction sequence may executed on different data
I typical instruction classes are
load, store
integer add, multiply
purpose of the operating system (!)
I the operating system is a collection of executable programs for
the administration of all devices as disk, keyboard, display, mouse, hardware interfaces, ...
scheduling and starting executable programs, assigning resources to them, finishing them
handling data, files and the file system
connecting to the outer world, e.g. other computing nodes, the internet
hides the complexity of the machine to the user
I examples: Microsoft Windows, Unix/Linux, NEC SuperUX, Android, ...
8/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
from the source code to the start of the executable (!)
I the source code of a program is written in a high level programming language because
machine code is too complicated to read, understand and handle
machine code is architecture dependent
I the source code is translated to the executable program by the following steps
compilation of files by the compiler generates objects in machine code; relations between objects still lacking
the linker binds the objects the executable program which is written to disk
to run the executable the loader load the executable from disk to memory
and starts the executable by setting the instruction pointer to the first instruction
I high level languages are for example
C, C++
Fortran
C#, Visual Basic, Delphi
Java; works differently: source code will be translated to an intermediate byte code which is independent on the machine; the intermediate code will be executed on a run time system
Pseudo Assembler
I Assembler is near to the machine code but better readable; only for special purposes, e.g. drivers
I here we are using some kind of pseudo assembler
Fortran:
do i=1,imax a(i)=b(i)+c(i) enddo
Java:
for (i = 1; i <= imax; i++) a[i] = b[i] + c[i];
1. i=0 2. start:
3. i=i+1
4. if i > imax goto end 5. load b(i) to register 0 6. load c(i) to register 1
7. add register 0 to register 1 and write to register 2 8. store register 2 to memory at location a(i) 9. goto start
10. end:
10/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
in X86 Assembler
1. Block 7:
2. movsdq (%rsi,%rcx,8), %xmm0 ; c(i ) -> xmm0[63..0]
3. movsdq 0x10(%rsi,%rcx,8), %xmm1 ; c(i+2) -> xmm1[63..0]
4. movsdq 0x20(%rsi,%rcx,8), %xmm2 ; c(i+4) -> xmm2[63..0]
5. movsdq 0x30(%rsi,%rcx,8), %xmm3 ; c(i+6) -> xmm3[63..0]
6. movhpdq 0x8(%rsi,%rcx,8), %xmm0 ; c(i+1) -> xmm0[127..64]
7. movhpdq 0x18(%rsi,%rcx,8), %xmm1 ; c(i+3) -> xmm1[127..64]
8. movhpdq 0x28(%rsi,%rcx,8), %xmm2 ; c(i+5) -> xmm2[127..64]
9. movhpdq 0x38(%rsi,%rcx,8), %xmm3 ; c(i+7) -> xmm3[127..64]
10. addpdx (%rdx,%rcx,8), %xmm0 ; b(i ),b(i+1) + xmm0 ->xmm0 11. addpdx 0x10(%rdx,%rcx,8), %xmm1 ; b(i+2),b(i+3) + xmm1 ->xmm1 12. addpdx 0x20(%rdx,%rcx,8), %xmm2 ; b(i+4),b(i+5) + xmm2 ->xmm2 13. addpdx 0x30(%rdx,%rcx,8), %xmm3 ; b(i+6),b(i+7) + xmm3 ->xmm3 14. movapsx %xmm0, (%rdi,%rcx,8) ; xmm0->a(i ),a(i+1) 15. movapsx %xmm1, 0x10(%rdi,%rcx,8) ; xmm1->a(i+2),a(i+3) 16. movapsx %xmm2, 0x20(%rdi,%rcx,8) ; xmm2->a(i+4),a(i+5)
some instruction of the X86 Assembler
1. Block 7: begin of a block
2. movsdq (%rsi,%rcx,8), %xmm0 load from address rsi+ 8*rcx in register xmm0 [63:0]
3. movhpdq 0x8(%rsi,%rcx,8), %xmm0 load from address 0x8+rsi+ 8*rcx in xmm0 [127:64]
4. addpdx (%rdx,%rcx,8), %xmm0 add from rdx+8*rcx to register xmm0 writing to xmm0 5. addpdx 0x10(%rdx,%rcx,8), %xmm1 add from 0x10+rdx+8*rcx to register xmm1 6. movapsx %xmm0, (%rdi,%rcx,8) store xmm0 to address rdi+8*rcx 7. movapsx %xmm1, 0x10(%rdi,%rcx,8) store xmm1 to address 0x10+rdi+8*rcx 8. add $0x8, %rcx add 0x8 to rcx with target rcx
9. cmp %rax, %rcx compare rax and rcx
10. jb 0x303f <Block 7> jump to 0x303f if rax > rcx
12/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Test program in Fortran; part 1
1. module test_module 2. contains
3. subroutine func(a,b,c,imax) 4. integer :: imax, i
5. real(kind=8), dimension(imax) :: a,b,c 6. do i=1,imax
7. a(i)=b(i)+c(i) 8. enddo
9. end subroutine func 10. end module test_module
Test program in Fortran; part 2
1. program test_prog 2. use test_module 3. implicit none 4. integer :: i,j,k=200
5. integer,parameter :: imax=100000
6. real(kind=8),allocatable, dimension(:) :: a,b,c ! are dynamic arrays 7. allocate(a(imax)) ! reserving space for array a
8. allocate(b(imax)) 9. allocate(c(imax))
10. do i=1,imax ! Definition of array elements
11. b(i)=i+j
12. c(i)=i*5
13. enddo
14. call func(a,b,c,imax) ! call the procedure 15. write(*,*)a(1)
16. end program test_prog
14/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Test program in Java; part 1
1. package test.module;
2. public class TestModule 3. {
4.
5. public static void func(double a[],double b[],double c[],int imax)
6. {
7. for (int i = 0; i < imax; i++)
8. a[i] = b[i] + c[i];
9. }
10. }
Test program in Java; part 2
1. package test.prog;
2. import test.module.TestModule;
3. public class TestProg 4. {
5. public static void main(String[] args)
6. {
7. int i, j = 0, k = 200;
8. final int imax = 100000;
9. double a[], b[], c[];
10. a = new double[imax]; // reserving space for array a 11. b = new double[imax];
12. c = new double[imax];
13. for (i = 0; i < imax; i++)
14. {
15. b[i] = i + j;
16. c[i] = i * 5;
17. }
18. TestModule.func(a, b, c, imax); // call the procedure 19. System.out.println(a[0]);
20. }
21. }
16/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Modern computers in detail
Core with Caches and Memory (!)
I data are stored in a connected memory hierarchy
I L1 cache ( Level 1 cache)
I L2 cache
I L3 cache ( not all architectures )
I memory
I the cache
I is an intermediate memory with the same addressing scheme replicating the data in memory
I has a smaller access latency and provides higher bandwidth
I a load/store first looks in the (low level) cache; if data is not there, the next level will tested
I all cached data are organized in cache lines (of 64 B for X86, X86 64)
I load/store moves cachelines, not single data
I old data will be overwritten, but saved to higher level caches or to memory if changed
I writing to a cacheline assumes that the cacheline is present in cache
I lower level caches are smaller than higher level caches
I nearby caches have higher bandwidth and smaller access latency
I load/store instructions can reach all memory locations
M C
L1 L2 L3
18/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Processor with Cores and Memory (!)
I here: Intel Sandy Bridge (2012)
I multi core processor with 8 cores
I L3 cache is shared but partitioned
I communication ring with high bandwidth
I 4 rings for data, requests, acknowledgments, snooping
I memory controller in ring
I QPI (Quick Path Interconnect) interface in ring
M C
L1 L2 L3
CL1
L2 L3
C
L1 L2 L3 C
L1 L2 L3 C
L1 L2 L3
CL1
L2 L3
CL1
L2 L3
CL1
L2 L3
QPI MC
4 rings:
data ackn requ snoop
snooping enables all cores to ”hear” which shared cachelines are changed by other cores ( cache coherence protocol)
Intel Sandy Bridge architectural data 1
I 8 cores
I AVX vector unit per core with 4 mult-plus operations/CP
I core with L1, L2 cache
I shared L3-cache
I 4 memory channels with 1600 MHz DDR3
I 2 QPI links up two 8 GT/s
I 51.2 GB/s nominal bandwidth
I >2.0 GHz frequency
I daughter card on PCIe 3.0 up to 16 GB/s (was ist PCI)
PCIe ( Peripheral Component Interconnect Express) is an interface standard for
connecting external devices as graphics cards, network interface controllers, sound cards, ...
20/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Intel Sandy Bridge architectural data 2
AMD Interlagos Processor
I 16 cores in 2 dies with 4 core pairs
I AVX vector unit per core pair(!) with 4 mult-plus operations/CP
I core with L1 cache
I 2 MB L2-cache per core pair
I 8 MB shared L3-cache per die, in total 16 MB
I 2 memory channels with 1600 MHz DDR3 per die
I one Hyper-Transport link between dies
I one Hyper-Transport link per die going outwards
I 51.2 GB/s nominal bandwidth for 4 channels
I >2.0 GHz frequency
picture from AMD
22/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Intel Knights Ferry
I 30 cores
I 4 vector units per core
I Pentium II based
I 100 GB/s nominal bandwidth
I 1.05 GHz frequency
I daughter card on PCIe 2.0
I prototype; will be followed by Knights Corner
Nvidia Graphics Card
nicht mehr aktuell
I 448 - 512 cores
I 1 8Byte FP-unit per core
I 14 - 16 multiprocessors
I 144 -177 GB/s nominal bandwidth
I 1.15 - 1.3 GHz frequency
I 515 - 665 DP peak performance
I daughter card on PCIe 2.0
Oliver Mangold
24/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
principle parallel architectures
Shared memory system (!)
I all processors/cores are using the same memory
I the address space is identical for all cores
I memory load/store instructions can reach all locations
I each thread may share his memory part with others
I shared memory may consist on different banks on different channels
I can be local to a processsor but accessible by others
Intel: Quick Path Interconnect (QPI)
AMD: Hyper-Transport (HT)
I bus/interconnect is a bottleneck
I memory bandwidth is insufficient
M M M M
P
M M M M
P
QPI / HT
IO
26/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Distributed memory system (!)
I nodes are using different memories
I the address space is different for all cores
I memory load/store instructions cannot reach other locations
I memory bus/interconnect is independent
I memory bandwidth scales with the number of nodes
I network access via a Network Interface Controller (NIC)
I network links connected via Routers (R)
M
P
NIC
R
M
P
NIC
R
M
P
NIC
R
M
P
NIC
R
Hierarchical Parallel Computer System (!)
I the different levels of a parallel system: cores, processors, nodes, supernodes, system
hierarchische Maschine.png
28/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Hierarchical Parallel Computer System (!)
vector floating point units (FPU) ( 2 - 8 per core )
cores (C) ( 4 - 64 per CPU, 500 per GPU)
processors (CPU) ( 2 - 8 per node )
nodes (N) ( 1-8 per supernode )
supernodes, blades, cages, ... (2 - 8 per rack )
rack (R) (10 - 1000 per system)
distributed system
I by multiplying small numbers we get a very large number
addressing of the total memory (!)
I shared memory
Uniform Memory Access (UMA)
Non-Uniform Memory Access (NUMA)
I distributed memory
locally addressable
nodes may have shared memory
I Partitioned Global Address Space (PGAS)
I globally addressable, every processor can address the memory in all other processors or nodes
Remote Direct Memory Access (RDMA) done by the Network Interface Controller (NIC)
enables simpler handling of global data structures
30/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
examples of large machines
HLRS machine Cray XE6
I Cray XE6
I 3500 nodes with two AMD processors Interlagos
I Gemini interconnect
I working since November 2011
I 16 C x 2 CPU x 96 N x 38 R = 116736 Cores (HLRS 2011)
I 32 GB x 96 N x 38 R = 116 TB memory
I 1.8 PB disk
Cray XE6 at HLRS
32/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
K-computer at Riken
I Fujitsu
I 10-13 MW
I finished in 2012
I 8 C x 4 N x 24 SN x 864 R = 82944 CPUs = 663552 Cores
I 10.6 PetaFlop/s peak performance
Blue Waters System
I >235 Cray XE6 cabinets
I >30 Cray XK6 cabinets
I >25000 compute nodes
I >1 TB/s usable storage bandwidth
I >1.5 PB aggregate system memory
I 4 GB Memory per core
I >9000 Gemini network cables
I 3D torus interconnect topology
I >17000 disks
I >190000 memory DIMMS
I >25 PB usable storage
I >11.5 PF peak performance
I >49000 AMD processors
I >380000 cores
I >3000 GPUs
I 100 GB/s bandwidth to near-line storage
Blue Waters machine under construction http://timelapse.ncsa.illinois.edu/
pcf/inside2/index.php http://www.ncsa.illinois.edu/
BlueWaters/system.html
34/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
parameters describing hardware
Moore’s Law 1 (!)
36/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Moore’s Law 2 (!)
I Moore’s Law from 1965 states that number of components on an integrated circuit doubles every 18 (24) months
I this is the reason that computers are inexpensive despite their complexity
I for some time it was understood as doubling the frequency every 18 month
I http://de.wikipedia.org/wiki/Mooresches_Gesetz
stagnating frequency (!)
I the power consumption depends heavily on the frequency of the processor
power ∼ frequency3
I the number of cores can be increased, if the frequency is reduced
I List of Intel processors:
http://www.intel.com/
pressroom/kits/
quickreffam.htm
I data from Gerhard Wellein / Erlangen
38/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Floating Point Performance Parameters (!)
I clock rate of the core/processor
inverse of frequency for scheduling the instructions; typical 1 – 4 GHz
I number of floating point operations per clock tic
instruction parallelism in the core/processor [FLOPs/FP − unit]; typical 2 – 8
I peak performance of processor
maximum number of Floating Point Operations per second [MFLOPs, GFLOPs, TFLOPs]; typical 4 GFLOPs – 100 GFLOPs
I total peak performance of machine
number of nodes defines the size of the cluster; typical 1 - 25000
I total performance=frequency x # FP-units x #cores x #processors x #nodes
Communication Parameters: Bandwidth (!)
I bandwidth: amount of data transferred per time through a transportation path
I typical parameter for all transportation systems
I for computers
L1 cache bandwidth ( 4x8 B/CP )
L2 cache bandwidth ( 4x8 B/CP )
L3 cache bandwidth ( 4x8 B/CP )
memory bandwidth
o (1,2,3,4) channels x 8 B x Bus frequency (at 1066, 1333, 1600 MHz) → up to 51,2 GB/s
interconnect bandwidth
o InfiniBand (SDR 1 GB/s, DDR 2 GB/s , QDR 4 GB/s, EDR 12.5 GB/s) o Cray Gemini ( 5 GB/s x 2 directional )
o Ethernet ( 1Gbit, 10 GBit, 40 GBit)
IO bandwidth o SATA disk ( 60 MB/s)
o Solid State Device SSD ( 120 –700 MB/s) o RAID Storage Subsystem ( 1.5 – 10 GB/s) o parallel Lustre Filesystem ( > 100 GB/s)
40/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Communication Parameters: Latency (!)
I Latency: time to get the first data after sending the request
I connected to the length of the pipeline
I all system parts show a typical latency
L1 cache (2 - 4 CP)
L2 cache ( 10 -20 CP)
L3 cache (25 - 70 CP)
memory (100 - 300 CP)
interconnect ( 1 -50 µ s)
disc ( 1 - 5 ms)
I the number of open transfers is limited by number of open requests
I the achievable bandwidth is limited by the
bandwidth ≤ number of open requests × transferred buffer size/latency
I the latency is the start up of the transfer pipeline
Effective Memory Bandwidth (!)
I The nominal memory bandwidth of a modern PC-system can be calculated by BWnom=path width × number of channels × bus frequency path width = 8B, number of channels = 1, 2, 3, 4,
bus frequency = 1066, 1333, 1600, 1866, 2133MHz
I The effective memory bandwidth is smaller than the nominal bandwidth BWeff= size of data transferred
latency + transfer time The transfer time is related to the nominal bandwidth by
transfer time =size of data transferred BWnom
I The effective bandwidth is influenced by the memory or cache latency
BWeff= size of data transferred latency +size of data transferred
BWnom
42/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Example 1
I Intels Sandy Bridge processor has 4 channels running at a bus frequency of 1600MHz. The channels are 8B wide.
BWnom=8B × 4 × 1600MHz = 51.2GB/s The memory latency is 70nsec.
I The resulting effective bandwidth for a single cache line of 64B is
BWeff = 64B
70nsec +51.2GB/s64B = 64B
70nsec + 1.25nsec =0, 898GB/s(!)
I We are far from the nominal bandwidth!
I The estimations are not completely correct:
the cacheline uses a single channel; the cacheline is transported with bus clock
Example 2
I The resulting effective bandwidth for 100 cache lines of 64B is
BWeff = 100 × 64B
70nsec +51.2GB/s100×64B = 100 × 64B
70nsec + 125nsec =32, 82GB/s(!)
I How to overcome this problem?
44/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Prefetching (!)
I The data access has to be started before consumption
I Prefetching initiates data access independent on using the data
I difficult to handle
I may be contraproductive for data already in caches
Types of Parallelism(!)
I single instruction single data (SISD)
simple or traditional processor
I single instruction multiple data (SIMD)
vector model
in a multi processor of a graphic card
I multiple instructions multiple data (MIMD)
multicore processors as shared memory processors
different multiprocessors of graphic cards
I multiple programs multiple data (MPMD)
multicore processors for different processes
distributed memory processors
I multiple instructions single data (MISD) ??
I SISD, SIMD, MIMD, MISD: Flynns notation
I pipeline parallelism Bild!
46/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
pipeline parallelism (!)
I phase shifted parallelism
I very general principle in production, transport, flow of goods
I assembly line in production systems
I important model in bureaucracy ( first come first serve)
I different parts are in operation at different stages at the same time
I dominant in modern computer architectures
I in computer:
accessing caches or memory within a loop
accessing data from a remote processor
input and output system
I description model of the pipeline
the pipeline is filled after startup steps. This is the length of the pipeline and at the same the degree of its parallelism. Could be understood also as latency.
performance modeling
48/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
performance modeling
performance behaviour of the pipeline (!)
I The pipeline mechanism is characterized by the
fixed startup or latency costs or time for filling the pipeline
time per unit ∆T
operations per unit op
performance perf
I The resulting performance is perf (n) = n ∗ op
startup + n ∗ ∆T
I peak performance is the limit for increasing n perf∞= lim
n→∞perf (n) = op
∆T
I getting half of the peak performance for n1
2 =startup∆T
perf (n1 2) =
startup
∆T ∗ op
startup +startup∆T ∗ ∆T = 1 2
op
∆T =1 2perf∞ 50/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Amdahls Law: limits of parallel speedup (!)
I Assume a parallelizable program running a fixed size test case having a
parallel part taking a total time tpar
serial (non parallelizable) part taking a total time tser
total time is ttot=tpar+tser
I We will run the program with p parallel processes
I The speedup is
speedup(p) = ttot
tpar/p + tser
6 ttot
tser I and the efficiency
efficiency (p) = speedup(p) p
Amdahls Law: limits of parallel speedup (!)
I These assumptions are correct of any type of parallel work!
I the pictures show speedup in dependence on tser/(tpar+tser)in percent.
I The behaviour is not encouraging!
I Why large scale parallel computing can be successful?
I the speedup behaviour of parallel program for a fixed sized test case is namedstrong scaling
52/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Amdahls Law with communication
I even worse: assume that communication times increase with the number of processes
parallel part taking a total time tpar
assume communication time com ∗ p is proportional to the number of processes; other dependencies to be discussed
serial part taking a total time tser
total time is ttot=tpar+tser
I We will run the program with p parallel processes
I The speedup is
speedup(p) = ttot
tpar/p + com ∗ p + tser 6 ttot
com ∗ p + tser
Gustafson: increase the work 1 (!)
I Amdahls assumes afixed amount of work.
I But what can we do in afixed amount of time?
parallel part taking a total time tpar=tprocp proportional to the number of processors
serial part taking a total time tser I The speedup is
speedup(p) = tprocp + tser
tprocp/p + tser
= tproc
tproc+tser
p + tser
tproc+tser I and the efficiency
efficiency (p) = tproc
tproc+tser
+ tser
tproc+tser
1 p
I The efficiency decreases to a lower limit.
54/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::
Gustafson: increase the work 2 (!)
I There are a lot of simulation problems of this type!
I But we are neglecting any necessary communication.
I the speedup behaviour of parallel program for a test case which is increasing with the number of active processors/cores is namedweak scaling
Danke f ¨ur die Aufmerksamkeit!
Kuester[at]hlrs.de
55/55 :: Modellierung, Simulation, Optimierung , Computer Architektur :: 28. Juni 2012 ::