Review. Effect of Memory Latency. Effect of Cache Memory. Effect of Cache Line or Block Size. Parallel computer architectures

(1)

Review

• Introduction to parallel & distributed computing – What is parallel computing?

– What is distributed computing?

• PC architecture

– CPU, chipset (north & south bridges), interconnections – Key performance components: CPU, memory, interconnections – Effect of memory latency

1 2

Effect of Memory Latency

• Effect of memory latency on performance

– A single 1GHz processor with 2 multiply-add units that is capable of 4 floating-point calculations per cycle (1 ns)  4GFLOPs – Assume memory latency of 100 ns for 1 operand

fetching/storing, the processor needs to wait 100 cycles before it can perform operations

– Consider a program computing dot-product of two vectors (one addition and one multiplication per element). Two data fetches (200 ns) and two floating point operations (2ns) are needed for each element of the vector. The peak speed is therefore

~10MFLOPS.  1/400 or 0.25% efficiency

3

Effect of Cache Memory

• Continue previous example, and introduce a 32KB cache memory between processor & memory

– 1GHz processor, 4GFLOPS theoretical peak, 100ns memory latency

– Assume 1ns cache latency (full-speed cache)

– Multiply two matrices A & B of dimensions 32x32 (8KB or 1K words for each matrix)

– Fetching two matrices into cache: 200000ns = 200 µs.

– Multiplying 2 NxN matrices ~ 2N³operations = 2¹⁶operations, needs 16K cycles or 16 µs

– Total computation time = 216 µs

– FLOPS = 2¹⁶/216 ~ 303MFLOPS  30 times improvement – But still < 10% CPU efficiency

4

Effect of Cache Line or Block Size

• Block size/cache line refers to the size of memory returned from a single memory request

• The same machine, with a block size of 4 words. Each memory request returns 4 words of data instead of one  4 times more operations can be performed in 100ns: 40MFLOPS

• However, this technique only helps for continuous data layout (e.g.

arrays)

• Programs need to be written such that the data access pattern is sequential to make use of cache line.

– Fortran: column-wise storage for 2-D arrays – C: row-wise storage for 2-D arrays

Parallel computer architectures

classical modern

Classical #1:

Shared-Memory Parallel Computer

CPU

Cache Memory

CPU

Cache Memory

CPU

Cache Memory

CPU

Cache Memory

CPU

Cache Memory

System Bus

(2)

7

Shared-Memory Parallel Computer

• Typical Machine

– Multi-socket and/or Multi-core machines

• Shared Memory

– Programming through threading – Multiple processors share a pool of memory – Problems: cache coherence

– UMA vs. NUMA architecture

• Pros:

– Easier to program (probably)

• Cons:

– Performance may surfer if the memory is located on distant processors/machines

– Limited scalability

UMA vs. NUMA

8

Read: http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access http://en.wikipedia.org/wiki/Uniform_Memory_Access

CPU

Cache Memory

CPU

Cache Memory

CPU

Cache Memory

CPU

Cache Memory

CPU

Cache Memory

Main Memory System Bus

Uniform Memory Access

Non-Uniform Memory Access

9

Classical #2:

Distributed-Memory Parallel Computer

Network

CPU

Cache Memory

Local Memory

CPU

Cache Memory

Local Memory

CPU

Cache Memory

Local Memory

CPU

Cache Memory

Local Memory

CPU

Cache Memory

Local Memory

10

Distributed-Memory Parallel Computer

• Typical machine

– Beowulf Clusters, Clusters

• Distributed Memory

– Programming through processes – Explicit message passing, typically over MPI – Networking

• Pros:

– Tighter control on message passing

• Cons:

– Harder to program

• Modern supercomputers are hybrids!

11

Beowulf Cluster

• Is a form of parallel computer

• Thomas Sterling, Donald Becker, … (1994)

• Emphasize the use of COTS – COTS: Components-Off-The-Shelf – Intel, AMD processors

– Gigabit Ethernet (1000Mbps) – Linux

• A dedicated facility for parallel processing – Non-dedicated: NOW (Network Of Workstations)

• Performance/Price ratio is significantly higherthan traditional supercomputers!

12

Modern Parallel Computer Architecture

Hybrid

Heterogeneous

(3)

Modern #1:

Heterogeneous Architecture

• Heterogeneous computing systems refer to systems that use a variety of different types of computational units:

– general-purpose processor (e.g. CPU)

– special-purpose processor (e.g. digital signal processor (DSP) – graphics processing unit (e.g. GPU)

– co-processor (e.g. 8087, 80287)

– custom acceleration logic (application-specific integrated circuit (ASIC) or field-programmable gate array (FPGA)).

• In general, a heterogeneous computing platform consists of processors with different instruction set architectures (ISAs).

13

http://en.wikipedia.org/wiki/Heterogeneous_computing

Modern #2:

Hybrid Architecture

• Using multi-core / multi-socket / heterogeneous nodes in clusters

• Example #1: Earth Simulator (the fastest supercomputer in the world from 2002 to 2004)

14 http://www.es.jamstec.go.jp/esc/eng/index.html

Modern Example #2:

#1 super computer – Tianhe-2

15

• 16,000 nodes interconnected

• Each node

– 2* Intel Xeon E5-2692 12 core 2.2GHz CPUs – 3* Intel Xeon Phi 31S1P (57 cores, 1.1GHz) – 64 + 24 GB memory

http://www.olcf.ornl.gov/titan/

#1 (Tianhe-2, China), 3120000 cores, 33862.7 TFlop/s (Rmax), 54902.4 TFlop/s(Rpeak), 1024000 GB RAM. 17808kW, 16000 nodes. Each node has 2 Intel CPUs & 3 Intel XeonPhi.

From HW1 Reading

• What is von Neumann bottleneck?

• What is SISD? SIMD?

• Which unit takes the biggest area in modern CPU chips?

• Section 1.2 CPU micro-architecture

– How does pipelining work? (Wind-up, Stall, Pipeline bubbles, loop carried dependencies)

– Superscalar – SIMD

– Out-of-order execution – CISC-Micro-Ops-RISC

16

From HW1 Reading

• Section 1.3 cache

– What’s cache hit? Cache miss? Unified cache?

– Locality of reference: temporal locality, spatial locality – streaming pattern?

– Cache line

– Write miss  cache allocate (read – write) with write-back strategy

– Fully associative cache vs. direct-mapped cache with n-way set associative

– Prefetch

• Multicore, Multithreading processors (SMT)

Lecture 2

Source Code Optimization

(4)

19

Topics

• Declarations

– Data type considerations – Functions

– Memory alignments

• Array access and loop optimization

• Other considerations – Integer division – Branching logics – Float literals – Avoid pointer usage

20

Declarations

Data Type Considerations

• Integer data type – unsigned:

• Division and remainders

• Loop counters: for(size_t i=0;i<1000;i++) { … }

• Array indexing – signed:

• Integer to float conversion

– use 32 or 64bit integers on 32/64bit platforms. Use shorter integers only when memory usage is of concern

21

Declarations

Functions

• If a function is referenced locally (file scope), then declare it as static static int fun() {

…

… }

• Function prototypes – Use “const” type qualifier

static int fun(constint a, const double *b) {

…

… }

– By the way, what is a constant pointer ?

Constant pointer

#include <iostream>

using namespace std;

int main() { const int *a;

int p[] = {1, 2, 3};

int q[] = {4, 5, 6};

int i;

a = p;

for(i=0;i<3;i++) { cout << a[i];

a[i] = 0;

}

22 a = q;

for(i=0;i<3;i++) { cout << a[i];

a[i] = 0;

} return 0;

}

Which line above will cause a compilation error?

Declarations

Memory Alignment

• Pointer Alignment

– Base pointers should align to 16-byte boundary (to allow compilers easily optimize your code with SIMD instructions)

• Aligning Arrays/Matrices

– The lowest dimension (rows in C/C++) should be a multiple of 16, and each row starts at 16-byte boundary. (To use SIMD)

– This technique is commonly known as padding.

– This may also cause problem known as thrashing for direct mapped cache

• Sorting and Padding C and C++ Structures

– Sort members from largest sized members to smallest sized members – Add padding to make sure that each member in AOS (array of structs) can

align naturally.

• Reading

23 24

Array Access and Loop Optimization

• Loop jamming / Loop fusion

• Loop fission / Loop distribution

• Move invariants out of loops

– Loop un-switching

• Loop peeling

• Loop interchange

• Loop unrolling

• Loop unrolling and sum reduction

(5)

25

Loop Jamming / Loop Fusion

for(i=0;i<N;i++) { b[i] += 1;

y += b[i];

}

for(i=0;i<N;i++) { b[i] += 1;

}

for(i=0;i<N;i++) { y += b[i];

}

i7-950, icpc -O2, N=10000000 Separated: 0.0106071secs.

Fused: 0.00695872secs.

26

Loop Fission / Loop Distribution

for(i=0;i<N;i++) x += b[i];

for(i=0;i<N;i++) y += c[i];

for(i=0;i<N;i++) { x += b[i];

y += c[i];

}

i7-950, icpc -O2, N=1000000 Fused: 0.000468991 secs.

Separated: 0.000335743 secs.

27

Move invariants out of loops If() - Loop unswitching

for(i=0;i<N;i++) { for(j=0;j<N;j++) {

if(a[i] > 100.0) b[i] = a[i] - 3.7;

x = x + a[j] + b[i];

} }

for(i=0;i<N;i++) {

if(a[i] > 100.0) b[i] = a[i] - 3.7;

for(j=0;j<N;j++) { x = x + a[j] + b[i];

} }

i7-950, icpc -O2, N=10000 Inside: 0.203336 secs.

Outside: 0.0930612 secs.

28

Move invariants out of loops If() - Loop unswitching

for(i=0;i<N;i++) { x[i] = x[i] + y[i];

if(w) z[i]=0.0;

} if(w) {

for(i=0;i<N;i++) { x[i] = x[i] + y[i];

z[i]=0.0;

} } else {

for(i=0;i<N;i++) { x[i] = x[i] + y[i];

} }

i7-950, icpc -O2, N=1000000 Before: 0.00160437 secs.

After : 0.00156413 secs.

AMD Software Optimization Guide: Loop Hoisting

Move invariants out of loops Math operations

for(i=0;i<N;i++) { a[i] = 0.0;

for(j=0;j<N;j++) {

a[i] += b[j] * d[j] / c[i];

} }

for(i=0;i<N;i++) { a[i] = 0.0;

for(j=0;j<N;j++) { a[i] += b[j] * d[j];

}

a[i] /= c[i];

i7-950, icpc -O2, N=10000 Inside: 0.0617907 secs.

Outside: 0.0157715 secs.

Loop Peeling

j = N-1;

for(i=0;i<N;i++) {

b[i] = (a[i] + a[j]) * 0.5;

j = i;

}

b[0] = (a[0] + a[N-1]) * 0.5;

for(i=1;i<N;i++) {

b[i] = (a[i] + a[i-1]) * 0.5;

}

(6)

31

Loop Interchange

Stride minimization

for(i=0;i<N;i++) { for(j=0;j<N;j++) {

c[i][j] += a[i][j] + b[i][j];

} }

for(j=0;j<N;j++) { for(i=0;i<N;i++) {

c[i][j] += a[i][j] + b[i][j];

} }

i7-950, icpc –O1, N=1000 Stride 1: 0.00162812 secs.

Stride N: 0.0117221 secs.

32

Loop Interchange

Exercise

for(i=0;i<NUM;i++) for(j=0;j<NUM;j++)

for(k=0;k<NUM;k++)

c[i][j] =c[i][j] + a[i][k] * b[k][j];

33

Loop Unrolling

for(i=0;i<N;i++) for(j=0;j<N;j++)

for(k=0;k<4;k++)

a[i][j] += b[k][i] * c[k][j];

i7-950, icpc -O2 -unroll0, N=1000 Before: 0.00368654 secs.

After : 0.00116636 secs.

for(i=0;i<N;i++) { for(j=0;j<N;j++) {

a[i][j] += b[0][i] * c[0][j];

a[i][j] += b[1][i] * c[1][j];

a[i][j] += b[2][i] * c[2][j];

a[i][j] += b[3][i] * c[3][j];

} }

34

Loop Unrolling and Sum Reduction

for(i=0;i<N;i++) for(j=0;j<N;j++)

for(k=0;k<4;k++) a += b[k][i] * c[k][j];

for(i=0;i<N;i++) { for(j=0;j<N;j++) {

a += b[0][i] * c[0][j];

a += b[1][i] * c[1][j];

a += b[2][i] * c[2][j];

a += b[3][i] * c[3][j];

} }

AMD Software Optimization Guide: Explicit Parallelism in Code

i7-950, icpc -O2 -unroll0, N=1000 Before: 0.00186869 secs.

After : 0.000803386 secs.

35

Loop Unrolling and Sum Reduction

for(i=0;i<N;i++) { for(j=0;j<N;j++) {

a1 += b[0][i] * c[0][j];

a2 += b[1][i] * c[1][j];

a3 += b[2][i] * c[2][j];

a4 += b[3][i] * c[3][j];

} }

a = a1 + a2 + a3 + a4;

i7-950, icpc -O2 –unroll0, N=1000 Before: 0.00188231 secs.

After : 0.000798778 secs.

Further: 0.00107725 secs.

i7-950, icpc –O1 –unroll0, N=1000 Before: 0.00399741 secs.

After : 0.00364868 secs.

AMD Phenom 1065t, icpc –O2 –unroll0, N=1000 Before: 0.00238779 secs.

After : 0.000945025 secs.

Other Considerations

Avoid integer division Branching logics

Float Literals Avoid pointer usage

36

(7)

Other Considerations

Avoid integer division

• Replacing Integer Division with Multiplication

37

for(size_t i=0;i<N;++i) { a[i] /= 3;

}

for(size_t i=0;i<N;++i) { a[i] *= 1.0 / 3.0;

}

i7-950, icpc –O2, N=1000000 Before: 0.00125388 secs.

After : 0.000616221 secs.

Other Considerations

branching logics

• Arrange Boolean Operands for Quick Expression Evaluation

38

for(size_t i=0;i<N;++i) { if(i<4 || i%4 == 0) {

a[i] = 0;

} else {

a[i] = 100;

} }

a[0] = a[1] = a[2] = a[3] = 0;

for(size_t i=4;i<N;++i) { if(i%4 != 0) {

a[i] = 100;

} else {

a[i] = 0;

} }

i7-950, icpc –O2, N=1000000 Before: 0.00133116 secs.

After : 0.000616221 secs.

Other Considerations

float literal

39 float data[N];

for(size_t i=0;i<N;++i) { data[i] *= 3.14;

}

float data[N];

for(size_t i=0;i<N;++i) { data[i] *= 3.14f;

}

Intel i7-950, icpc -O2, N=1000000 Left: 0.000616121 secs.

Right: 0.000290689 secs.

Other Considerations

Avoid pointer usage

• Use array notation instead of pointer notation when working with arrays to avoid aliasing issues.

• Avoid frequently dereferenced pointer arguments in functions. – increase memory traffic.

40 void Plus(size_t n, int *ptrD, int *ptrE, int *ptrF) {

int *ptrF1 = ptrF;

ptrF++;

ptrD++;

ptrE++;

for(size_t i=1;i<n;++i) {

*ptrF++ = *ptrD++ + *ptrE++ + *ptrF1++;

} }

for(size_t i=1;i<n; ++i) {

ptrF[i] = ptrD[i] + ptrE[i] + ptrF[i-1];

}

Intel i7-950, icpc -O2, N=1000000 Before: 0.0013956000secs.

After: 0.0010977000 secs.

Write Efficient Code

• Contiguous memory access helps to improve code efficiency.

(why?)

• Loop unrolling & SIMD instructions (e.g. MMX, SSE, 3DNOW, …) can be used to help vector operations.

• Modern compilers (gcc, Intel compiler) can do loop unrolling automatically for you.

– icpc-O2 -unroll -xHostyourcode.cpp

– g++ -O2 -funroll-loops -march=native yourcode.cpp

• Compiler flags (red text above) can make a huge difference on 1) computational efficiency and 2) results (sometimes).

Matrix operations

• For operations involve matrices, each element may be referenced more than once, thus provides better chance for optimizing memory access pattern (reuse the data that is already in cache memory)

• Even if elements (either vector or matrix) are referenced only once, how the memory is accessed still dictates the performance.

– We must pay attention to the 2-D data layout in memory.

• As modern computers has blocking memory access (lecture one, cache-line: one memory request returns a block of memory), we need to work on adjacent entries first.

(8)

Summary

Technique Before After Improvement

Loop Interchange 0.011722 0.001628 619.98%

Loop Common Expression 0.061791 0.015772 291.79%

Unrolling 0.003687 0.001166 216.07%

Loop Unswitching 0.093061 0.03108 199.43%

float literal 0.000616 0.000291 111.95%

Integer division 0.001254 0.000616 103.48%

branching logics 0.001331 0.000824 61.49%

Loop Peeling 9.38E-05 6.28E-05 49.38%

Loop Jamming 0.020384 0.014192 43.63%

Loop Fission 0.000469 0.000336 39.69%

Avoid pointer usage 0.001396 0.001098 27.14%

Unroll + Sum Reduction 0.003997 0.003649 9.56%

Loop Unswitching-2 0.001604 0.00154 4.21%

43 I did not have examples for some of the technique introduced.

But that does not mean they are not effective!

Readings

• AMD (2014), “Software Optimization Guide for AMD Family 15h Processors”, Chapter 3 – C and C++ source-level optimizations.

• Ahn, “Data Alignment”, On-line:

http://www.songho.ca/misc/alignment/dataalign.html

• SGI, “Optimizing Cache Utilization”, On-line:

http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/

OrOn2_PfTune/sgi_html/ch06.html

• Next week’s lecture is on the following topics:

– Cache tiling / blocking – make sure you understand how cache works!

– Dynamic memory allocation for multi-dimensional arrays – Command line arguments in C/C++

– Preprocessor directives

– Linking with external libraries and functions – Writing Makefile

44