Review
• Introduction to parallel & distributed computing – What is parallel computing?
– What is distributed computing?
• PC architecture
– CPU, chipset (north & south bridges), interconnections – Key performance components: CPU, memory, interconnections – Effect of memory latency
1 2
Effect of Memory Latency
• Effect of memory latency on performance
– A single 1GHz processor with 2 multiply-add units that is capable of 4 floating-point calculations per cycle (1 ns) 4GFLOPs – Assume memory latency of 100 ns for 1 operand
fetching/storing, the processor needs to wait 100 cycles before it can perform operations
– Consider a program computing dot-product of two vectors (one addition and one multiplication per element). Two data fetches (200 ns) and two floating point operations (2ns) are needed for each element of the vector. The peak speed is therefore
~10MFLOPS. 1/400 or 0.25% efficiency
3
Effect of Cache Memory
• Continue previous example, and introduce a 32KB cache memory between processor & memory
– 1GHz processor, 4GFLOPS theoretical peak, 100ns memory latency
– Assume 1ns cache latency (full-speed cache)
– Multiply two matrices A & B of dimensions 32x32 (8KB or 1K words for each matrix)
– Fetching two matrices into cache: 200000ns = 200 µs.
– Multiplying 2 NxN matrices ~ 2N3operations = 216operations, needs 16K cycles or 16 µs
– Total computation time = 216 µs
– FLOPS = 216/216 ~ 303MFLOPS 30 times improvement – But still < 10% CPU efficiency
4
Effect of Cache Line or Block Size
• Block size/cache line refers to the size of memory returned from a single memory request
• The same machine, with a block size of 4 words. Each memory request returns 4 words of data instead of one 4 times more operations can be performed in 100ns: 40MFLOPS
• However, this technique only helps for continuous data layout (e.g.
arrays)
• Programs need to be written such that the data access pattern is sequential to make use of cache line.
– Fortran: column-wise storage for 2-D arrays – C: row-wise storage for 2-D arrays
Parallel computer architectures
classical modern
Classical #1:
Shared-Memory Parallel Computer
CPU
Cache Memory
CPU
Cache Memory
CPU
Cache Memory
CPU
Cache Memory
CPU
Cache Memory
System Bus
7
Shared-Memory Parallel Computer
• Typical Machine
– Multi-socket and/or Multi-core machines
• Shared Memory
– Programming through threading – Multiple processors share a pool of memory – Problems: cache coherence
– UMA vs. NUMA architecture
• Pros:
– Easier to program (probably)
• Cons:
– Performance may surfer if the memory is located on distant processors/machines
– Limited scalability
UMA vs. NUMA
8
Read: http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access http://en.wikipedia.org/wiki/Uniform_Memory_Access
CPU
Cache Memory
CPU
Cache Memory
CPU
Cache Memory
CPU
Cache Memory
CPU
Cache Memory
Main Memory System Bus
Uniform Memory Access
Non-Uniform Memory Access
9
Classical #2:
Distributed-Memory Parallel Computer
Network
CPU
Cache Memory
Local Memory
CPU
Cache Memory
Local Memory
CPU
Cache Memory
Local Memory
CPU
Cache Memory
Local Memory
CPU
Cache Memory
Local Memory
10
Distributed-Memory Parallel Computer
• Typical machine
– Beowulf Clusters, Clusters
• Distributed Memory
– Programming through processes – Explicit message passing, typically over MPI – Networking
• Pros:
– Tighter control on message passing
• Cons:
– Harder to program
• Modern supercomputers are hybrids!
11
Beowulf Cluster
• Is a form of parallel computer
• Thomas Sterling, Donald Becker, … (1994)
• Emphasize the use of COTS – COTS: Components-Off-The-Shelf – Intel, AMD processors
– Gigabit Ethernet (1000Mbps) – Linux
• A dedicated facility for parallel processing – Non-dedicated: NOW (Network Of Workstations)
• Performance/Price ratio is significantly higherthan traditional supercomputers!
12
Modern Parallel Computer Architecture
Hybrid
Heterogeneous
Modern #1:
Heterogeneous Architecture
• Heterogeneous computing systems refer to systems that use a variety of different types of computational units:
– general-purpose processor (e.g. CPU)
– special-purpose processor (e.g. digital signal processor (DSP) – graphics processing unit (e.g. GPU)
– co-processor (e.g. 8087, 80287)
– custom acceleration logic (application-specific integrated circuit (ASIC) or field-programmable gate array (FPGA)).
• In general, a heterogeneous computing platform consists of processors with different instruction set architectures (ISAs).
13
http://en.wikipedia.org/wiki/Heterogeneous_computing
Modern #2:
Hybrid Architecture
• Using multi-core / multi-socket / heterogeneous nodes in clusters
• Example #1: Earth Simulator (the fastest supercomputer in the world from 2002 to 2004)
14 http://www.es.jamstec.go.jp/esc/eng/index.html
Modern Example #2:
#1 super computer – Tianhe-2
15
• 16,000 nodes interconnected
• Each node
– 2* Intel Xeon E5-2692 12 core 2.2GHz CPUs – 3* Intel Xeon Phi 31S1P (57 cores, 1.1GHz) – 64 + 24 GB memory
http://www.olcf.ornl.gov/titan/
#1 (Tianhe-2, China), 3120000 cores, 33862.7 TFlop/s (Rmax), 54902.4 TFlop/s(Rpeak), 1024000 GB RAM. 17808kW, 16000 nodes. Each node has 2 Intel CPUs & 3 Intel XeonPhi.
From HW1 Reading
• What is von Neumann bottleneck?
• What is SISD? SIMD?
• Which unit takes the biggest area in modern CPU chips?
• Section 1.2 CPU micro-architecture
– How does pipelining work? (Wind-up, Stall, Pipeline bubbles, loop carried dependencies)
– Superscalar – SIMD
– Out-of-order execution – CISC-Micro-Ops-RISC
16
From HW1 Reading
• Section 1.3 cache
– What’s cache hit? Cache miss? Unified cache?
– Locality of reference: temporal locality, spatial locality – streaming pattern?
– Cache line
– Write miss cache allocate (read – write) with write-back strategy
– Fully associative cache vs. direct-mapped cache with n-way set associative
– Prefetch
• Multicore, Multithreading processors (SMT)
Lecture 2
Source Code Optimization
19
Topics
• Declarations
– Data type considerations – Functions
– Memory alignments
• Array access and loop optimization
• Other considerations – Integer division – Branching logics – Float literals – Avoid pointer usage
20
Declarations
Data Type Considerations
• Integer data type – unsigned:
• Division and remainders
• Loop counters: for(size_t i=0;i<1000;i++) { … }
• Array indexing – signed:
• Integer to float conversion
– use 32 or 64bit integers on 32/64bit platforms. Use shorter integers only when memory usage is of concern
21
Declarations
Functions
• If a function is referenced locally (file scope), then declare it as static static int fun() {
…
… }
• Function prototypes – Use “const” type qualifier
static int fun(constint a, const double *b) {
…
… }
– By the way, what is a constant pointer ?
Constant pointer
#include <iostream>
using namespace std;
int main() { const int *a;
int p[] = {1, 2, 3};
int q[] = {4, 5, 6};
int i;
a = p;
for(i=0;i<3;i++) { cout << a[i];
a[i] = 0;
}
22 a = q;
for(i=0;i<3;i++) { cout << a[i];
a[i] = 0;
} return 0;
}
Which line above will cause a compilation error?
Declarations
Memory Alignment
• Pointer Alignment
– Base pointers should align to 16-byte boundary (to allow compilers easily optimize your code with SIMD instructions)
• Aligning Arrays/Matrices
– The lowest dimension (rows in C/C++) should be a multiple of 16, and each row starts at 16-byte boundary. (To use SIMD)
– This technique is commonly known as padding.
– This may also cause problem known as thrashing for direct mapped cache
• Sorting and Padding C and C++ Structures
– Sort members from largest sized members to smallest sized members – Add padding to make sure that each member in AOS (array of structs) can
align naturally.
• Reading
23 24
Array Access and Loop Optimization
• Loop jamming / Loop fusion
• Loop fission / Loop distribution
• Move invariants out of loops
– Loop un-switching• Loop peeling
• Loop interchange
• Loop unrolling
• Loop unrolling and sum reduction
25
Loop Jamming / Loop Fusion
for(i=0;i<N;i++) { b[i] += 1;
y += b[i];
}
for(i=0;i<N;i++) { b[i] += 1;
}
for(i=0;i<N;i++) { y += b[i];
}
i7-950, icpc -O2, N=10000000 Separated: 0.0106071secs.
Fused: 0.00695872secs.
26
Loop Fission / Loop Distribution
for(i=0;i<N;i++) x += b[i];
for(i=0;i<N;i++) y += c[i];
for(i=0;i<N;i++) { x += b[i];
y += c[i];
}
i7-950, icpc -O2, N=1000000 Fused: 0.000468991 secs.
Separated: 0.000335743 secs.
27
Move invariants out of loops If() - Loop unswitching
for(i=0;i<N;i++) { for(j=0;j<N;j++) {
if(a[i] > 100.0) b[i] = a[i] - 3.7;
x = x + a[j] + b[i];
} }
for(i=0;i<N;i++) {
if(a[i] > 100.0) b[i] = a[i] - 3.7;
for(j=0;j<N;j++) { x = x + a[j] + b[i];
} }
i7-950, icpc -O2, N=10000 Inside: 0.203336 secs.
Outside: 0.0930612 secs.
28
Move invariants out of loops If() - Loop unswitching
for(i=0;i<N;i++) { x[i] = x[i] + y[i];
if(w) z[i]=0.0;
} if(w) {
for(i=0;i<N;i++) { x[i] = x[i] + y[i];
z[i]=0.0;
} } else {
for(i=0;i<N;i++) { x[i] = x[i] + y[i];
} }
i7-950, icpc -O2, N=1000000 Before: 0.00160437 secs.
After : 0.00156413 secs.
AMD Software Optimization Guide: Loop Hoisting
Move invariants out of loops Math operations
for(i=0;i<N;i++) { a[i] = 0.0;
for(j=0;j<N;j++) {
a[i] += b[j] * d[j] / c[i];
} }
for(i=0;i<N;i++) { a[i] = 0.0;
for(j=0;j<N;j++) { a[i] += b[j] * d[j];
}
a[i] /= c[i];
i7-950, icpc -O2, N=10000 Inside: 0.0617907 secs.
Outside: 0.0157715 secs.
Loop Peeling
j = N-1;
for(i=0;i<N;i++) {
b[i] = (a[i] + a[j]) * 0.5;
j = i;
}
b[0] = (a[0] + a[N-1]) * 0.5;
for(i=1;i<N;i++) {
b[i] = (a[i] + a[i-1]) * 0.5;
}
31
Loop Interchange
Stride minimization
for(i=0;i<N;i++) { for(j=0;j<N;j++) {
c[i][j] += a[i][j] + b[i][j];
} }
for(j=0;j<N;j++) { for(i=0;i<N;i++) {
c[i][j] += a[i][j] + b[i][j];
} }
i7-950, icpc –O1, N=1000 Stride 1: 0.00162812 secs.
Stride N: 0.0117221 secs.
32
Loop Interchange
Exercise
for(i=0;i<NUM;i++) for(j=0;j<NUM;j++)
for(k=0;k<NUM;k++)
c[i][j] =c[i][j] + a[i][k] * b[k][j];
33
Loop Unrolling
for(i=0;i<N;i++) for(j=0;j<N;j++)
for(k=0;k<4;k++)
a[i][j] += b[k][i] * c[k][j];
i7-950, icpc -O2 -unroll0, N=1000 Before: 0.00368654 secs.
After : 0.00116636 secs.
for(i=0;i<N;i++) { for(j=0;j<N;j++) {
a[i][j] += b[0][i] * c[0][j];
a[i][j] += b[1][i] * c[1][j];
a[i][j] += b[2][i] * c[2][j];
a[i][j] += b[3][i] * c[3][j];
} }
34
Loop Unrolling and Sum Reduction
for(i=0;i<N;i++) for(j=0;j<N;j++)
for(k=0;k<4;k++) a += b[k][i] * c[k][j];
for(i=0;i<N;i++) { for(j=0;j<N;j++) {
a += b[0][i] * c[0][j];
a += b[1][i] * c[1][j];
a += b[2][i] * c[2][j];
a += b[3][i] * c[3][j];
} }
AMD Software Optimization Guide: Explicit Parallelism in Code
i7-950, icpc -O2 -unroll0, N=1000 Before: 0.00186869 secs.
After : 0.000803386 secs.
35
Loop Unrolling and Sum Reduction
for(i=0;i<N;i++) { for(j=0;j<N;j++) {
a1 += b[0][i] * c[0][j];
a2 += b[1][i] * c[1][j];
a3 += b[2][i] * c[2][j];
a4 += b[3][i] * c[3][j];
} }
a = a1 + a2 + a3 + a4;
i7-950, icpc -O2 –unroll0, N=1000 Before: 0.00188231 secs.
After : 0.000798778 secs.
Further: 0.00107725 secs.
i7-950, icpc –O1 –unroll0, N=1000 Before: 0.00399741 secs.
After : 0.00364868 secs.
Further: 0.00190666 secs.
AMD Phenom 1065t, icpc –O2 –unroll0, N=1000 Before: 0.00238779 secs.
After : 0.000945025 secs.
Further: 0.000891968 secs.
Other Considerations
Avoid integer division Branching logics
Float Literals Avoid pointer usage
36
Other Considerations
Avoid integer division
• Replacing Integer Division with Multiplication
37
for(size_t i=0;i<N;++i) { a[i] /= 3;
}
for(size_t i=0;i<N;++i) { a[i] *= 1.0 / 3.0;
}
i7-950, icpc –O2, N=1000000 Before: 0.00125388 secs.
After : 0.000616221 secs.
Other Considerations
branching logics
• Arrange Boolean Operands for Quick Expression Evaluation
38
for(size_t i=0;i<N;++i) { if(i<4 || i%4 == 0) {
a[i] = 0;
} else {
a[i] = 100;
} }
a[0] = a[1] = a[2] = a[3] = 0;
for(size_t i=4;i<N;++i) { if(i%4 != 0) {
a[i] = 100;
} else {
a[i] = 0;
} }
i7-950, icpc –O2, N=1000000 Before: 0.00133116 secs.
After : 0.000616221 secs.
Other Considerations
float literal
39 float data[N];
for(size_t i=0;i<N;++i) { data[i] *= 3.14;
}
float data[N];
for(size_t i=0;i<N;++i) { data[i] *= 3.14f;
}
Intel i7-950, icpc -O2, N=1000000 Left: 0.000616121 secs.
Right: 0.000290689 secs.
Other Considerations
Avoid pointer usage
• Use array notation instead of pointer notation when working with arrays to avoid aliasing issues.
• Avoid frequently dereferenced pointer arguments in functions. – increase memory traffic.
40 void Plus(size_t n, int *ptrD, int *ptrE, int *ptrF) {
int *ptrF1 = ptrF;
ptrF++;
ptrD++;
ptrE++;
for(size_t i=1;i<n;++i) {
*ptrF++ = *ptrD++ + *ptrE++ + *ptrF1++;
} }
for(size_t i=1;i<n; ++i) {
ptrF[i] = ptrD[i] + ptrE[i] + ptrF[i-1];
}
Intel i7-950, icpc -O2, N=1000000 Before: 0.0013956000secs.
After: 0.0010977000 secs.
Write Efficient Code
• Contiguous memory access helps to improve code efficiency.
(why?)
• Loop unrolling & SIMD instructions (e.g. MMX, SSE, 3DNOW, …) can be used to help vector operations.
• Modern compilers (gcc, Intel compiler) can do loop unrolling automatically for you.
– icpc-O2 -unroll -xHostyourcode.cpp
– g++ -O2 -funroll-loops -march=native yourcode.cpp
• Compiler flags (red text above) can make a huge difference on 1) computational efficiency and 2) results (sometimes).
Matrix operations
• For operations involve matrices, each element may be referenced more than once, thus provides better chance for optimizing memory access pattern (reuse the data that is already in cache memory)
• Even if elements (either vector or matrix) are referenced only once, how the memory is accessed still dictates the performance.
– We must pay attention to the 2-D data layout in memory.
• As modern computers has blocking memory access (lecture one, cache-line: one memory request returns a block of memory), we need to work on adjacent entries first.
Summary
Technique Before After Improvement
Loop Interchange 0.011722 0.001628 619.98%
Loop Common Expression 0.061791 0.015772 291.79%
Unrolling 0.003687 0.001166 216.07%
Loop Unswitching 0.093061 0.03108 199.43%
float literal 0.000616 0.000291 111.95%
Integer division 0.001254 0.000616 103.48%
branching logics 0.001331 0.000824 61.49%
Loop Peeling 9.38E-05 6.28E-05 49.38%
Loop Jamming 0.020384 0.014192 43.63%
Loop Fission 0.000469 0.000336 39.69%
Avoid pointer usage 0.001396 0.001098 27.14%
Unroll + Sum Reduction 0.003997 0.003649 9.56%
Loop Unswitching-2 0.001604 0.00154 4.21%
43 I did not have examples for some of the technique introduced.
But that does not mean they are not effective!
Readings
• AMD (2014), “Software Optimization Guide for AMD Family 15h Processors”, Chapter 3 – C and C++ source-level optimizations.
• Ahn, “Data Alignment”, On-line:
http://www.songho.ca/misc/alignment/dataalign.html
• SGI, “Optimizing Cache Utilization”, On-line:
http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/
OrOn2_PfTune/sgi_html/ch06.html
• Next week’s lecture is on the following topics:
– Cache tiling / blocking – make sure you understand how cache works!
– Dynamic memory allocation for multi-dimensional arrays – Command line arguments in C/C++
– Preprocessor directives
– Linking with external libraries and functions – Writing Makefile
44