High Performance Computing for Science and Engineering
I
Dr. Sergio Martin
Computational Science & Engineering Laboratory
(Partially based on original material by Fabian Wermelinger)
Cache, Parallelism and Concurrency
• Cache Hierarchy and Optimization
• Concurrency and Parallelism
• Processes and Threads
• Threading Libraries
• Race Conditions and Synchronization Mechanisms
OUTLINE
Cache Hierarchy:
A brief History
The Intel 8086 Processor
1978 - Intel Releases 8086, the first 16-bit processor of the x86 architecture.
29k Transistors - 5Mhz
Picture : https://commons.wikimedia.org/wiki/File:Intel_8086_block_scheme.svg
Address Calculator Unit
Segment Selectors (16-bit)
General-Purpose Registers (16-bit)
1xArithmetic-Logic Unit (ALU)
Further Read : Computer Organization and Design: The Hardware/Software Interface
16-bit = 64k
Max Memory Address
(32-bit) Extended Registers
1986 - Intel Releases 80386, its first 32-bit Processor.
Picture Source: CPU collection Konstantin Lanzet
275k Transistors - 20Mhz
Extended Registers
Applications Operate on Larger Data Sets
Additional Pressure on RAM
32-bit = 4gb
Max Memory Address
CPU/RAM Latency
Picture Source: CPU collection Konstantin Lanzet
275k Transistors - 20Mhz Asynchronous DRAM
Capacity: 1-64 Mbytes Latency: ~120ns
Latency Ratio: 24x
Capacity Ratio: ~10^6
GP Register Capacity: 16 Bytes Register Memory Latency: 2-5ns
Memory Performance Gap
Picture Source: Computer Architecture: A Quantitative Approach Book by David A Patterson and John L. Hennessy
Problem - Growing disparity between register and RAM latencies.
Perf orma nce Ga p
External Cache (SRAM) Capacity: ~128 kbytes
Latency: 10-25ns
Cache Memory
Picture Source: CPU collection Konstantin Lanzet
275k Transistors - 20Mhz
GP Register Capacity: 16 Bytes Register Memory Latency: 2-5ns
Asynchronous DRAM Capacity: 1-64 Mbytes
Latency: ~120ns 1988 - Intel Releases 80386SX, the first commercial CPU with a Data-Cache Memory
Latency Ratio: 4x
Capacity Ratio: ~10^4
Latency Ratio: 6x Capacity Ratio: ~10^2
Cache Memories do not improve the RAM->CPU Latency.
Instead, they speed-up the reuse of data based on temporal and cache locality.
RAM is itself a cache for HDD!
Cache Hierarchy
Modern Processors Employ Multiple Cache Levels
RAM Latency
Picture Source: Computer Architecture: A Quantitative Approach Book by David A Patterson and John L. Hennessy
Solution - Cache memories bridges the gap between CPU and RAM performance
L1 Cache L2 Cache
L3 Cache
Memory Hierarchy
Source & Further Read : https://computationstructures.org/lectures/caches/caches.html
What does
this mean?
How does a Cache work
1 float a[1024]; // 4 KB 2 ...
3 float sum = 0.0;
4 for (int i = 0; i < 1024; ++i) 5 sum += a[i];
3 C’s:
1. Compulsory (can not avoid it) 2. Capacity (cache is full)
3. Conflict (misaligned data, mapping) Cache Line: contains a fixed number of
contiguous (data or code) words mapped to an address of main memory.
Data layout in memory
Assume cache line holds 4 elements, cold cache:
i = 0: Compulsory miss, load cache line (DRAM): 0 1 2 3
i = 4: Compulsory miss, load cache line (DRAM): 4 5 6 7 and so on…
0 1 2 3 i = 1: Cache hit:
0 1 2 3 i = 2: Cache hit:
0 1 2 3 i = 3: Cache hit:
Assume the cache can hold 512 data elements, if we access the first element at the end of the loop again:
i = 0: Capacity miss, load cache line (DRAM): 0 1 2 3
Cache Miss: when a memory address is requested but not present in cache, It
must be fetched from RAM. (Costly)
Cache Locality
Image source: Optimizing for instruction caches, part 1, Amir Kleen, Livadariu Mircea, Itay Peled, Erez Steinberg, Moshe Anschel, Freescale
Cache structures in modern processors benefit from both Temporal and Spatial locality.
High Cache Line Reuse Frequent Cache Fails
Cache Line Associativity
Where are cache lines placed in the cache?
Assume cache can hold 4 cache lines (16 addresses):
Address: 0 4 8 12 16 20
Data layout in memory
0 1 2 3 4 5 6 7
24 28 32
Fully associative:
Cache lines can be placed anywhere.
0 1 3 2
4 2 3 7
One possibility:
Direct mapped:
Cache lines map to the same cache location
0/4 1/5 2/6 3/7 0
1 3 2
n-way set associative:
Cache lines can map to a set of n possible locations
0 1 3 2
0/2/4/6 0/2/4/6 1/3/5/7 1/3/5/7
2-way set associative:
Conflict miss
array0 array1
Bad data alignment leads to cache thrashing!
array 2
Programming for Cache Performance
Which of the two codes performs better?
Assume cache can hold 4 cache lines (16 floats, fully associative):
Data layout in memory
Address: 0 4 8 12 16 20
0 1 2 3 4 5 6 7
24 28 32
• Bad accessing pattern!
• C/C++ stores data in row-major order
• 64 cache misses
• Good accessing pattern!
• C/C++ stores data in row-major order
• 16 cache misses
Code B:
1 float A[8][8];
2 // initialize data 3
4 float sum = 0.0;
5 for (int i = 0; i < 8; ++i)
6 for (int j = 0; j < 8; ++j) 7 sum += A[i][j];
Code A:
1 float A[8][8];
2 // initialize data 3
4 float sum = 0.0;
5 for (int i = 0; i < 8; ++i)
6 for (int j = 0; j < 8; ++j) 7 sum += A[j][i];
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4
Compulsory Misses: 6 Capacity Misses: 0
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4 A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Compulsory Misses: 6 Capacity Misses: 4
10 Misses / 2 Elements
~ 5 Misses / Element
Cache Optimization: Blocked Multiplication
Idea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4
=
C
Compulsory Misses: 4 Capacity Misses: 0
Cache Optimization: Blocked Multiplication
Idea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4
Compulsory Misses: 6 Capacity Misses: 0
=
C
Cache Optimization: Blocked Multiplication
Idea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4
Compulsory Misses: 10 Capacity Misses: 0
=
C
Cache Optimization: Blocked Multiplication
Idea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4
=
C
Compulsory Misses: 12 Capacity Misses: 0
12 Misses / 4 Elements (in 16 quarters)
~ 3 Misses / Element
Concurrency And Parallelism
Terminology: Concurrency and Parallelism
Concurrency:
The existence of two or more stream of instructions, whose execution order cannot be determined a priori.
Parallelism:
The existence of two or more stream of instructions executing simultaneously.
To think about:
There can be concurrency without parallelism, but there cannot be parallelism without concurrency. Why?
Support for Parallelism in Hardware
• Multiple physical cores (Thread-Level Parallelism)
• Pipelining (Instruction-Level Parallelism)
• Vectorization (Data-Level Parallelism)
Terminology: Processes and Threads
Processes:
• Are OS structures (memory+code+threads)
• Operate on their own private virtual address space (64-bit addresses)
• Managed by the OS scheduler
• Contain one or more threads
Threads (Kernel-Level):
• Represents a CPU execution state (register values, program counter)
• Executes a stream of instructions (code) within a running process
• All threads associated to the same process share the same virtual address space
• Programmer can control creation/deletion of additional threads.
Memory Mapping and Inter-Process
• No process can access other processes' memory space
(basic safety in multi-user environment, e.g. Linux)
7
Process Memory Layout
Command line arguments and environment variables
Stack
Heap
Uninitialized data Initialized data
Text (code)
growth directions
Low virtual address High virtual address
Executable machine code (instructions), read-only
Global variables, static variables. Allocated by compiler.
LIFO stack, allocated by loader at startup,
advances when calling a function. Susceptible to recursion.
Heap is Dynamic memory that is user-managed
(e.g. malloc, new)
• Inter-process communication is implemented by the OS with
several mechanisms: Signals, files, pipes, sockets, shared memory.
• Distributed: when processes
communicate through the network.
e.g., with Message Passing (MPI)
Traditional vs Multithreading Execution
11
Processes and Threads
Heap Data Code
Traditional Process Multithreaded Process
Registers Stack
Heap Data Code
Single execution flow Multiple execution flows
Registers Stack
Registers Stack
Registers Stack
Registers Stack
(Sequential execution)
11
Processes and Threads
Heap Data Code
Traditional Process Multithreaded Process
Registers Stack
Heap Data Code
Single execution flow Multiple execution flows
Registers Stack
Registers Stack
Registers Stack
Registers Stack
(Concurrent execution)
Remember: Threads in the same process share the same virtual address space
Communication by shared memory
Master/Worker Threading Model
• The master thread creates new tasks and stores them into a queue
• Workers constantly check the queue for new tasks to perform
• Workers are created from the start, and do not finish while there is work left to do
• The master does not do work (what could go wrong if it did?)
• Synchronization only necessary to access work queues.
Fork/Join Threading Model
https://computing.llnl.gov/tutorials/openMP/
Time
Team of threads Join threads: Synchronization point Fork threads
• The master thread is the main thread that enters the main function in your program
• It controls the creation child threads with a Fork/Join operation
• Child threads within a team (parallel region) may create sub-teams of threads (nested parallelism)
• Threads are joined together at synchronization points (barrier, fences). Only master continues
• The longer it is possible for a team of threads to execute in the same parallel region, the higher the parallel fraction of your code. (Synchronization is expensive)
Terminology: Local vs. Distributed Parallelism
Local (a.k.a. Simultaneous Multithreading, Shared Memory) Parallelism:
When parallelism is achieved by running two or more collaborating threads
simultaneously in a multi-core system, communicating through shared memory.
Distributed Parallelism:
When parallelism is achieved by running two or more collaborating single-thread processes on more than one computer and communicating across the network
(distributed memory).
Hybrid Parallelism.
A combination of the two above: multiple processes communicating across the
network, each running more than one thread which communicate through shared memory.
Full Picture: Local + Distributed Parallelism
https://computing.llnl.gov/tutorials/openMP/
11
Processes and Threads
Heap Data Code
Traditional Process Multithreaded Process
Registers Stack
Heap Data Code
Single execution flow Multiple execution flows
Registers Stack
Registers Stack
Registers Stack
Registers Stack
Shared Memory
Process 0
11
Processes and Threads
Heap Data Code
Traditional Process Multithreaded Process
Registers Stack
Heap Data Code
Single execution flow Multiple execution flows
Registers Stack
Registers Stack
Registers Stack
Registers Stack
Shared Memory
Process 2
11
Processes and Threads
Heap Data Code
Traditional Process Multithreaded Process
Registers Stack
Heap Data Code
Single execution flow Multiple execution flows
Registers Stack
Registers Stack
Registers Stack
Registers Stack
Shared Memory
Process 1
Distributed Memory
Message Passing Message Passing Message
Passing