High Performance Computing for Science and Engineering I

(1)

High Performance Computing for Science and Engineering

I

Dr. Sergio Martin

Computational Science & Engineering Laboratory

(Partially based on original material by Fabian Wermelinger)

Cache, Parallelism and Concurrency

(2)

• Cache Hierarchy and Optimization

• Concurrency and Parallelism

• Processes and Threads

• Threading Libraries

• Race Conditions and Synchronization Mechanisms

OUTLINE

(3)

Cache Hierarchy:

A brief History

(4)

The Intel 8086 Processor

1978 - Intel Releases 8086, the first 16-bit processor of the x86 architecture.

29k Transistors - 5Mhz

Picture : https://commons.wikimedia.org/wiki/File:Intel_8086_block_scheme.svg

Address Calculator Unit

Segment Selectors (16-bit)

General-Purpose Registers (16-bit)

1xArithmetic-Logic Unit (ALU)

Further Read : Computer Organization and Design: The Hardware/Software Interface

16-bit = 64k

Max Memory Address

(5)

(32-bit) Extended Registers

1986 - Intel Releases 80386, its first 32-bit Processor.

Picture Source: CPU collection Konstantin Lanzet

Extended Registers

Applications Operate on Larger Data Sets

Additional Pressure on RAM

32-bit = 4gb

Max Memory Address

(6)

CPU/RAM Latency

275k Transistors - 20Mhz Asynchronous DRAM

Capacity: 1-64 Mbytes Latency: ~120ns

Latency Ratio: 24x

Capacity Ratio: ~10^6

GP Register Capacity: 16 Bytes Register Memory Latency: 2-5ns

(7)

Memory Performance Gap

Picture Source: Computer Architecture: A Quantitative Approach Book by David A Patterson and John L. Hennessy

Problem - Growing disparity between register and RAM latencies.

Perf orma nce Ga p

(8)

External Cache (SRAM) Capacity: ~128 kbytes

Latency: 10-25ns

Cache Memory

GP Register Capacity: 16 Bytes Register Memory Latency: 2-5ns

Asynchronous DRAM Capacity: 1-64 Mbytes

Latency: ~120ns 1988 - Intel Releases 80386SX, the first commercial CPU with a Data-Cache Memory

Latency Ratio: 4x

Capacity Ratio: ~10^4

Latency Ratio: 6x Capacity Ratio: ~10^2

Cache Memories do not improve the RAM->CPU Latency.

Instead, they speed-up the reuse of data based on temporal and cache locality.

RAM is itself a cache for HDD!

(9)

Cache Hierarchy

Modern Processors Employ Multiple Cache Levels

(10)

RAM Latency

Picture Source: Computer Architecture: A Quantitative Approach Book by David A Patterson and John L. Hennessy

Solution - Cache memories bridges the gap between CPU and RAM performance

L1 Cache L2 Cache

L3 Cache

(11)

Memory Hierarchy

Source & Further Read : https://computationstructures.org/lectures/caches/caches.html

What does

this mean?

(12)

How does a Cache work

1 float a[1024]; // 4 KB 2 ...

3 float sum = 0.0;

4 for (int i = 0; i < 1024; ++i) 5 sum += a[i];

3 C’s:

1. Compulsory (can not avoid it) 2. Capacity (cache is full)

3. Conflict (misaligned data, mapping) Cache Line: contains a fixed number of

contiguous (data or code) words mapped to an address of main memory.

Data layout in memory

Assume cache line holds 4 elements, cold cache:

i = 0: Compulsory miss, load cache line (DRAM): 0 1 2 3

i = 4: Compulsory miss, load cache line (DRAM): 4 5 6 7 and so on…

0 1 2 3 i = 1: ^{Cache hit:}

0 1 2 3 i = 2: ^{Cache hit:}

0 1 2 3 i = 3: ^{Cache hit:}

Assume the cache can hold 512 data elements, if we access the first element at the end of the loop again:

i = 0: Capacity miss, load cache line (DRAM): 0 1 2 3

Cache Miss: when a memory address is requested but not present in cache, It

must be fetched from RAM. (Costly)

(13)

Cache Locality

Image source: Optimizing for instruction caches, part 1, Amir Kleen, Livadariu Mircea, Itay Peled, Erez Steinberg, Moshe Anschel, Freescale

Cache structures in modern processors benefit from both Temporal and Spatial locality.

High Cache Line Reuse Frequent Cache Fails

(14)

Cache Line Associativity

Where are cache lines placed in the cache?

Assume cache can hold 4 cache lines (16 addresses):

Address: 0 4 8 12 16 20

Data layout in memory

0 1 2 3 4 5 6 7

24 28 32

Fully associative:

Cache lines can be placed anywhere.

0 1 3 2

4 2 3 7

One possibility:

Direct mapped:

Cache lines map to the same cache location

0/4 1/5 2/6 3/7 0

1 3 2

n-way set associative:

Cache lines can map to a set of n possible locations

0 1 3 2

0/2/4/6 0/2/4/6 1/3/5/7 1/3/5/7

2-way set associative:

Conflict miss

array0 array1

Bad data alignment leads to cache thrashing!

array 2

(15)

Programming for Cache Performance

Which of the two codes performs better?

Assume cache can hold 4 cache lines (16 floats, fully associative):

Data layout in memory

Address: 0 4 8 12 16 20

0 1 2 3 4 5 6 7

24 28 32

• Bad accessing pattern!

• C/C++ stores data in row-major order

• 64 cache misses

• Good accessing pattern!

• C/C++ stores data in row-major order

• 16 cache misses

Code B:

1 float A[8][8];

2 // initialize data 3

4 float sum = 0.0;

5 for (int i = 0; i < 8; ++i)

6 for (int j = 0; j < 8; ++j) 7 sum += A[i][j];

Code A:

1 float A[8][8];

2 // initialize data 3

4 float sum = 0.0;

5 for (int i = 0; i < 8; ++i)

6 for (int j = 0; j < 8; ++j) 7 sum += A[j][i];

(16)

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4

Compulsory Misses: 6 Capacity Misses: 0

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

(17)

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4 A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

10 Misses / 2 Elements

~ 5 Misses / Element

(18)

Cache Optimization: Blocked Multiplication

Idea: Let's solve the multiplication in blocks, storing partial solutions to C^i,j

System Memory

A B

x

=

C

(19)

Cache Optimization: Blocked Multiplication

System Memory

A B

x

=

C

(20)

Cache Optimization: Blocked Multiplication

System Memory

A B

x

=

C

(21)

Cache Optimization: Blocked Multiplication

System Memory

A B

x

=

C

12 Misses / 4 Elements (in 16 quarters)

~ 3 Misses / Element

(22)

Concurrency And Parallelism

(23)

Terminology: Concurrency and Parallelism

Concurrency:

The existence of two or more stream of instructions, whose execution order cannot be determined a priori.

Parallelism:

The existence of two or more stream of instructions executing simultaneously.

To think about:

There can be concurrency without parallelism, but there cannot be parallelism without concurrency. Why?

(24)

Support for Parallelism in Hardware

• Multiple physical cores (Thread-Level Parallelism)

• Pipelining (Instruction-Level Parallelism)

• Vectorization (Data-Level Parallelism)

(25)

Terminology: Processes and Threads

Processes:

• Are OS structures (memory+code+threads)

• Operate on their own private virtual address space (64-bit addresses)

• Managed by the OS scheduler

• Contain one or more threads

Threads (Kernel-Level):

• Represents a CPU execution state (register values, program counter)

• Executes a stream of instructions (code) within a running process

• All threads associated to the same process share the same virtual address space

• Programmer can control creation/deletion of additional threads.

(26)

Memory Mapping and Inter-Process

• No process can access other processes' memory space 

(basic safety in multi-user environment, e.g. Linux)

7

Process Memory Layout

Command line arguments and environment variables

Stack

Heap

Uninitialized data Initialized data

Text (code)

growth directions

Low virtual address High virtual address

Executable machine code (instructions), read-only

Global variables, static variables. Allocated by compiler.

LIFO stack, allocated by loader at startup,

advances when calling a function. Susceptible to recursion.

Heap is Dynamic memory that is user-managed 

(e.g. malloc, new)

• Inter-process communication is implemented by the OS with

several mechanisms: Signals, files, pipes, sockets, shared memory. 

• Distributed: when processes

communicate through the network. 

e.g., with Message Passing (MPI)

(27)

Traditional vs Multithreading Execution

11

Processes and Threads

Heap Data Code

Traditional Process Multithreaded Process

Registers Stack

Heap Data Code

Single execution flow Multiple execution flows

Registers Stack

(Sequential execution)

11

Processes and Threads

Heap Data Code

Registers Stack

Heap Data Code

Registers Stack

(Concurrent execution)

Remember: Threads in the same process share the same virtual address space

Communication by shared memory

(28)

Master/Worker Threading Model

• The master thread creates new tasks and stores them into a queue

• Workers constantly check the queue for new tasks to perform

• Workers are created from the start, and do not finish while there is work left to do

• The master does not do work (what could go wrong if it did?)

• Synchronization only necessary to access work queues.

(29)

Fork/Join Threading Model

https://computing.llnl.gov/tutorials/openMP/

Time

Team of threads Join threads: Synchronization point Fork threads

• The master thread is the main thread that enters the main function in your program

• It controls the creation child threads with a Fork/Join operation

• Child threads within a team (parallel region) may create sub-teams of threads (nested parallelism)

• Threads are joined together at synchronization points (barrier, fences). Only master continues

• The longer it is possible for a team of threads to execute in the same parallel region, the higher the parallel fraction of your code. (Synchronization is expensive)

(30)

Terminology: Local vs. Distributed Parallelism

Local (a.k.a. Simultaneous Multithreading, Shared Memory) Parallelism:

When parallelism is achieved by running two or more collaborating threads

simultaneously in a multi-core system, communicating through shared memory.

Distributed Parallelism:

When parallelism is achieved by running two or more collaborating single-thread processes on more than one computer and communicating across the network

(distributed memory).

Hybrid Parallelism.

A combination of the two above: multiple processes communicating across the

network, each running more than one thread which communicate through shared memory.

(31)

Full Picture: Local + Distributed Parallelism

https://computing.llnl.gov/tutorials/openMP/

11

Processes and Threads

Heap Data Code

Registers Stack

Heap Data Code

Registers Stack

Shared Memory

Process 0

11

Heap Data Code

Registers Stack

Heap Data Code

Registers Stack

Shared Memory

Process 2

11

Heap Data Code

Registers Stack

Heap Data Code

Registers Stack

Shared Memory

Process 1

Distributed Memory

Message Passing Message Passing Message

Passing