• No results found

High Performance Computing for Science and Engineering I

N/A
N/A
Protected

Academic year: 2022

Share "High Performance Computing for Science and Engineering I"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

High Performance Computing for Science and Engineering

I

Dr. Sergio Martin

Computational Science & Engineering Laboratory

(Partially based on original material by Fabian Wermelinger)

Cache, Parallelism and Concurrency

(2)

Cache Hierarchy and Optimization

Concurrency and Parallelism

Processes and Threads

Threading Libraries

Race Conditions and Synchronization Mechanisms

OUTLINE

(3)

Cache Hierarchy:

A brief History

(4)

The Intel 8086 Processor

1978 - Intel Releases 8086, the first 16-bit processor of the x86 architecture.

29k Transistors - 5Mhz

Picture : https://commons.wikimedia.org/wiki/File:Intel_8086_block_scheme.svg

Address Calculator Unit

Segment Selectors (16-bit)

General-Purpose Registers (16-bit)

1xArithmetic-Logic Unit (ALU)

Further Read : Computer Organization and Design: The Hardware/Software Interface

16-bit = 64k

Max Memory Address

(5)

(32-bit) Extended Registers

1986 - Intel Releases 80386, its first 32-bit Processor.

Picture Source: CPU collection Konstantin Lanzet

275k Transistors - 20Mhz

Extended Registers

Applications Operate on Larger Data Sets

Additional Pressure on RAM

32-bit = 4gb

Max Memory Address

(6)

CPU/RAM Latency

Picture Source: CPU collection Konstantin Lanzet

275k Transistors - 20Mhz Asynchronous DRAM

Capacity: 1-64 Mbytes Latency: ~120ns

Latency Ratio: 24x

Capacity Ratio: ~10^6

GP Register Capacity: 16 Bytes Register Memory Latency: 2-5ns

(7)

Memory Performance Gap

Picture Source: Computer Architecture: A Quantitative Approach Book by David A Patterson and John L. Hennessy

Problem - Growing disparity between register and RAM latencies.

Perf orma nce Ga p

(8)

External Cache (SRAM) Capacity: ~128 kbytes

Latency: 10-25ns

Cache Memory

Picture Source: CPU collection Konstantin Lanzet

275k Transistors - 20Mhz

GP Register Capacity: 16 Bytes Register Memory Latency: 2-5ns

Asynchronous DRAM Capacity: 1-64 Mbytes

Latency: ~120ns 1988 - Intel Releases 80386SX, the first commercial CPU with a Data-Cache Memory

Latency Ratio: 4x

Capacity Ratio: ~10^4

Latency Ratio: 6x Capacity Ratio: ~10^2

Cache Memories do not improve the RAM->CPU Latency.

Instead, they speed-up the reuse of data based on temporal and cache locality.

RAM is itself a cache for HDD!

(9)

Cache Hierarchy

Modern Processors Employ Multiple Cache Levels

(10)

RAM Latency

Picture Source: Computer Architecture: A Quantitative Approach Book by David A Patterson and John L. Hennessy

Solution - Cache memories bridges the gap between CPU and RAM performance

L1 Cache L2 Cache

L3 Cache

(11)

Memory Hierarchy

Source & Further Read : https://computationstructures.org/lectures/caches/caches.html

What does

this mean?

(12)

How does a Cache work

1 float a[1024]; // 4 KB 2 ...

3 float sum = 0.0;

4 for (int i = 0; i < 1024; ++i) 5 sum += a[i];

3 C’s:

1. Compulsory (can not avoid it) 2. Capacity (cache is full)

3. Conflict (misaligned data, mapping) Cache Line: contains a fixed number of

contiguous (data or code) words mapped to an address of main memory.

Data layout in memory

Assume cache line holds 4 elements, cold cache:

i = 0: Compulsory miss, load cache line (DRAM): 0 1 2 3

i = 4: Compulsory miss, load cache line (DRAM): 4 5 6 7 and so on…

0 1 2 3 i = 1: Cache hit:

0 1 2 3 i = 2: Cache hit:

0 1 2 3 i = 3: Cache hit:

Assume the cache can hold 512 data elements, if we access the first element at the end of the loop again:

i = 0: Capacity miss, load cache line (DRAM): 0 1 2 3

Cache Miss: when a memory address is requested but not present in cache, It

must be fetched from RAM. (Costly)

(13)

Cache Locality

Image source: Optimizing for instruction caches, part 1, Amir Kleen, Livadariu Mircea, Itay Peled, Erez Steinberg, Moshe Anschel, Freescale

Cache structures in modern processors benefit from both Temporal and Spatial locality.

High Cache Line Reuse Frequent Cache Fails

(14)

Cache Line Associativity

Where are cache lines placed in the cache?

Assume cache can hold 4 cache lines (16 addresses):

Address: 0 4 8 12 16 20

Data layout in memory

0 1 2 3 4 5 6 7

24 28 32

Fully associative:

Cache lines can be placed anywhere.

0 1 3 2

4 2 3 7

One possibility:

Direct mapped:

Cache lines map to the same cache location

0/4 1/5 2/6 3/7 0

1 3 2

n-way set associative:

Cache lines can map to a set of n possible locations

0 1 3 2

0/2/4/6 0/2/4/6 1/3/5/7 1/3/5/7

2-way set associative:

Conflict miss

array0 array1

Bad data alignment leads to cache thrashing!

array 2

(15)

Programming for Cache Performance

Which of the two codes performs better?

Assume cache can hold 4 cache lines (16 floats, fully associative):

Data layout in memory

Address: 0 4 8 12 16 20

0 1 2 3 4 5 6 7

24 28 32

Bad accessing pattern!

C/C++ stores data in row-major order

64 cache misses

Good accessing pattern!

C/C++ stores data in row-major order

16 cache misses

Code B:

1 float A[8][8];

2 // initialize data 3

4 float sum = 0.0;

5 for (int i = 0; i < 8; ++i)

6 for (int j = 0; j < 8; ++j) 7 sum += A[i][j];

Code A:

1 float A[8][8];

2 // initialize data 3

4 float sum = 0.0;

5 for (int i = 0; i < 8; ++i)

6 for (int j = 0; j < 8; ++j) 7 sum += A[j][i];

(16)

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4

Compulsory Misses: 6 Capacity Misses: 0

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

(17)

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4 A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Compulsory Misses: 6 Capacity Misses: 4

10 Misses / 2 Elements

~ 5 Misses / Element

(18)

Cache Optimization: Blocked Multiplication

Idea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4

=

C

Compulsory Misses: 4 Capacity Misses: 0

(19)

Cache Optimization: Blocked Multiplication

Idea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4

Compulsory Misses: 6 Capacity Misses: 0

=

C

(20)

Cache Optimization: Blocked Multiplication

Idea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4

Compulsory Misses: 10 Capacity Misses: 0

=

C

(21)

Cache Optimization: Blocked Multiplication

Idea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4

=

C

Compulsory Misses: 12 Capacity Misses: 0

12 Misses / 4 Elements (in 16 quarters)

~ 3 Misses / Element

(22)

Concurrency And Parallelism

(23)

Terminology: Concurrency and Parallelism

Concurrency:

The existence of two or more stream of instructions, whose execution order cannot be determined a priori.

Parallelism:

The existence of two or more stream of instructions executing simultaneously.

To think about:

There can be concurrency without parallelism, but there cannot be parallelism without concurrency. Why?

(24)

Support for Parallelism in Hardware

Multiple physical cores (Thread-Level Parallelism)

Pipelining (Instruction-Level Parallelism)

Vectorization (Data-Level Parallelism)

(25)

Terminology: Processes and Threads

Processes:

Are OS structures (memory+code+threads)

Operate on their own private virtual address space (64-bit addresses)

Managed by the OS scheduler

Contain one or more threads

Threads (Kernel-Level):

Represents a CPU execution state (register values, program counter)

Executes a stream of instructions (code) within a running process

All threads associated to the same process share the same virtual address space

Programmer can control creation/deletion of additional threads.

(26)

Memory Mapping and Inter-Process

No process can access other processes' memory space


(basic safety in multi-user environment, e.g. Linux)

7

Process Memory Layout

Command line arguments and environment variables

Stack

Heap

Uninitialized data Initialized data

Text (code)

growth directions

Low virtual address High virtual address

Executable machine code (instructions), read-only

Global variables, static variables. Allocated by compiler.

LIFO stack, allocated by loader at startup,

advances when calling a function. Susceptible to recursion.

Heap is Dynamic memory that is user-managed


(e.g. malloc, new)

Inter-process communication is implemented by the OS with

several mechanisms: Signals, files, pipes, sockets, shared memory.


Distributed: when processes

communicate through the network.


e.g., with Message Passing (MPI)

(27)

Traditional vs Multithreading Execution

11

Processes and Threads

Heap Data Code

Traditional Process Multithreaded Process

Registers Stack

Heap Data Code

Single execution flow Multiple execution flows

Registers Stack

Registers Stack

Registers Stack

Registers Stack

(Sequential execution)

11

Processes and Threads

Heap Data Code

Traditional Process Multithreaded Process

Registers Stack

Heap Data Code

Single execution flow Multiple execution flows

Registers Stack

Registers Stack

Registers Stack

Registers Stack

(Concurrent execution)

Remember: Threads in the same process share the same virtual address space

Communication by shared memory

(28)

Master/Worker Threading Model

The master thread creates new tasks and stores them into a queue

Workers constantly check the queue for new tasks to perform

Workers are created from the start, and do not finish while there is work left to do

The master does not do work (what could go wrong if it did?)

Synchronization only necessary to access work queues.

(29)

Fork/Join Threading Model

https://computing.llnl.gov/tutorials/openMP/

Time

Team of threads Join threads: Synchronization point Fork threads

The master thread is the main thread that enters the main function in your program

It controls the creation child threads with a Fork/Join operation

Child threads within a team (parallel region) may create sub-teams of threads (nested parallelism)

Threads are joined together at synchronization points (barrier, fences). Only master continues

The longer it is possible for a team of threads to execute in the same parallel region, the higher the parallel fraction of your code. (Synchronization is expensive)

(30)

Terminology: Local vs. Distributed Parallelism

Local (a.k.a. Simultaneous Multithreading, Shared Memory) Parallelism:

When parallelism is achieved by running two or more collaborating threads

simultaneously in a multi-core system, communicating through shared memory.

Distributed Parallelism:

When parallelism is achieved by running two or more collaborating single-thread processes on more than one computer and communicating across the network

(distributed memory).

Hybrid Parallelism.

A combination of the two above: multiple processes communicating across the

network, each running more than one thread which communicate through shared memory.

(31)

Full Picture: Local + Distributed Parallelism

https://computing.llnl.gov/tutorials/openMP/

11

Processes and Threads

Heap Data Code

Traditional Process Multithreaded Process

Registers Stack

Heap Data Code

Single execution flow Multiple execution flows

Registers Stack

Registers Stack

Registers Stack

Registers Stack

Shared Memory

Process 0

11

Processes and Threads

Heap Data Code

Traditional Process Multithreaded Process

Registers Stack

Heap Data Code

Single execution flow Multiple execution flows

Registers Stack

Registers Stack

Registers Stack

Registers Stack

Shared Memory

Process 2

11

Processes and Threads

Heap Data Code

Traditional Process Multithreaded Process

Registers Stack

Heap Data Code

Single execution flow Multiple execution flows

Registers Stack

Registers Stack

Registers Stack

Registers Stack

Shared Memory

Process 1

Distributed Memory

Message Passing Message Passing Message

Passing

References

Related documents

enforcement agencies that voluntary report this information within federal data sets (EPIC) and other federal collection sources (DEA, U.S. Forest Service, National Guard and

The volunteer groups will always be working with experienced staff member, who will be able to show volunteers proper scientific methods for working with sea

Thus, the aim of the current study was to determine basic biological (length of caught fish, conditions, and growth rate) and population (size and age structure) characteristics and

Organizational Behavior (OB) is a field of study that focuses on three primary determinants of behavior in organizations: (a) the individual, (b) groups, and (c) structure.. The

Tutor-led classroom activity – discuss different types of shampoo and conditioning products available for different hair and scalp types.. How shampoo works on the hair with the pH of

two eggs any style, roasted rosemary potatoes choice of one: bacon, black forest ham, pork sausage, chicken apple sausage choice of toast: white, wheat, rye, English

Results: A distinctive N400 component which was modulated by emotional content of vocal stimulus was observed in children over parietal and occipital scalp regions—amplitudes

The original aim and objective of this research was to measure the prevalence of academic librarian burnout using the Copenhagen Burnout Inventory (CBI), a widely available and