• No results found

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

N/A
N/A
Protected

Academic year: 2021

Share "Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip."

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

1 1 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Lecture 11: Multi-Core and GPU

General Purpose GPUs

Multithreading

Multi-core computers

GPUs

Multi-Core System

 Integration of multiple processor cores on a single chip.

 To provide a cheap parallel computer solution.

 To increase the computation power in a PC platform.

 Multi-core processor is a special kind of multi- processors:

 All processors are on the same chip  also called Chip Multiprocessor.

 MIMD: Different cores execute different codes (threads) operating on different data.

 A shared memory multiprocessor: all cores share the

same memory via some cache system.

(2)

3 3 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Why Multi-Core?

Limitations of single core architectures:

 High power consumption due to high clock rates (ca. 2-3% power increase per 1% performance increase).

 Heat generation (cooling is expensive).

 Limited parallelism (Instruction Level Parallelism only).

 Design time and complexity increased due to complex methods to increase ILP.

Many new applications are multithreaded, suitable for multi-core.

 Ex. Multimedia applications.

General trend in computer architecture -> shift towards more parallelism.

Much faster cache coherency circuits, in a single chip.

Smaller in physical size than SMP.

A Simplified Multi-Core Model

core L1 L1 Inst Data

core L1 L1 Inst Data

core L1 L1 Inst Data

core L1 L1 Inst Data

Main Memory On Chip

Shared Bus

(3)

5 5 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Alternative Models

Intel Core i7

Four (or six) identical x86 processors.

Each with its own L1 split, and L2 unified cache.

Shared L3 cache to speed up inter-proc.

communication

High speed link between processor chips.

Up to 4 GHz clock

frequency.

(4)

7 7 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Superscalar vs. Multi-Core

6-way superscalar Multi-core processor with 4 identical 2-way superscalar processors (total 8 degree of parallelism)

Quadratically increases with issue width

Multi-core processors can run at higher clock rate.

2000 2008+

Per formance 10X

SINGLE CORE MULTI-CORE

2004

3X

Source: Intel

Normalized Performance vs. Initial Intel®Pentium®4 Processor

Source: Intel

2000 2008+

Per formance

SINGLE CORE

2004

3X

Normalized Performance vs. Initial Intel®Pentium®4 Processor

Single Core vs. Multi Core

(5)

9 9 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Intel Polaris

I/O Area

I/O Area PLL

single tile 1.5mm

2.0mm

TAP

21.72mm

I/O Area

PLL TAP

12.64mm

65nm, 1 poly, 8 metal (Cu) Technology

100 Million (full-chip) 1.2 Million (tile) Transistors

275mm2 (full-chip) 3mm2(tile) Die Area

8390 C4 bumps #

65nm, 1 poly, 8 metal (Cu) Technology

100 Million (full-chip) 1.2 Million (tile) Transistors

275mm2 (full-chip) 3mm2(tile) Die Area

8390 C4 bumps #

80 cores with tFLOPS (10

12

) performance on a single chip (1

st

chip to do so).

Mesh network-on-a-chip.

Frequency target at 5 GHz.

Workload-aware power management:

 Instructions to make any core sleep or wake as Apps demand.

 Chip voltage & frequency control.

Peak of 1.01 Teraflops at 62 watts.

 Peak power efficiency of 19.4

gFLOPS/Watt (almost 10X better than supercomputers).

Short design time (regularity).

Innovations on Intel’s Polaris

 Rapid design – The tiled-design approach allows designers to use smaller cores that can easily be repeated across the chip.

 Network-on-a-chip – The cores are connected in a 2D mesh network that implement message-passing.

 This scheme is much more scalable.

 Fine-grain power management:

 The individual computing engines and data routers in

each core can be activated or put to sleep based on

the performance required by the applications.

(6)

11 11 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Lecture 11: Multi-Core and GPU

General Purpose GPUs

Multithreading

Multi-core computers

GPUs

Thread-Level Parallelism (TLP)

This is parallelism on a coarser level than instruction-level parallelism (ILP).

Instruction stream divided into smaller streams (threads) to be executed in parallel.

Thread has its own instructions and data.

 It is part of a parallel program or independent programs.

 Each thread has all state (instructions, data, PC, etc.) needed to execute.

Multi-core architectures can exploit TLP efficiently.

 Use multiple instruction streams to improve the throughput of computers that run several programs.

TLP are more cost-effective to exploit than ILP.

(7)

13 13 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Multithreading Approaches

 Interleaved (fine-grained)

 Processor deals with several thread contexts at a time.

 Switching thread at each clock cycle (hardware support needed).

 If a thread is blocked, it is skipped.

 Hide latency of both short and long pipeline stalls.

 Blocked (coarse-grained)

 Thread executed until an event causes delay (e.g., cache miss).

 Relieves the need to have very fast thread switching.

 No slow down for ready-to-go threads.

 Simultaneous (SMT)

 Instructions simultaneously issued from multiple threads to execution units of a superscalar processor.

 Chip multiprocessing

 Each processor handles separate threads.

Scalar Processor Case

Cache miss

(8)

15 15 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Scalar Processor Approaches

 Single-threaded scalar

 Simple pipeline and no multithreading.

 Interleaved multithreaded scalar

 Easiest multithreading to implement.

 Switch threads at each clock cycle.

 Pipeline stages kept close to fully occupied.

 Hardware needs to switch thread context between cycles.

 Blocked multithreaded scalar

 Thread executed until latency event occurs to stop the pipeline.

 Processor switches to another thread.

Superscalar Case

(9)

17 17 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Superscalar Approaches

Classical superscalar

 No multithreading.

Interleaved multithreading superscalar:

 Each cycle, as many instructions as possible issued from single thread.

 Delays due to thread switches eliminated.

 Number of instructions issued in a cycle is limited by dependencies.

Blocked multithreaded superscalar

 Instructions from one thread are executed at a time.

 Blocked multithreading used.

SMT and Chip Multiprocessing

 Simultaneous

multithreading (SMT):

 Issue multiple instructions at a time.

 One thread may fill all horizontal slots.

 Instructions from two or more threads may be issued to fill all horizontal slots.

 With enough threads, it can issue maximum number of instructions on each cycle.

• Best efficiency!

(10)

19 19 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

SMT and Chip Multiprocessing

 Chip multiprocessing:

 Multiple processors.

 Each can be itself a superscalar

processor, with multiple instruction issues (e.g., 2 issues).

 Each processor is assigned a thread (or a process).

• Control is simplified!

Multithreading Paradigms

Thread 1 Unused

Execution Time

FU1 FU2 FU3 FU4

Conventional Superscalar

Single Threaded

Fine-grained Multithreading (cycle-by-cycle Interleaving)

Thread 2 Thread 3 Thread 4 Thread 5

Coarse-grained Multithreading (Block Interleaving)

Chip Multiprocessor

(CMP) Simultaneous

Multithreading (SMT)

(11)

21 21 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Programming for Multi-Core

 There must be many threads or processes:

 Multiple applications running on the same machine.

• Multi-tasking is becoming very common.

• OS software tends to run many threads as a part of its normal operation.

 An application may also have multiple threads.

• In most cases, it must be specifically written or compiled.

 OS scheduler should map threads to different cores:

 To balance the work load; or

 To avoid hot spots due to heat generation.

Lecture 11: Multi-Core and GPU

General Purpose GPUs

Multithreading

Multi-core computers

GPUs

(12)

23 23 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Graphics Processing Unit  Why?

 Graphics applications are:

 Rendering of 2D or 3D images with complex optical effects;

 Highly computation intensive;

 Massively parallel;

 Data stream based.

 General purpose processors (CPU) are:

 Designed to handle huge data volumes;

 Serial, one operation at a time;

 Control flow based;

 High flexible, but not very well adapted to graphics.

GPUs are the solutions!

Graphic Processing Example

64bits to memory

64bits to memory

64bits to memory

64bits to memory Input from CPU

Host interface

Vertex processing

Triangle setup

Pixel processing

Memory Interface

Yellow parts are

programmable

(13)

25 25 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Graphic Processing Feature

1) Process pixels in parallel

Large degree of data-parallel:

2.1M pixels per frame (HDTV)

=> lots of work

All pixels are independent

=> no synchronization needed

Lots of spatial locality

=> regular memory access

Great speedups can be achieved

Limited only by the amount of hardware

Graphic Processing Feature

2) Focus on throughput, not latency

Each pixel can take a long time to process, as long as we process many of them at the same time.

Great scalability

Lots of simple parallel processors

Low clock speed

Latency-optimized (fast, serial) Throughput-optimized (slow, parallel)

(14)

27 27 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

GPU Examples

Nvidia G80

AMD 5870

Lots of Small Parallel SIMD Processors Limited Interconnect

Limited Memory

Lots of Memory Controllers Very Small Caches

GPU Examples

Nvidia G80

AMD 5870

Hardware thread schedulers

Fixed-function Logic

(15)

29 29 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

CPU vs. GPU Philosophy

L2

BP L1

LM I$

L2

BP L1

L2

BP L1

L2

BP L1

4 CPU Massive Cores: Big caches, branch predictors, out-of-order, multiple- issue, speculative execution…

2 IPC per core, 8 IPC total @3GHz

LM I$

LM I$ LM I$

LM I$ LM I$

LM I$ LM I$

8*8 Simple GPU Cores: No big caches, in-order, single-issue, …

 1 IPC per core, 64 IPC total @1.5GHz Source: David Black-Schaffer

CPU vs. GPU Chips

130W, 263mm

2

106 gFLOPS (Single Precision) Big caches (8MB)

Out-of-order execution 0.8 gFLOPS/W

188W, 334mm

2

2720 gFLOPS (Single Precision) Small caches (<1MB)

Hardware thread scheduling 14.5 gFLOPS/W

Intel Nehalem 4-core AMD Radeon 5870

(16)

31 31 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

CPU vs. GPU Architecture

CPUs are latency-optimized

 Each thread runs as fast as possible, but only a few threads.

GPUs are throughput-optimized

 Each thread may take a long time, but thousands of threads.

CPUs have a few massive cores.

GPUs have hundreds of simple cores.

CPUs excel at irregular control-intensive work

 Lots of hardware for control, few ALUs.

GPUs excel at regular computation-intensive work

 Lots of ALUs for computation, little hardware for control.

The best is to combine both architectures in the same machine!

Lecture 11: Multi-Core and GPU

General Purpose GPUs

Multithreading

Multi-core computers

GPUs

(17)

33 33 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

GPU – Specialized Processor

 Very efficient for

 Fast parallel floating point processing.

 SIMD (Single Instruction Multiple Data) operations.

 High computation per memory access.

 Not as efficient for

 Double precision computations.

 Logical operations on integer data.

 Random access, memory-intensive operations.

 Branching-intensive operations.

GPU Instruction Flow

 GPU control units fetch 1 instruction per cycle, and share it with 8 processor cores.

 SIMD (AMD GPU = 4-way; Intel’s Larabee = 16-way).

 What if they don’t all want the same instruction?

 Divergent execution, due to branches, for example.

LM I$

(18)

35 35 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Divergent Execution

1 2 if 3 el 4 5 Thread Instructions

if (…) do 3 else do 4

1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 if if if if if if if if if 3 3 3 3 3 3 3 3

el el

5 5 5 5 5 5 5 5

4 4

6 6 6 6 6 6 6 6 6

5 5

Cycle 0 Fetch:

Cycle 1 Fetch:

Cycle 2 Fetch:

Cycle 3 Fetch:

Cycle 4 Fetch:

Cycle 5 Fetch:

Cycle 6 Fetch:

Cycle 7 Fetch:

Cycle 8 Fetch:

t0 t1 t2 t3 t4 t5 t6 t7 thread

t7 stalls

Divergent execution can dramatically hurt performance!

Source: David Black-Schaffer

Reduce Branch Divergence

 Software-based optimizations can be used.

 Iteration delaying:

 target a divergent branch enclosed by a loop.

 execute loop iterations that take the same branch direction.

 delay those that take the other direction until later iterations.

 execute the delayed operations in future iterations

where more threads take the same direction.

(19)

37 37 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

GPGPU

 Make GPU more general by having more programmable components.

 Implement GPU and CPU (multi-core) in the same machine.

 Provide hardware support on GPU for non-graphics applications.

 Include multiple parallelism scheme in the same architecture.

NVIDIA Tesla Architecture

(20)

39 39 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

GPU Architecture Features

 Massively Parallel, 1000s of processors (today).

 Cost efficiency:

 Graphics driving demand up, supply up, leading to low price.

 Power efficiency:

 Fixed Function Hardware = area & power efficient.

 No speculation: more processing, less leaky cache.

 SIMD computation.

 Computing power = Frequency * Transistors

 (Moore’s law)

2

.

 GPUs: 1.7X (pixels) to 2.3X (vertices) annual growth.

 CPUs: 1.4X annual growth.

Performance Improvement

0 200 400 600 800 1000 1200 1400

9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009

GF LO P s

NVIDIA GPU Intel CPU

T12

Westmere

NV30 NV40

G70

G80

GT200

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

(21)

41 41 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

CUDA Programming Language

 CUDA = Computer Unified Device Architecture.

 Minimal extensions to familiar C/C++.

 Heterogeneous serial/parallel programming model.

 Serial code executes on CPU.

 Parallel kernels executed across a set of parallel threads on the GPU.

 It enables general-purpose GPU computing

 It also maps well to multi-core CPUs.

 It enables heterogeneous systems (i.e., CPU+GPU)

 CPU good at serial computation, GPU at parallel.

General CUDA Steps

 Copy data from CPU to GPU.

 Perform SIMD computations (kernel) on GPU.

 Copy data back from GPU to CPU.

 Execution on host doesn’t wait for kernel to finish on GPU.

 General rules:

 Minimize data transfer between CPU & GPU

 Maximize number of threads on GPU

(22)

43 43 Zebo Peng, IDA, LiTH

Zebo Peng, IDA, LiTH TDTS 08 – Lecture 11TDTS 08 – Lecture 11

Summary

 All computers are now parallel computers!

 Multi-core processors represent an important new trend in computer architecture.

 Decreased power consumption and heat generation.

 Minimized wire lengths and interconnect latencies.

 They enable true thread-level parallelism with great energy efficiency and scalability.

 Graphics requires a lot of computation and a huge amount of bandwidth  GPUs.

 GPUs are coming to general purpose computing, since they deliver huge performance with small power.

 Massively parallel computing are becoming a commodity

technology.

References

Related documents

Nevertheless, very little is known about the volatility dynamics of the global Alternative Energy market and the possible correlation between the stock prices of

Multi-core processors use chip multiprocessing (CMP).. than just reuse select processor resources in a single-core processor, processor manufacturers take advantage of improvements

meaning that PCsrc is asserted if it is a Branch instruction and ALU result is zero (Zero = 1 signal is true). Control signals

The purpose of this document is to specify the approach that USEFP will use to select the Advertising Agency for providing communication (advertising, media

In order to retain your Old National Bank transaction history, you will need to download them from Old National Bank’s Online Banking prior to August 14, 2015.. After that date,

Companion Contract to compete an RFP to sustain the operational baseline, potentially modify the existing system to provide current capabilities and any regulatory compliance

The current UK General Medical Council guidance on end-of-life care requires doctors to ensure that death becomes an explicit discussion point when patients are likely to die

governmental unit, the governmental unit shall make available a receiving receiving facility that can lawfully accept all septage waste generated wi. facility that can lawfully