Thread Level Parallelism (TLP)

(1)

Thread Level Parallelism

(TLP)

Calcolatori Elettronici 2

http://www.dii.unisi.it/~giorgi/didattica/calel2

Calcolatori Elettronici 2

(2)

Estimated Industry Trends

Moore's Law allows for the rapid increase in transistors per core. TLP optimised cores will

start out much simpler, and may grow complex more slowly.

The trend is for chips and CPU cores to get smaller, though TLP

optimised ones will start much smaller.

Growth rates in maximum power for "fat" CPUs have levelled off a bit. For "thin" cores, the number of CPU cores

per chip will probably increase rather

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 3

complex more slowly. _smaller.

than the power consumption per core.

"Fat" cores need lots of cache to reduce memory latency. TLP optimised designs are less latency sensitive, so less cache is needed.

Better process technology helps both types to increase, though the simpler, slower clocked "thin"

cores will be slower on more traditional benchmarks.

"Fat" cores will benefit from TLP techniques and general improvements, but

not as much as "thin" cores.

Current 4-way SMP

• An illustration of a 4-way system today. The only TLP

(3)

Toward NIAGARA chips

• An illustration of a system with a heavily optimised TLP

design

Niagara: A Torrent of Threads

(4)

First Niagara Chips:

November 2005 UltraSPARC T1

• I sistemi Niagara hanno 14 volte

le prestazioni di un sistema UltraSPARC IIIi

• I sistemi con il single-chip Niagara 2, 35 volte

• I sistemi con Victoria Falls, 65 volte

(5)

Global Embedded Systems Revenue (by Region)

Global Embedded Systems Revenue

20 25 30 35 40 $ B il li o n s 15 20 25 A A G R % 2004 2009

AAGR:average annual growth rate

Source: “Future of Embedded Systems Technology” , BCC Co, Inc., 2005

0 5 10 15 20

Americas Europe Japan Asia-Pacific

Region $ B il li o n s 0 5 10 _AAG R % 2009 AAGR%

Global Embedded Systems Revenue (by Application)

World Embedded Systems Revenue

10 15 20 25 $ B il li o n s 10 15 20 25 A A G R %

Source: “Future of Embedded Systems Technology” , BCC Co, Inc., 2005

0 5 Tele com m Con sum er Aut om otiv e Med ical /Offi ce Indu stria l/Mili t. Application $ B il li o n s 0 5 A A G R % 2004 2009 AAGR%

(6)

Global Embedded HW Revenue

Global Embedded Hardware Revenue by Category

15 20 25 $ B il li o n s 15 20 25 30 A A G R % MPU : microprocessors MCU: microcontrollers

Source: “Future of Embedded Systems Technology” , BCC Co, Inc., 2005 0 5 10 MP U MC U DS P Mem ory AS IC/P LD Ana log Category $ B il li o n s 0 5 10 15 A A G R % 2004 2009 AAGR%

Projected Technology Progress

Source: “Process Integration, Devices and Structures ”, ITRS, 2005

Transistor Density MPU (including SRAM)

400 600 800 1000 M tr a n s is to rs /c m 2 0 200 400 2006 2008 2010 2012 Year M tr a n s is to rs /c m 2

(7)

Embedded Platforms Roadmap

14 18 19 24 60% 80% 100%

Use of embedded processors in FPGAs

Hard FPGA processor Soft FPGA processor No FPGA processor

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 13 68 57 0% 20% 40% 2005 2006 No FPGA processor

Source: “Survey of System Design Trends”, Celoxica Inc., August 2005

Hardwired Logic (ASIC-like) is being replaced by embedded processor devices

Embedded Processors: Innovation driven by Technology + Architecture Advances

Multi-processing: Higher throughput

With less speed

(8)

Case Study – ITRS Mobile Handheld Roadmap

Year of Production 2006 2009 2012 2015

Process Technology (nm) 90 65 45 32

Supply Voltage (V) 1 0.8 0.6 0.5

Clock Frequency (MHz) 450 600 900 1200

Processing Performance (GOPS) 2 14 77 461

Average Power (W) 0.1 0.1 0.1 0.1

Standby Power (mW) 2 2 2 2

Applications Real Time

Video Codec

TV Telephone

Source: “System Drivers”, ITRS, 2003 Performance, En. Efficiency (GOPs/W) increase by 200x

ITRS – Low-Power SoC

Source: “System Drivers”, ITRS, 2005

• Many Processing Elements

• Reusability, Multi-Standard requirements drive for programmable (processor-based) solutions (PEs)

(9)

ITRS – Low-Power SoC – Processing/Performance

Trends

Source: “System Drivers”, ITRS, 2005

> 100 Processing Elements in 2011 !

Future Embedded System Design Trends

•

Mobile Handset Market driving commercial factor

•

New applications, wireless transmission standards require high performance embedded computing @ low power

•

ITRS foresees 3x magnitude improvement in performance and energy efficiency over the next 10 years

(Heterogeneous) Multi-Processor system-on-Chip (Heterogeneous) Multi-Processor system-on-Chip

Platforms

Compiler Technologies for high-performance, low-power embedded computing will be needed

Compiler and System-Design Tools for heterogeneous, massively parallel processing systems and networks

(10)

Network of Excellence

http://www.dii.unisi.it/~giorgi/didattica/calel2

HiPEAC

High-Performance Embedded Architectures

and Compilers

IST – 004408

(11)

• ACACES

– Extranet (Program, practical info, ...)

– Participant management

• HiPEAC Conference

– Extranet (Committees, Call for papers,

practical info, ...)

What We Have Now

practical info, ...)

– Paper submission (Commence)

SARC: Scalable ARChitectures

• WEB Site: http://www.sarc-ip.org

(12)

Paradigm shift

• Tiled architecture, built from fixed size nodes

• The architecture scales up by adding nodes

• NOT by growing the node size • The node becomes the processor • The processors become the

functional units

• The processors become the functional units

Programming model features

• Programming model will have tagged procedure calls

• Define local and global (shared) variables - Defines address range(s) to copy to local store

- Automatic programming of DMA transfers

- Defines address range(s) to watch for interference • Set procedure properties

- Has secondary effects (modifies global state)

- Reads global space - Reads global space - Writes global space

- Requires atomicity

- Regarding local variables - Regarding global variables

• Processor functionality requirements

(13)

Intra-node memory hierarchy

• Architecture must be easy to

program for:

• Shared memory

• Accelerators may have:

• Local memory

- Private, non-coherent • DMA controller

- Bridge between global memory

ACC

L

o

ca

l m

e

m

o

ry

DMA

ACC

L

o

ca

l m

e

m

o

ry

DMA

- Bridge between global memory and Local memory

• Accelerators must have:

• Global memory access

- Directly, or through cache hierarchy

• Single load/store instruction

• Address range differentiates Local memory from Global memory

Local interconnect

Outer shared cache

Accelerator

Cache(s)

Accelerator

Cache(s)

Intra-node memory hierarchy (II)

• All caches inside a node must be coherent

• All outer caches (from each node) should also be coherent

• Caches work as shared distributed memory

• If threads do not share memory

- There’s no coherence traffic, nor overhead - There’s no memory waste

- There’s no memory waste • If threads share memory

- Turning off coherence results in wrong execution

• Which is the benefit of turning off coherence? The

hardware must be there anyway …

• Turn it off for power savings?

• Lower memory access latency in non-shared mode?

(14)

Examples for intra-node memory

ACC

L

o

ca

l m

e

m

o

ry

DMA

Accelerator

ACC

L

o

ca

l m

e

m

o

ry

Accelerator

ACC

Acc

Cache(s)

ACC

L

o

ca

l m

e

m

o

ry

DMA

Local interconnect

Outer shared cache

Accelerator

Cache(s)

Accelerator

Cache(s)

Determine the node size

• If node size is fixed, we must

determine its size

• Split available area among

• Shared cache • Local interconnect

• General purpose processor • Accelerators

• Fixed or flexible distribution?

Outer cache memory

L

o

ca

l i

n

te

rco

n

e

ct

GPP

• Fixed or flexible distribution?

• Fixed GPP, cache, interconnect • Reconfigurable accelerator area • How many accelerators can a

thread actually exploit?

• Streaming computation

L

o

ca

l i

n

te

rco

n

e

ct

(15)

Node examples

• Sea of simple cores

• Niagara

• Cell

• Few complex cores

• Power5

• Single vector/media/bio

Outer cache memory

L

o

ca

l i

n

te

rco

n

e

ct

GPP

• Single vector/media/bio accelerator • Multiple accelerators

L

o

ca

l i

n

te

rco

n

e

ct

GPP – Accelerator interface

• For the processor to become the

functional unit, task offloading must have minimum overhead • Accelerator as ISA extension

• Shares PC, Fetch & Decode with a general purpose CPU

• Issue logic sends instructions to CPU or Accelerator Units

Outer cache memory

L o c a l i n te rc o n n e c t GPP ACC F e tc h & D is p a tc h F & D or Accelerator Units

• Implements an extension of the base ISA

• Accelerator as a new CPU

• Has a separate PC, Fetch, Decode engine

• May implement a completely different ISA

- VLIW, SIMD, Stack, 16-bit

L o c a l i n te rc o n n e c t ACC F e tc h & D is p a tc h CPU ACC F e tc h & D is p a tc h

(16)

Memory Hierarchy

DRAM

I/O

L3

• Set of coherent (processor-shared?) L1 caches inside the nodes

• C x Node

• Set of coherent node-shared L2 caches inside the chip (one from each node)

• 1 x Node, N x Chip

• Chip-shared L3 cache

• 1 x Chip

• Off-chip DRAM (or other memory technology)

Control Control

Cache

Motivation

• Hard to further scale uniprocessors

• Brought back focus to multiprocessors

• Different applications profit from different

techniques/types of parallelism

• ILP, TLP, DLP

• Motivates a customizable system with

• complex cores • simple cores

(17)

Motivation (2)

•Parallelism type exhibited by application and suitable

architecture:

TLP

DLP

SSC

CMP+

vector

33 ILP

FCC

SMT

vector

SARC?

complex

cores

simple

cores

34 cores

accelerators

(18)

ISA considerations

• Complex cores and simple cores have the same ISA (allows

to move threads from one to another [for real-time performance, power, …], simpler programming and compilation)

• ISA-agnostic

• approaches applicable to basically any ISA (ARM, PowerPC, …)

• Accelerator ISAs extensions of GPP ISA

35

• Accelerator ISAs extensions of GPP ISA

• single instruction stream (co-processor instructions) or • multiple instructions stream

How to realize customization?

•At design-time:

• The right mix of simple cores, complex cores, accelerators is determined at design-time

• Pro: Highest performance for specific application domains • Con: after fabrication, only for specific application domains

•At run-time:

• There will be many processing cores on a chip, for temperature • There will be many processing cores on a chip, for temperature

reasons some will have to be powered down anyhow

• Pro: Allows to achieve good performance, low power on many applications

(19)

Levels of Abstraction

• Levels of abstraction: • Architecture • Microarchitecture • Implementation • Realization

• SARC WP1 focuses mainly on levels 1 and 2

37 SARC node architecture

(20)

Architectures of Domain Specific Accelerators

• SARC specifically targets (but is not limited to) application

domains

• scientific computing (supercomputing) • bioinformatics

• multimedia

• internet and transaction processing

• Contain code pieces responsible for large fraction of

39

• Contain code pieces responsible for large fraction of

execution time

• Performance and power-efficiency can be improved

significantly by employing domain-specific accelerators

Scientific Computing Vector Accelerator

Architecture

• For applications dominated by loops with vector operands

• What are the innovations:

• Matrix by Matrix operations (at least 2D)

• Dimensionality not encoded in the instructions (novel register file to support this)

• Sparse and Dense matrices considered identically

• Auto-indexing and –sectioning addressing mechanisms (link to WP2)

• (possible) on-chip distributed vector facility

• ISA, data formats, register file organization and memory addressing scheme under investigation

(21)

Scientific Computing Vector Accelerator

Architecture (cont)

• ISA (check the document)

• Operand types: Vectors, Matrices (Sparse and Dense), Bit

vectors and Scalars. (in “sparse” mode ½ of the available registers used as index vectors)

• Data formats: 64 bit FP; 8, 16, 32 and 64 bit INT and

BOOL

• Auto indexing for rectangular patterns (dense):

41

• Auto indexing for rectangular patterns (dense):

Scientific Computing Vector Accelerator

Architecture (cont)

• Register file: The SARC vector register file is a parameterizable

register file, which can be logically reorganized by the programmer to support multiple register dimensions and sizes simultaneously.

• Scalar reg. file

shared with GPP

42

1) Vector registers can

overlap (think about it) 2) Scalar registers can be

used for conditional branches on the GPP side

(22)

Bioinformatics Accelerator

•Will have a scalar and vector-SIMD part

•(Multiple) sequence alignment algorithms require:

• support for efficient unaligned memory accesses • strided memory accesses

• vector reduction operations, etc.

•In structure prediction monte carlo or molecular dynamic

simulations common

43

simulations common

• can profit from earlier ASIC/FPGA work

•Docking

• profits from architectural features incorporated for structure prediction

• but also from matrix rotations, transposes, …

Multimedia accelerator

• Vector-SIMD architecture

• Architecture agnostic to physical vector length

• Avoid packing/unpacking, reorganization overhead

• unpacking while loading • packing while storing

• flexible access to register file

• Use more dimensions

(23)

Micro-architectural considerations

• Simple/complex GPP mixture

• Scalable cache coherence

• Support for (existing) sequential, single-threaded

applications

• Thread-level speculation • Kilo-instruction processors

45 I/O and Communication Subsystem

• Overheads of system call, context switch, interrupt,

network protocol no longer justified

• With fewer threads than processing cores

• no reason for switching execution context

• OS must not run on same processor as user applications • requires extra-low communication latency

(24)

Interconnection Network

• LANs/SANs are so fast that switching and routing have to

be provided in hardware

• but reliable and congestion control left to end-nodes

• needs to be addressed

• Power considerations also

• Applies to multi-chip interconnection networks, but NoCs

have to solve similar problems

47

have to solve similar problems

• in a much more constrained enviroment

TRANSACTIONAL MEMORY

• The most difficult task when developing multithreaded applications is making sure that the program works (e.g. deadlocks may occur when combining correct code fragments)

• Transactional memory is a concurrency control mechanism for controlling access to shared memory

• A transaction is a piece of code that executes a series of reads and writes to shared memory, which logically occur at a single instant in time, and are typically implemented in a lock-free way

• Transactional memory is optimistic: every thread completes its • Transactional memory is optimistic: every thread completes its modifications to shared memory without regard for what other

threads might be doing, recording every read and write that it makes in a log, which are validated in the commit stage

• Implementing part of the system memory as transactional memory could be the solution for storing shared data in parallel applications while simplifying programming

(25)

Riflessione…

• PROBLEM: THINKING IN PARALLEL IS HARD !

• Perhaps: THINKING is hard ! (YALE PATT - Sep.2007)