• No results found

Thread Level Parallelism (TLP)

N/A
N/A
Protected

Academic year: 2021

Share "Thread Level Parallelism (TLP)"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Thread Level Parallelism

(TLP)

Calcolatori Elettronici 2

http://www.dii.unisi.it/~giorgi/didattica/calel2

Calcolatori Elettronici 2

(2)

Estimated Industry Trends

Moore's Law allows for the rapid increase in transistors per core. TLP optimised cores will

start out much simpler, and may grow complex more slowly.

The trend is for chips and CPU cores to get smaller, though TLP

optimised ones will start much smaller.

Growth rates in maximum power for "fat" CPUs have levelled off a bit. For "thin" cores, the number of CPU cores

per chip will probably increase rather

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 3

complex more slowly. smaller.

than the power consumption per core.

"Fat" cores need lots of cache to reduce memory latency. TLP optimised designs are less latency sensitive, so less cache is needed.

Better process technology helps both types to increase, though the simpler, slower clocked "thin"

cores will be slower on more traditional benchmarks.

"Fat" cores will benefit from TLP techniques and general improvements, but

not as much as "thin" cores.

Current 4-way SMP

An illustration of a 4-way system today. The only TLP

An illustration of a 4-way system today. The only TLP

(3)

Toward NIAGARA chips

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 5

An illustration of a system with a heavily optimised TLP

design

Niagara: A Torrent of Threads

(4)

First Niagara Chips:

November 2005 UltraSPARC T1

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 7

I sistemi Niagara hanno 14 volte

le prestazioni di un sistema UltraSPARC IIIi

I sistemi con il single-chip Niagara 2, 35 volte

I sistemi con Victoria Falls, 65 volte

(5)

Global Embedded Systems Revenue (by Region)

Global Embedded Systems Revenue

20 25 30 35 40 $ B il li o n s 15 20 25 A A G R % 2004 2009

AAGR:average annual growth rate

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 9

Source: “Future of Embedded Systems Technology” , BCC Co, Inc., 2005

0 5 10 15 20

Americas Europe Japan Asia-Pacific

Region $ B il li o n s 0 5 10 AAG R % 2009 AAGR%

Global Embedded Systems Revenue (by Application)

World Embedded Systems Revenue

10 15 20 25 $ B il li o n s 10 15 20 25 A A G R %

Source: “Future of Embedded Systems Technology” , BCC Co, Inc., 2005

0 5 Tele com m Con sum er Aut om otiv e Med ical /Offi ce Indu stria l/Mili t. Application $ B il li o n s 0 5 A A G R % 2004 2009 AAGR%

(6)

Global Embedded HW Revenue

Global Embedded Hardware Revenue by Category

15 20 25 $ B il li o n s 15 20 25 30 A A G R % MPU : microprocessors MCU: microcontrollers

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 11

Source: “Future of Embedded Systems Technology” , BCC Co, Inc., 2005 0 5 10 MP U MC U DS P Mem ory AS IC/P LD Ana log Category $ B il li o n s 0 5 10 15 A A G R % 2004 2009 AAGR%

Projected Technology Progress

Source: “Process Integration, Devices and Structures ”, ITRS, 2005

Transistor Density MPU (including SRAM)

400 600 800 1000 M tr a n s is to rs /c m 2 0 200 400 2006 2008 2010 2012 Year M tr a n s is to rs /c m 2

(7)

Embedded Platforms Roadmap

14 18 19 24 60% 80% 100%

Use of embedded processors in FPGAs

Hard FPGA processor Soft FPGA processor No FPGA processor

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 13 68 57 0% 20% 40% 2005 2006 No FPGA processor

Source: “Survey of System Design Trends”, Celoxica Inc., August 2005

 Hardwired Logic (ASIC-like) is being replaced by embedded processor devices

Embedded Processors: Innovation driven by Technology + Architecture Advances

Multi-processing: Higher throughput

With less speed

(8)

Case Study – ITRS Mobile Handheld Roadmap

Year of Production 2006 2009 2012 2015

Process Technology (nm) 90 65 45 32

Supply Voltage (V) 1 0.8 0.6 0.5

Clock Frequency (MHz) 450 600 900 1200

Processing Performance (GOPS) 2 14 77 461

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 15

Processing Performance (GOPS) 2 14 77 461

Average Power (W) 0.1 0.1 0.1 0.1

Standby Power (mW) 2 2 2 2

Applications Real Time

Video Codec

TV Telephone

Source: “System Drivers”, ITRS, 2003  Performance, En. Efficiency (GOPs/W) increase by 200x

ITRS – Low-Power SoC

Source: “System Drivers”, ITRS, 2005

• Many Processing Elements

• Reusability, Multi-Standard requirements drive for programmable (processor-based) solutions (PEs)

(9)

ITRS – Low-Power SoC – Processing/Performance

Trends

Source: “System Drivers”, ITRS, 2005

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 17

> 100 Processing Elements in 2011 !

Future Embedded System Design Trends

Mobile Handset Market driving commercial factor

New applications, wireless transmission standards require high performance embedded computing @ low power

ITRS foresees 3x magnitude improvement in performance and energy efficiency over the next 10 years

 (Heterogeneous) Multi-Processor system-on-Chip  (Heterogeneous) Multi-Processor system-on-Chip

Platforms

 Compiler Technologies for high-performance, low-power embedded computing will be needed

 Compiler and System-Design Tools for heterogeneous, massively parallel processing systems and networks

(10)

Network of Excellence

http://www.dii.unisi.it/~giorgi/didattica/calel2

HiPEAC

High-Performance Embedded Architectures

and Compilers

IST – 004408

(11)

• ACACES

– Extranet (Program, practical info, ...)

– Participant management

• HiPEAC Conference

– Extranet (Committees, Call for papers,

practical info, ...)

What We Have Now

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 21

practical info, ...)

– Paper submission (Commence)

SARC: Scalable ARChitectures

WEB Site: http://www.sarc-ip.org

(12)

Paradigm shift

Tiled architecture, built from fixed size nodes

• The architecture scales up by adding nodes

• NOT by growing the node size • The node becomes the processorThe processors become the

functional units

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 23

The processors become the functional units

Programming model features

Programming model will have tagged procedure calls

• Define local and global (shared) variables - Defines address range(s) to copy to local store

- Automatic programming of DMA transfers

- Defines address range(s) to watch for interference • Set procedure properties

- Has secondary effects (modifies global state)

- Reads global space - Reads global space - Writes global space

- Requires atomicity

- Regarding local variables - Regarding global variables

• Processor functionality requirements

(13)

Intra-node memory hierarchy

Architecture must be easy to

program for:

• Shared memory

Accelerators may have:

• Local memory

- Private, non-coherent • DMA controller

- Bridge between global memory

ACC

L

o

ca

l m

e

m

o

ry

DMA

ACC

L

o

ca

l m

e

m

o

ry

DMA

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 25

- Bridge between global memory and Local memory

Accelerators must have:

• Global memory access

- Directly, or through cache hierarchy

Single load/store instruction

• Address range differentiates Local memory from Global memory

Local interconnect

Outer shared cache

Accelerator

Cache(s)

Accelerator

Cache(s)

Intra-node memory hierarchy (II)

All caches inside a node must be coherent

All outer caches (from each node) should also be coherent

Caches work as shared distributed memory

• If threads do not share memory

- There’s no coherence traffic, nor overhead - There’s no memory waste

- There’s no memory waste • If threads share memory

- Turning off coherence results in wrong execution

Which is the benefit of turning off coherence? The

hardware must be there anyway …

• Turn it off for power savings?

• Lower memory access latency in non-shared mode?

(14)

Examples for intra-node memory

ACC

L

o

ca

l m

e

m

o

ry

DMA

Accelerator

ACC

L

o

ca

l m

e

m

o

ry

Accelerator

ACC

Acc

Cache(s)

ACC

L

o

ca

l m

e

m

o

ry

DMA

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 27

Local interconnect

Outer shared cache

Accelerator

Cache(s)

Accelerator

Cache(s)

Cache(s)

Determine the node size

If node size is fixed, we must

determine its size

Split available area among

• Shared cache • Local interconnect

• General purpose processor • Accelerators

Fixed or flexible distribution?

Outer cache memory

L

o

ca

l i

n

te

rco

n

n

e

ct

GPP

Fixed or flexible distribution?

• Fixed GPP, cache, interconnect • Reconfigurable accelerator area • How many accelerators can a

thread actually exploit?

• Streaming computation

L

o

ca

l i

n

te

rco

n

n

e

ct

(15)

Node examples

Sea of simple cores

• Niagara

• Cell

Few complex cores

• Power5

Single vector/media/bio

Outer cache memory

L

o

ca

l i

n

te

rco

n

n

e

ct

GPP

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 29

Single vector/media/bio acceleratorMultiple accelerators

L

o

ca

l i

n

te

rco

n

n

e

ct

GPP – Accelerator interface

For the processor to become the

functional unit, task offloading must have minimum overheadAccelerator as ISA extension

• Shares PC, Fetch & Decode with a general purpose CPU

• Issue logic sends instructions to CPU or Accelerator Units

Outer cache memory

L o c a l i n te rc o n n e c t GPP ACC F e tc h & D is p a tc h F & D or Accelerator Units

• Implements an extension of the base ISA

Accelerator as a new CPU

• Has a separate PC, Fetch, Decode engine

• May implement a completely different ISA

- VLIW, SIMD, Stack, 16-bit

L o c a l i n te rc o n n e c t ACC F e tc h & D is p a tc h CPU ACC F e tc h & D is p a tc h

(16)

Memory Hierarchy

DRAM

I/O

L3

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 31

Set of coherent (processor-shared?) L1 caches inside the nodes

• C x Node

Set of coherent node-shared L2 caches inside the chip (one from each node)

• 1 x Node, N x Chip

Chip-shared L3 cache

• 1 x Chip

Off-chip DRAM (or other memory technology)

Control Control

Cache

Motivation

Hard to further scale uniprocessors

Brought back focus to multiprocessors

Different applications profit from different

techniques/types of parallelism

• ILP, TLP, DLP

Motivates a customizable system with

Motivates a customizable system with

• complex cores • simple cores

(17)

Motivation (2)

Parallelism type exhibited by application and suitable

architecture:

TLP

DLP

SSC

CMP+

vector

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 33

33

ILP

FCC

SMT

vector

SARC?

complex

cores

simple

cores

34

cores

accelerators

(18)

ISA considerations

Complex cores and simple cores have the same ISA (allows

to move threads from one to another [for real-time performance, power, …], simpler programming and compilation)

ISA-agnostic

• approaches applicable to basically any ISA (ARM, PowerPC, …)

Accelerator ISAs extensions of GPP ISA

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 35

35

Accelerator ISAs extensions of GPP ISA

• single instruction stream (co-processor instructions) or • multiple instructions stream

How to realize customization?

At design-time:

• The right mix of simple cores, complex cores, accelerators is determined at design-time

• Pro: Highest performance for specific application domains • Con: after fabrication, only for specific application domains

At run-time:

• There will be many processing cores on a chip, for temperature • There will be many processing cores on a chip, for temperature

reasons some will have to be powered down anyhow

• Pro: Allows to achieve good performance, low power on many applications

(19)

Levels of Abstraction

Levels of abstraction: • Architecture • Microarchitecture • Implementation • Realization

SARC WP1 focuses mainly on levels 1 and 2

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 37

37

SARC node architecture

(20)

Architectures of Domain Specific Accelerators

SARC specifically targets (but is not limited to) application

domains

• scientific computing (supercomputing) • bioinformatics

• multimedia

• internet and transaction processing

Contain code pieces responsible for large fraction of

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 39

39

Contain code pieces responsible for large fraction of

execution time

Performance and power-efficiency can be improved

significantly by employing domain-specific accelerators

Scientific Computing Vector Accelerator

Architecture

For applications dominated by loops with vector operands

What are the innovations:

• Matrix by Matrix operations (at least 2D)

• Dimensionality not encoded in the instructions (novel register file to support this)

• Sparse and Dense matrices considered identically

• Auto-indexing and –sectioning addressing mechanisms (link to WP2)

• Auto-indexing and –sectioning addressing mechanisms (link to WP2)

• (possible) on-chip distributed vector facility

ISA, data formats, register file organization and memory addressing scheme under investigation

(21)

Scientific Computing Vector Accelerator

Architecture (cont)

ISA (check the document)

Operand types: Vectors, Matrices (Sparse and Dense), Bit

vectors and Scalars. (in “sparse” mode ½ of the available registers used as index vectors)

Data formats: 64 bit FP; 8, 16, 32 and 64 bit INT and

BOOL

Auto indexing for rectangular patterns (dense):

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 41

41

Auto indexing for rectangular patterns (dense):

Scientific Computing Vector Accelerator

Architecture (cont)

Register file: The SARC vector register file is a parameterizable

register file, which can be logically reorganized by the programmer to support multiple register dimensions and sizes simultaneously.

Scalar reg. file

shared with GPP

42

1) Vector registers can

overlap (think about it) 2) Scalar registers can be

used for conditional branches on the GPP side

(22)

Bioinformatics Accelerator

Will have a scalar and vector-SIMD part

(Multiple) sequence alignment algorithms require:

• support for efficient unaligned memory accesses • strided memory accesses

• vector reduction operations, etc.

In structure prediction monte carlo or molecular dynamic

simulations common

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 43

43

simulations common

• can profit from earlier ASIC/FPGA work

Docking

• profits from architectural features incorporated for structure prediction

• but also from matrix rotations, transposes, …

Multimedia accelerator

Vector-SIMD architecture

Architecture agnostic to physical vector length

Avoid packing/unpacking, reorganization overhead

• unpacking while loading • packing while storing

• flexible access to register file

Use more dimensions

(23)

Micro-architectural considerations

Simple/complex GPP mixture

Scalable cache coherence

Support for (existing) sequential, single-threaded

applications

• Thread-level speculation • Kilo-instruction processors

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 45

45

I/O and Communication Subsystem

Overheads of system call, context switch, interrupt,

network protocol no longer justified

With fewer threads than processing cores

• no reason for switching execution context

• OS must not run on same processor as user applications • requires extra-low communication latency

(24)

Interconnection Network

LANs/SANs are so fast that switching and routing have to

be provided in hardware

but reliable and congestion control left to end-nodes

needs to be addressed

Power considerations also

Applies to multi-chip interconnection networks, but NoCs

have to solve similar problems

Roberto Giorgi, Universita’ di Siena, C208L15, Slide 47

47

have to solve similar problems

• in a much more constrained enviroment

TRANSACTIONAL MEMORY

The most difficult task when developing multithreaded applications is making sure that the program works (e.g. deadlocks may occur when combining correct code fragments)

Transactional memory is a concurrency control mechanism for controlling access to shared memory

A transaction is a piece of code that executes a series of reads and writes to shared memory, which logically occur at a single instant in time, and are typically implemented in a lock-free way

Transactional memory is optimistic: every thread completes its Transactional memory is optimistic: every thread completes its modifications to shared memory without regard for what other

threads might be doing, recording every read and write that it makes in a log, which are validated in the commit stage

Implementing part of the system memory as transactional memory could be the solution for storing shared data in parallel applications while simplifying programming

(25)

Riflessione…

PROBLEM: THINKING IN PARALLEL IS HARD !

• Perhaps: THINKING is hard ! (YALE PATT - Sep.2007)

References

Related documents

Laws for practice restrictions of licensed dental hygienists significantly prevent the increase of access and utilization of diagnostic and therapeutic preventive oral health

The main category was the students’ perception, and the subcategories included: mentality about the training (limiting the training to vaccination; rest

In general terms, this research demonstrates that new environmentally friendly composite laminates with attracting mechanical performance can be obtained by Resin

governmental unit, the governmental unit shall make available a receiving receiving facility that can lawfully accept all septage waste generated wi. facility that can lawfully

Prices for Back Ribs are expected to trade steady to slightly higher through early January.. Lighter

Abstract: This paper applies two measures to assess spillovers across markets: the Diebold Yilmaz (2012) Spillover Index and the Hafner and Herwartz (2006) analysis

The views and skeletons that represent the data and computation of a parallel program are basic 2D shapes drawn in the editor when the program is running.. Classes are given

The impact of a 'stage 1' trade liberalisation is comparable to a real depreciation of 0.59 per cent per annum and that of a 'stage 2' trade liberalisation is equivalent to a 1.57