Thread Level Parallelism
(TLP)
Calcolatori Elettronici 2
http://www.dii.unisi.it/~giorgi/didattica/calel2
Calcolatori Elettronici 2
Estimated Industry Trends
Moore's Law allows for the rapid increase in transistors per core. TLP optimised cores will
start out much simpler, and may grow complex more slowly.
The trend is for chips and CPU cores to get smaller, though TLP
optimised ones will start much smaller.
Growth rates in maximum power for "fat" CPUs have levelled off a bit. For "thin" cores, the number of CPU cores
per chip will probably increase rather
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 3
complex more slowly. smaller.
than the power consumption per core.
"Fat" cores need lots of cache to reduce memory latency. TLP optimised designs are less latency sensitive, so less cache is needed.
Better process technology helps both types to increase, though the simpler, slower clocked "thin"
cores will be slower on more traditional benchmarks.
"Fat" cores will benefit from TLP techniques and general improvements, but
not as much as "thin" cores.
Current 4-way SMP
• An illustration of a 4-way system today. The only TLP
• An illustration of a 4-way system today. The only TLP
Toward NIAGARA chips
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 5
• An illustration of a system with a heavily optimised TLP
design
Niagara: A Torrent of Threads
First Niagara Chips:
November 2005 UltraSPARC T1
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 7
• I sistemi Niagara hanno 14 volte
le prestazioni di un sistema UltraSPARC IIIi
• I sistemi con il single-chip Niagara 2, 35 volte
• I sistemi con Victoria Falls, 65 volte
Global Embedded Systems Revenue (by Region)
Global Embedded Systems Revenue
20 25 30 35 40 $ B il li o n s 15 20 25 A A G R % 2004 2009
AAGR:average annual growth rate
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 9
Source: “Future of Embedded Systems Technology” , BCC Co, Inc., 2005
0 5 10 15 20
Americas Europe Japan Asia-Pacific
Region $ B il li o n s 0 5 10 AAG R % 2009 AAGR%
Global Embedded Systems Revenue (by Application)
World Embedded Systems Revenue
10 15 20 25 $ B il li o n s 10 15 20 25 A A G R %
Source: “Future of Embedded Systems Technology” , BCC Co, Inc., 2005
0 5 Tele com m Con sum er Aut om otiv e Med ical /Offi ce Indu stria l/Mili t. Application $ B il li o n s 0 5 A A G R % 2004 2009 AAGR%
Global Embedded HW Revenue
Global Embedded Hardware Revenue by Category
15 20 25 $ B il li o n s 15 20 25 30 A A G R % MPU : microprocessors MCU: microcontrollers
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 11
Source: “Future of Embedded Systems Technology” , BCC Co, Inc., 2005 0 5 10 MP U MC U DS P Mem ory AS IC/P LD Ana log Category $ B il li o n s 0 5 10 15 A A G R % 2004 2009 AAGR%
Projected Technology Progress
Source: “Process Integration, Devices and Structures ”, ITRS, 2005
Transistor Density MPU (including SRAM)
400 600 800 1000 M tr a n s is to rs /c m 2 0 200 400 2006 2008 2010 2012 Year M tr a n s is to rs /c m 2
Embedded Platforms Roadmap
14 18 19 24 60% 80% 100%Use of embedded processors in FPGAs
Hard FPGA processor Soft FPGA processor No FPGA processor
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 13 68 57 0% 20% 40% 2005 2006 No FPGA processor
Source: “Survey of System Design Trends”, Celoxica Inc., August 2005
Hardwired Logic (ASIC-like) is being replaced by embedded processor devices
Embedded Processors: Innovation driven by Technology + Architecture Advances
Multi-processing: Higher throughput
With less speed
Case Study – ITRS Mobile Handheld Roadmap
Year of Production 2006 2009 2012 2015
Process Technology (nm) 90 65 45 32
Supply Voltage (V) 1 0.8 0.6 0.5
Clock Frequency (MHz) 450 600 900 1200
Processing Performance (GOPS) 2 14 77 461
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 15
Processing Performance (GOPS) 2 14 77 461
Average Power (W) 0.1 0.1 0.1 0.1
Standby Power (mW) 2 2 2 2
Applications Real Time
Video Codec
TV Telephone
Source: “System Drivers”, ITRS, 2003 Performance, En. Efficiency (GOPs/W) increase by 200x
ITRS – Low-Power SoC
Source: “System Drivers”, ITRS, 2005
• Many Processing Elements
• Reusability, Multi-Standard requirements drive for programmable (processor-based) solutions (PEs)
ITRS – Low-Power SoC – Processing/Performance
Trends
Source: “System Drivers”, ITRS, 2005Roberto Giorgi, Universita’ di Siena, C208L15, Slide 17
> 100 Processing Elements in 2011 !
Future Embedded System Design Trends
•
Mobile Handset Market driving commercial factor•
New applications, wireless transmission standards require high performance embedded computing @ low power•
ITRS foresees 3x magnitude improvement in performance and energy efficiency over the next 10 years(Heterogeneous) Multi-Processor system-on-Chip (Heterogeneous) Multi-Processor system-on-Chip
Platforms
Compiler Technologies for high-performance, low-power embedded computing will be needed
Compiler and System-Design Tools for heterogeneous, massively parallel processing systems and networks
Network of Excellence
http://www.dii.unisi.it/~giorgi/didattica/calel2
HiPEAC
High-Performance Embedded Architectures
and Compilers
IST – 004408
• ACACES
– Extranet (Program, practical info, ...)
– Participant management
• HiPEAC Conference
– Extranet (Committees, Call for papers,
practical info, ...)
What We Have Now
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 21
practical info, ...)
– Paper submission (Commence)
SARC: Scalable ARChitectures
• WEB Site: http://www.sarc-ip.org
Paradigm shift
• Tiled architecture, built from fixed size nodes
• The architecture scales up by adding nodes
• NOT by growing the node size • The node becomes the processor • The processors become the
functional units
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 23
• The processors become the functional units
Programming model features
• Programming model will have tagged procedure calls
• Define local and global (shared) variables - Defines address range(s) to copy to local store
- Automatic programming of DMA transfers
- Defines address range(s) to watch for interference • Set procedure properties
- Has secondary effects (modifies global state)
- Reads global space - Reads global space - Writes global space
- Requires atomicity
- Regarding local variables - Regarding global variables
• Processor functionality requirements
Intra-node memory hierarchy
• Architecture must be easy toprogram for:
• Shared memory
• Accelerators may have:
• Local memory
- Private, non-coherent • DMA controller
- Bridge between global memory
ACC
L
o
ca
l m
e
m
o
ry
DMA
ACC
L
o
ca
l m
e
m
o
ry
DMA
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 25
- Bridge between global memory and Local memory
• Accelerators must have:
• Global memory access
- Directly, or through cache hierarchy
• Single load/store instruction
• Address range differentiates Local memory from Global memory
Local interconnect
Outer shared cache
Accelerator
Cache(s)
Accelerator
Cache(s)
Intra-node memory hierarchy (II)
• All caches inside a node must be coherent
• All outer caches (from each node) should also be coherent
• Caches work as shared distributed memory
• If threads do not share memory
- There’s no coherence traffic, nor overhead - There’s no memory waste
- There’s no memory waste • If threads share memory
- Turning off coherence results in wrong execution
• Which is the benefit of turning off coherence? The
hardware must be there anyway …
• Turn it off for power savings?
• Lower memory access latency in non-shared mode?
Examples for intra-node memory
ACC
L
o
ca
l m
e
m
o
ry
DMA
Accelerator
ACC
L
o
ca
l m
e
m
o
ry
Accelerator
ACC
Acc
Cache(s)
ACC
L
o
ca
l m
e
m
o
ry
DMA
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 27
Local interconnect
Outer shared cache
Accelerator
Cache(s)
Accelerator
Cache(s)
Cache(s)
Determine the node size
• If node size is fixed, we mustdetermine its size
• Split available area among
• Shared cache • Local interconnect
• General purpose processor • Accelerators
• Fixed or flexible distribution?
Outer cache memory
L
o
ca
l i
n
te
rco
n
n
e
ct
GPP
• Fixed or flexible distribution?• Fixed GPP, cache, interconnect • Reconfigurable accelerator area • How many accelerators can a
thread actually exploit?
• Streaming computation
L
o
ca
l i
n
te
rco
n
n
e
ct
Node examples
• Sea of simple cores
• Niagara
• Cell
• Few complex cores
• Power5
• Single vector/media/bio
Outer cache memory
L
o
ca
l i
n
te
rco
n
n
e
ct
GPP
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 29
• Single vector/media/bio accelerator • Multiple accelerators
L
o
ca
l i
n
te
rco
n
n
e
ct
GPP – Accelerator interface
• For the processor to become thefunctional unit, task offloading must have minimum overhead • Accelerator as ISA extension
• Shares PC, Fetch & Decode with a general purpose CPU
• Issue logic sends instructions to CPU or Accelerator Units
Outer cache memory
L o c a l i n te rc o n n e c t GPP ACC F e tc h & D is p a tc h F & D or Accelerator Units
• Implements an extension of the base ISA
• Accelerator as a new CPU
• Has a separate PC, Fetch, Decode engine
• May implement a completely different ISA
- VLIW, SIMD, Stack, 16-bit
L o c a l i n te rc o n n e c t ACC F e tc h & D is p a tc h CPU ACC F e tc h & D is p a tc h
Memory Hierarchy
DRAM
I/O
L3
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 31
• Set of coherent (processor-shared?) L1 caches inside the nodes
• C x Node
• Set of coherent node-shared L2 caches inside the chip (one from each node)
• 1 x Node, N x Chip
• Chip-shared L3 cache
• 1 x Chip
• Off-chip DRAM (or other memory technology)
Control Control
Cache
Motivation
• Hard to further scale uniprocessors
• Brought back focus to multiprocessors
• Different applications profit from different
techniques/types of parallelism
• ILP, TLP, DLP
• Motivates a customizable system with
• Motivates a customizable system with
• complex cores • simple cores
Motivation (2)
•Parallelism type exhibited by application and suitable
architecture:
TLP
DLP
SSC
CMP+
vector
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 33
33
ILP
FCC
SMT
vector
SARC?
complex
cores
simple
cores
34
cores
accelerators
ISA considerations
• Complex cores and simple cores have the same ISA (allows
to move threads from one to another [for real-time performance, power, …], simpler programming and compilation)
• ISA-agnostic
• approaches applicable to basically any ISA (ARM, PowerPC, …)
• Accelerator ISAs extensions of GPP ISA
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 35
35
• Accelerator ISAs extensions of GPP ISA
• single instruction stream (co-processor instructions) or • multiple instructions stream
How to realize customization?
•At design-time:
• The right mix of simple cores, complex cores, accelerators is determined at design-time
• Pro: Highest performance for specific application domains • Con: after fabrication, only for specific application domains
•At run-time:
• There will be many processing cores on a chip, for temperature • There will be many processing cores on a chip, for temperature
reasons some will have to be powered down anyhow
• Pro: Allows to achieve good performance, low power on many applications
Levels of Abstraction
• Levels of abstraction: • Architecture • Microarchitecture • Implementation • Realization• SARC WP1 focuses mainly on levels 1 and 2
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 37
37
SARC node architecture
Architectures of Domain Specific Accelerators
• SARC specifically targets (but is not limited to) application
domains
• scientific computing (supercomputing) • bioinformatics
• multimedia
• internet and transaction processing
• Contain code pieces responsible for large fraction of
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 39
39
• Contain code pieces responsible for large fraction of
execution time
• Performance and power-efficiency can be improved
significantly by employing domain-specific accelerators
Scientific Computing Vector Accelerator
Architecture
• For applications dominated by loops with vector operands
• What are the innovations:
• Matrix by Matrix operations (at least 2D)
• Dimensionality not encoded in the instructions (novel register file to support this)
• Sparse and Dense matrices considered identically
• Auto-indexing and –sectioning addressing mechanisms (link to WP2)
• Auto-indexing and –sectioning addressing mechanisms (link to WP2)
• (possible) on-chip distributed vector facility
• ISA, data formats, register file organization and memory addressing scheme under investigation
Scientific Computing Vector Accelerator
Architecture (cont)
• ISA (check the document)
• Operand types: Vectors, Matrices (Sparse and Dense), Bit
vectors and Scalars. (in “sparse” mode ½ of the available registers used as index vectors)
• Data formats: 64 bit FP; 8, 16, 32 and 64 bit INT and
BOOL
• Auto indexing for rectangular patterns (dense):
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 41
41
• Auto indexing for rectangular patterns (dense):
Scientific Computing Vector Accelerator
Architecture (cont)
• Register file: The SARC vector register file is a parameterizable
register file, which can be logically reorganized by the programmer to support multiple register dimensions and sizes simultaneously.
• Scalar reg. file
shared with GPP
42
1) Vector registers canoverlap (think about it) 2) Scalar registers can be
used for conditional branches on the GPP side
Bioinformatics Accelerator
•Will have a scalar and vector-SIMD part
•(Multiple) sequence alignment algorithms require:
• support for efficient unaligned memory accesses • strided memory accesses
• vector reduction operations, etc.
•In structure prediction monte carlo or molecular dynamic
simulations common
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 43
43
simulations common
• can profit from earlier ASIC/FPGA work
•Docking
• profits from architectural features incorporated for structure prediction
• but also from matrix rotations, transposes, …
Multimedia accelerator
• Vector-SIMD architecture
• Architecture agnostic to physical vector length
• Avoid packing/unpacking, reorganization overhead
• unpacking while loading • packing while storing
• flexible access to register file
• Use more dimensions
Micro-architectural considerations
• Simple/complex GPP mixture
• Scalable cache coherence
• Support for (existing) sequential, single-threaded
applications
• Thread-level speculation • Kilo-instruction processors
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 45
45
I/O and Communication Subsystem
• Overheads of system call, context switch, interrupt,
network protocol no longer justified
• With fewer threads than processing cores
• no reason for switching execution context
• OS must not run on same processor as user applications • requires extra-low communication latency
Interconnection Network
• LANs/SANs are so fast that switching and routing have to
be provided in hardware
• but reliable and congestion control left to end-nodes
• needs to be addressed
• Power considerations also
• Applies to multi-chip interconnection networks, but NoCs
have to solve similar problems
Roberto Giorgi, Universita’ di Siena, C208L15, Slide 47
47
have to solve similar problems
• in a much more constrained enviroment
TRANSACTIONAL MEMORY
• The most difficult task when developing multithreaded applications is making sure that the program works (e.g. deadlocks may occur when combining correct code fragments)
• Transactional memory is a concurrency control mechanism for controlling access to shared memory
• A transaction is a piece of code that executes a series of reads and writes to shared memory, which logically occur at a single instant in time, and are typically implemented in a lock-free way
• Transactional memory is optimistic: every thread completes its • Transactional memory is optimistic: every thread completes its modifications to shared memory without regard for what other
threads might be doing, recording every read and write that it makes in a log, which are validated in the commit stage
• Implementing part of the system memory as transactional memory could be the solution for storing shared data in parallel applications while simplifying programming
Riflessione…
• PROBLEM: THINKING IN PARALLEL IS HARD !
• Perhaps: THINKING is hard ! (YALE PATT - Sep.2007)