TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
TDT 4260 – lecture 11 – spring semester 2013
Lasse Natvig,
The CARD group Dept. of computer & information science NTNU
http://research.idi.ntnu.no/multicore
2
Lecture overview
• Interconnection network continued
– Routing
– Switch microarchitecture
• Dataflow computing
– Principles
– MDM in detail
• “… think differently”!
innovation
• Research method
• Administrativia
– Reading list is now in its final version
– Next week: Mini project presentations, Room 454 in IT-Building – Last lecture 30/4, exam Saturday 25/5 at 0900
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
ARM guest lectures Thursday
morning
•
Part of the course TDT4258 - Energieffektive
datamaskinsystemer
– Thursday 18. April, at 08:15 in auditorium F6:
1) Low power HW design. Guest lecturer: Nir Leshem (Hardware Engineering Manager, ARM) (In english)
2) Driverutvikling for Linux, driverarkitektur og debugging. Gjesteforelesere: Ørjan Eide (Senior Engineer, ARM) og Mikael Valen-Sendstad (Staff Software Architect, ARM) (In Norwegian)
4
F.5: Routing, Arbitration, Switching
•
Routing
– Which of the possible paths are allowable for packets? – Set of operations needed to compute a valid path
•
Arbitration
– When are paths available for packets?
– Resolves packets requesting the same resources at the same time – For every arbitration, there is a winner and possibly many losers
• Losers are buffered (lossless) or dropped on overflow (lossy)
•
Switching
– How are paths allocated to packets?
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Routing
•
Shared Media
▫ Broadcast to everyone
•
Switched Media needs real routing. Options:
▫ Source-based routing: message specifies path to the destination (changes of direction)
▫ Virtual Circuit: circuit established from source to destination, message picks the circuit to follow
▫ Destination-based routing: message specifies destination, switch must pick the path
Deterministic: always follow same path
Adaptive: pick different paths to avoid congestion, failures
Randomized routing: pick between several good paths to balance network load
6
Store & Forward vs Cut-Through Routing
•
Cut-through (on blocking)
▫ Virtual cut-through (spools rest of packet into buffer)
▫ 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 3 1 0 2 1 0 2 3 1 0 0 1 2 3 2 3 1 0 Time
Store & Forward Routing Cut-Through Routing
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Routing mechanism
•
Need to select output port for each input packet
▫ And fast…
•
Simple arithmetic in regular topologies
▫ Example: x, y routing in a grid with bi-directional links (first x then y)
west (-x) x < 0 east (+x) x > 0
south (-y) x = 0, y < 0 north (+y) x = 0, y > 0
•
Unidirectional links sufficient for torus (+x, +y)
• Dimension-order routing (DOR)
▫ Reduce relative address of each dimension in order to avoid deadlock
8
Deadlock
•
How can it arise?
▫ necessary conditions: shared resources incrementally allocated non-preemptible
•
How do you handle it?
▫ constrain how channel resources are allocated (deadlock avoidance)
▫ Add a mechanism that detects likely deadlocks and fixes them
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Deadlock – example 1
Red: S
1
d
1
Green:S
2
d
2
Blue: S
3
d
3
Black: S
4
d
4
10
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Deadlock example 2
• Deadlock can occur even with DOR if uni-directional links
– Can be solved by having two (virtual) channels
TRC (0,0) TRC (0,1) TRC (0,2) TRC (0,3) TRC (1,0) TRC (1,1) TRC (1,2) TRC (1,3) TRC (2,0) TRC (2,1) TRC (2,2) TRC (2,3) TRC (3,0) TRC (3,1) TRC (3,2) TRC (3,3)
X
X
12Arbitration (1/2)
• Several simultaneous
requests to shared
resource
• Ideal: Maximize usage of
network resources
• Problem: Starvation
– Fairness needed
• Figure: Two phase
arbitration.
– Request, Grant
– Poor usage
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Arbitration (2/2)
• Three phases
• Multiple
requests
• Better usage
• But:
Increased
latency
14• Allocating paths for packets
• Two techniques:
– Circuit switching (connection oriented)
• Communication channel • Allocated before first packet
• Packet headers don’t need routing info • Wastes bandwidth
– Packet switching (connection less)
• Each packet handled independently • Can’t guarantee response time • Two types – next slide
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Store & Forward vs. Cut-Through Routing
•
Cut-through (on blocking)
▫ Virtual cut-through (spools rest of packet into buffer)
▫ Wormhole (buffers only a few flits, leaves tail along route, (--- only one flit in the figure above))
2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 3 1 0 2 1 0 2 3 1 0 0 1 2 3 2 3 1 0 Time
Store & Forward Routing Cut-Through Routing
Source Dest Dest
Packet switching
Circuit switching
16
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Pipelined switch
18
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
IDI Open, a
challenge for you?
20
DATAFLOW COMPUTING
AND MDM
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Dataflow computing and
computers
• Dataflow computing
– suitable for highly parallel solutions – requires different HW and SW
• Dataflow computers
– Principles– History
– Statical vs. dynamical – Typical architecture
• pipelined ring with circulating packets
•
Manchester Dataflow Machine (MDM)
22
Dataflow programs
• Represent computation as a graph • Node = operation = instruction • Computation flows through
• Inherently parallel, data driven, no program counter, asynchronous • Logical processor at each node,
activated by availability of operands, executed when a physical processor is available 1 b a + c e d f Dataflow graph f = a x d a = (b +1) x (b - c) d = c x e
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Example — data flow
24
Control flow and data flow
• (Traditional) control-flow
– Explicit control flow (manipulation of program counter (PC))
– Data are communicated between instructions via shared memory locations
– Data is referenced via memory-address – One single control thread
– Many parallel control threads:
• Explicit parallelism
• Data flow computers
– Data driven computation, that is the selection of instructions for execution is controlled by the availability of operands
• Implicit parallelism
– Programs represented as directed graphs
– Results are sent directly as data-packets between instructions – Has normally/originally no shared memory that more than one
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Data flow computers, history
• Relatively old topic
• Many research projects
– Fundamentally different
interesting
• Link to functional languages gave renewed
interest
• Some prototypes built, none with outstanding
performance
• Status 1998
– Few research projects
– Data flow principle used many places
• In processors (Reservation stations, Tomasulo, TDT 4255) • Chaining of DSP PE’s for high performance
• ...
26
Dataflow computers related to
other architectures (anno 1986)
Dataflow machine architecture Arthur H. Veen , ACM Computing Surveys December 1986 Volume 18 Issue 4TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Dataflow machines, architecture and
implementation (anno 1986)
28
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Static and dynamic data flow systems
Static systems
,
does not allow concurrent reactivation of
code
– A given part of a data flow graph can only exist in one instance at the same time
– Maximum one data packet exist on one line
– Data packets communicate directly from instruction to instruction – Control packets are used as acknowledge signal from receiver to
sender so it is know when it can produce a new result
Dynamic systems
• Allows concurrent activation, e.g. the same code can be executed at the same time in different contexts
• What opportunities does this give for program execution? – Loop unrolling, “unfolding” (iteration number)
– Simultaneous procedure calls – Recursive procedures
30
Dynamic dataflow systems
-implementation
How can it be realized? 1) Tag operands with context-identificator • tagged token 2) Copying of code context-1: value = 10 context-2: value = 33+
• Needs the ability to have more than on value «on its way»
between two instructions at the same time
– In this case not enough storage space in the
receiver-instruction to store several operands
– Needs a unit/component where one operand can wait
for its «fellow-operand»
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Manchester Dataflow Machine (MDM)
•
Source: The Manchester
Prototype Dataflow
Computer, Gurd, Kirkham,
Watson, CACM, jan 85, pp.
34-52
•
Data flow machine based on
dynamic tagging of (small)
data-packets (token)
•
Approx 1-2 MIPS in 1985
32
MDM: Data flow programs
•
Three levels
SISAL (Fig. 2)
Assembler (Fig. 3)
» variables from SISAL
» operators from data flow instruction-set
Machine code (Fig. 1)
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
MDM –
machine
code
•
Graphical
presentation
•
Some special
instructions:
CGR, DUP,
BRR, ADL
34SISAL (fig.2)
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Template Assembly Language (TASS), fig. 3
36
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Execution sequence, fig. 4, cnt’d
38
Manchester Dataflow Machine
•
Tagged data packets =
token
•
Tag:
- Iteration level (for loops)
- Activation name (for simultaneous procedure calls and recursion)
- Index (when same code operates on different parts of a data structure)
•
Implementation
Fig. 7 og 8
Output Matching Unit Instruction Store Processing Unit P0...P19 Switch Token Queue InputTDT4260 – computer architecture http://research.idi.ntnu.no/multicore
MDM
Matching
Unit (1/2)
40MDM
Matching Unit
(2/2)
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Instruction
Store &
Processing
Unit
42MDM: System evaluation
• Test method: Load program, load input-token into input-queue, starts clock and release token from queue. Stops clock when first token is received at the host
• Execution time = f(#processors, program, input) • Research goals; build knowledge about:
Hardware utilization and bottlenecks Parallelism in software
Data flow-MIPS vs. «normal MIPS" • Reduce number of "variables"
Different artificial situations to avoid testing too much at the same time » Micro benchmarks, e.g. program with Pby = 1.00
• Test classes
1) small programs -> does not use overflow unit
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Test programs (Table II)
44
Speedup
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
MDM: Problems
•
Low efficiency when handling data structures
• special-HW
•
Arbitraration to/from functional units
• Easily becomes a bottleneck
•
Experiments:
processors starves when a large fraction of the tokens
do not give "match"
» larger buffer on the output port of the matching unit
Needs better compilers
46
MDM: Retrospective (1992)
•
Manchester Data-Flow: A Progress Report, Gurd &
Snelling, ICS’92.
•
MDM, history; started in 1976, stopped in 1989
•
Included:
– Structure Store Units – “Throttle Unit”
• Large programs can generate too much parallel activity that drowns the system in tokens that cannot be processed in a long time (a kind of “trashing” (OS-concept))
• A unit in the ring that assigns unique “activation names» (A part of the “tag”-field)
• From information about the load of different parts of the system can the throttling unit slow down the assignment of these so that the total load is reduced
TDT4260 – computer architecture http://research.idi.ntnu.no/multicore
Lecture plan
• Administrativia
– Reading list is now in its final version
– Note that all slides are part of the curriculum
• Next week: Mini project presentations, Room
454 in IT-Building
• Last lecture 30/4, exam Saturday 25/5 at 0900
– Wrap up
– Short presentation of projects/master theses offered by EECS/CARD/Lasse
– Short presentation of mini-course TDT1:
• TDT1 Energy Efficient Multicore Computing
– Repetition --- send e-mail to Lasse before 24/4 to ask for special topics