TDT 4260 lecture 11 spring semester Interconnection network continued

(1)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

TDT 4260 – lecture 11 – spring semester 2013

Lasse Natvig,

The CARD group Dept. of computer & information science NTNU

http://research.idi.ntnu.no/multicore

2

Lecture overview

• Interconnection network continued

– Routing

– Switch microarchitecture

• Dataflow computing

– Principles

– MDM in detail

• “… think differently”!



innovation

• Research method

• Administrativia

– Reading list is now in its final version

– Next week: Mini project presentations, Room 454 in IT-Building – Last lecture 30/4, exam Saturday 25/5 at 0900

(2)

ARM guest lectures Thursday

morning

• Part of the course TDT4258 - Energieffektive

datamaskinsystemer

– Thursday 18. April, at 08:15 in auditorium F6:

1) Low power HW design. Guest lecturer: Nir Leshem (Hardware Engineering Manager, ARM) (In english)

2) Driverutvikling for Linux, driverarkitektur og debugging. Gjesteforelesere: Ørjan Eide (Senior Engineer, ARM) og Mikael Valen-Sendstad (Staff Software Architect, ARM) (In Norwegian)

4

F.5: Routing, Arbitration, Switching

• Routing

– Which of the possible paths are allowable for packets? – Set of operations needed to compute a valid path

• Arbitration

– When are paths available for packets?

– Resolves packets requesting the same resources at the same time – For every arbitration, there is a winner and possibly many losers

• Losers are buffered (lossless) or dropped on overflow (lossy)

• Switching

– How are paths allocated to packets?

(3)

Routing

• Shared Media

▫ Broadcast to everyone

• Switched Media needs real routing. Options:

▫ Source-based routing: message specifies path to the destination (changes of direction)

▫ Virtual Circuit: circuit established from source to destination, message picks the circuit to follow

▫ Destination-based routing: message specifies destination, switch must pick the path

 Deterministic: always follow same path

 Adaptive: pick different paths to avoid congestion, failures

 Randomized routing: pick between several good paths to balance network load

6

Store & Forward vs Cut-Through Routing

• Cut-through (on blocking)

▫ Virtual cut-through (spools rest of packet into buffer)

▫ 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 3 1 0 2 1 0 2 3 1 0 0 1 2 3 2 3 1 0 Time

Store & Forward Routing Cut-Through Routing

(4)

Routing mechanism

• Need to select output port for each input packet

▫ And fast…

• Simple arithmetic in regular topologies

▫ Example: x, y routing in a grid with bi-directional links (first x then y)

 west (-x) x < 0  east (+x) x > 0

 south (-y) x = 0, y < 0  north (+y) x = 0, y > 0

• Unidirectional links sufficient for torus (+x, +y)

• Dimension-order routing (DOR)

▫ Reduce relative address of each dimension in order to avoid deadlock

8

Deadlock

• How can it arise?

▫ necessary conditions:  shared resources  incrementally allocated  non-preemptible

• How do you handle it?

▫ constrain how channel resources are allocated (deadlock avoidance)

▫ Add a mechanism that detects likely deadlocks and fixes them

(5)

Deadlock – example 1

_{Red: S}

1 

d

1 Green:S

₂



d

₂

Blue: S

₃



d

₃

Black: S

₄



d

₄

10

(6)

Deadlock example 2

• Deadlock can occur even with DOR if uni-directional links

– Can be solved by having two (virtual) channels

TRC (0,0) TRC (0,1) TRC (0,2) TRC (0,3) TRC (1,0) TRC (1,1) TRC (1,2) TRC (1,3) TRC (2,0) TRC (2,1) TRC (2,2) TRC (2,3) TRC (3,0) TRC (3,1) TRC (3,2) TRC (3,3)

X

12

Arbitration (1/2)

• Several simultaneous

requests to shared

resource

• Ideal: Maximize usage of

network resources

• Problem: Starvation

– Fairness needed

• Figure: Two phase

arbitration.

– Request, Grant

– Poor usage

(7)

Arbitration (2/2)

• Three phases

• Multiple

requests

• Better usage

• But:

Increased

latency

14

• Allocating paths for packets

• Two techniques:

– Circuit switching (connection oriented)

• Communication channel • Allocated before first packet

• Packet headers don’t need routing info • Wastes bandwidth

– Packet switching (connection less)

• Each packet handled independently • Can’t guarantee response time • Two types – next slide

(8)

Store & Forward vs. Cut-Through Routing

• Cut-through (on blocking)

▫ Virtual cut-through (spools rest of packet into buffer)

▫ Wormhole (buffers only a few flits, leaves tail along route, (--- only one flit in the figure above))

2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 3 1 0 2 1 0 2 3 1 0 0 1 2 3 2 3 1 0 Time

Store & Forward Routing Cut-Through Routing

Source Dest Dest

Packet switching

Circuit switching

16

(9)

Pipelined switch

18

(10)

IDI Open, a

challenge for you?

20

DATAFLOW COMPUTING

AND MDM

(11)

Dataflow computing and

computers

• Dataflow computing

– suitable for highly parallel solutions – requires different HW and SW

• Dataflow computers

– Principles

– History

– Statical vs. dynamical – Typical architecture

• pipelined ring with circulating packets

• Manchester Dataflow Machine (MDM)

22

Dataflow programs

• Represent computation as a graph • Node = operation = instruction • Computation flows through

• Inherently parallel, data driven, no program counter, asynchronous • Logical processor at each node,

activated by availability of operands, executed when a physical processor is available 1 b a +     c e d f Dataflow graph f = a x d a = (b +1) x (b - c) d = c x e

(12)

Example — data flow

24

Control flow and data flow

• (Traditional) control-flow

– Explicit control flow (manipulation of program counter (PC))

– Data are communicated between instructions via shared memory locations

– Data is referenced via memory-address – One single control thread

– Many parallel control threads:

• Explicit parallelism

• Data flow computers

– Data driven computation, that is the selection of instructions for execution is controlled by the availability of operands

• Implicit parallelism

– Programs represented as directed graphs

– Results are sent directly as data-packets between instructions – Has normally/originally no shared memory that more than one

(13)

Data flow computers, history

• Relatively old topic

• Many research projects

– Fundamentally different



interesting

• Link to functional languages gave renewed

interest

• Some prototypes built, none with outstanding

performance

• Status 1998

– Few research projects

– Data flow principle used many places

• In processors (Reservation stations, Tomasulo, TDT 4255) • Chaining of DSP PE’s for high performance

• ...

26

Dataflow computers related to

other architectures (anno 1986)

Dataflow machine architecture Arthur H. Veen , ACM Computing Surveys December 1986 Volume 18 Issue 4

(14)

Dataflow machines, architecture and

implementation (anno 1986)

28

(15)

Static and dynamic data flow systems

Static systems

,

does not allow concurrent reactivation of

code

– A given part of a data flow graph can only exist in one instance at the same time

– Maximum one data packet exist on one line

– Data packets communicate directly from instruction to instruction – Control packets are used as acknowledge signal from receiver to

sender so it is know when it can produce a new result

Dynamic systems

• Allows concurrent activation, e.g. the same code can be executed at the same time in different contexts

• What opportunities does this give for program execution? – Loop unrolling, “unfolding” (iteration number)

– Simultaneous procedure calls – Recursive procedures

30

Dynamic dataflow systems

-implementation

How can it be realized? 1) Tag operands with context-identificator • tagged token 2) Copying of code context-1: value = 10 context-2: value = 33

+

• Needs the ability to have more than on value «on its way»

between two instructions at the same time

– In this case not enough storage space in the

receiver-instruction to store several operands

– Needs a unit/component where one operand can wait

for its «fellow-operand»

(16)

Manchester Dataflow Machine (MDM)

• Source: The Manchester

Prototype Dataflow

Computer, Gurd, Kirkham,

Watson, CACM, jan 85, pp.

34-52

• Data flow machine based on

dynamic tagging of (small)

data-packets (token)

• Approx 1-2 MIPS in 1985

32

MDM: Data flow programs

• Three levels

SISAL (Fig. 2)

Assembler (Fig. 3)

» variables from SISAL

» operators from data flow instruction-set

Machine code (Fig. 1)

(17)

MDM –

machine

code

• Graphical

presentation

• Some special

instructions:

CGR, DUP,

BRR, ADL

34

SISAL (fig.2)

(18)

Template Assembly Language (TASS), fig. 3

36

(19)

Execution sequence, fig. 4, cnt’d

38

Manchester Dataflow Machine

• Tagged data packets =

token

• Tag:

- Iteration level (for loops)

- Activation name (for simultaneous procedure calls and recursion)

- Index (when same code operates on different parts of a data structure)

• Implementation

Fig. 7 og 8

Output Matching Unit Instruction Store Processing Unit P0...P19 Switch Token Queue Input

(20)

MDM

Matching

Unit (1/2)

40

MDM

Matching Unit

(2/2)

(21)

Instruction

Store &

Processing

Unit

42

MDM: System evaluation

• Test method: Load program, load input-token into input-queue, starts clock and release token from queue. Stops clock when first token is received at the host

• Execution time = f(#processors, program, input) • Research goals; build knowledge about:

Hardware utilization and bottlenecks Parallelism in software

Data flow-MIPS vs. «normal MIPS" • Reduce number of "variables"

Different artificial situations to avoid testing too much at the same time » Micro benchmarks, e.g. program with Pby = 1.00

• Test classes

1) small programs -> does not use overflow unit

(22)

Test programs (Table II)

44

Speedup

(23)

MDM: Problems

• Low efficiency when handling data structures

• special-HW

• Arbitraration to/from functional units

• Easily becomes a bottleneck

• Experiments:

processors starves when a large fraction of the tokens

do not give "match"

» larger buffer on the output port of the matching unit

Needs better compilers

46

MDM: Retrospective (1992)

• Manchester Data-Flow: A Progress Report, Gurd &

Snelling, ICS’92.

• MDM, history; started in 1976, stopped in 1989

• Included:

– Structure Store Units – “Throttle Unit”

• Large programs can generate too much parallel activity that drowns the system in tokens that cannot be processed in a long time (a kind of “trashing” (OS-concept))

• A unit in the ring that assigns unique “activation names» (A part of the “tag”-field)

• From information about the load of different parts of the system can the throttling unit slow down the assignment of these so that the total load is reduced

(24)

Lecture plan

• Administrativia

– Reading list is now in its final version

– Note that all slides are part of the curriculum

• Next week: Mini project presentations, Room

454 in IT-Building

• Last lecture 30/4, exam Saturday 25/5 at 0900

– Wrap up

– Short presentation of projects/master theses offered by EECS/CARD/Lasse

– Short presentation of mini-course TDT1:

• TDT1 Energy Efficient Multicore Computing

– Repetition --- send e-mail to Lasse before 24/4 to ask for special topics