• No results found

TDT 4260 lecture 11 spring semester Interconnection network continued

N/A
N/A
Protected

Academic year: 2021

Share "TDT 4260 lecture 11 spring semester Interconnection network continued"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

TDT 4260 – lecture 11 – spring semester 2013

Lasse Natvig,

The CARD group Dept. of computer & information science NTNU

http://research.idi.ntnu.no/multicore

2

Lecture overview

• Interconnection network continued

– Routing

– Switch microarchitecture

• Dataflow computing

– Principles

– MDM in detail

• “… think differently”!

innovation

• Research method

• Administrativia

– Reading list is now in its final version

– Next week: Mini project presentations, Room 454 in IT-Building – Last lecture 30/4, exam Saturday 25/5 at 0900

(2)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

ARM guest lectures Thursday

morning

Part of the course TDT4258 - Energieffektive

datamaskinsystemer

– Thursday 18. April, at 08:15 in auditorium F6:

1) Low power HW design. Guest lecturer: Nir Leshem (Hardware Engineering Manager, ARM) (In english)

2) Driverutvikling for Linux, driverarkitektur og debugging. Gjesteforelesere: Ørjan Eide (Senior Engineer, ARM) og Mikael Valen-Sendstad (Staff Software Architect, ARM) (In Norwegian)

4

F.5: Routing, Arbitration, Switching

Routing

Which of the possible paths are allowable for packets? – Set of operations needed to compute a valid path

Arbitration

When are paths available for packets?

– Resolves packets requesting the same resources at the same time – For every arbitration, there is a winner and possibly many losers

• Losers are buffered (lossless) or dropped on overflow (lossy)

Switching

How are paths allocated to packets?

(3)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Routing

Shared Media

▫ Broadcast to everyone

Switched Media needs real routing. Options:

Source-based routing: message specifies path to the destination (changes of direction)

Virtual Circuit: circuit established from source to destination, message picks the circuit to follow

Destination-based routing: message specifies destination, switch must pick the path

 Deterministic: always follow same path

 Adaptive: pick different paths to avoid congestion, failures

 Randomized routing: pick between several good paths to balance network load

6

Store & Forward vs Cut-Through Routing

Cut-through (on blocking)

▫ Virtual cut-through (spools rest of packet into buffer)

2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 3 1 0 2 1 0 2 3 1 0 0 1 2 3 2 3 1 0 Time

Store & Forward Routing Cut-Through Routing

(4)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Routing mechanism

Need to select output port for each input packet

▫ And fast…

Simple arithmetic in regular topologies

▫ Example: x, y routing in a grid with bi-directional links (first x then y)

 west (-x) x < 0  east (+x) x > 0

 south (-y) x = 0, y < 0  north (+y) x = 0, y > 0

Unidirectional links sufficient for torus (+x, +y)

Dimension-order routing (DOR)

▫ Reduce relative address of each dimension in order to avoid deadlock

8

Deadlock

How can it arise?

▫ necessary conditions:  shared resources  incrementally allocated  non-preemptible

How do you handle it?

▫ constrain how channel resources are allocated (deadlock avoidance)

▫ Add a mechanism that detects likely deadlocks and fixes them

(5)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Deadlock – example 1

Red: S

1

d

1

Green:S

2

d

2

Blue: S

3

d

3

Black: S

4

d

4

10

(6)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Deadlock example 2

• Deadlock can occur even with DOR if uni-directional links

– Can be solved by having two (virtual) channels

TRC (0,0) TRC (0,1) TRC (0,2) TRC (0,3) TRC (1,0) TRC (1,1) TRC (1,2) TRC (1,3) TRC (2,0) TRC (2,1) TRC (2,2) TRC (2,3) TRC (3,0) TRC (3,1) TRC (3,2) TRC (3,3)

X

X

12

Arbitration (1/2)

• Several simultaneous

requests to shared

resource

• Ideal: Maximize usage of

network resources

• Problem: Starvation

– Fairness needed

• Figure: Two phase

arbitration.

– Request, Grant

– Poor usage

(7)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Arbitration (2/2)

• Three phases

• Multiple

requests

• Better usage

• But:

Increased

latency

14

• Allocating paths for packets

• Two techniques:

– Circuit switching (connection oriented)

• Communication channel • Allocated before first packet

• Packet headers don’t need routing info • Wastes bandwidth

– Packet switching (connection less)

• Each packet handled independently • Can’t guarantee response time • Two types – next slide

(8)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Store & Forward vs. Cut-Through Routing

Cut-through (on blocking)

▫ Virtual cut-through (spools rest of packet into buffer)

▫ Wormhole (buffers only a few flits, leaves tail along route, (--- only one flit in the figure above))

2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 3 1 0 2 1 0 2 3 1 0 0 1 2 3 2 3 1 0 Time

Store & Forward Routing Cut-Through Routing

Source Dest Dest

Packet switching

Circuit switching

16

(9)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Pipelined switch

18

(10)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

IDI Open, a

challenge for you?

20

DATAFLOW COMPUTING

AND MDM

(11)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Dataflow computing and

computers

• Dataflow computing

– suitable for highly parallel solutions – requires different HW and SW

• Dataflow computers

– Principles

– History

– Statical vs. dynamical – Typical architecture

• pipelined ring with circulating packets

Manchester Dataflow Machine (MDM)

22

Dataflow programs

• Represent computation as a graph • Node = operation = instruction • Computation flows through

• Inherently parallel, data driven, no program counter, asynchronous • Logical processor at each node,

activated by availability of operands, executed when a physical processor is available 1 b a +     c e d f Dataflow graph f = a x d a = (b +1) x (b - c) d = c x e

(12)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Example — data flow

24

Control flow and data flow

• (Traditional) control-flow

– Explicit control flow (manipulation of program counter (PC))

– Data are communicated between instructions via shared memory locations

– Data is referenced via memory-address – One single control thread

– Many parallel control threads:

Explicit parallelism

• Data flow computers

– Data driven computation, that is the selection of instructions for execution is controlled by the availability of operands

Implicit parallelism

– Programs represented as directed graphs

– Results are sent directly as data-packets between instructions – Has normally/originally no shared memory that more than one

(13)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Data flow computers, history

• Relatively old topic

• Many research projects

– Fundamentally different

interesting

• Link to functional languages gave renewed

interest

• Some prototypes built, none with outstanding

performance

• Status 1998

– Few research projects

– Data flow principle used many places

• In processors (Reservation stations, Tomasulo, TDT 4255) • Chaining of DSP PE’s for high performance

• ...

26

Dataflow computers related to

other architectures (anno 1986)

Dataflow machine architecture Arthur H. Veen , ACM Computing Surveys December 1986 Volume 18 Issue 4

(14)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Dataflow machines, architecture and

implementation (anno 1986)

28

(15)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Static and dynamic data flow systems

Static systems

,

does not allow concurrent reactivation of

code

– A given part of a data flow graph can only exist in one instance at the same time

– Maximum one data packet exist on one line

– Data packets communicate directly from instruction to instruction – Control packets are used as acknowledge signal from receiver to

sender so it is know when it can produce a new result

Dynamic systems

• Allows concurrent activation, e.g. the same code can be executed at the same time in different contexts

• What opportunities does this give for program execution? – Loop unrolling, “unfolding” (iteration number)

– Simultaneous procedure calls – Recursive procedures

30

Dynamic dataflow systems

-implementation

How can it be realized? 1) Tag operands with context-identificatortagged token 2) Copying of code context-1: value = 10 context-2: value = 33

+

• Needs the ability to have more than on value «on its way»

between two instructions at the same time

– In this case not enough storage space in the

receiver-instruction to store several operands

– Needs a unit/component where one operand can wait

for its «fellow-operand»

(16)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Manchester Dataflow Machine (MDM)

Source: The Manchester

Prototype Dataflow

Computer, Gurd, Kirkham,

Watson, CACM, jan 85, pp.

34-52

Data flow machine based on

dynamic tagging of (small)

data-packets (token)

Approx 1-2 MIPS in 1985

32

MDM: Data flow programs

Three levels

SISAL (Fig. 2)

Assembler (Fig. 3)

» variables from SISAL

» operators from data flow instruction-set

Machine code (Fig. 1)

(17)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

MDM –

machine

code

Graphical

presentation

Some special

instructions:

CGR, DUP,

BRR, ADL

34

SISAL (fig.2)

(18)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Template Assembly Language (TASS), fig. 3

36

(19)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Execution sequence, fig. 4, cnt’d

38

Manchester Dataflow Machine

Tagged data packets =

token

Tag:

- Iteration level (for loops)

- Activation name (for simultaneous procedure calls and recursion)

- Index (when same code operates on different parts of a data structure)

Implementation

Fig. 7 og 8

Output Matching Unit Instruction Store Processing Unit P0...P19 Switch Token Queue Input

(20)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

MDM

Matching

Unit (1/2)

40

MDM

Matching Unit

(2/2)

(21)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Instruction

Store &

Processing

Unit

42

MDM: System evaluation

• Test method: Load program, load input-token into input-queue, starts clock and release token from queue. Stops clock when first token is received at the host

• Execution time = f(#processors, program, input) • Research goals; build knowledge about:

Hardware utilization and bottlenecks Parallelism in software

Data flow-MIPS vs. «normal MIPS" • Reduce number of "variables"

Different artificial situations to avoid testing too much at the same time » Micro benchmarks, e.g. program with Pby = 1.00

• Test classes

1) small programs -> does not use overflow unit

(22)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Test programs (Table II)

44

Speedup

(23)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

MDM: Problems

Low efficiency when handling data structures

• special-HW

Arbitraration to/from functional units

• Easily becomes a bottleneck

Experiments:

processors starves when a large fraction of the tokens

do not give "match"

» larger buffer on the output port of the matching unit

Needs better compilers

46

MDM: Retrospective (1992)

Manchester Data-Flow: A Progress Report, Gurd &

Snelling, ICS’92.

MDM, history; started in 1976, stopped in 1989

Included:

Structure Store Units“Throttle Unit”

• Large programs can generate too much parallel activity that drowns the system in tokens that cannot be processed in a long time (a kind of “trashing” (OS-concept))

• A unit in the ring that assigns unique “activation names» (A part of the “tag”-field)

• From information about the load of different parts of the system can the throttling unit slow down the assignment of these so that the total load is reduced

(24)

TDT4260 – computer architecture http://research.idi.ntnu.no/multicore

Lecture plan

• Administrativia

– Reading list is now in its final version

– Note that all slides are part of the curriculum

• Next week: Mini project presentations, Room

454 in IT-Building

• Last lecture 30/4, exam Saturday 25/5 at 0900

– Wrap up

– Short presentation of projects/master theses offered by EECS/CARD/Lasse

– Short presentation of mini-course TDT1:

TDT1 Energy Efficient Multicore Computing

– Repetition --- send e-mail to Lasse before 24/4 to ask for special topics

References

Related documents

Components of Information Technology: Components Hardware &amp; its Functioning - Input Unit, Control Processing Unit, Output Unit, Types of Input Units &amp; Output Units

Abstracts should be submitted electronically to ECS headquarters, and questions and inquiries should be sent to the symposium organizers: Pe- ter Hesketh , Georgia Institute

Load balancing is a computer network method for distributing workloads across multiple computing resources, for example computers, a computer cluster, network links,

Stars are a recurrent motif in Spare’s work from the luxurious and sybaritic pen and ink oeuvre of his early years through to the late magickal pastels such as “Cacophonic Fugue

If the value in the register satisfies the condition specified by the instruction (e.g. equal to zero, etc.), then the machine jumps to execute the instruction whose label is

Load balanced Single Machine Configuration (Active-Active) GIS Server http:6080 Manager Server Directories &amp; Configuration Store ArcGIS Site Client Proxy\Load Balancer

This chapter presents an overview of a range of noninvasive modalities that have been used in reha- bilitation after SCI. Among others, we present repetitive transcranial magnetic

The fee covers complete cost of your course which includes: the cost of registration, course study material, tutor support and certification fee. We at BOLC offer you the variety