• No results found

Interconnection Networks

N/A
N/A
Protected

Academic year: 2021

Share "Interconnection Networks"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

Interconnection Networks

Z. Jerry Shi

Assistant Professor of Computer Science and Engineering University of Connecticut

* Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*

Three questions about interconnection networks

• What is an interconnection network?

– A programmable system that transports data between terminals

• Where do you find interconnection network?

– Used in almost all digital systems that are large enough to have two components to connect

– The most common applications are in computer systems and communication switches

• Connection between processors and memories, I/O devices and I/O controllers

– Simple bus systems are used in many systems, but high processor performance demand fast interconnection networks

• Why are interconnection network important?

(2)

Architecture of Interconnection Networks

• How to connect the nodes up (processors, memories, router line

cards, SoC modules) –

TOPOLOGY

• Which path should a message take? –

ROUTING AND

DEADLOCKS

• How is the message actually forwarded from source to

destination –

FLOW CONTROL

• How to build the routers –

ROUTER MICROARCHITECTURE

• How to build the links –

LINK ARCHITECTURE

• How do nodes talk to the network –

NETWORK INTERFACE

Metrics in Interconnection Networks

• Performance

– Latency

• How fast data can be transported through the network

– Throughput

• How many pieces of data (messages) can be transported in each time unit

• Power

• Area

• Cost

• Fault-Tolerance

• Quality-of-service

(3)

Topology

• Interconnection networks consists of a set of shared router nodes

and channels

• Topology refers to the arrangement of these nodes and channels

– Analogous to roadmap

• Channels (roads), packets (cars), router nodes (intersection)

Topological Properties

• Routing Distance

- number of links on route

– Average Distance

• Diameter

- maximum routing distance

• Bisection Bandwidth

is the bandwidth crossing a minimal cut

that divides the network in half

– A network is partitioned by a set of links if their removal disconnects the graph

(4)

Linear Arrays and Rings

• Linear Array – Diameter?

– Average Distance? – Bisection bandwidth?

– Route A -> B given by relative address R = B-A • Ring?

• Examples: Fiber Distributed Data Interface (FDDI), Scalable Coherent Interface (SCI), FiberChannel Arbitrated Loop

. . .

0 1 2 3 N-2 N-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Multidimensional Meshes, Tori, and Hypercubes

d-dimensional k-ary torus (or k-aryd-cube) N= kd

– Each dimension has knodes, which can be located with a vector

– Ak-aryd-cube can be constructed with k k-ary (d –1)-cubes

– The radix in each dimension may be different

• For example, 2,3,4-ary 3-cube

d-dimensional k-ary mesh: similar to torus

– Cut the channels between the first and last node in every dimension • Hypercube: binary d-cube

(5)

Hypercubes

• Also called binary n-cubes

– Number of nodes N = 2n • Distance: O(logN) hops • Good bisection bandwidth • Complexity

– Out degree is n= logN

0-D 1-D 2-D 3-D 4-D

5-D !

Real World 2D mesh

(6)

Properties

• Routing

– Relative distance: R = (bd-1– ad-1, ... , b0–a0 )

– Traverse ri= biaihops in each dimension

dimension-order routing • Degree? • Diameter? • Average Distance – dk/4 for cube • Bisection bandwidth? – kd-1bidirectional links • Physical layout? – 2D in O(N) space – Higher dimension?

Embeddings in two dimensions

• Embed multiple logical dimension in one physical dimension

using long wires

(7)

Topology Summary

• All have some “bad permutations”

– Many popular permutations are very bad for meshes (transpose) – Randomness in wiring or routing makes it hard to find a bad one! Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024

1D Array 2 N-1 N / 3 1 huge 1D Ring 2 N/2 N/4 2 2D Mesh 4 2 (N1/2- 1) 2/3 N1/2 N1/2 63 (21) 2D Torus 4 N1/2 1/2 N1/2 2N1/2 32 (16) k-aryn-cube 2n nk/2 nk/4 nk/4 15 (7.5) @n=3 Hypercube n=logN n n/2 N/2 10 (5)

Trees

• Diameter and ave distance logarithmic – k-ary tree, height d= logkN

– Address specified d-vector of radix kcoordinates describing path

down from root • Fixed degree

• Route up to common ancestor and down – R = B xor A

– let ibe position of most significant 1 in R, route up i+1 levels

– down in direction given by low i+1 bits of B

(8)

Fat-Trees

• Fatter links (really more of them) as you go up, so bisection BW

scales with N

Butterflies

• Tree with lots of roots! • N log N (actually N/2 x logN)

• Exactly one route from any source to any dest

• R = A xor B, at level iuse ‘straight’ edge if ri=0, otherwise cross edge

(9)

Benes network and Fat Tree

• Back-to-back butterfly can route all permutations

– Off line

• What if you just pick a random mid point? INPUT OUTPUT Butterfly network Inverse butterfly network

Relationship Butterflies to Hypercubes

• Wiring is isomorphic

• Except that Butterfly always takes log n

steps

(10)

How Many Dimensions?

n

= 2 or n

= 3

– Short wires, easy to build

– Many hops, low bisection bandwidth – Requires traffic locality

n

>= 4

– Harder to build, more wires, longer average length – Fewer hops, better bisection bandwidth

– Can handle non-local traffic

k-ary

d-cubes provide a consistent framework for comparison

– N = kd

– Scale dimension (d) or nodes per dimension (k)

Real Machines

• Wide links, smaller routing delay

• Tremendous variation

(11)

Routing

Messages, Packets, Flits, Phits

• Flits (flow control digits) is the basic unit of bandwidth and storage allocation • Phits(physical transfer digits) is the unit of information that is transferred across a

(12)

Typical Packet Format

• A packet consists of different types of flits – Head, body, or tail

• The head flit carries the packet’s routing information

– A packet has a format of HB*T*

Ro ut in g an d Co nt ro l H ead er Da ta Pa yl oa d Er ro r Co d e Tr ai le r

digital

symbol

Sequence of symbols transmitted over a channel

Routing

• Routing algorithm determines

– which of the possible paths are used as routes – how the route is determined

– R: N x N ÆC, which at each switch maps the destination node to the next channel on the route

• Issues:

– Routing mechanism

• arithmetic

• source-based port select • table driven

• general computation

– Properties of the routes – Deadlock free

(13)

Taxonomy of Routing Algorithms

• Deterministic

– Route determined by (source, dest), not intermediate state (i.e. traffic)

• Given two nodes xand y, the path Rx,yis the same

• Oblivious

– Choose a route without considering any information about the network’s current state

• Example, a random algorithm

• Adaptive

– Route influenced by traffic along the way

• Minimal

– Only selects shortest paths

Example: routing on a ring

• Greedy

– Always send the packet in the shortest direction • Uniform random

– Randomly pick a direction, with equal probability for picking either direction

• Weighted random

– Randomly pick a direction, but weight the short direction with 1 –d/n

where dis the shortest path

• Adaptive

– Send the packet in the direction for which local channel has the lowest load

(14)

Routing relation

• R: N × N

Æ

ρ(P)

– The output of the relation is an entire path – There may be multiple paths

• R: N × N

Æ

ρ(C)

– Routing is incremental

– The output only indicates the channels that the packet take at the current node

• R: C × N

Æ

ρ(C)

– Similar to the second method

– Use the current channel instead of current node

Adaptive Routing

• R: C × N ×ΣÆC

• Essential for fault tolerance

– At least multipath

• Can improve utilization of the network

• Simple deterministic algorithms easily run into bad permutations

• Fully/partially adaptive, minimal/non-minimal • Can introduce complexity or anomalies • Little adaptation goes a long way!

(15)

Routing Mechanism

• Need to select output port for each input packet – in a few cycles

• Simple arithmetic in regular topologies – Example: ∆x, ∆y routing in a grid

• west (-x) ∆x < 0 • east (+x) ∆x > 0 • south (-y) ∆x = 0, ∆y < 0 • north (+y) ∆x = 0, ∆y > 0 • processor ∆x = 0, ∆y = 0

• Reduce relative address of each dimension in order – Dimension-order routing in k-aryd-cubes

• Calculate preferred directions then adjust one dimension each time

• Used in Cray T3D, which connects up to 2048 DEC Alpha processing elements

Routing Mechanism (cont)

• Source-based

– Mainly used in deterministic and oblivious routing

– All routing decisions are made in the source and message header carries series of port selects

• Used and stripped en route

– Fast, simple, and scalable – CS-2, Myrinet, MIT Artic • Node-table

– More appropriate for adaptive routing

– Decide the output channel based on incoming channel and destination – Can redirect traffic if one output link is congested or fails

– ATM, HPPI

P0 P1 P2 P3

(16)

Deadlock

• How can it arise?

– Necessary conditions:

• Shared resource (buffers or channels) • Incrementally allocated

• Non-preemptible

– Think of a channel as a shared resource that is acquired incrementally

• Source buffer then destination buffer • Channels along a route

• How do you avoid it?

– Deadlock avoidance: guarantee no deadlock

• Constrain how channel resources are allocated. • Example: dimension order

– Deadlock recovery: deadlock is detected and corrected • How do you prove that a routing algorithm is deadlock free?

Deadlock Freedom

• Resources are logically associated with channels

• Messages introduce dependences between resources as they

move forward

• Need to articulate the possible dependences that can arise

between channels

• Show that there are no cycles in Channel Dependence Graph

– Find a numbering of channel resources such that every legal route follows a monotonic sequence

=> No traffic pattern can lead to deadlock

– Network need not be acyclic, on channel dependence graph

• All deadlock avoidance techniques use some form of resource

ordering

(17)

Deadlock Recovery

• Detection

– Determining exactly whether the network is deadlocked is difficult – Most practical detection mechanism are conservative

• May have false positives

– Timeout counters

• Reset when making progress

• Recovery

– Regressive: packets or connections that are deadlocked are removed

– Progressive: keep the packets or connections in escape buffer

• Potentially has better performance

• Routing using the escape buffer is designed to be deadlock-free

Flow Control

• Flow control determines how a network’s resources are allocated – Resources: channel bandwidth, buffer capacity, etc.

– Good flow control: achieves a high fraction of ideal bandwidth and delivers packets with low, predictable latency

• Can also be viewed as a problem of contention resolution • Problem is there because we are sharing resources

– Processor:

• Resources in a processor: ALUs, registers

• How to run as many operations, optimizing use of ALUs and registers

– Network

• Resources in a network: Buffers, links

(18)

Contention

• Two packets trying to use the same link at the same time – Limited buffering

– Drop?

Flow control protocols

• Bufferless

– Dropping – Misrouting – Circuit switching

• Header traverses the network and reserves resources • Data are then sent through the reserved path

• Buffered

– Store-and-forward – Virtual cut-through – Wormhole

(19)

Simplest Flow Control: Dropping

• If two things arrive and I don’t have resources, drop one of them

• Flow control protocol on the Internet

• Not used in interconnection networks – why?

(20)

Next Simplest Flow Control: Misrouting

• If only one message can enter the network at each node, and one

message can exit the network at each node, the network can

never be congested. Right?

• Philosophy behind misrouting: intentionally route away from

congestion

• No need for buffering

Circuit Switching

• Bufferless

• Probe that sets up path through network

– If the request flit is blocked, it is held in place (not dropped)

• Reserve all links

• Data are then sent through links

• Simple router

– Similar to the dropping case

– Need only one register to buffer the header

• When is this good?

• When is it not?

(21)

Time-space Diagram: Circuit Switching

Store-and-Forward

• Buffered flow control: flits can be stored in routing nodes

– Flits arriving on cycle ido not have to leave on cycle i+ 1

• Make intermediate stops and wait till the whole packet has

arrived before you move on

• Two resources must be allocated to the packet

– A packet-sized buffer at the other side of the channel – Exclusive use of the channel

• Other packets can use intermediate links

• Pros and cons?

(22)

Time-space Diagram: Store-and-Forward

• With store-and-forward, packets do no have to be divided into flits

Virtual Cut-through

• Why wait till entire message has arrived at each intermediate

stop?

• The head of the message can dash off first

– Of course, the two resources must be allocated

• When the head gets blocked, whole message gets blocked at the

intermediate node

(23)

Time-space Diagram: Virtual Cut-through

Wormhole

• Similar to virtual cut-through, but channel and buffers are allocated to flits rather than packets

• When the head flit arrives, it must acquire three resources before being forwarded to the next node

– A virtual channel for the packet

• State bits indicating the output channel, state of virtual channel (Idle, waiting for resources, or active), and other information

– One flit buffer

– One flit of channel bandwidth

• Body flits do not need to acquire virtual channels

– But still needs to allocate flit buffer and channel bandwidth • The tail flit releases the virtual channel

• Channel is owned by a packet, but buffers are allocated on a flit-by-flit basis – When a flit cannot acquire a buffer, the channel goes idle

(24)

Time-space Diagram: Wormhole

Virtual Channel

• Associates several virtual channels with a single physical channel

– When a packet blocks, instead of holding on to physical links so others cannot use them, hold on to virtual links

• The head flit needs three resources to advance

– A virtual channel, a downstream flit buffer, and channel bandwidth

• Subsequent body flits uses the same virtual channel

– But still needs to allocate flit buffer and channel bandwidth

• However, these flits are not guaranteed access to the channel

bandwidth

• Lanes on the highway

(25)

Time-space diagram: virtual-channel

• Arbitration may not be fair

– It can be winner-take-all

Link-level flow control

• Given that you can’t drop packets, how to manage the buffers? When can you send stuff forward, when not?

• Three techniques

– Credit-based: upstream router keeps a count of the number of free flit buffer in each virtual channel downstream

– On/off: a single bit indicate whether the upstream node can send or not – Ack/nack: upstream node optimistically sends flits when they are available

and downstream node sends back ack or nack • Flit-Reservation

(26)

Link-level flow control

„

Short Links

„

Long links

„

Several flits on the wire

So ur ce De st in at io n Data Req Ready/Ack F/E F/E

Buffer turnaround time

releasehold release

credit delay wire delay credit delay buffer use buffer use pipeline delay hold wire delay pipeline delay

Buffer turnaround time

A flits leaves downstream node.

Credit is sent to the current node.

Credit is processed and a flip is sent to

(27)

Flit-reservation flow control

• Hides the overhead by separating the control and data networks

– Control flits race ahead to reserve network resources – Can also streamlines the delivery of credits

• Allows zero buffer turnaround time

– Not always possible to reserve resources

• The control head flit is similar to a typical head flit, but with an

additional field shows the time offset to the first data flit

– Routing node knows when the data flit will arrive, and starts to prepare buffer now

Router (switch) microarchitecture: What’s in a router?

• It’s a system as well

– Logic – State machines, Arbiters, Allocators

• Control the movement through router • Idle, Routing, Waiting for resources, Active

– Memory – Buffers

• Store flits before forwarding them • SRAMs, registers, processor memory

– Communication – Switches

• Transfer flits from input to output ports

(28)

Typical Router Design

Cross-bar Input Buffer Control Output Ports

Input Receiver Transmiter

Ports Routing, Scheduling Output Buffer

Router Components

• Output ports

– Transmitter (typically drives clock and data)

• Input ports

– Synchronizer and aligns data signal with local clock domain – Essentially a FIFO buffer

• Crossbar

– Connects each input to any output – Degree limited by area or pinout

• Buffering

• Control logic

– Complexity depends on routing logic and scheduling algorithm – Determine output port for each incoming packet

(29)

Buffer Organizations

• Input buffers

– Buffering at each input port, stores flits till they get to leave through switch to next hop

• Central buffers

– A central memory shared among every port – Functions as switch as well

• Output buffers

– Flits flow right through to output port

– Highest throughput, no head-of-line blocking

Input Buffered Router

• Independent routing logic per input – FSM

• Scheduler logic arbitrates each output – Priority, FIFO, or random • Head-of-line blocking problem

– If an earlier flit is missed, the later flits hold the buffer Cross-bar Output Ports Input Ports Scheduling R0 R1 R2 R3

(30)

Output Buffered Router

• Commit to output - limited adaptivity

• Switch has to handle input line speeds

Control Output Ports Input Ports Output Ports Output Ports Output Ports R0 R1 R2 R3

Virtual-channel Router

(31)

Virtual-channel Router

• Packet – head, body, tail flits • Head

– Routing Æoutput port

– Request and arbitrate for next VC – Request and arbitrate for switch path – Request and arbitrate for buffer – Traverse switch

• Body

– Request and arbitrate for switch path – Request and arbitrate for buffer – Traverse switch

• Tail

– Request and arbitrate for switch path – Request and arbitrate for buffer – Traverse switch

– Release switch path

State machines

• Control the state of the router

• Each input channel

– G: Global State: is it idle? routing? waiting for VC? buffer? – R: Output port

• Filled by routing

– O: Output VC

• Filled by VC allocation

– P: Head and tail queue pointers – C: Credits

• Each output channel

– G: Global state: Idle? Active? Waiting for credits? – I: Input VC that is sending flits to this output port – C: Credit count

(32)

Pipelining of a typical virtual channel router

ST SA Tail flit ST SA Boyd flit 2 ST SA Body flit 1 ST SA VA RC Head flit 7 6 5 4 3 2 1 Cycle

• Cycle 0: Head flits arrives. G will change to R on the next cycle

• Cycle 1: RC(Routing computation). R and G (=V) will be updated on the next cycle

• Cycle 2: VA(Virtual channel allocation). On the next cycle, O and G (=A) will be updated. The state of output channel will be updated

• Cycle 3: SA: Switch allocation • Cycle 4: ST: Switch traversal

Output arbiters

• N requesters (inputs) trying to get a single resource under

contention (output)

– N:1 arbiter for each output

• Several types of arbiters

– Fixed priority arbiter – Variable priority arbiter

• Oblivious arbiter • Round robin arbiter

(33)

Fixed Priority Arbiter

Variable Priority Arbiter

• A one-hot priority signal pselects the

highest priority – Only one of the

(34)

Variable Priority Arbiters

• Oblivious

– Not dependent on previous grants or requests – Rotating priorities

– Random priorities

Variable Priority Arbiters

• Round robin

– Request that was last served should have lowest priority – Serve all other requests first before returning to this requestor

– If a grant is issued this cycle, the request next to the one receiving the grant will have the highest priority on the next cycle

(35)

Allocators

• NxM allocator: N requestors fighting for M resources • Results:

– A grant can be asserted only if the corresponding request is asserted – At most one grant for each input

may be asserted

– At most one grant for each resource may be asserted

Allocators In Routers

• VC Allocator

– Input VCs requesting for a range of output VCs

– E.g. a packet of VC0 arrives at East input port. It’s destined for west output port, and would like to get any of the VCs of that output port.

• Switch Allocator

– Input VCs of an input port request for different output ports (e.g. One’s going North, another’s going West)

(36)

Simplest Allocators: Separable

• Approximate with two stages of arbitration

– One on inputs, one on outputs. They can be in either order.

Separable Allocator

(37)

Switches

• The fabric that directs flits from one input port to another output

port

• Design issue: number of input and output ports, and speedups

– Speedup: the ratio of the total input bandwidth to the netowk’s ideal capacity (the best throughput)

• Tradeoff between cost (delay, area, power) and performance

(throughput)

• Tradeoff between leaving it up to allocation or simplifying the

job for allocators

Crossbar switches

Input speedup = 1

Input speedup = 2

(38)

Effect of input speedup

• With a random allocator

• Throughput is the fraction of capacity

Several flit buffer organizations

• Central

– Simple logical view

• There are actually two switches: MUX in and deMUX out

– Problems: bandwidth and latency • Separate memory per input port

(39)

Virtual Channel (VC) Buffer Organization

• One buffer per VC

Allows switches to access multiple VC associated with one PC, but leads to poor memory utilization.

A small amount of output

ports on a single buffer Divide VCs among buffersMemory Interleaving!

Approximations:

(40)

Alpha 21364 router

• Torus

• Virtual cut-through (316 packet buffers)

• Adaptive routing: prefer to continue in the same dimension

• Deadlock avoidance

– Coherence: Requests may fill up buffers, stalling acks (Solution: Virtual channel class, order)

– Network: Escape virtual channel

(41)

Router microarchitecture

Network Interface

• How a processor sends data to the network

• Shared memory cache-coherent multiprocessors

– Interfaces caches with networks

• Message-passing multiprocessors

– Interfaces processor pipeline with networks

• Dedicated register (or two registers) • Register map

• Memory map • Virtual memory map • I/O interrupt + DMA

(42)

Cache-coherent SMP processor-network interface

• Highly optimized interface: from load/ store to messages in a few cycles • Request is placed in memory request

register

– Tag: how to handle the reply, e.g., store the data in R24

– Type: cacheable or not; read or write

• Cache hit: place in reply registerright away

• Cache miss: enter miss status holding register (MSHR)

– Use this to merge reads/writes as well – Number of MSHRs == number of

pending memory references (4 to 32)

Cache-coherent SMP memory-network interface

•Messages from the network initialize transaction status holding register (TSHR)

– Messages may be queued •TSHR tracks the status of pending memory operations

Example:

For a non-cacheable read, the TSHR status changes:

Read pending (waiting for bank)

ÆBank activated (waiting for data)

ÆRead complete (preparing message)

(43)

Message-passing multiprocessors: Dedicated register

• Send

– Move a value to the network out register

– Special MOV instruction for the last word to terminate the packet

• Read

– Block on the register until packet arrives, or test register and retry later

• Pros: fast • Cons:

– Long messages: processor becoming DMA engine! – Security: hold the register

forever

Register map

• Send a message atomically from a subset of the processor’s general purpose register • Cons:

– Long messages have to be segmented – Pressures on general purpose register – Processors are still DMA engines

(44)

I/O interface

• Most common interface today, in PCs, Clusters of workstations (e.g. Infiniband, Myrinet, PCI)

• Software-level messaging: – Interrupt triggers handler – Handler sets up DMA

– DMA engine constructs packets from memory and sends out to network • Physical-memory-mapped or virtual-memory-mapped

Case Study: Princeton SHRIMP

Where: I/O bus

(45)

Virtual memory mapping

Map_network(My_virtual_addr_range,Your_virtual_addr_range)

Each virtual page -> local physical page -> remote physical page -> remote virtual address

Store to these virtual addresses => network

(46)

Case Study: M-Machine Multicomputer

• Experimental multicomputer built at MIT and Standford – 2-D torus

References

Related documents

shop.nypl.org 100 Great Children’s Books was selected by The New York Public Library’s Jeanne Lamb, Coordinator, Youth Collections, and Elizabeth Bird, Supervising

There is a tendency to associate inclusive education with learners with various kinds of disability, especially those with ‘special educational needs’ (Walton and Bekker 2013 ,

E-commerce has enhanced the corporate image of your organization (B=2.038) carried the heaviest weight in explaining Benefit of effect of E-commerce on the performance of

ber of steps SSSP has taken starting from these nodes as values (i.e., it contains an entry { u : M } , for each hub node u at the be- ginning); and 2) the Reverse Rank Dictionary ;

In this hands-on training camp you'll not only learn how to market your firm on the Internet, you'll actually create your marketing tools right here with our guidance

By imposing forecasted GDP growth from Botswana’s 2003-09 National Development Plan (NDP9) (MFDP, 2003) as well as the population and labor supply projections from the

Auto Dealers, Parts and Services Stations 5511 AUTO DEALERS/NEW AND USED Exclude Exclude. 5521 AUTO DEALERS USED ONLY

Тож закріплення у Кримінальному кодексі України (далі – КК), наприклад, 16-річного віку потерпілого може викликати негативні наслідки в