Interconnection Networks

(1)

Interconnection Networks

Z. Jerry Shi

Assistant Professor of Computer Science and Engineering University of Connecticut

* Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*

Three questions about interconnection networks

• What is an interconnection network?

– A programmable system that transports data between terminals

• Where do you find interconnection network?

– Used in almost all digital systems that are large enough to have two components to connect

– The most common applications are in computer systems and communication switches

• Connection between processors and memories, I/O devices and I/O controllers

– Simple bus systems are used in many systems, but high processor performance demand fast interconnection networks

• Why are interconnection network important?

(2)

Architecture of Interconnection Networks

• How to connect the nodes up (processors, memories, router line

cards, SoC modules) –

TOPOLOGY

• Which path should a message take? –

ROUTING AND

DEADLOCKS

• How is the message actually forwarded from source to

destination –

FLOW CONTROL

• How to build the routers –

ROUTER MICROARCHITECTURE

• How to build the links –

LINK ARCHITECTURE

• How do nodes talk to the network –

NETWORK INTERFACE

Metrics in Interconnection Networks

• Performance

– Latency

• How fast data can be transported through the network

– Throughput

• How many pieces of data (messages) can be transported in each time unit

• Power

• Area

• Cost

• Fault-Tolerance

• Quality-of-service

(3)

Topology

• Interconnection networks consists of a set of shared router nodes

and channels

• Topology refers to the arrangement of these nodes and channels

– Analogous to roadmap

• Channels (roads), packets (cars), router nodes (intersection)

Topological Properties

• Routing Distance

- number of links on route

– Average Distance

• Diameter

- maximum routing distance

• Bisection Bandwidth

is the bandwidth crossing a minimal cut

that divides the network in half

– A network is partitioned by a set of links if their removal disconnects the graph

(4)

Linear Arrays and Rings

• Linear Array – Diameter?

– Average Distance? – Bisection bandwidth?

– Route A -> B given by relative address R = B-A • Ring?

• Examples: Fiber Distributed Data Interface (FDDI), Scalable Coherent Interface (SCI), FiberChannel Arbitrated Loop

. . .

0 1 2 3 N-2 N-1 0 1 ₂ 3 4 5 6 7 8 9 10 11 12 13 14 15

Multidimensional Meshes, Tori, and Hypercubes

• d-dimensional k-ary torus (or k-aryd-cube) N= kd

– Each dimension has knodes, which can be located with a vector

– Ak-aryd-cube can be constructed with k k-ary (d –1)-cubes

– The radix in each dimension may be different

• For example, 2,3,4-ary 3-cube

• d-dimensional k-ary mesh: similar to torus

– Cut the channels between the first and last node in every dimension • Hypercube: binary d-cube

(5)

Hypercubes

• Also called binary n-cubes

– Number of nodes N = 2n • Distance: O(logN) hops • Good bisection bandwidth • Complexity

– Out degree is n= logN

0-D 1-D 2-D 3-D 4-D

5-D !

Real World 2D mesh

(6)

Properties

• Routing

– Relative distance: R = (bd-1– ad-1, ... , b0–a0 )

– Traverse r_i= b_i–a_ihops in each dimension

– dimension-order routing • Degree? • Diameter? • Average Distance – dk/4 for cube • Bisection bandwidth? – kd-1_{bidirectional links} • Physical layout? – 2D in O(N) space – Higher dimension?

Embeddings in two dimensions

• Embed multiple logical dimension in one physical dimension

using long wires

(7)

Topology Summary

• All have some “bad permutations”

– Many popular permutations are very bad for meshes (transpose) – Randomness in wiring or routing makes it hard to find a bad one! Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024

1D Array 2 N-1 N / 3 1 huge 1D Ring 2 N/2 N/4 2 2D Mesh 4 2 (N1/2_{- 1)} _{2/3 N}1/2 _N1/2 _{63 (21)} 2D Torus 4 N1/2 _{1/2 N}1/2 _2N1/2 _{32 (16)} k-aryn-cube 2n nk/2 nk/4 nk/4 15 (7.5) @n=3 Hypercube n=logN n n/2 N/2 10 (5)

Trees

• Diameter and ave distance logarithmic – k-ary tree, height d= logkN

– Address specified d-vector of radix kcoordinates describing path

down from root • Fixed degree

• Route up to common ancestor and down – R = B xor A

– let ibe position of most significant 1 in R, route up i+1 levels

– down in direction given by low i+1 bits of B

(8)

Fat-Trees

• Fatter links (really more of them) as you go up, so bisection BW

scales with N

Butterflies

• Tree with lots of roots! • N log N (actually N/2 x logN)

• Exactly one route from any source to any dest

• R = A xor B, at level iuse ‘straight’ edge if ri=0, otherwise cross edge

(9)

Benes network and Fat Tree

• Back-to-back butterfly can route all permutations

– Off line

• What if you just pick a random mid point? INPUT OUTPUT Butterfly network Inverse butterfly network

Relationship Butterflies to Hypercubes

• Wiring is isomorphic

• Except that Butterfly always takes log n

steps

(10)

How Many Dimensions?

• n

= 2 or n

= 3

– Short wires, easy to build

– Many hops, low bisection bandwidth – Requires traffic locality

• n

>= 4

– Harder to build, more wires, longer average length – Fewer hops, better bisection bandwidth

– Can handle non-local traffic

• k-ary

d-cubes provide a consistent framework for comparison

– N = kd

– Scale dimension (d) or nodes per dimension (k)

Real Machines

• Wide links, smaller routing delay

• Tremendous variation

(11)

Routing

Messages, Packets, Flits, Phits

• Flits (flow control digits) is the basic unit of bandwidth and storage allocation • Phits(physical transfer digits) is the unit of information that is transferred across a

(12)

Typical Packet Format

• A packet consists of different types of flits – Head, body, or tail

• The head flit carries the packet’s routing information

– A packet has a format of HB*T*

Ro ut in g an d Co nt ro l H ead er Da ta Pa yl oa d Er ro r Co d e Tr ai le r

digital

symbol

Sequence of symbols transmitted over a channel

Routing

• Routing algorithm determines

– which of the possible paths are used as routes – how the route is determined

– R: N x N ÆC, which at each switch maps the destination node to the next channel on the route

• Issues:

– Routing mechanism

• arithmetic

• source-based port select • table driven

• general computation

– Properties of the routes – Deadlock free

(13)

Taxonomy of Routing Algorithms

• Deterministic

– Route determined by (source, dest), not intermediate state (i.e. traffic)

• Given two nodes xand y, the path Rx,yis the same

• Oblivious

– Choose a route without considering any information about the network’s current state

• Example, a random algorithm

• Adaptive

– Route influenced by traffic along the way

• Minimal

– Only selects shortest paths

Example: routing on a ring

• Greedy

– Always send the packet in the shortest direction • Uniform random

– Randomly pick a direction, with equal probability for picking either direction

• Weighted random

– Randomly pick a direction, but weight the short direction with 1 –d/n

where dis the shortest path

• Adaptive

– Send the packet in the direction for which local channel has the lowest load

(14)

Routing relation

• R: N × N

Æ

ρ(P)

– The output of the relation is an entire path – There may be multiple paths

• R: N × N

Æ

ρ(C)

– Routing is incremental

– The output only indicates the channels that the packet take at the current node

• R: C × N

Æ

ρ(C)

– Similar to the second method

– Use the current channel instead of current node

Adaptive Routing

• R: C × N ×ΣÆC

• Essential for fault tolerance

– At least multipath

• Can improve utilization of the network

• Simple deterministic algorithms easily run into bad permutations

• Fully/partially adaptive, minimal/non-minimal • Can introduce complexity or anomalies • Little adaptation goes a long way!

(15)

Routing Mechanism

• Need to select output port for each input packet – in a few cycles

• Simple arithmetic in regular topologies – Example: ∆x, ∆y routing in a grid

• west (-x) ∆x < 0 • east (+x) ∆x > 0 • south (-y) ∆x = 0, ∆y < 0 • north (+y) ∆x = 0, ∆y > 0 • processor ∆x = 0, ∆y = 0

• Reduce relative address of each dimension in order – Dimension-order routing in k-aryd-cubes

• Calculate preferred directions then adjust one dimension each time

• Used in Cray T3D, which connects up to 2048 DEC Alpha processing elements

Routing Mechanism (cont)

• Source-based

– Mainly used in deterministic and oblivious routing

– All routing decisions are made in the source and message header carries series of port selects

• Used and stripped en route

– Fast, simple, and scalable – CS-2, Myrinet, MIT Artic • Node-table

– More appropriate for adaptive routing

– Decide the output channel based on incoming channel and destination – Can redirect traffic if one output link is congested or fails

– ATM, HPPI

P₀ P₁ P₂ P₃

(16)

Deadlock

• How can it arise?

– Necessary conditions:

• Shared resource (buffers or channels) • Incrementally allocated

• Non-preemptible

– Think of a channel as a shared resource that is acquired incrementally

• Source buffer then destination buffer • Channels along a route

• How do you avoid it?

– Deadlock avoidance: guarantee no deadlock

• Constrain how channel resources are allocated. • Example: dimension order

– Deadlock recovery: deadlock is detected and corrected • How do you prove that a routing algorithm is deadlock free?

Deadlock Freedom

• Resources are logically associated with channels

• Messages introduce dependences between resources as they

move forward

• Need to articulate the possible dependences that can arise

between channels

• Show that there are no cycles in Channel Dependence Graph

– Find a numbering of channel resources such that every legal route follows a monotonic sequence

=> No traffic pattern can lead to deadlock

– Network need not be acyclic, on channel dependence graph

• All deadlock avoidance techniques use some form of resource

ordering

(17)

Deadlock Recovery

• Detection

– Determining exactly whether the network is deadlocked is difficult – Most practical detection mechanism are conservative

• May have false positives

– Timeout counters

• Reset when making progress

• Recovery

– Regressive: packets or connections that are deadlocked are removed

– Progressive: keep the packets or connections in escape buffer

• Potentially has better performance

• Routing using the escape buffer is designed to be deadlock-free

Flow Control

• Flow control determines how a network’s resources are allocated – Resources: channel bandwidth, buffer capacity, etc.

– Good flow control: achieves a high fraction of ideal bandwidth and delivers packets with low, predictable latency

• Can also be viewed as a problem of contention resolution • Problem is there because we are sharing resources

– Processor:

• Resources in a processor: ALUs, registers

• How to run as many operations, optimizing use of ALUs and registers

– Network

• Resources in a network: Buffers, links

(18)

Contention

• Two packets trying to use the same link at the same time – Limited buffering

– Drop?

Flow control protocols

• Bufferless

– Dropping – Misrouting – Circuit switching

• Header traverses the network and reserves resources • Data are then sent through the reserved path

• Buffered

– Store-and-forward – Virtual cut-through – Wormhole

(19)

Simplest Flow Control: Dropping

• If two things arrive and I don’t have resources, drop one of them

• Flow control protocol on the Internet

• Not used in interconnection networks – why?

(20)

Next Simplest Flow Control: Misrouting

• If only one message can enter the network at each node, and one

message can exit the network at each node, the network can

never be congested. Right?

• Philosophy behind misrouting: intentionally route away from

congestion

• No need for buffering

Circuit Switching

• Bufferless

• Probe that sets up path through network

– If the request flit is blocked, it is held in place (not dropped)

• Reserve all links

• Data are then sent through links

• Simple router

– Similar to the dropping case

– Need only one register to buffer the header

• When is this good?

• When is it not?

(21)

Time-space Diagram: Circuit Switching

Store-and-Forward

• Buffered flow control: flits can be stored in routing nodes

– Flits arriving on cycle ido not have to leave on cycle i+ 1

• Make intermediate stops and wait till the whole packet has

arrived before you move on

• Two resources must be allocated to the packet

– A packet-sized buffer at the other side of the channel – Exclusive use of the channel

• Other packets can use intermediate links

• Pros and cons?

(22)

Time-space Diagram: Store-and-Forward

• With store-and-forward, packets do no have to be divided into flits

Virtual Cut-through

• Why wait till entire message has arrived at each intermediate

stop?

• The head of the message can dash off first

– Of course, the two resources must be allocated

• When the head gets blocked, whole message gets blocked at the

intermediate node

(23)

Time-space Diagram: Virtual Cut-through

Wormhole

• Similar to virtual cut-through, but channel and buffers are allocated to flits rather than packets

• When the head flit arrives, it must acquire three resources before being forwarded to the next node

– A virtual channel for the packet

• State bits indicating the output channel, state of virtual channel (Idle, waiting for resources, or active), and other information

– One flit buffer

– One flit of channel bandwidth

• Body flits do not need to acquire virtual channels

– But still needs to allocate flit buffer and channel bandwidth • The tail flit releases the virtual channel

• Channel is owned by a packet, but buffers are allocated on a flit-by-flit basis – When a flit cannot acquire a buffer, the channel goes idle

(24)

Time-space Diagram: Wormhole

Virtual Channel

• Associates several virtual channels with a single physical channel

– When a packet blocks, instead of holding on to physical links so others cannot use them, hold on to virtual links

• The head flit needs three resources to advance

– A virtual channel, a downstream flit buffer, and channel bandwidth

• Subsequent body flits uses the same virtual channel

– But still needs to allocate flit buffer and channel bandwidth

• However, these flits are not guaranteed access to the channel

bandwidth

• Lanes on the highway

(25)

Time-space diagram: virtual-channel

• Arbitration may not be fair

– It can be winner-take-all

Link-level flow control

• Given that you can’t drop packets, how to manage the buffers? When can you send stuff forward, when not?

• Three techniques

– Credit-based: upstream router keeps a count of the number of free flit buffer in each virtual channel downstream

– On/off: a single bit indicate whether the upstream node can send or not – Ack/nack: upstream node optimistically sends flits when they are available

and downstream node sends back ack or nack • Flit-Reservation

(26)

Link-level flow control

Short Links

Long links

Several flits on the wire

So ur ce De st in at io n Data Req Ready/Ack F/E F/E

Buffer turnaround time

releasehold release

credit delay wire delay credit delay buffer use buffer use pipeline delay hold wire delay pipeline delay

Buffer turnaround time

A flits leaves downstream node.

Credit is sent to the current node.

Credit is processed and a flip is sent to

(27)

Flit-reservation flow control

• Hides the overhead by separating the control and data networks

– Control flits race ahead to reserve network resources – Can also streamlines the delivery of credits

• Allows zero buffer turnaround time

– Not always possible to reserve resources

• The control head flit is similar to a typical head flit, but with an

additional field shows the time offset to the first data flit

– Routing node knows when the data flit will arrive, and starts to prepare buffer now

Router (switch) microarchitecture: What’s in a router?

• It’s a system as well

– Logic – State machines, Arbiters, Allocators

• Control the movement through router • Idle, Routing, Waiting for resources, Active

– Memory – Buffers

• Store flits before forwarding them • SRAMs, registers, processor memory

– Communication – Switches

• Transfer flits from input to output ports

(28)

Typical Router Design

Cross-bar Input Buffer Control Output Ports

Input Receiver Transmiter

Ports Routing, Scheduling Output Buffer

Router Components

• Output ports

– Transmitter (typically drives clock and data)

• Input ports

– Synchronizer and aligns data signal with local clock domain – Essentially a FIFO buffer

• Crossbar

– Connects each input to any output – Degree limited by area or pinout

• Buffering

• Control logic

– Complexity depends on routing logic and scheduling algorithm – Determine output port for each incoming packet

(29)

Buffer Organizations

• Input buffers

– Buffering at each input port, stores flits till they get to leave through switch to next hop

• Central buffers

– A central memory shared among every port – Functions as switch as well

• Output buffers

– Flits flow right through to output port

– Highest throughput, no head-of-line blocking

Input Buffered Router

• Independent routing logic per input – FSM

• Scheduler logic arbitrates each output – Priority, FIFO, or random • Head-of-line blocking problem

– If an earlier flit is missed, the later flits hold the buffer Cross-bar Output Ports Input Ports Scheduling R0 R1 R2 R3

(30)

Output Buffered Router

• Commit to output - limited adaptivity

• Switch has to handle input line speeds

Control Output Ports Input Ports Output Ports Output Ports Output Ports R0 R1 R2 R3

Virtual-channel Router

(31)

Virtual-channel Router

• Packet – head, body, tail flits • Head

– Routing Æoutput port

– Request and arbitrate for next VC – Request and arbitrate for switch path – Request and arbitrate for buffer – Traverse switch

• Body

– Request and arbitrate for switch path – Request and arbitrate for buffer – Traverse switch

• Tail

– Request and arbitrate for switch path – Request and arbitrate for buffer – Traverse switch

– Release switch path

State machines

• Control the state of the router

• Each input channel

– G: Global State: is it idle? routing? waiting for VC? buffer? – R: Output port

• Filled by routing

– O: Output VC

• Filled by VC allocation

– P: Head and tail queue pointers – C: Credits

• Each output channel

– G: Global state: Idle? Active? Waiting for credits? – I: Input VC that is sending flits to this output port – C: Credit count

(32)

Pipelining of a typical virtual channel router

ST SA Tail flit ST SA Boyd flit 2 ST SA Body flit 1 ST SA VA RC Head flit 7 6 5 4 3 2 1 Cycle

• Cycle 0: Head flits arrives. G will change to R on the next cycle

• Cycle 1: RC(Routing computation). R and G (=V) will be updated on the next cycle

• Cycle 2: VA(Virtual channel allocation). On the next cycle, O and G (=A) will be updated. The state of output channel will be updated

• Cycle 3: SA: Switch allocation • Cycle 4: ST: Switch traversal

Output arbiters

• N requesters (inputs) trying to get a single resource under

contention (output)

– N:1 arbiter for each output

• Several types of arbiters

– Fixed priority arbiter – Variable priority arbiter

• Oblivious arbiter • Round robin arbiter

(33)

Fixed Priority Arbiter

Variable Priority Arbiter

• A one-hot priority signal pselects the

highest priority – Only one of the

(34)

Variable Priority Arbiters

• Oblivious

– Not dependent on previous grants or requests – Rotating priorities

– Random priorities

Variable Priority Arbiters

• Round robin

– Request that was last served should have lowest priority – Serve all other requests first before returning to this requestor

– If a grant is issued this cycle, the request next to the one receiving the grant will have the highest priority on the next cycle

(35)

Allocators

• NxM allocator: N requestors fighting for M resources • Results:

– A grant can be asserted only if the corresponding request is asserted – At most one grant for each input

may be asserted

– At most one grant for each resource may be asserted

Allocators In Routers

• VC Allocator

– Input VCs requesting for a range of output VCs

– E.g. a packet of VC0 arrives at East input port. It’s destined for west output port, and would like to get any of the VCs of that output port.

• Switch Allocator

– Input VCs of an input port request for different output ports (e.g. One’s going North, another’s going West)

(36)

Simplest Allocators: Separable

• Approximate with two stages of arbitration

– One on inputs, one on outputs. They can be in either order.

Separable Allocator

(37)

Switches

• The fabric that directs flits from one input port to another output

port

• Design issue: number of input and output ports, and speedups

– Speedup: the ratio of the total input bandwidth to the netowk’s ideal capacity (the best throughput)

• Tradeoff between cost (delay, area, power) and performance

(throughput)

• Tradeoff between leaving it up to allocation or simplifying the

job for allocators

Crossbar switches

Input speedup = 1

Input speedup = 2

(38)

Effect of input speedup

• With a random allocator

• Throughput is the fraction of capacity

Several flit buffer organizations

• Central

– Simple logical view

• There are actually two switches: MUX in and deMUX out

– Problems: bandwidth and latency • Separate memory per input port

(39)

Virtual Channel (VC) Buffer Organization

• One buffer per VC

Allows switches to access multiple VC associated with one PC, but leads to poor memory utilization.

A small amount of output

ports on a single buffer Divide VCs among buffersMemory Interleaving!

Approximations:

(40)

Alpha 21364 router

• Torus

• Virtual cut-through (316 packet buffers)

• Adaptive routing: prefer to continue in the same dimension

• Deadlock avoidance

– Coherence: Requests may fill up buffers, stalling acks (Solution: Virtual channel class, order)

– Network: Escape virtual channel

(41)

Router microarchitecture

Network Interface

• How a processor sends data to the network

• Shared memory cache-coherent multiprocessors

– Interfaces caches with networks

• Message-passing multiprocessors

– Interfaces processor pipeline with networks

• Dedicated register (or two registers) • Register map

• Memory map • Virtual memory map • I/O interrupt + DMA

(42)

Cache-coherent SMP processor-network interface

• Highly optimized interface: from load/ store to messages in a few cycles • Request is placed in memory request

register

– Tag: how to handle the reply, e.g., store the data in R24

– Type: cacheable or not; read or write

• Cache hit: place in reply registerright away

• Cache miss: enter miss status holding register (MSHR)

– Use this to merge reads/writes as well – Number of MSHRs == number of

pending memory references (4 to 32)

Cache-coherent SMP memory-network interface

•Messages from the network initialize transaction status holding register (TSHR)

– Messages may be queued •TSHR tracks the status of pending memory operations

Example:

For a non-cacheable read, the TSHR status changes:

Read pending (waiting for bank)

ÆBank activated (waiting for data)

ÆRead complete (preparing message)

(43)

Message-passing multiprocessors: Dedicated register

• Send

– Move a value to the network out register

– Special MOV instruction for the last word to terminate the packet

• Read

– Block on the register until packet arrives, or test register and retry later

• Pros: fast • Cons:

– Long messages: processor becoming DMA engine! – Security: hold the register

forever

Register map

• Send a message atomically from a subset of the processor’s general purpose register • Cons:

– Long messages have to be segmented – Pressures on general purpose register – Processors are still DMA engines

(44)

I/O interface

• Most common interface today, in PCs, Clusters of workstations (e.g. Infiniband, Myrinet, PCI)

• Software-level messaging: – Interrupt triggers handler – Handler sets up DMA

– DMA engine constructs packets from memory and sends out to network • Physical-memory-mapped or virtual-memory-mapped

Case Study: Princeton SHRIMP

Where: I/O bus

(45)

Virtual memory mapping

Map_network(My_virtual_addr_range,Your_virtual_addr_range)

Each virtual page -> local physical page -> remote physical page -> remote virtual address

Store to these virtual addresses => network

(46)

Case Study: M-Machine Multicomputer

• Experimental multicomputer built at MIT and Standford – 2-D torus