Interconnection Networks
Z. Jerry Shi
Assistant Professor of Computer Science and Engineering University of Connecticut
* Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*
Three questions about interconnection networks
• What is an interconnection network?
– A programmable system that transports data between terminals
• Where do you find interconnection network?
– Used in almost all digital systems that are large enough to have two components to connect
– The most common applications are in computer systems and communication switches
• Connection between processors and memories, I/O devices and I/O controllers
– Simple bus systems are used in many systems, but high processor performance demand fast interconnection networks
• Why are interconnection network important?
Architecture of Interconnection Networks
• How to connect the nodes up (processors, memories, router line
cards, SoC modules) –
TOPOLOGY
• Which path should a message take? –
ROUTING AND
DEADLOCKS
• How is the message actually forwarded from source to
destination –
FLOW CONTROL
• How to build the routers –
ROUTER MICROARCHITECTURE
• How to build the links –
LINK ARCHITECTURE
• How do nodes talk to the network –
NETWORK INTERFACE
Metrics in Interconnection Networks
• Performance
– Latency
• How fast data can be transported through the network
– Throughput
• How many pieces of data (messages) can be transported in each time unit
• Power
• Area
• Cost
• Fault-Tolerance
• Quality-of-service
Topology
• Interconnection networks consists of a set of shared router nodes
and channels
• Topology refers to the arrangement of these nodes and channels
– Analogous to roadmap
• Channels (roads), packets (cars), router nodes (intersection)
Topological Properties
• Routing Distance
- number of links on route
– Average Distance
• Diameter
- maximum routing distance
• Bisection Bandwidth
is the bandwidth crossing a minimal cut
that divides the network in half
– A network is partitioned by a set of links if their removal disconnects the graph
Linear Arrays and Rings
• Linear Array – Diameter?
– Average Distance? – Bisection bandwidth?
– Route A -> B given by relative address R = B-A • Ring?
• Examples: Fiber Distributed Data Interface (FDDI), Scalable Coherent Interface (SCI), FiberChannel Arbitrated Loop
. . .
0 1 2 3 N-2 N-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Multidimensional Meshes, Tori, and Hypercubes
• d-dimensional k-ary torus (or k-aryd-cube) N= kd
– Each dimension has knodes, which can be located with a vector
– Ak-aryd-cube can be constructed with k k-ary (d –1)-cubes
– The radix in each dimension may be different
• For example, 2,3,4-ary 3-cube
• d-dimensional k-ary mesh: similar to torus
– Cut the channels between the first and last node in every dimension • Hypercube: binary d-cube
Hypercubes
• Also called binary n-cubes
– Number of nodes N = 2n • Distance: O(logN) hops • Good bisection bandwidth • Complexity
– Out degree is n= logN
0-D 1-D 2-D 3-D 4-D
5-D !
Real World 2D mesh
Properties
• Routing
– Relative distance: R = (bd-1– ad-1, ... , b0–a0 )
– Traverse ri= bi–aihops in each dimension
– dimension-order routing • Degree? • Diameter? • Average Distance – dk/4 for cube • Bisection bandwidth? – kd-1bidirectional links • Physical layout? – 2D in O(N) space – Higher dimension?
Embeddings in two dimensions
• Embed multiple logical dimension in one physical dimension
using long wires
Topology Summary
• All have some “bad permutations”
– Many popular permutations are very bad for meshes (transpose) – Randomness in wiring or routing makes it hard to find a bad one! Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024
1D Array 2 N-1 N / 3 1 huge 1D Ring 2 N/2 N/4 2 2D Mesh 4 2 (N1/2- 1) 2/3 N1/2 N1/2 63 (21) 2D Torus 4 N1/2 1/2 N1/2 2N1/2 32 (16) k-aryn-cube 2n nk/2 nk/4 nk/4 15 (7.5) @n=3 Hypercube n=logN n n/2 N/2 10 (5)
Trees
• Diameter and ave distance logarithmic – k-ary tree, height d= logkN
– Address specified d-vector of radix kcoordinates describing path
down from root • Fixed degree
• Route up to common ancestor and down – R = B xor A
– let ibe position of most significant 1 in R, route up i+1 levels
– down in direction given by low i+1 bits of B
Fat-Trees
• Fatter links (really more of them) as you go up, so bisection BW
scales with N
Butterflies
• Tree with lots of roots! • N log N (actually N/2 x logN)
• Exactly one route from any source to any dest
• R = A xor B, at level iuse ‘straight’ edge if ri=0, otherwise cross edge
Benes network and Fat Tree
• Back-to-back butterfly can route all permutations
– Off line
• What if you just pick a random mid point? INPUT OUTPUT Butterfly network Inverse butterfly network
Relationship Butterflies to Hypercubes
• Wiring is isomorphic
• Except that Butterfly always takes log n
steps
How Many Dimensions?
•
n
= 2 or n
= 3
– Short wires, easy to build
– Many hops, low bisection bandwidth – Requires traffic locality
•
n
>= 4
– Harder to build, more wires, longer average length – Fewer hops, better bisection bandwidth
– Can handle non-local traffic
•
k-ary
d-cubes provide a consistent framework for comparison
– N = kd
– Scale dimension (d) or nodes per dimension (k)
Real Machines
• Wide links, smaller routing delay
• Tremendous variation
Routing
Messages, Packets, Flits, Phits
• Flits (flow control digits) is the basic unit of bandwidth and storage allocation • Phits(physical transfer digits) is the unit of information that is transferred across a
Typical Packet Format
• A packet consists of different types of flits – Head, body, or tail
• The head flit carries the packet’s routing information
– A packet has a format of HB*T*
Ro ut in g an d Co nt ro l H ead er Da ta Pa yl oa d Er ro r Co d e Tr ai le r
digital
symbol
Sequence of symbols transmitted over a channel
Routing
• Routing algorithm determines
– which of the possible paths are used as routes – how the route is determined
– R: N x N ÆC, which at each switch maps the destination node to the next channel on the route
• Issues:
– Routing mechanism
• arithmetic
• source-based port select • table driven
• general computation
– Properties of the routes – Deadlock free
Taxonomy of Routing Algorithms
• Deterministic
– Route determined by (source, dest), not intermediate state (i.e. traffic)
• Given two nodes xand y, the path Rx,yis the same
• Oblivious
– Choose a route without considering any information about the network’s current state
• Example, a random algorithm
• Adaptive
– Route influenced by traffic along the way
• Minimal
– Only selects shortest paths
Example: routing on a ring
• Greedy
– Always send the packet in the shortest direction • Uniform random
– Randomly pick a direction, with equal probability for picking either direction
• Weighted random
– Randomly pick a direction, but weight the short direction with 1 –d/n
where dis the shortest path
• Adaptive
– Send the packet in the direction for which local channel has the lowest load
Routing relation
• R: N × N
Æ
ρ(P)
– The output of the relation is an entire path – There may be multiple paths
• R: N × N
Æ
ρ(C)
– Routing is incremental
– The output only indicates the channels that the packet take at the current node
• R: C × N
Æ
ρ(C)
– Similar to the second method
– Use the current channel instead of current node
Adaptive Routing
• R: C × N ×ΣÆC
• Essential for fault tolerance
– At least multipath
• Can improve utilization of the network
• Simple deterministic algorithms easily run into bad permutations
• Fully/partially adaptive, minimal/non-minimal • Can introduce complexity or anomalies • Little adaptation goes a long way!
Routing Mechanism
• Need to select output port for each input packet – in a few cycles
• Simple arithmetic in regular topologies – Example: ∆x, ∆y routing in a grid
• west (-x) ∆x < 0 • east (+x) ∆x > 0 • south (-y) ∆x = 0, ∆y < 0 • north (+y) ∆x = 0, ∆y > 0 • processor ∆x = 0, ∆y = 0
• Reduce relative address of each dimension in order – Dimension-order routing in k-aryd-cubes
• Calculate preferred directions then adjust one dimension each time
• Used in Cray T3D, which connects up to 2048 DEC Alpha processing elements
Routing Mechanism (cont)
• Source-based
– Mainly used in deterministic and oblivious routing
– All routing decisions are made in the source and message header carries series of port selects
• Used and stripped en route
– Fast, simple, and scalable – CS-2, Myrinet, MIT Artic • Node-table
– More appropriate for adaptive routing
– Decide the output channel based on incoming channel and destination – Can redirect traffic if one output link is congested or fails
– ATM, HPPI
P0 P1 P2 P3
Deadlock
• How can it arise?
– Necessary conditions:
• Shared resource (buffers or channels) • Incrementally allocated
• Non-preemptible
– Think of a channel as a shared resource that is acquired incrementally
• Source buffer then destination buffer • Channels along a route
• How do you avoid it?
– Deadlock avoidance: guarantee no deadlock
• Constrain how channel resources are allocated. • Example: dimension order
– Deadlock recovery: deadlock is detected and corrected • How do you prove that a routing algorithm is deadlock free?
Deadlock Freedom
• Resources are logically associated with channels
• Messages introduce dependences between resources as they
move forward
• Need to articulate the possible dependences that can arise
between channels
• Show that there are no cycles in Channel Dependence Graph
– Find a numbering of channel resources such that every legal route follows a monotonic sequence
=> No traffic pattern can lead to deadlock
– Network need not be acyclic, on channel dependence graph
• All deadlock avoidance techniques use some form of resource
ordering
Deadlock Recovery
• Detection
– Determining exactly whether the network is deadlocked is difficult – Most practical detection mechanism are conservative
• May have false positives
– Timeout counters
• Reset when making progress
• Recovery
– Regressive: packets or connections that are deadlocked are removed
– Progressive: keep the packets or connections in escape buffer
• Potentially has better performance
• Routing using the escape buffer is designed to be deadlock-free
Flow Control
• Flow control determines how a network’s resources are allocated – Resources: channel bandwidth, buffer capacity, etc.
– Good flow control: achieves a high fraction of ideal bandwidth and delivers packets with low, predictable latency
• Can also be viewed as a problem of contention resolution • Problem is there because we are sharing resources
– Processor:
• Resources in a processor: ALUs, registers
• How to run as many operations, optimizing use of ALUs and registers
– Network
• Resources in a network: Buffers, links
Contention
• Two packets trying to use the same link at the same time – Limited buffering
– Drop?
Flow control protocols
• Bufferless
– Dropping – Misrouting – Circuit switching
• Header traverses the network and reserves resources • Data are then sent through the reserved path
• Buffered
– Store-and-forward – Virtual cut-through – Wormhole
Simplest Flow Control: Dropping
• If two things arrive and I don’t have resources, drop one of them
• Flow control protocol on the Internet
• Not used in interconnection networks – why?
Next Simplest Flow Control: Misrouting
• If only one message can enter the network at each node, and one
message can exit the network at each node, the network can
never be congested. Right?
• Philosophy behind misrouting: intentionally route away from
congestion
• No need for buffering
Circuit Switching
• Bufferless
• Probe that sets up path through network
– If the request flit is blocked, it is held in place (not dropped)
• Reserve all links
• Data are then sent through links
• Simple router
– Similar to the dropping case
– Need only one register to buffer the header
• When is this good?
• When is it not?
Time-space Diagram: Circuit Switching
Store-and-Forward
• Buffered flow control: flits can be stored in routing nodes
– Flits arriving on cycle ido not have to leave on cycle i+ 1
• Make intermediate stops and wait till the whole packet has
arrived before you move on
• Two resources must be allocated to the packet
– A packet-sized buffer at the other side of the channel – Exclusive use of the channel
• Other packets can use intermediate links
• Pros and cons?
Time-space Diagram: Store-and-Forward
• With store-and-forward, packets do no have to be divided into flits
Virtual Cut-through
• Why wait till entire message has arrived at each intermediate
stop?
• The head of the message can dash off first
– Of course, the two resources must be allocated
• When the head gets blocked, whole message gets blocked at the
intermediate node
Time-space Diagram: Virtual Cut-through
Wormhole
• Similar to virtual cut-through, but channel and buffers are allocated to flits rather than packets
• When the head flit arrives, it must acquire three resources before being forwarded to the next node
– A virtual channel for the packet
• State bits indicating the output channel, state of virtual channel (Idle, waiting for resources, or active), and other information
– One flit buffer
– One flit of channel bandwidth
• Body flits do not need to acquire virtual channels
– But still needs to allocate flit buffer and channel bandwidth • The tail flit releases the virtual channel
• Channel is owned by a packet, but buffers are allocated on a flit-by-flit basis – When a flit cannot acquire a buffer, the channel goes idle
Time-space Diagram: Wormhole
Virtual Channel
• Associates several virtual channels with a single physical channel
– When a packet blocks, instead of holding on to physical links so others cannot use them, hold on to virtual links
• The head flit needs three resources to advance
– A virtual channel, a downstream flit buffer, and channel bandwidth
• Subsequent body flits uses the same virtual channel
– But still needs to allocate flit buffer and channel bandwidth
• However, these flits are not guaranteed access to the channel
bandwidth
• Lanes on the highway
Time-space diagram: virtual-channel
• Arbitration may not be fair
– It can be winner-take-all
Link-level flow control
• Given that you can’t drop packets, how to manage the buffers? When can you send stuff forward, when not?
• Three techniques
– Credit-based: upstream router keeps a count of the number of free flit buffer in each virtual channel downstream
– On/off: a single bit indicate whether the upstream node can send or not – Ack/nack: upstream node optimistically sends flits when they are available
and downstream node sends back ack or nack • Flit-Reservation
Link-level flow control
Short Links
Long links
Several flits on the wire
So ur ce De st in at io n Data Req Ready/Ack F/E F/E
Buffer turnaround time
releasehold release
credit delay wire delay credit delay buffer use buffer use pipeline delay hold wire delay pipeline delay
Buffer turnaround time
A flits leaves downstream node.
Credit is sent to the current node.
Credit is processed and a flip is sent to
Flit-reservation flow control
• Hides the overhead by separating the control and data networks
– Control flits race ahead to reserve network resources – Can also streamlines the delivery of credits
• Allows zero buffer turnaround time
– Not always possible to reserve resources
• The control head flit is similar to a typical head flit, but with an
additional field shows the time offset to the first data flit
– Routing node knows when the data flit will arrive, and starts to prepare buffer now
Router (switch) microarchitecture: What’s in a router?
• It’s a system as well
– Logic – State machines, Arbiters, Allocators
• Control the movement through router • Idle, Routing, Waiting for resources, Active
– Memory – Buffers
• Store flits before forwarding them • SRAMs, registers, processor memory
– Communication – Switches
• Transfer flits from input to output ports
Typical Router Design
Cross-bar Input Buffer Control Output PortsInput Receiver Transmiter
Ports Routing, Scheduling Output Buffer
Router Components
• Output ports
– Transmitter (typically drives clock and data)
• Input ports
– Synchronizer and aligns data signal with local clock domain – Essentially a FIFO buffer
• Crossbar
– Connects each input to any output – Degree limited by area or pinout
• Buffering
• Control logic
– Complexity depends on routing logic and scheduling algorithm – Determine output port for each incoming packet
Buffer Organizations
• Input buffers
– Buffering at each input port, stores flits till they get to leave through switch to next hop
• Central buffers
– A central memory shared among every port – Functions as switch as well
• Output buffers
– Flits flow right through to output port
– Highest throughput, no head-of-line blocking
Input Buffered Router
• Independent routing logic per input – FSM
• Scheduler logic arbitrates each output – Priority, FIFO, or random • Head-of-line blocking problem
– If an earlier flit is missed, the later flits hold the buffer Cross-bar Output Ports Input Ports Scheduling R0 R1 R2 R3
Output Buffered Router
• Commit to output - limited adaptivity
• Switch has to handle input line speeds
Control Output Ports Input Ports Output Ports Output Ports Output Ports R0 R1 R2 R3
Virtual-channel Router
Virtual-channel Router
• Packet – head, body, tail flits • Head
– Routing Æoutput port
– Request and arbitrate for next VC – Request and arbitrate for switch path – Request and arbitrate for buffer – Traverse switch
• Body
– Request and arbitrate for switch path – Request and arbitrate for buffer – Traverse switch
• Tail
– Request and arbitrate for switch path – Request and arbitrate for buffer – Traverse switch
– Release switch path
State machines
• Control the state of the router
• Each input channel
– G: Global State: is it idle? routing? waiting for VC? buffer? – R: Output port
• Filled by routing
– O: Output VC
• Filled by VC allocation
– P: Head and tail queue pointers – C: Credits
• Each output channel
– G: Global state: Idle? Active? Waiting for credits? – I: Input VC that is sending flits to this output port – C: Credit count
Pipelining of a typical virtual channel router
ST SA Tail flit ST SA Boyd flit 2 ST SA Body flit 1 ST SA VA RC Head flit 7 6 5 4 3 2 1 Cycle• Cycle 0: Head flits arrives. G will change to R on the next cycle
• Cycle 1: RC(Routing computation). R and G (=V) will be updated on the next cycle
• Cycle 2: VA(Virtual channel allocation). On the next cycle, O and G (=A) will be updated. The state of output channel will be updated
• Cycle 3: SA: Switch allocation • Cycle 4: ST: Switch traversal
Output arbiters
• N requesters (inputs) trying to get a single resource under
contention (output)
– N:1 arbiter for each output
• Several types of arbiters
– Fixed priority arbiter – Variable priority arbiter
• Oblivious arbiter • Round robin arbiter
Fixed Priority Arbiter
Variable Priority Arbiter
• A one-hot priority signal pselects the
highest priority – Only one of the
Variable Priority Arbiters
• Oblivious
– Not dependent on previous grants or requests – Rotating priorities
– Random priorities
Variable Priority Arbiters
• Round robin
– Request that was last served should have lowest priority – Serve all other requests first before returning to this requestor
– If a grant is issued this cycle, the request next to the one receiving the grant will have the highest priority on the next cycle
Allocators
• NxM allocator: N requestors fighting for M resources • Results:
– A grant can be asserted only if the corresponding request is asserted – At most one grant for each input
may be asserted
– At most one grant for each resource may be asserted
Allocators In Routers
• VC Allocator
– Input VCs requesting for a range of output VCs
– E.g. a packet of VC0 arrives at East input port. It’s destined for west output port, and would like to get any of the VCs of that output port.
• Switch Allocator
– Input VCs of an input port request for different output ports (e.g. One’s going North, another’s going West)
Simplest Allocators: Separable
• Approximate with two stages of arbitration
– One on inputs, one on outputs. They can be in either order.
Separable Allocator
Switches
• The fabric that directs flits from one input port to another output
port
• Design issue: number of input and output ports, and speedups
– Speedup: the ratio of the total input bandwidth to the netowk’s ideal capacity (the best throughput)
• Tradeoff between cost (delay, area, power) and performance
(throughput)
• Tradeoff between leaving it up to allocation or simplifying the
job for allocators
Crossbar switches
Input speedup = 1
Input speedup = 2
Effect of input speedup
• With a random allocator
• Throughput is the fraction of capacity
Several flit buffer organizations
• Central
– Simple logical view
• There are actually two switches: MUX in and deMUX out
– Problems: bandwidth and latency • Separate memory per input port
Virtual Channel (VC) Buffer Organization
• One buffer per VC
Allows switches to access multiple VC associated with one PC, but leads to poor memory utilization.
A small amount of output
ports on a single buffer Divide VCs among buffersMemory Interleaving!
Approximations:
Alpha 21364 router
• Torus
• Virtual cut-through (316 packet buffers)
• Adaptive routing: prefer to continue in the same dimension
• Deadlock avoidance
– Coherence: Requests may fill up buffers, stalling acks (Solution: Virtual channel class, order)
– Network: Escape virtual channel
Router microarchitecture
Network Interface
• How a processor sends data to the network
• Shared memory cache-coherent multiprocessors
– Interfaces caches with networks
• Message-passing multiprocessors
– Interfaces processor pipeline with networks
• Dedicated register (or two registers) • Register map
• Memory map • Virtual memory map • I/O interrupt + DMA
Cache-coherent SMP processor-network interface
• Highly optimized interface: from load/ store to messages in a few cycles • Request is placed in memory request
register
– Tag: how to handle the reply, e.g., store the data in R24
– Type: cacheable or not; read or write
• Cache hit: place in reply registerright away
• Cache miss: enter miss status holding register (MSHR)
– Use this to merge reads/writes as well – Number of MSHRs == number of
pending memory references (4 to 32)
Cache-coherent SMP memory-network interface
•Messages from the network initialize transaction status holding register (TSHR)
– Messages may be queued •TSHR tracks the status of pending memory operations
Example:
For a non-cacheable read, the TSHR status changes:
Read pending (waiting for bank)
ÆBank activated (waiting for data)
ÆRead complete (preparing message)
Message-passing multiprocessors: Dedicated register
• Send
– Move a value to the network out register
– Special MOV instruction for the last word to terminate the packet
• Read
– Block on the register until packet arrives, or test register and retry later
• Pros: fast • Cons:
– Long messages: processor becoming DMA engine! – Security: hold the register
forever
Register map
• Send a message atomically from a subset of the processor’s general purpose register • Cons:
– Long messages have to be segmented – Pressures on general purpose register – Processors are still DMA engines
I/O interface
• Most common interface today, in PCs, Clusters of workstations (e.g. Infiniband, Myrinet, PCI)
• Software-level messaging: – Interrupt triggers handler – Handler sets up DMA
– DMA engine constructs packets from memory and sends out to network • Physical-memory-mapped or virtual-memory-mapped
Case Study: Princeton SHRIMP
Where: I/O bus
Virtual memory mapping
Map_network(My_virtual_addr_range,Your_virtual_addr_range)
Each virtual page -> local physical page -> remote physical page -> remote virtual address
Store to these virtual addresses => network
Case Study: M-Machine Multicomputer
• Experimental multicomputer built at MIT and Standford – 2-D torus