Why the Network Matters

(1)

(2)

So Far …

•  Overview of Multicore Systems … Why Memory Matters

… Memory Architectures …

•  Emerging Chip Multiprocessors (CMP)

–  Increasing number of cores on a chip

–  Cache coherency  shopping list

–  Memory performance  network performance

•  Today: The Network

–  Moving data around to support “shopping list” model

•  How to connect processors to memory and its impact on

performance and applications

(3)

Data Mobility

•  It’s actually all about the data, data, data.

•  No matter how fast the functional units of a system, the

performance bottleneck has always been (and will continue to be) moving data around.

•  Challenge

–  How to efficiently feed the functional units?

(4)

Granularity

•  Increasing our focus in granularity

–  Functional unit pipelines

–  Single and multicore cache hierarchies

–  Coherence to manage nondeterminism between tightly coupled

cores

•  And now … interconnection networks …

–  Cannot practically sustain bus snooping protocols in hardware

–  Use to interconnect “small” multiprocessors to form arbitrarily

(5)

Interconnection Networks

•  The network that connects processing elements together.

•  Broad applicability

–  Infrastructure in shared and distributed memory systems that tie

processors to memories and to each other.

•  Examples: (1) Distributed memory system with potentially large

message sizes  SGI Altix. (2) Massively parallel collections of small processors that communicate in small amounts but frequently  GPGPU :-)

–  Cannot practically sustain bus snooping protocols in hardware

–  Use to interconnect “small” multiprocessors to form arbitrarily

(6)

Interconnection Networks for Multicore?

•  On-chip

–  How cores are linked together.

•  Off-chip

–  How CMPs are connected to motherboard buses.

•  Recall the bi-directional circular “EIB” interconnect on

the Cell.

(7)

Design Factors

•  Economic factors for the actual hardware

•  Performance

–  Peak

–  Sustained / Actual / Practical

–  Other

(8)

Design Dimensions

•  Topology

–  The physical interconnection structure of the network

•  Routing algorithms

–  The method for choosing which route that messages take

through the network graph from source to destination

•  Switching Strategy

–  How the data in a message traverse the route

•  Flow Control

–  Determination of when a message (or portions thereof) moves

(9)

Terminology

•  Channel: A link between two nodes on the network, including buffers to hold data

•  Bandwidth: b = wf, where w is the channel width and f is the signaling rate with a cycle time of T

•  Degree: Connectivity of a node (# channels to/from a node)

•  Route: A path through the network graph

•  Diameter: Length of the maximum shortest path between any two nodes

•  Routing Distance: Number of links traversed enroute between two nodes

•  Average Distance: Average routing distance over all pairs of nodes

(10)

Bandwidth

•  Raw bandwidth is b = wf, where w = width and f =

frequency

•  Effective bandwidth is impacted by overhead n_E for

encapsulating a packet of size n

•  If a switch delays routing decisions by d, the bandwidth

(11)

Bisection Bandwidth

•  Multiple nodes on an interconnect send messages at the

same time. How to measure?

•  Bisection Bandwidth: Sum of the bandwidths of the minimum set of channels that, if removed, partition the network into two equal unconnected sets of nodes.

–  Value? If all nodes communicate in a uniform pattern, half the

messages will be expected to cross the bisection in each direction.

(12)

Routing

•  A request between two processors must be routed in

some way, preferably in an optimal manner that minimizes hops.

•  Desirable properties

–  Simple  Low complexity, low overhead, ease of correctness

(deadlock free)

(13)

Routing Strategies

•  Store-and-forward: A method typically used in LAN or WAN networks. Data is sent in packets that are received in their entirety

(14)

Routing Strategies

•  Cut-through routing: A method that reduces latency for packets to traverse a path. Think of it as network pipelining.

(15)

Store-and-Forward vs. Cut-Through Routing

•  Store-and-forward makes the routing decision only when

all phits are received of a packet.

•  Cut-through routing makes the routing decision

immediately upon receiving the physical unit (phit) of the beginning of the packet, and all subsequent phits “cut-through” this route.

•  Train analogy

–  What train scenario looks like store-and-forward?

(16)

Train Analogy

•  Train at a station (as a connection to the next station)

–  Store-and-forward routing: The entire train must stop before

moving on.

•  Train encountering a railroad switch

–  Cut-through routing: The first car makes the “decision” as to

(17)

Routing Strategies: Analysis

•  What’s the big deal?

•  Latency

–  Let h be the routing distance, b the bandwidth, n the size of the

message, and d the delay at each switch.

–  How to make a store-and-forward look more like a cut-through,

(18)

(19)

The Crossbar

•  Provides the internal switching structure for the switch.

•  Non-blocking crossbar

+ Guarantees a path between each distinct input and output simultaneously in any permutation

– Costs go up quadratically. Cost of full NxN crossbar, N = # inputs = # outputs?

•  Anatomy of a fully-connected NxN crossbar?

(20)

The Crossbar

•  Provides the internal switching structure for the switch.

•  Blocking crossbar

–  Pros and cons complement the above.

–  “Degenerate” crossbar is a bus.

•  Cost of a bus-based NxN “crossbar”?

–  Multistage interconnection network (MIN)?

•  What does it look like?

•  Cost of a MIN NxN“crossbar”?

(21)

Topology

•  Oftentimes infeasible to connect every processing

element to each other.

–  Example

•  Macroscale, e.g., cluster supercomputers

–  PE count: O(1,000) to O(10,000)

–  Functionally possible but very, very expensive.

»  As much as half the price of a supercomputer

•  Microscale, e.g., emerging chip multiprocessors like Cell, GPGPU

–  Larger interconnect  larger real estate required for

“non-compute” entities

•  Solution: Be smart about how to connect PEs together.

(22)

Simple Topology

•  One-Dimensional Topologies

–  Chain

•  Order all N processors in a line number 1 ...

P and connect processor P with processors P-1 and P+1

•  Sending a message from P1 to P4 must

traverse 3 links.

•  Best case? Average case? Worst case?

–  Torus (or Ring)

•  Instead of letting ends dangle, connect first

to last to form a ring.

(23)

Simple Topology

•  One-Dimensional Topologies

–  Chain

•  Order all N processors in a line number 1 ...

P and connect processor P with processors P-1 and P+1

•  Sending a message from P1 to P4 must

traverse 3 links.

•  Best case? Average case? Worst case?

–  Torus (or Ring)

•  Instead of letting ends dangle, connect first

to last to form a ring.

(24)

The Effect of Adding Dimensions

•  Increase to two dimensions, i.e., 1-D chain  2-D grid

–  Each side (or dimension) will have how many processors?

–  What about an k-dimensional “grid”?

•  For 2-D, connect each processor to its neighbors.

–  Up to 4 connections per processor.

(25)

(26)

Higher-Dimensional Meshes and Tori

•  Keep playing this “trick” of embedding processors into

grids of increasing dimensionality. •  Key Observation

–  Each time the dimension is increased, the # of point-to-point

connections for each processor increases.

•  Generalization

–  The # of point-to-point connections per node within a

(27)

Hypercubes

•  A d-dimensional hypercube has 2d corners, each of which

is an endpoint for d edges.

•  Such interconnection networks were the rage of the

1980s and 1990s •  Pros and Cons?

(28)

Trees

•  Another topology for attacking the hop count problem …

•  Hop distance is logarithmic. Yay!

•  Bisection bandwidth is O(1) due to single critical node at

(29)

Butterflies

“Extend the ‘tree’ with butterflies …”

•  Takes same logarithmic-depth approach but with multiple

roots.

•  Can be built out of basic 2x2 switches.

(30)

Butterflies

“Extend the ‘tree’ with butterflies …”

•  Pro

–  Natural correspondence to algorithmic structures, e.g., Fast

Fourier Transform (FFT) and sorting networks.

•  Con

–  Cost of short diameter (logarithmic) and bisection (N/2) is $$$$.

(31)

Butterflies



Fat Trees

•  Butterflies are related to another topology encountered in

practice – fat trees – particularly in large cluster supercomputers.

(32)

* : d = dimension

(33)

Topologies and Routing

•  Topologies with regular structure have simple routing

algorithms.

•  Example: Hypercube (2-D and 3-D)

–  Simple labeling of nodes with the binary encoding of the number

(34)

Connectivity and Routing: Hypercube

•  Connectivity: A matter of edges between nodes that

differ by exactly one bit.

•  Routing: A to B must traverse the dimensions that have

bits on in XOR(A, B).

(35)

Connectivity and Routing: Hypercube

•  Connectivity: A matter of edges between nodes that

differ by exactly one bit.

•  Routing: A to B must traverse the dimensions that have

bits on in XOR(A, B).

(36)

Routing Algorithms

•  Key Insight

–  Build algorithms that take advantage of intrinsic properties of

topology.

•  Other Considerations

–  Minimize hop counts

–  Minimize data transmissions

•  What happens when link to root (in a tree) goes down due to heat?

•  Consider a torus-based network where each processor

holds a set of numbers.

•  Goal: Compute the sum of all numbers and store the

(37)

Global Sum on a Torus

1.  Each processor computes sum of local data.

2.  Each processor sends its sum to their left neighbor. Sum

of neighbor is added to local sum. This new partial sum is passed to the left.

3.  After sqrt(P) steps, the partial sum along one dimension

(i.e., row) returns to each processor.

4.  Repeat 1-3 but along the other dimension (i.e., column).

•  Total time for data set of size N split over P processors?

(38)

Considerations

•  Faster than sequential?

•  Local sums obviously faster.

–  Concurrently compute the partial sums of N/P elements faster

than any one processor can compute the sum of all N elements.

•  Problem?

–  Interconnect overhead to execute the 2 * sqrt(P) transmissions

may be quite high relative to the computing capability of each processor.

(39)

Performance: Machine Balance

•  Last example refers to the need to “balance a machine or

algorithm” …

•  Quantity that we are tuning? “Surface” (communication)

to “volume” (computation) ratio.

•  Performance factors to consider … performance profiling

–  Time to compute a local sum over a local data set.

–  Time to send a single small message over the interconnect.

•  Performance profiling will come into play when using the

CPU vs. CPU+GPGPU, e.g., adding a grid of 16 numbers on a quad-core CPU vs. CPU+GPGPU.

(40)

Architectural Aspects

•  Currently, caches still a key performance enhancement for

multiprocessor systems (just like single CPU systems). •  Caches require some additional logic to make them

continue to function and provide determinism in the main memory of a compute node.

•  Coherence protocols and any other form of data transport

between cores requires an interconnection network.

–  At scale, all-to-all bus-like structures are infeasible.

–  Solution: Novel topologies that sacrifice peak performance (avg

latency, bandwidth, contention characteristics, etc.) for economical (and physical) factors underlying their design and manufacturing.

(41)

Multicore Considerations

•  Interconnection networks are constrained more in the

multicore context than in the large-scale SMP world. Why?

•  But AMD Barcelona quad-core processor utilizes 11 Cu

layers.

–  Relative to # transistors in in the two planar dimensions of the

processor, the CPU remains for all intents and purposes … flat. –  Cramming a sophisticated interconnection network that is not

planar into a limited number of layers is quite hard. (Caveat:

Proximity interconnect.) Thus, there is a limitation on the type of interconnect on-chip.

(42)

Multicore at Scale

•  Life becoming more interesting as core counts continue

to increase.

–  Intel Terascale Chip: 80 cores

–  Tilera: “Reconfigurable” 64 cores based on a 2-D mesh

topology.

–  AMD/ATi HD 4870: 800 cores

–  NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-)

•  In the not-to-distant future, interconnect topology will be

(43)

Multicore at Scale

to increase.

topology.

–  AMD/ATi HD 4870: 800 cores

back “in vogue” for parallel computing …

(44)

Parallel Software: Correctness & Performance

to increase.

topology.

–  AMD/ATi HD 4870: 800 cores

back “in vogue” for parallel computing …

(45)

Correctness

•  Hardest aspect of parallel algorithm design and parallel

programming? Writing programs that are “correct” …

–  What good is a program that generates wrong answers faster? 

•  What do we mean by correctness?

–  Traditionally, proving that a given algorithm produced the output

that is desired.

–  Example: Prim’s algorithm produces a minimum spanning tree.

–  Correctness means that the tree produced by Prim’s algorithm is

(46)

Underlying Assumption

•  Traditional algorithms take the following for granted:

–  The machine is deterministic.

–  Only one flow of control is active at any given time.

•  Nondeterminism only comes into play in a purely theoretical sense

when talking about automata theory, NFAs vs. DFAs, and P vs. NP. •  This is not the sort of determinism that we are talking about here.

What are we talking about?

•  When two uncoordinated flows of control that interact

with each other, no guarantee that without explicit

guidance that the relative effects and interactions of the multiple threads of control will happen in a predictable order.

(47)

Performance

•  The “holy grail” of parallel computing …

•  A parallel program should run at least as fast as the

sequential equivalent for a fixed input size.

–  One may use parallelism to increase the volume that can be

computed, in which case, comparisons of time are not as important. (Weak scaling)

(48)

Performance and Correctness

(or Correctness and Performance?)

•  Performance and correctness are often intimately

coupled.

–  Without protections in place, a program can run very quickly but

suffer from severe correctness problems.

–  Very conservative decisions can be made to ensure correctness

but at the cost of significant performance degradation. Example of this?

•  Other performance factors (unrelated to logic flow in

place) to maintain determinism and correctness.

–  Example: Granularity of computation and communication can be