Copyright © 2009 by W. Feng. Based on material from M. Sottile.
So Far …
• Overview of Multicore Systems … Why Memory Matters
… Memory Architectures …
• Emerging Chip Multiprocessors (CMP)
– Increasing number of cores on a chip
– Cache coherency shopping list
– Memory performance network performance
• Today: The Network
– Moving data around to support “shopping list” model
• How to connect processors to memory and its impact on
performance and applications
Data Mobility
• It’s actually all about the data, data, data.
• No matter how fast the functional units of a system, the
performance bottleneck has always been (and will continue to be) moving data around.
• Challenge
– How to efficiently feed the functional units?
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Granularity
• Increasing our focus in granularity
– Functional unit pipelines
– Single and multicore cache hierarchies
– Coherence to manage nondeterminism between tightly coupled
cores
• And now … interconnection networks …
– Cannot practically sustain bus snooping protocols in hardware
– Use to interconnect “small” multiprocessors to form arbitrarily
Interconnection Networks
• The network that connects processing elements together.
• Broad applicability
– Infrastructure in shared and distributed memory systems that tie
processors to memories and to each other.
• Examples: (1) Distributed memory system with potentially large
message sizes SGI Altix. (2) Massively parallel collections of small processors that communicate in small amounts but frequently GPGPU :-)
– Cannot practically sustain bus snooping protocols in hardware
– Use to interconnect “small” multiprocessors to form arbitrarily
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Interconnection Networks for Multicore?
• On-chip
– How cores are linked together.
• Off-chip
– How CMPs are connected to motherboard buses.
• Recall the bi-directional circular “EIB” interconnect on
the Cell.
Design Factors
• Economic factors for the actual hardware
• Performance
– Peak
– Sustained / Actual / Practical
– Other
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Design Dimensions
• Topology
– The physical interconnection structure of the network
• Routing algorithms
– The method for choosing which route that messages take
through the network graph from source to destination
• Switching Strategy
– How the data in a message traverse the route
• Flow Control
– Determination of when a message (or portions thereof) moves
Terminology
• Channel: A link between two nodes on the network, including buffers to hold data
• Bandwidth: b = wf, where w is the channel width and f is the signaling rate with a cycle time of T
• Degree: Connectivity of a node (# channels to/from a node)
• Route: A path through the network graph
• Diameter: Length of the maximum shortest path between any two nodes
• Routing Distance: Number of links traversed enroute between two nodes
• Average Distance: Average routing distance over all pairs of nodes
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Bandwidth
• Raw bandwidth is b = wf, where w = width and f =
frequency
• Effective bandwidth is impacted by overhead nE for
encapsulating a packet of size n
• If a switch delays routing decisions by d, the bandwidth
Bisection Bandwidth
• Multiple nodes on an interconnect send messages at the
same time. How to measure?
• Bisection Bandwidth: Sum of the bandwidths of the minimum set of channels that, if removed, partition the network into two equal unconnected sets of nodes.
– Value? If all nodes communicate in a uniform pattern, half the
messages will be expected to cross the bisection in each direction.
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Routing
• A request between two processors must be routed in
some way, preferably in an optimal manner that minimizes hops.
• Desirable properties
– Simple Low complexity, low overhead, ease of correctness
(deadlock free)
Routing Strategies
• Store-and-forward: A method typically used in LAN or WAN networks. Data is sent in packets that are received in their entirety
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Routing Strategies
• Cut-through routing: A method that reduces latency for packets to traverse a path. Think of it as network pipelining.
Store-and-Forward vs. Cut-Through Routing
• Store-and-forward makes the routing decision only when
all phits are received of a packet.
• Cut-through routing makes the routing decision
immediately upon receiving the physical unit (phit) of the beginning of the packet, and all subsequent phits “cut-through” this route.
• Train analogy
– What train scenario looks like store-and-forward?
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Train Analogy
• Train at a station (as a connection to the next station)
– Store-and-forward routing: The entire train must stop before
moving on.
• Train encountering a railroad switch
– Cut-through routing: The first car makes the “decision” as to
Routing Strategies: Analysis
• What’s the big deal?
• Latency
– Let h be the routing distance, b the bandwidth, n the size of the
message, and d the delay at each switch.
– How to make a store-and-forward look more like a cut-through,
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
The Crossbar
• Provides the internal switching structure for the switch.
• Non-blocking crossbar
+ Guarantees a path between each distinct input and output simultaneously in any permutation
– Costs go up quadratically. Cost of full NxN crossbar, N = # inputs = # outputs?
• Anatomy of a fully-connected NxN crossbar?
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
The Crossbar
• Provides the internal switching structure for the switch.
• Blocking crossbar
– Pros and cons complement the above.
– “Degenerate” crossbar is a bus.
• Cost of a bus-based NxN “crossbar”?
– Multistage interconnection network (MIN)?
• What does it look like?
• Cost of a MIN NxN“crossbar”?
Topology
• Oftentimes infeasible to connect every processing
element to each other.
– Example
• Macroscale, e.g., cluster supercomputers
– PE count: O(1,000) to O(10,000)
– Functionally possible but very, very expensive.
» As much as half the price of a supercomputer
• Microscale, e.g., emerging chip multiprocessors like Cell, GPGPU
– Larger interconnect larger real estate required for
“non-compute” entities
• Solution: Be smart about how to connect PEs together.
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Simple Topology
• One-Dimensional Topologies
– Chain
• Order all N processors in a line number 1 ...
P and connect processor P with processors P-1 and P+1
• Sending a message from P1 to P4 must
traverse 3 links.
• Best case? Average case? Worst case?
– Torus (or Ring)
• Instead of letting ends dangle, connect first
to last to form a ring.
Simple Topology
• One-Dimensional Topologies
– Chain
• Order all N processors in a line number 1 ...
P and connect processor P with processors P-1 and P+1
• Sending a message from P1 to P4 must
traverse 3 links.
• Best case? Average case? Worst case?
– Torus (or Ring)
• Instead of letting ends dangle, connect first
to last to form a ring.
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
The Effect of Adding Dimensions
• Increase to two dimensions, i.e., 1-D chain 2-D grid
– Each side (or dimension) will have how many processors?
– What about an k-dimensional “grid”?
• For 2-D, connect each processor to its neighbors.
– Up to 4 connections per processor.
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Higher-Dimensional Meshes and Tori
• Keep playing this “trick” of embedding processors into
grids of increasing dimensionality. • Key Observation
– Each time the dimension is increased, the # of point-to-point
connections for each processor increases.
• Generalization
– The # of point-to-point connections per node within a
Hypercubes
• A d-dimensional hypercube has 2d corners, each of which
is an endpoint for d edges.
• Such interconnection networks were the rage of the
1980s and 1990s • Pros and Cons?
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Trees
• Another topology for attacking the hop count problem …
• Hop distance is logarithmic. Yay!
• Bisection bandwidth is O(1) due to single critical node at
Butterflies
“Extend the ‘tree’ with butterflies …”
• Takes same logarithmic-depth approach but with multiple
roots.
• Can be built out of basic 2x2 switches.
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Butterflies
“Extend the ‘tree’ with butterflies …”
• Pro
– Natural correspondence to algorithmic structures, e.g., Fast
Fourier Transform (FFT) and sorting networks.
• Con
– Cost of short diameter (logarithmic) and bisection (N/2) is $$$$.
Butterflies
Fat Trees
• Butterflies are related to another topology encountered in
practice – fat trees – particularly in large cluster supercomputers.
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
* : d = dimension
Topologies and Routing
• Topologies with regular structure have simple routing
algorithms.
• Example: Hypercube (2-D and 3-D)
– Simple labeling of nodes with the binary encoding of the number
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Connectivity and Routing: Hypercube
• Connectivity: A matter of edges between nodes that
differ by exactly one bit.
• Routing: A to B must traverse the dimensions that have
bits on in XOR(A, B).
Connectivity and Routing: Hypercube
• Connectivity: A matter of edges between nodes that
differ by exactly one bit.
• Routing: A to B must traverse the dimensions that have
bits on in XOR(A, B).
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Routing Algorithms
• Key Insight
– Build algorithms that take advantage of intrinsic properties of
topology.
• Other Considerations
– Minimize hop counts
– Minimize data transmissions
• What happens when link to root (in a tree) goes down due to heat?
• Consider a torus-based network where each processor
holds a set of numbers.
• Goal: Compute the sum of all numbers and store the
Global Sum on a Torus
1. Each processor computes sum of local data.
2. Each processor sends its sum to their left neighbor. Sum
of neighbor is added to local sum. This new partial sum is passed to the left.
3. After sqrt(P) steps, the partial sum along one dimension
(i.e., row) returns to each processor.
4. Repeat 1-3 but along the other dimension (i.e., column).
• Total time for data set of size N split over P processors?
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Considerations
• Faster than sequential?
• Local sums obviously faster.
– Concurrently compute the partial sums of N/P elements faster
than any one processor can compute the sum of all N elements.
• Problem?
– Interconnect overhead to execute the 2 * sqrt(P) transmissions
may be quite high relative to the computing capability of each processor.
Performance: Machine Balance
• Last example refers to the need to “balance a machine or
algorithm” …
• Quantity that we are tuning? “Surface” (communication)
to “volume” (computation) ratio.
• Performance factors to consider … performance profiling
– Time to compute a local sum over a local data set.
– Time to send a single small message over the interconnect.
• Performance profiling will come into play when using the
CPU vs. CPU+GPGPU, e.g., adding a grid of 16 numbers on a quad-core CPU vs. CPU+GPGPU.
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Architectural Aspects
• Currently, caches still a key performance enhancement for
multiprocessor systems (just like single CPU systems). • Caches require some additional logic to make them
continue to function and provide determinism in the main memory of a compute node.
• Coherence protocols and any other form of data transport
between cores requires an interconnection network.
– At scale, all-to-all bus-like structures are infeasible.
– Solution: Novel topologies that sacrifice peak performance (avg
latency, bandwidth, contention characteristics, etc.) for economical (and physical) factors underlying their design and manufacturing.
Multicore Considerations
• Interconnection networks are constrained more in the
multicore context than in the large-scale SMP world. Why?
• But AMD Barcelona quad-core processor utilizes 11 Cu
layers.
– Relative to # transistors in in the two planar dimensions of the
processor, the CPU remains for all intents and purposes … flat. – Cramming a sophisticated interconnection network that is not
planar into a limited number of layers is quite hard. (Caveat:
Proximity interconnect.) Thus, there is a limitation on the type of interconnect on-chip.
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Multicore at Scale
• Life becoming more interesting as core counts continue
to increase.
– Intel Terascale Chip: 80 cores
– Tilera: “Reconfigurable” 64 cores based on a 2-D mesh
topology.
– AMD/ATi HD 4870: 800 cores
– NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-)
• In the not-to-distant future, interconnect topology will be
Multicore at Scale
• Life becoming more interesting as core counts continue
to increase.
– Intel Terascale Chip: 80 cores
– Tilera: “Reconfigurable” 64 cores based on a 2-D mesh
topology.
– AMD/ATi HD 4870: 800 cores
– NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-)
• In the not-to-distant future, interconnect topology will be
back “in vogue” for parallel computing …
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Parallel Software: Correctness & Performance
• Life becoming more interesting as core counts continue
to increase.
– Intel Terascale Chip: 80 cores
– Tilera: “Reconfigurable” 64 cores based on a 2-D mesh
topology.
– AMD/ATi HD 4870: 800 cores
– NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-)
• In the not-to-distant future, interconnect topology will be
back “in vogue” for parallel computing …
Correctness
• Hardest aspect of parallel algorithm design and parallel
programming? Writing programs that are “correct” …
– What good is a program that generates wrong answers faster?
• What do we mean by correctness?
– Traditionally, proving that a given algorithm produced the output
that is desired.
– Example: Prim’s algorithm produces a minimum spanning tree.
– Correctness means that the tree produced by Prim’s algorithm is
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Underlying Assumption
• Traditional algorithms take the following for granted:
– The machine is deterministic.
– Only one flow of control is active at any given time.
• Nondeterminism only comes into play in a purely theoretical sense
when talking about automata theory, NFAs vs. DFAs, and P vs. NP. • This is not the sort of determinism that we are talking about here.
What are we talking about?
• When two uncoordinated flows of control that interact
with each other, no guarantee that without explicit
guidance that the relative effects and interactions of the multiple threads of control will happen in a predictable order.
Performance
• The “holy grail” of parallel computing …
• A parallel program should run at least as fast as the
sequential equivalent for a fixed input size.
– One may use parallelism to increase the volume that can be
computed, in which case, comparisons of time are not as important. (Weak scaling)
Copyright © 2009 by W. Feng. Based on material from M. Sottile.
Performance and Correctness
(or Correctness and Performance?)
• Performance and correctness are often intimately
coupled.
– Without protections in place, a program can run very quickly but
suffer from severe correctness problems.
– Very conservative decisions can be made to ensure correctness
but at the cost of significant performance degradation. Example of this?
• Other performance factors (unrelated to logic flow in
place) to maintain determinism and correctness.
– Example: Granularity of computation and communication can be