CS 598: Provably Efficient Algorithms for Numerical and
Combinatorial Problems
Part 4: Communication Cost in Algorithms
Edgar Solomonik
Algorithmic cache management
Consider a computer with unlimited memory and a cache of size H
I
we can design algorithms by manually managing cache transfers
I minimize amount of data moved from memory to cache (bandwidth cost)
I minimize number of synchronous memory-to-cache transfers (latency cost)
I
generally, efficient algorithms in this model try to select blocks of
computation that minimize the surface-to-volume ratio
I i.e., do as much computation with the cache-resident data as possible
Cache-efficient matrix multiplication
Consider multiplication of n × n matrices C = A · B
For i ∈ [1, n/s], j ∈ [1, n/t], k ∈ [1, n/v], define blocks C[i, j], A[i, k], B[k, j] with
dimensions s × t, s × v, and v × t, respectively
for ( i = 1 to n / s )
for ( j = 1 to n / t )
i n i t i a l i z e C [i , j ] = 0 in cache
for ( k = 1 to n / v ) load A [i , k ] into cache load B [k , j ] into cache
C [i , j ] = C [i , j ] + A [i , k ]*B [k , j ] end
write C [i , j ] to m e m o r y end
Memory-bandwidth analysis of matrix multiplication
I
Lets consider bandwidth and latency cost if each matrix multiplication has
dimensions s, t, v
I there are a total of (n/s)(n/t)(n/v) inner loop iterations
I the memory latency cost of the algorithm is the number of inner loop iterations,
O(n3/(stv))
I since each block of C stays resident in the innermost loop, we write each element of C to memory only once
I we read each block s × v block of A and v × t block of B in each innermost loop
I therefore, the bandwidth cost is Q = n2+ (n/s + n/t)n2= n2+ n3/s + n3/t
I
Given the constraint, st + sv + vt ≤ H, we can derive the optimal block sizes
I if we pick s = t = v =pH/3, we satisfy the constraint and obtain Q ≈ 2n3/pH/3, with n3/H3/2memory latency cost
I if we pick s = t =pH − 2√H and v = 1, we obtain Q ≈ 2n3/√H with n3/H
Ideal cache model
I
A more accurate model is to consider a cache line size L in addition to the
cache size H
I each memory-to-cache transfer has size L
I new unified metric: cache misses (number of cache lines transferred)
I the bandwidth cost is the number of cache misses multiplied by L
I the (old) latency cost (number of transfers) is disregarded
I assume ‘tall’ cache, L ≤√H (more convenient, H = Ω(L2))
I
We can now consider different caching protocols
I an ideal cache model corresponds to the assumption that the protocol always makes the best decision
I this ideal cache model is in a sense equivalent to a manually orchestrated cache protocol
Matrix transposition in the ideal cache model
I
Matrix multiplication bandwidth cost with a tall cache is not affected by L
I if we read square blocks into cache they have dimension Θ(L)I if we compute outer products, just need to transpose B initially
I
n × n matrix transposition becomes non-trivial
I when L = 1 (original model), there is no notion of how a matrix is laid out in memory
I for general L, we should read√H ×√H blocks into cache, transpose them,
then write them to memory to get linear bandwidth cost O(n2)
Cache obliviousness
I
Introduced by Frigo, Leiserson, Prokop, Ramachadran
I basic idea: algorithms should not be parameterized by architectural parameters
I good ideas in computer science are most often good abstractions
I designing an algorithm obliviously of cache size makes it portable and efficient for all levels of a cache hierarchy
I
cache oblivious algorithms are stated without explicit control of data
movement
I their communication cost is derived by assuming an ideal cache model
Cache oblivious matrix transposition
Given m × n matrix A, compute B = A
TI
if m ≤ n subdivide A = [A
1A
2] and B =
B
1B
2and compute recursively,
B
1= A
T1, B
2= A
T2I
if m > n subdivide A =
A
1A
2and B = [B
1B
2] and compute recursively,
B
1= A
T1, B
2= A
T2obtains linear bandwidth cost T (mn) = 2T (mn/2), T (1) = O(1), so
Cache oblivious matrix multiplication
Given m × k matrix A and k × n matrix B, compute m × n matrix C = AB
I
if k ≥ m and k ≥ m subdivide A =
A
1A
2and B =
B
1B
2and compute
recursively, ¯
C = A
1B
1, ˆ
C = A
2B
2, then C = ¯
C + ˆ
C
I
if n > k and n ≥ m subdivide C =
C
1C
2and B = B
1B
2and compute
Cache oblivious fast Fourier transform (FFT)
I
The Fourier transform computes y = D
(n)x, where d
(n)ij
= ω
ijn
and ω
nis the
nth complex root of identity
I D(n)is complex and symmetric (not Hermitian)
I D(n)is unitary modulo scaling (ignored here)
I
A cache-oblivious algorithm for the FFT can be derived by folding y and x
into matrices Y and X of dimensions
√
n ×
√
n
I Using the fact that ωm
n = ωn/m, and assuming m = √ n is an integer, observe ωnij = ω(i1+mi2)(j1+mj2) n = ω i1j1 n ω mi1j2 n ω mi2j1 n ω nj1j2 n = ωi1j1 n ω i1j2 m ω i2j1 m I Consequently, d(n)i 1+mi2,j1+mj2 = d (n) i1j1d (m) i1j2d (m)
i2j1, so we can compute Y via
yi1i2 = X j1,j2 [d(n)i 1j1(d (m) i1j2xj1j2)]d (m) i2j1
Cache oblivious fast Fourier transform (FFT)
I
Lets now analyze the cost of the cache oblivious algorithm based on
Y = (((D
(m)X) F )D
(m))
TI There are 2√n recursive calls and O(n) work to apply the Hadamard product at
each level, so the work is
W (n) = 2√nW (√n) + O(n) = O(n log n)
I The Hadamard product can be applied with depth 1, but half the recursive calls are dependent on the other half, so the depth is
D(n) = 2D(√n) + O(1) = O(log n)
I After log n/ log H = logHn recursive calls, n < H and the computation can be
done in cache. O(n) memory traffic is required otherwise, so
A simple model for point-to-point messages
The time to send or receive a message of s words is α + s · β
I
α – latency/synchronization cost per message
I
β – bandwidth cost per word
I
each processor can send and/or receive one message at a time
Consider the cost of a broadcast of s words
I using a binary tree of heighthr= 2(log2(p + 1) − 1) ≈ 2 log2(p) T = hr· (α + s · β) I using a binomial tree of height
Bulk Synchronous Parallel (BSP) Model
I
Bulk Synchronous Parallel (BSP) model (Valiant 1990)
I execution is subdivided into supersteps, each associated with local work and a single round of communication
I within each superstep each processor can send and receive up to h messages (called an h-relation)
I in original model, messages were restricted to one-word length
I for modern architectures makes sense to allow each processor to send arbitrary-sized messages to at most h other processors
I
The cost of a BSP algorithm is a sum over supersteps of the maximum costs
incurred in that superstep
I given S supersteps, where processor i sends and receives Wija total of words
in superstep j and performs Fijlocal work, we have T = S X j=1 α + β max i Wij+ γ maxi Fij
Collective communication in BSP
I
When h = p, most collective communication routines involving s words of
data per processor can be done with BSP cost O(α + s · β)
I Scatter: root sends each s/p-sized message to its target (root incurs s · β send bandwidth)
I Reduce-Scatter: each processor sends s/p summands to every other processor (every processor incurs s · β send and receive bandwidth)
I Gather: send each message of size s/p to root (root incurs s · β receive bandwidth)
I Allgather: each processor sends s/p words portion to every other processor (every processor incurs s · β send and receive bandwidth)
I Broadcast done by Scatter then Allgather
I Reduce done by Reduce-Scatter then Gather
I Allreduce done by Reduce-Scatter then Allgather
I All-to-all can be done by sending messages directly in one round
I For h < p, O(loghp) supersteps required, but bandwidth cost same for all except
Matrix-vector Product
I
Lets design a cache-efficient algorithm for a matrix-vector product
I Each processor owns n2/p matrix dataI Given O(H) matrix data in cache, can perform O(H) work, input/output O(H) vector entries
I Row-wise partitioning avoids write conflicts, achieves Q = O(n2/p) cache traffic
I
Lets design a BSP algorithm for a matrix-vector product
I Each processor starts with matrix and vector data, which it may not need to communicate
I Need to pick initial distribution, assume initial data is not replicated
I First consider row-wise (1D) distribution, communicate O(n) input vector data per processor (all-gather)
I With 2D blocked or cyclic distribution, require O(n/√p) input vector data per
processor (broadcast or all-gather in processor columns) and contribute
Sparse matrix-vector Product
I
1D distribution is effective for BSP algorithm for SpMV
I Each processor assigned n/p input/output vector entries and n/p matrix rows
y1 .. . yp = A1 .. . Ap x1 .. . xp
I Let Ai= Alocali + Aremotei where Alocali is on the block-diagonal I Can perform local SpMV Alocal
i x without communication, since only xiis needed I ith processor needs to receive entries for each nonzero column in Aremotei I ith processor needs to send entries for each nonzero column in Aremote
i
assuming A is symmetric (otherwise consider nonzero rows of block-column)
I Algorithm can be more efficient than 2D if the number of such nonzero columns is less than n/√p
Massively Parallel Computing (MPC) Model
I
Massively Parallel Computing (MPC) model
I Given input size N , allow each processor to have S = Nαmemory for 0 < α < 1 I Let the number of processors be N/S so that the input fits into global memory
or (N/S) polylog B
I Each round corresponds to local computation followed by arbitrary global communication
I Aim to minimize number of rounds, can simulate PRAM and BSP
I
Aim to achieve O(log N ) or O(log log N ) rounds with minimal memory per
processor
I strongly superlinear if S ≥ N1+ I nearly linear if S = O(N polylog N )
Graph Algorithms in MPC
I
Graph algorithms in the MPC model for graphs with n vertices
I With S = Θ(n) memory can assign a vertex and its incident edges to aprocessor
I Can perform SpMV with O(1) rounds, many graph algorithms in constant or
O(polylog(log n)) rounds
Communication lower bounds
I
Given an algorithm (e.g. radix-2 FFT, bitonic sort) or family of algorithms
(e.g. radix-k FFT, comparison based sorting algorithms), how much
communication is necessary?
I How much data or cache lines must be moved between memory and cache?
I How much data must processes communicate in BSP, assuming work or input is load balanced?
I
Communication lower bounds ascertain optimality of communication
schedules
I For numerical problems, full space of potential algorithms is often too difficult to consider for communication lower bounds
I Communication cost lower bounds consequently focus on a particular set of algorithms
Classical results in communication lower bounds
I
Floyd 1972: for large cache lines L = Θ(H) matrix transposition has cost
O(n
2log(n) · β)
I
Hong and Kung 1981, pebbling lower bound
I model communication as placing pebbles on a dependency graph of an algorithm
I lower bounds for matrix-matrix multiplication, FFT, stencil computation, odd-even sort
I
Aggarwal and Vitter 1988, lower bounds with any L, H
I communication lower bounds for general permutation networks
Lower bounds by partitioning memory operations
Pebbling bounds employ the following general argument
I
consider the sequence of loads and stores (memory-cache) transfers
computed by a program
I
the length of the sequence is the bandwidth cost Q
I
partition the sequence into parts of size H
I
upper-bound the amount of useful work that can be done between the
beginning and end of this sequence
I
H bounds the number of inputs we read from memory and outputs we write
to cache
I
with partitioning, all we need is a bound f
alg(H) on how much useful
computation can be done with 3H inputs + outputs
Lower bounds by partitioning computation
We can also take the dual view
I
we are given an algorithm that must perform F operations
I
we need to prove that the given 3H inputs and outputs at most f
alg(H) of the
computation can be done
I to prove this we generally need some assumptions to guarantee that outputs cannot be discarded
I its typical to assume that the F operations are not recomputed (outputs are not regenerated)
I we can also represent some algorithms with dependency graphs (DAGs) with F vertices
I consider any execution schedule (ordering) of the F operations
I for each subsequence of size falg(H), we can show that H loads or stores are required
Bounding work in matrix multiplication
Consider the F = n
3products computed in square matrix multiplication
I
additions are tricky, we don’t want to impose specific summation trees
I
consider any G of the products C(i, j) ← A(i, k) · B(k, j)
I
the d = 3 Loomis-Whitney theorem tells us that the number of unique (i, k),
(k, j), and (i, j) indices in G: g
A, g
B, and g
C, satisfy
√
g
A· g
B· g
C≥ G
I
in other words, the inputs needed to compute the G entries include g
Avalues
of A, g
Bvalues of B, and they contribute to g
Cdifferent entries of C
I
we can safely restrict the space of algorithms to those that do not sum
products which contribute to different entries of C
I
bound the size of G provided the number of inputs and outputs is at most H
f
MM(H) =
max
|gA+gB+gC|≤3H
√
Cache complexity lower bound for MM
Given f
MM(H) = H
3/2, we are essentially done
I
we obtain the sequential memory bandwidth lower bound
Q
seq-MM(n, H) ≥ n
3H/f
MM(H) =
n
3√
H
I
in the parallel case, one of P processors needs to perform n
3of the products,
so
Qpar-MM(n, H, P ) ≥
n
3
Interprocessor communication lower bound for MM
We can also use f
MMto get lower bounds on interprocessor communication
I
given that each processor has M memory, f
MM(M ) tells us how much
computation can be done with 3M inputs/outputs
I
we can assume no processor has more than 2n
2/P inputs at the start of
execution and n
2/P outputs at the end, so
Wpar-MM(n, H, M, P ) ≥ n
3M/fMM(M ) − 3n
2/P =
n
3P
√
M
− 3n
2/P
I
for c ∈ [1, P
1/3] we get
Wpar-MM(n, H, cn
2/P, P ) = Ω
n
2√
cP
I
restricting the amount of work done per processor to n
3/P , gets us
Wpar-MM(n, H, P ) = Ω
n
2P
2/3Latency/synchronization lower bounds
From f
MMto get lower bounds on the number of messages
I
Given a cache of size H, Ω(n
3/fMM(H)) blocks must be transfered between
memory and cache
I
Given M = 3cn
2/p memory, Ω((n
3/p)/fMM(M )) = Ω(pp/c
3) messages must
be sent or received by some processor
I
Given M = 3cn
2/p memory, Ω((n
3/p)/fMM(M )) = Ω(pp/c
3) BSP supersteps
Paths in Radix-2 FFT dependency graph
Any two edge-disjoint paths in the FFT DAG intersect at no more than one vertex
Work bound for FFT
We prove that the work bound for the radix-2 FFT is f
FFT(s) = s log
2s
I
in particular that with s inputs, at most s log
2s work can be done
I
we can do this by induction on s
I
the base case, s = 1 holds trivially
Work bound for FFT, contd
I
consider the last level in the FFT graph in which a vertex is computed
I
if k vertices in the level were computed, we must know k/2 values in each of
the left sub-FFTs
I
moreover, each sub-FFT must have at least k/2 inputs
I
conversely, if one of the sub-FFTs had t of the s inputs, we can at most
Communication lower bound for the FFT
By induction the expression f
FFT(s) = max
t(f
FFT(s − t) + f
FFT(t) + 2 min(s − t, t))
implies
f
FFT(s) = max
t
((s − t) log
2(s − t) + t log(t) + 2 min(s − t, t))
which is maximized by picking t = s/2
fFFT(s) = 2fFFT(s/2) + s = s log
2(s)
Given f
FFT(s) = s log2(s), the cache complexity is
Qseq-FFT(n, H) ≥ n log
2(n)H/fFFT(2H) = n
log(n)
2 log(2H)
= Ω(n log
H(n))
Lower bounds via graph partitioning
I
Given a DAG representation of an algorithm, graph partitioning properties
can provide communication lower bounds
I Consider 2-processor load-balanced parallelization to get balanced two-way partitioning of graph
I Vertices with outgoing edges to vertices in the other part must be communicated
I These vertices define a separator between the two parts, since their removal disconnects the graph
I Lower bound on vertex separator size yields lower bound on communication
I
Consideration of expansion of subgraphs can yield better bounds
I Two-processor view can yield communication volume lower bound byconsidering data movement between first half and second half of processors
I Can get better lower bounds like before by obtaining function f (s) on how much useful computation can happen with 3s data
I If for any subset of vertices S ⊂ V with |S| ≤ k |V |, a separator of size
Ω(r(k)) is needed to disconnect S from V \ S, then f (s) = O(r−1(s))