CS 598: Provably Efficient Algorithms for Numerical and Combinatorial Problems

(1)

CS 598: Provably Efficient Algorithms for Numerical and

Combinatorial Problems

Part 4: Communication Cost in Algorithms

Edgar Solomonik

(2)

Algorithmic cache management

Consider a computer with unlimited memory and a cache of size H

I

we can design algorithms by manually managing cache transfers

I minimize amount of data moved from memory to cache (bandwidth cost)

I minimize number of synchronous memory-to-cache transfers (latency cost)

I

generally, efficient algorithms in this model try to select blocks of

computation that minimize the surface-to-volume ratio

I i.e., do as much computation with the cache-resident data as possible

(3)

Cache-efficient matrix multiplication

Consider multiplication of n × n matrices C = A · B

For i ∈ [1, n/s], j ∈ [1, n/t], k ∈ [1, n/v], define blocks C[i, j], A[i, k], B[k, j] with

dimensions s × t, s × v, and v × t, respectively

for ( i = 1 to n / s )

for ( j = 1 to n / t )

i n i t i a l i z e C [i , j ] = 0 in cache

for ( k = 1 to n / v ) load A [i , k ] into cache load B [k , j ] into cache

C [i , j ] = C [i , j ] + A [i , k ]*B [k , j ] end

write C [i , j ] to m e m o r y end

(4)

Memory-bandwidth analysis of matrix multiplication

I

Lets consider bandwidth and latency cost if each matrix multiplication has

dimensions s, t, v

I there are a total of (n/s)(n/t)(n/v) inner loop iterations

I the memory latency cost of the algorithm is the number of inner loop iterations,

O(n3_/(stv))

I since each block of C stays resident in the innermost loop, we write each element of C to memory only once

I we read each block s × v block of A and v × t block of B in each innermost loop

I therefore, the bandwidth cost is Q = n2_{+ (n/s + n/t)n}2_{= n}2_{+ n}3_{/s + n}3_/t

I

Given the constraint, st + sv + vt ≤ H, we can derive the optimal block sizes

I if we pick s = t = v =pH/3, we satisfy the constraint and obtain Q ≈ 2n3_{/pH/3, with n}3_/H3/2_{memory latency cost}

I if we pick s = t =pH − 2√H and v = 1, we obtain Q ≈ 2n3_/√_{H with n}3_/H

(5)

Ideal cache model

I

A more accurate model is to consider a cache line size L in addition to the

cache size H

I each memory-to-cache transfer has size L

I new unified metric: cache misses (number of cache lines transferred)

I the bandwidth cost is the number of cache misses multiplied by L

I the (old) latency cost (number of transfers) is disregarded

I assume ‘tall’ cache, L ≤√H (more convenient, H = Ω(L2₎₎

I

We can now consider different caching protocols

I an ideal cache model corresponds to the assumption that the protocol always makes the best decision

I this ideal cache model is in a sense equivalent to a manually orchestrated cache protocol

(6)

Matrix transposition in the ideal cache model

I

Matrix multiplication bandwidth cost with a tall cache is not affected by L

I if we read square blocks into cache they have dimension Θ(L)

I if we compute outer products, just need to transpose B initially

I

n × n matrix transposition becomes non-trivial

I when L = 1 (original model), there is no notion of how a matrix is laid out in memory

I for general L, we should read√H ×√H blocks into cache, transpose them,

then write them to memory to get linear bandwidth cost O(n2₎

(7)

Cache obliviousness

I

Introduced by Frigo, Leiserson, Prokop, Ramachadran

I basic idea: algorithms should not be parameterized by architectural parameters

I good ideas in computer science are most often good abstractions

I designing an algorithm obliviously of cache size makes it portable and efficient for all levels of a cache hierarchy

I

cache oblivious algorithms are stated without explicit control of data

movement

I their communication cost is derived by assuming an ideal cache model

(8)

Cache oblivious matrix transposition

Given m × n matrix A, compute B = A

T

I

if m ≤ n subdivide A = [A

1

A

2

] and B =

B

1

B

2

and compute recursively,

B

1

= A

T1

, B

2

= A

T2

I

if m > n subdivide A =

A

1

A

2

and B = [B

1

B

2

] and compute recursively,

B

1

= A

T1

, B

2

= A

T2

obtains linear bandwidth cost T (mn) = 2T (mn/2), T (1) = O(1), so

(9)

Cache oblivious matrix multiplication

Given m × k matrix A and k × n matrix B, compute m × n matrix C = AB

I

if k ≥ m and k ≥ m subdivide A =

A

1

A

2

and B =

B

1

B

2

and compute

recursively, ¯

C = A

1

B

1

, ˆ

C = A

2

B

2

, then C = ¯

C + ˆ

C

I

if n > k and n ≥ m subdivide C =

C

1

C

2

and B = B

1

B

2

and compute

(10)

Cache oblivious fast Fourier transform (FFT)

I

The Fourier transform computes y = D

(n)

_{x, where d}

(n)

ij

= ω

ij

n

and ω

n

is the

nth complex root of identity

I D(n)is complex and symmetric (not Hermitian)

I D(n)_{is unitary modulo scaling (ignored here)}

I

A cache-oblivious algorithm for the FFT can be derived by folding y and x

into matrices Y and X of dimensions

√

n ×

√

n

I Using the fact that ωm

n = ωn/m, and assuming m = √ n is an integer, observe ω_nij = ω(i1+mi2)(j1+mj2) n = ω i1j1 n ω mi1j2 n ω mi2j1 n ω nj1j2 n = ωi1j1 n ω i1j2 m ω i2j1 m I Consequently, d(n)_i 1+mi2,j1+mj2 = d (n) i1j1d (m) i1j2d (m)

i2j1, so we can compute Y via

yi1i2 = X j1,j2 [d(n)_i 1j1(d (m) i1j2xj1j2)]d (m) i2j1

(11)

Cache oblivious fast Fourier transform (FFT)

I

Lets now analyze the cost of the cache oblivious algorithm based on

Y = (((D

(m)

X) F )D

(m)

)

T

I There are 2√n recursive calls and O(n) work to apply the Hadamard product at

each level, so the work is

W (n) = 2√nW (√n) + O(n) = O(n log n)

I The Hadamard product can be applied with depth 1, but half the recursive calls are dependent on the other half, so the depth is

D(n) = 2D(√n) + O(1) = O(log n)

I After log n/ log H = logHn recursive calls, n < H and the computation can be

done in cache. O(n) memory traffic is required otherwise, so

(12)

A simple model for point-to-point messages

The time to send or receive a message of s words is α + s · β

I

α – latency/synchronization cost per message

I

β – bandwidth cost per word

I

each processor can send and/or receive one message at a time

Consider the cost of a broadcast of s words

I using a binary tree of height

hr= 2(log2(p + 1) − 1) ≈ 2 log2(p) T = hr· (α + s · β) I using a binomial tree of height

(13)

Bulk Synchronous Parallel (BSP) Model

I

Bulk Synchronous Parallel (BSP) model (Valiant 1990)

I execution is subdivided into supersteps, each associated with local work and a single round of communication

I within each superstep each processor can send and receive up to h messages (called an h-relation)

I in original model, messages were restricted to one-word length

I for modern architectures makes sense to allow each processor to send arbitrary-sized messages to at most h other processors

I

The cost of a BSP algorithm is a sum over supersteps of the maximum costs

incurred in that superstep

I given S supersteps, where processor i sends and receives Wija total of words

in superstep j and performs Fijlocal work, we have T = S X j=1 α + β max i Wij+ γ maxi Fij

(14)

Collective communication in BSP

I

When h = p, most collective communication routines involving s words of

data per processor can be done with BSP cost O(α + s · β)

I Scatter: root sends each s/p-sized message to its target (root incurs s · β send bandwidth)

I Reduce-Scatter: each processor sends s/p summands to every other processor (every processor incurs s · β send and receive bandwidth)

I Gather: send each message of size s/p to root (root incurs s · β receive bandwidth)

I Allgather: each processor sends s/p words portion to every other processor (every processor incurs s · β send and receive bandwidth)

I Broadcast done by Scatter then Allgather

I Reduce done by Reduce-Scatter then Gather

I Allreduce done by Reduce-Scatter then Allgather

I All-to-all can be done by sending messages directly in one round

I For h < p, O(log_hp) supersteps required, but bandwidth cost same for all except

(15)

(16)

Matrix-vector Product

I

Lets design a cache-efficient algorithm for a matrix-vector product

I Each processor owns n2_{/p matrix data}

I Given O(H) matrix data in cache, can perform O(H) work, input/output O(H) vector entries

I Row-wise partitioning avoids write conflicts, achieves Q = O(n2_{/p) cache traffic}

I

Lets design a BSP algorithm for a matrix-vector product

I Each processor starts with matrix and vector data, which it may not need to communicate

I Need to pick initial distribution, assume initial data is not replicated

I First consider row-wise (1D) distribution, communicate O(n) input vector data per processor (all-gather)

I With 2D blocked or cyclic distribution, require O(n/√p) input vector data per

processor (broadcast or all-gather in processor columns) and contribute

(17)

Sparse matrix-vector Product

I

1D distribution is effective for BSP algorithm for SpMV

I Each processor assigned n/p input/output vector entries and n/p matrix rows

   y1 .. . yp   =    A1 .. . Ap       x1 .. . xp   

I Let Ai= Alocali + Aremotei where Alocali is on the block-diagonal I Can perform local SpMV Alocal

i x without communication, since only xiis needed I ith processor needs to receive entries for each nonzero column in Aremote_i I ith processor needs to send entries for each nonzero column in Aremote

i

assuming A is symmetric (otherwise consider nonzero rows of block-column)

I Algorithm can be more efficient than 2D if the number of such nonzero columns is less than n/√p

(18)

Massively Parallel Computing (MPC) Model

I

Massively Parallel Computing (MPC) model

I Given input size N , allow each processor to have S = Nα_{memory for 0 < α < 1} I Let the number of processors be N/S so that the input fits into global memory

or (N/S) polylog B

I Each round corresponds to local computation followed by arbitrary global communication

I Aim to minimize number of rounds, can simulate PRAM and BSP

I

Aim to achieve O(log N ) or O(log log N ) rounds with minimal memory per

processor

I strongly superlinear if S ≥ N1+ I nearly linear if S = O(N polylog N )

(19)

Graph Algorithms in MPC

I

Graph algorithms in the MPC model for graphs with n vertices

I With S = Θ(n) memory can assign a vertex and its incident edges to a

processor

I Can perform SpMV with O(1) rounds, many graph algorithms in constant or

O(polylog(log n)) rounds

(20)

Communication lower bounds

I

Given an algorithm (e.g. radix-2 FFT, bitonic sort) or family of algorithms

(e.g. radix-k FFT, comparison based sorting algorithms), how much

communication is necessary?

I How much data or cache lines must be moved between memory and cache?

I How much data must processes communicate in BSP, assuming work or input is load balanced?

I

Communication lower bounds ascertain optimality of communication

schedules

I For numerical problems, full space of potential algorithms is often too difficult to consider for communication lower bounds

I Communication cost lower bounds consequently focus on a particular set of algorithms

(21)

Classical results in communication lower bounds

I

Floyd 1972: for large cache lines L = Θ(H) matrix transposition has cost

O(n

2

log(n) · β)

I

Hong and Kung 1981, pebbling lower bound

I model communication as placing pebbles on a dependency graph of an algorithm

I lower bounds for matrix-matrix multiplication, FFT, stencil computation, odd-even sort

I

Aggarwal and Vitter 1988, lower bounds with any L, H

I communication lower bounds for general permutation networks

(22)

Lower bounds by partitioning memory operations

Pebbling bounds employ the following general argument

I

consider the sequence of loads and stores (memory-cache) transfers

computed by a program

I

the length of the sequence is the bandwidth cost Q

I

partition the sequence into parts of size H

I

upper-bound the amount of useful work that can be done between the

beginning and end of this sequence

I

H bounds the number of inputs we read from memory and outputs we write

to cache

I

with partitioning, all we need is a bound f

alg

(H) on how much useful

computation can be done with 3H inputs + outputs

(23)

Lower bounds by partitioning computation

We can also take the dual view

I

we are given an algorithm that must perform F operations

I

we need to prove that the given 3H inputs and outputs at most f

alg

(H) of the

computation can be done

I to prove this we generally need some assumptions to guarantee that outputs cannot be discarded

I its typical to assume that the F operations are not recomputed (outputs are not regenerated)

I we can also represent some algorithms with dependency graphs (DAGs) with F vertices

I consider any execution schedule (ordering) of the F operations

I for each subsequence of size falg(H), we can show that H loads or stores are required

(24)

Bounding work in matrix multiplication

Consider the F = n

3

_{products computed in square matrix multiplication}

I

additions are tricky, we don’t want to impose specific summation trees

I

consider any G of the products C(i, j) ← A(i, k) · B(k, j)

I

the d = 3 Loomis-Whitney theorem tells us that the number of unique (i, k),

(k, j), and (i, j) indices in G: g

A

, g

B

, and g

C

, satisfy

√

g

A

· g

B

· g

C

≥ G

I

in other words, the inputs needed to compute the G entries include g

A

values

of A, g

B

values of B, and they contribute to g

C

different entries of C

I

we can safely restrict the space of algorithms to those that do not sum

products which contribute to different entries of C

I

bound the size of G provided the number of inputs and outputs is at most H

f

MM

(H) =

max

|gA+gB+gC|≤3H

√

(25)

Cache complexity lower bound for MM

Given f

MM

(H) = H

3/2

, we are essentially done

I

we obtain the sequential memory bandwidth lower bound

Q

seq-MM

(n, H) ≥ n

3

H/f

MM

(H) =

n

3

√

H

I

in the parallel case, one of P processors needs to perform n

3

of the products,

so

Qpar-MM(n, H, P ) ≥

n

3

(26)

Interprocessor communication lower bound for MM

We can also use f

MM

to get lower bounds on interprocessor communication

I

given that each processor has M memory, f

MM

(M ) tells us how much

computation can be done with 3M inputs/outputs

I

we can assume no processor has more than 2n

2

/P inputs at the start of

execution and n

2

_{/P outputs at the end, so}

Wpar-MM(n, H, M, P ) ≥ n

3

M/fMM(M ) − 3n

2

/P =

n

3

P

√

M

− 3n

2

_/P

I

for c ∈ [1, P

1/3

] we get

Wpar-MM(n, H, cn

2

/P, P ) = Ω

n

2

√

cP

I

restricting the amount of work done per processor to n

3

_{/P , gets us}

Wpar-MM(n, H, P ) = Ω

n

2

P

2/3

(27)

Latency/synchronization lower bounds

From f

MM

to get lower bounds on the number of messages

I

Given a cache of size H, Ω(n

3

_{/fMM(H)) blocks must be transfered between}

memory and cache

I

Given M = 3cn

2

/p memory, Ω((n

3

/p)/fMM(M )) = Ω(pp/c

3

_{) messages must}

be sent or received by some processor

I

Given M = 3cn

2

/p memory, Ω((n

3

/p)/fMM(M )) = Ω(pp/c

3

_{) BSP supersteps}

(28)

(29)

Paths in Radix-2 FFT dependency graph

Any two edge-disjoint paths in the FFT DAG intersect at no more than one vertex

(30)

Work bound for FFT

We prove that the work bound for the radix-2 FFT is f

FFT

(s) = s log

2

s

I

in particular that with s inputs, at most s log

₂

s work can be done

I

we can do this by induction on s

I

the base case, s = 1 holds trivially

(31)

Work bound for FFT, contd

I

consider the last level in the FFT graph in which a vertex is computed

I

if k vertices in the level were computed, we must know k/2 values in each of

the left sub-FFTs

I

moreover, each sub-FFT must have at least k/2 inputs

I

conversely, if one of the sub-FFTs had t of the s inputs, we can at most

(32)

Communication lower bound for the FFT

By induction the expression f

FFT

(s) = max

t

(f

FFT

(s − t) + f

FFT

(t) + 2 min(s − t, t))

implies

f

FFT

(s) = max

t

((s − t) log

2

(s − t) + t log(t) + 2 min(s − t, t))

which is maximized by picking t = s/2

fFFT(s) = 2fFFT(s/2) + s = s log

₂

(s)

Given f

FFT(s) = s log₂

(s), the cache complexity is

Qseq-FFT(n, H) ≥ n log

₂

(n)H/fFFT(2H) = n

log(n)

2 log(2H)

= Ω(n log

H

(n))

(33)

Lower bounds via graph partitioning

I

Given a DAG representation of an algorithm, graph partitioning properties

can provide communication lower bounds

I Consider 2-processor load-balanced parallelization to get balanced two-way partitioning of graph

I Vertices with outgoing edges to vertices in the other part must be communicated

I These vertices define a separator between the two parts, since their removal disconnects the graph

I Lower bound on vertex separator size yields lower bound on communication

I

Consideration of expansion of subgraphs can yield better bounds

I Two-processor view can yield communication volume lower bound by

considering data movement between first half and second half of processors

I Can get better lower bounds like before by obtaining function f (s) on how much useful computation can happen with 3s data

I If for any subset of vertices S ⊂ V with |S| ≤ k |V |, a separator of size

CS 598: Provably Efficient Algorithms for Numerical and Combinatorial Problems