• No results found

CS 598: Provably Efficient Algorithms for Numerical and Combinatorial Problems

N/A
N/A
Protected

Academic year: 2021

Share "CS 598: Provably Efficient Algorithms for Numerical and Combinatorial Problems"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

CS 598: Provably Efficient Algorithms for Numerical and

Combinatorial Problems

Part 4: Communication Cost in Algorithms

Edgar Solomonik

(2)

Algorithmic cache management

Consider a computer with unlimited memory and a cache of size H

I

we can design algorithms by manually managing cache transfers

I minimize amount of data moved from memory to cache (bandwidth cost)

I minimize number of synchronous memory-to-cache transfers (latency cost)

I

generally, efficient algorithms in this model try to select blocks of

computation that minimize the surface-to-volume ratio

I i.e., do as much computation with the cache-resident data as possible

(3)

Cache-efficient matrix multiplication

Consider multiplication of n × n matrices C = A · B

For i ∈ [1, n/s], j ∈ [1, n/t], k ∈ [1, n/v], define blocks C[i, j], A[i, k], B[k, j] with

dimensions s × t, s × v, and v × t, respectively

for ( i = 1 to n / s )

for ( j = 1 to n / t )

i n i t i a l i z e C [i , j ] = 0 in cache

for ( k = 1 to n / v ) load A [i , k ] into cache load B [k , j ] into cache

C [i , j ] = C [i , j ] + A [i , k ]*B [k , j ] end

write C [i , j ] to m e m o r y end

(4)

Memory-bandwidth analysis of matrix multiplication

I

Lets consider bandwidth and latency cost if each matrix multiplication has

dimensions s, t, v

I there are a total of (n/s)(n/t)(n/v) inner loop iterations

I the memory latency cost of the algorithm is the number of inner loop iterations,

O(n3/(stv))

I since each block of C stays resident in the innermost loop, we write each element of C to memory only once

I we read each block s × v block of A and v × t block of B in each innermost loop

I therefore, the bandwidth cost is Q = n2+ (n/s + n/t)n2= n2+ n3/s + n3/t

I

Given the constraint, st + sv + vt ≤ H, we can derive the optimal block sizes

I if we pick s = t = v =pH/3, we satisfy the constraint and obtain Q ≈ 2n3/pH/3, with n3/H3/2memory latency cost

I if we pick s = t =pH − 2√H and v = 1, we obtain Q ≈ 2n3/H with n3/H

(5)

Ideal cache model

I

A more accurate model is to consider a cache line size L in addition to the

cache size H

I each memory-to-cache transfer has size L

I new unified metric: cache misses (number of cache lines transferred)

I the bandwidth cost is the number of cache misses multiplied by L

I the (old) latency cost (number of transfers) is disregarded

I assume ‘tall’ cache, L ≤H (more convenient, H = Ω(L2))

I

We can now consider different caching protocols

I an ideal cache model corresponds to the assumption that the protocol always makes the best decision

I this ideal cache model is in a sense equivalent to a manually orchestrated cache protocol

(6)

Matrix transposition in the ideal cache model

I

Matrix multiplication bandwidth cost with a tall cache is not affected by L

I if we read square blocks into cache they have dimension Θ(L)

I if we compute outer products, just need to transpose B initially

I

n × n matrix transposition becomes non-trivial

I when L = 1 (original model), there is no notion of how a matrix is laid out in memory

I for general L, we should read√H ×√H blocks into cache, transpose them,

then write them to memory to get linear bandwidth cost O(n2)

(7)

Cache obliviousness

I

Introduced by Frigo, Leiserson, Prokop, Ramachadran

I basic idea: algorithms should not be parameterized by architectural parameters

I good ideas in computer science are most often good abstractions

I designing an algorithm obliviously of cache size makes it portable and efficient for all levels of a cache hierarchy

I

cache oblivious algorithms are stated without explicit control of data

movement

I their communication cost is derived by assuming an ideal cache model

(8)

Cache oblivious matrix transposition

Given m × n matrix A, compute B = A

T

I

if m ≤ n subdivide A = [A

1

A

2

] and B =

B

1

B

2



and compute recursively,

B

1

= A

T1

, B

2

= A

T2

I

if m > n subdivide A =

A

1

A

2



and B = [B

1

B

2

] and compute recursively,

B

1

= A

T1

, B

2

= A

T2

obtains linear bandwidth cost T (mn) = 2T (mn/2), T (1) = O(1), so

(9)

Cache oblivious matrix multiplication

Given m × k matrix A and k × n matrix B, compute m × n matrix C = AB

I

if k ≥ m and k ≥ m subdivide A =

A

1

A

2

 and B =

B

1

B

2



and compute

recursively, ¯

C = A

1

B

1

, ˆ

C = A

2

B

2

, then C = ¯

C + ˆ

C

I

if n > k and n ≥ m subdivide C =

C

1

C

2

 and B = B

1

B

2

 and compute

(10)

Cache oblivious fast Fourier transform (FFT)

I

The Fourier transform computes y = D

(n)

x, where d

(n)

ij

= ω

ij

n

and ω

n

is the

nth complex root of identity

I D(n)is complex and symmetric (not Hermitian)

I D(n)is unitary modulo scaling (ignored here)

I

A cache-oblivious algorithm for the FFT can be derived by folding y and x

into matrices Y and X of dimensions

n ×

n

I Using the fact that ωm

n = ωn/m, and assuming m =n is an integer, observe ωnij = ω(i1+mi2)(j1+mj2) n = ω i1j1 n ω mi1j2 n ω mi2j1 n ω nj1j2 n = ωi1j1 n ω i1j2 m ω i2j1 m I Consequently, d(n)i 1+mi2,j1+mj2 = d (n) i1j1d (m) i1j2d (m)

i2j1, so we can compute Y via

yi1i2 = X j1,j2 [d(n)i 1j1(d (m) i1j2xj1j2)]d (m) i2j1

(11)

Cache oblivious fast Fourier transform (FFT)

I

Lets now analyze the cost of the cache oblivious algorithm based on

Y = (((D

(m)

X) F )D

(m)

)

T

I There are 2n recursive calls and O(n) work to apply the Hadamard product at

each level, so the work is

W (n) = 2√nW (√n) + O(n) = O(n log n)

I The Hadamard product can be applied with depth 1, but half the recursive calls are dependent on the other half, so the depth is

D(n) = 2D(√n) + O(1) = O(log n)

I After log n/ log H = logHn recursive calls, n < H and the computation can be

done in cache. O(n) memory traffic is required otherwise, so

(12)

A simple model for point-to-point messages

The time to send or receive a message of s words is α + s · β

I

α – latency/synchronization cost per message

I

β – bandwidth cost per word

I

each processor can send and/or receive one message at a time

Consider the cost of a broadcast of s words

I using a binary tree of height

hr= 2(log2(p + 1) − 1) ≈ 2 log2(p) T = hr· (α + s · β) I using a binomial tree of height

(13)

Bulk Synchronous Parallel (BSP) Model

I

Bulk Synchronous Parallel (BSP) model (Valiant 1990)

I execution is subdivided into supersteps, each associated with local work and a single round of communication

I within each superstep each processor can send and receive up to h messages (called an h-relation)

I in original model, messages were restricted to one-word length

I for modern architectures makes sense to allow each processor to send arbitrary-sized messages to at most h other processors

I

The cost of a BSP algorithm is a sum over supersteps of the maximum costs

incurred in that superstep

I given S supersteps, where processor i sends and receives Wija total of words

in superstep j and performs Fijlocal work, we have T = S X j=1 α + β max i Wij+ γ maxi Fij

(14)

Collective communication in BSP

I

When h = p, most collective communication routines involving s words of

data per processor can be done with BSP cost O(α + s · β)

I Scatter: root sends each s/p-sized message to its target (root incurs s · β send bandwidth)

I Reduce-Scatter: each processor sends s/p summands to every other processor (every processor incurs s · β send and receive bandwidth)

I Gather: send each message of size s/p to root (root incurs s · β receive bandwidth)

I Allgather: each processor sends s/p words portion to every other processor (every processor incurs s · β send and receive bandwidth)

I Broadcast done by Scatter then Allgather

I Reduce done by Reduce-Scatter then Gather

I Allreduce done by Reduce-Scatter then Allgather

I All-to-all can be done by sending messages directly in one round

I For h < p, O(loghp) supersteps required, but bandwidth cost same for all except

(15)
(16)

Matrix-vector Product

I

Lets design a cache-efficient algorithm for a matrix-vector product

I Each processor owns n2/p matrix data

I Given O(H) matrix data in cache, can perform O(H) work, input/output O(H) vector entries

I Row-wise partitioning avoids write conflicts, achieves Q = O(n2/p) cache traffic

I

Lets design a BSP algorithm for a matrix-vector product

I Each processor starts with matrix and vector data, which it may not need to communicate

I Need to pick initial distribution, assume initial data is not replicated

I First consider row-wise (1D) distribution, communicate O(n) input vector data per processor (all-gather)

I With 2D blocked or cyclic distribution, require O(n/p) input vector data per

processor (broadcast or all-gather in processor columns) and contribute

(17)

Sparse matrix-vector Product

I

1D distribution is effective for BSP algorithm for SpMV

I Each processor assigned n/p input/output vector entries and n/p matrix rows

   y1 .. . yp   =    A1 .. . Ap       x1 .. . xp   

I Let Ai= Alocali + Aremotei where Alocali is on the block-diagonal I Can perform local SpMV Alocal

i x without communication, since only xiis needed I ith processor needs to receive entries for each nonzero column in Aremotei I ith processor needs to send entries for each nonzero column in Aremote

i

assuming A is symmetric (otherwise consider nonzero rows of block-column)

I Algorithm can be more efficient than 2D if the number of such nonzero columns is less than n/√p

(18)

Massively Parallel Computing (MPC) Model

I

Massively Parallel Computing (MPC) model

I Given input size N , allow each processor to have S = Nαmemory for 0 < α < 1 I Let the number of processors be N/S so that the input fits into global memory

or (N/S) polylog B

I Each round corresponds to local computation followed by arbitrary global communication

I Aim to minimize number of rounds, can simulate PRAM and BSP

I

Aim to achieve O(log N ) or O(log log N ) rounds with minimal memory per

processor

I strongly superlinear if S ≥ N1+ I nearly linear if S = O(N polylog N )

(19)

Graph Algorithms in MPC

I

Graph algorithms in the MPC model for graphs with n vertices

I With S = Θ(n) memory can assign a vertex and its incident edges to a

processor

I Can perform SpMV with O(1) rounds, many graph algorithms in constant or

O(polylog(log n)) rounds

(20)

Communication lower bounds

I

Given an algorithm (e.g. radix-2 FFT, bitonic sort) or family of algorithms

(e.g. radix-k FFT, comparison based sorting algorithms), how much

communication is necessary?

I How much data or cache lines must be moved between memory and cache?

I How much data must processes communicate in BSP, assuming work or input is load balanced?

I

Communication lower bounds ascertain optimality of communication

schedules

I For numerical problems, full space of potential algorithms is often too difficult to consider for communication lower bounds

I Communication cost lower bounds consequently focus on a particular set of algorithms

(21)

Classical results in communication lower bounds

I

Floyd 1972: for large cache lines L = Θ(H) matrix transposition has cost

O(n

2

log(n) · β)

I

Hong and Kung 1981, pebbling lower bound

I model communication as placing pebbles on a dependency graph of an algorithm

I lower bounds for matrix-matrix multiplication, FFT, stencil computation, odd-even sort

I

Aggarwal and Vitter 1988, lower bounds with any L, H

I communication lower bounds for general permutation networks

(22)

Lower bounds by partitioning memory operations

Pebbling bounds employ the following general argument

I

consider the sequence of loads and stores (memory-cache) transfers

computed by a program

I

the length of the sequence is the bandwidth cost Q

I

partition the sequence into parts of size H

I

upper-bound the amount of useful work that can be done between the

beginning and end of this sequence

I

H bounds the number of inputs we read from memory and outputs we write

to cache

I

with partitioning, all we need is a bound f

alg

(H) on how much useful

computation can be done with 3H inputs + outputs

(23)

Lower bounds by partitioning computation

We can also take the dual view

I

we are given an algorithm that must perform F operations

I

we need to prove that the given 3H inputs and outputs at most f

alg

(H) of the

computation can be done

I to prove this we generally need some assumptions to guarantee that outputs cannot be discarded

I its typical to assume that the F operations are not recomputed (outputs are not regenerated)

I we can also represent some algorithms with dependency graphs (DAGs) with F vertices

I consider any execution schedule (ordering) of the F operations

I for each subsequence of size falg(H), we can show that H loads or stores are required

(24)

Bounding work in matrix multiplication

Consider the F = n

3

products computed in square matrix multiplication

I

additions are tricky, we don’t want to impose specific summation trees

I

consider any G of the products C(i, j) ← A(i, k) · B(k, j)

I

the d = 3 Loomis-Whitney theorem tells us that the number of unique (i, k),

(k, j), and (i, j) indices in G: g

A

, g

B

, and g

C

, satisfy

g

A

· g

B

· g

C

≥ G

I

in other words, the inputs needed to compute the G entries include g

A

values

of A, g

B

values of B, and they contribute to g

C

different entries of C

I

we can safely restrict the space of algorithms to those that do not sum

products which contribute to different entries of C

I

bound the size of G provided the number of inputs and outputs is at most H

f

MM

(H) =

max

|gA+gB+gC|≤3H

(25)

Cache complexity lower bound for MM

Given f

MM

(H) = H

3/2

, we are essentially done

I

we obtain the sequential memory bandwidth lower bound

Q

seq-MM

(n, H) ≥ n

3

H/f

MM

(H) =

n

3

H

I

in the parallel case, one of P processors needs to perform n

3

of the products,

so

Qpar-MM(n, H, P ) ≥

n

3

(26)

Interprocessor communication lower bound for MM

We can also use f

MM

to get lower bounds on interprocessor communication

I

given that each processor has M memory, f

MM

(M ) tells us how much

computation can be done with 3M inputs/outputs

I

we can assume no processor has more than 2n

2

/P inputs at the start of

execution and n

2

/P outputs at the end, so

Wpar-MM(n, H, M, P ) ≥ n

3

M/fMM(M ) − 3n

2

/P =

n

3

P

M

− 3n

2

/P

I

for c ∈ [1, P

1/3

] we get

Wpar-MM(n, H, cn

2

/P, P ) = Ω



n

2

cP



I

restricting the amount of work done per processor to n

3

/P , gets us

Wpar-MM(n, H, P ) = Ω



n

2

P

2/3

(27)

Latency/synchronization lower bounds

From f

MM

to get lower bounds on the number of messages

I

Given a cache of size H, Ω(n

3

/fMM(H)) blocks must be transfered between

memory and cache

I

Given M = 3cn

2

/p memory, Ω((n

3

/p)/fMM(M )) = Ω(pp/c

3

) messages must

be sent or received by some processor

I

Given M = 3cn

2

/p memory, Ω((n

3

/p)/fMM(M )) = Ω(pp/c

3

) BSP supersteps

(28)
(29)

Paths in Radix-2 FFT dependency graph

Any two edge-disjoint paths in the FFT DAG intersect at no more than one vertex

(30)

Work bound for FFT

We prove that the work bound for the radix-2 FFT is f

FFT

(s) = s log

2

s

I

in particular that with s inputs, at most s log

2

s work can be done

I

we can do this by induction on s

I

the base case, s = 1 holds trivially

(31)

Work bound for FFT, contd

I

consider the last level in the FFT graph in which a vertex is computed

I

if k vertices in the level were computed, we must know k/2 values in each of

the left sub-FFTs

I

moreover, each sub-FFT must have at least k/2 inputs

I

conversely, if one of the sub-FFTs had t of the s inputs, we can at most

(32)

Communication lower bound for the FFT

By induction the expression f

FFT

(s) = max

t

(f

FFT

(s − t) + f

FFT

(t) + 2 min(s − t, t))

implies

f

FFT

(s) = max

t

((s − t) log

2

(s − t) + t log(t) + 2 min(s − t, t))

which is maximized by picking t = s/2

fFFT(s) = 2fFFT(s/2) + s = s log

2

(s)

Given f

FFT(s) = s log2

(s), the cache complexity is

Qseq-FFT(n, H) ≥ n log

2

(n)H/fFFT(2H) = n

log(n)

2 log(2H)

= Ω(n log

H

(n))

(33)

Lower bounds via graph partitioning

I

Given a DAG representation of an algorithm, graph partitioning properties

can provide communication lower bounds

I Consider 2-processor load-balanced parallelization to get balanced two-way partitioning of graph

I Vertices with outgoing edges to vertices in the other part must be communicated

I These vertices define a separator between the two parts, since their removal disconnects the graph

I Lower bound on vertex separator size yields lower bound on communication

I

Consideration of expansion of subgraphs can yield better bounds

I Two-processor view can yield communication volume lower bound by

considering data movement between first half and second half of processors

I Can get better lower bounds like before by obtaining function f (s) on how much useful computation can happen with 3s data

I If for any subset of vertices S ⊂ V with |S| ≤ k  |V |, a separator of size

Ω(r(k)) is needed to disconnect S from V \ S, then f (s) = O(r−1(s))

(34)

Dependency interval expansion

Consider an algorithm that computes a set of operations V with a partial

ordering, we denote a dependency interval between a, b ∈ V as

[a, b] = {a, b} ∪ {c : a < c < b, c ∈ V }

If there exists

{v

1

, . . . , v

n

} ∈ V with v

i

< v

i+1

and

[v

i+1

, v

i+k

]

= Θ(k

d

) for all

k ∈ N, then

F · S

d−1

= Ω(n

d

)

where F is the computation cost and S is the synchronization cost

(35)

Dependency interval expansion

Consider an algorithm that computes a set of operations V with a partial

ordering, we denote a dependency interval between a, b ∈ V as

[a, b] = {a, b} ∪ {c : a < c < b, c ∈ V }

If there exists

{v

1

, . . . , v

n

} ∈ V with v

i

< v

i+1

and

[v

i+1

, v

i+k

]

= Θ(k

d

) for all

k ∈ N, then

F · S

d−1

= Ω(n

d

)

where F is the computation cost and S is the synchronization cost

(36)

Dependency interval expansion

Consider an algorithm that computes a set of operations V with a partial

ordering, we denote a dependency interval between a, b ∈ V as

[a, b] = {a, b} ∪ {c : a < c < b, c ∈ V }

If there exists

{v

1

, . . . , v

n

} ∈ V with v

i

< v

i+1

and

[v

i+1

, v

i+k

]

= Θ(k

d

) for all

k ∈ N, then

F · S

d−1

= Ω(n

d

)

where F is the computation cost and S is the synchronization cost

(37)

Example: diamond DAG

For the n × n diamond DAG (d = 2),

(38)

Tradeoffs involving synchronization

For triangular solve with an n × n matrix

F

TRSV

· S

TRSV

= Ω n

2



For Cholesky of an n × n matrix

F

CHOL

· S

CHOL2

= Ω n

3



W

CHOL

· S

CHOL

= Ω n

2



For computing s applications of a (2m + 1)

d

-point stencil

F

St

· S

Std

= Ω



m

2d

· s

d+1



W

St

· S

Std−1

= Ω



References

Related documents

Using data from the Current Population Survey and the Health and Retirement Study, this report estimates consumption and labor supply responses of individuals in their 50s and 60s

Result shows that the relationship between IV and DV is positive and significant which means consumer’s attitude towards social media adv ertisement, considering

Event Review Pro is our comprehensive case management tool for the most demanding data managers and medical directors, with even more detailed data entry screens to record

Finally, the newly created second-order constructs ( Processing ), included the Monitoring and Capturing, Analysis, and Exploitation constructs, represents the

The objective has been to ensure that Bonneville's near and long-term power and transmission costs are as low as possible consistent with sound business practices, thereby enabling

Roughly since the mid 1990s Customer Relationship Management systems (CRM systems) have been avail- able as configurable standard business software for the collection, analysis

Mikko Poutanen This document has been printed out through electrical mail.. A copy with original signature is available in

Each incorporates a particular model input and conveys its constraints: (i) the age distribution incorporates the cache size, and constrains modeled capacity; (ii) the hit