Sanders: Parallel Algorithms December 10, Parallel Algorithms. Peter Sanders Institute of Theoretical Informatics

(1)

Parallel Algorithms

Peter Sanders

(2)

Why Parallel Processing

increase speed:

p

computers jointly working on a problem solve it up to

p

times faster. But, too many cooks spoil the broth. careful

coordination of processors

save energy: Two processors with half the clock frequency need less power than one processor at full speed. (power

≈

voltage

·

clock frequency)

expand available memory by using aggregate memory of many processors

less communication: when data is distributed it can also be (pre)processed in a distributed way.

(3)

Subject of the Lecture

Basic methods for parallel problem solving

Parallelization of basic sequential techniques: sorting, data structures, graph algorithms, . . .

Basic communication patterns

load balancing

Emphasis on provable performance guarantees

(4)

Overview

Models, simple examples

Matrix multiplikation

Broadcasting

Sorting

General data exchange

Load balanciong I, II, III

List ranking (conversion list

→

array)

Hashing, priority queues

(5)

Literature

Script (in German) +

(6)

More Literature

[Kumar, Grama, Gupta und Karypis],

Introduction to Parallel Computing. Design and Analysis of Algorithms, Benjamin/Cummings, 1994. mostly practical, programming oriented [Leighton], Introduction to Parallel Algorithms and Architectures,

Morgan Kaufmann, 1992.

Theoretical algorithms on concrete interconnection networks

[JáJá], An Introduction to Parallel Algorithms, Addison Wesley, 1992. PRAM

[Sanders, Worsch],

(7)

Parallel Computing at ITI Sanders

Massively parallel sorting, Michael Axtmann

Massively parallel graph algorithms, Sebastian Lamm

Fault tolerance, Lukas Hübner

Shared-memory data structures, Tobias Maier

(Hyper)graph partitioning,

Tobias Heuer & Daniel Seemaier

Comm. eff. alg., Lorenz Hübschle-Schneider

SAT-solving and planning, Markus Iser and Dominik Schreiber

(8)

Role in the CS curriculum

Elective or “Mastervorzug”

Bachelor!

Vertiefungsfach

–

Algorithmentechnik

–

Parallelverarbeitung

(9)

Related Courses

Parallel programming: Tichy, Karl, Streit

Modelle der Parallelverarbeitung: viel theoretischer,

Komplexitätstheorie,. . . Worsch

Algorithmen in Zellularautomaten: spezieller, radikaler, theoretischer Worsch

Rechnerarchitektur: Karl GPUs: Dachsbacher

(10)

RAM/von Neumann Model

ALU

O(1) registers

1 word = O(log n) bits

large memory

freely programmable

Analysis:

count machine instructions — load, store, arithmetics, branch,. . .

simple

(11)

Algorithmenanalyse:

Count cycles:

T

(

I

)

, for given problem instance

I

.

Worst case depending on problem size:

T

(

n

)

=

max

_|_I_|₌_n

T

(

I

)

Average case:

T

_avg

(

n

)

=

∑

|I|=n

T

(

I

)

|{

I

:

_|

I

_|

=

n

_}|

Example: Quicksort has average case complexity

O

(

n

log

n

)

Randomized Algorithms:

T

(

n

)

(worst case) is a random variable. We are interested, e.g. in its expectation (later more).

Don’t mix up with average case.

Example: Quicksort with random pivot selection has expected worst case cost

E

[

T

(

n

)] =

O

(

n

log

n

)

(12)

Algorithm Analysis: More Conventions

O

₍

_·

₎

_{gets rid of cumbersome constants}

Secondary goal: memory

Execution time can depend on several parameters:

Example: An efficient variant of Dijkstra’s Algorithm for shortest paths needs time

O

(

m

+

n

log

n

)

where

n

is the number of nodes and

m

is the number of edges. (It always has to be clear what

(13)

A Simple Parallele Model:

PRAM

Idea: change RAM as little as possible.

p

PEs (

P

rocessing

E

lements); numbered

1 ..

p

(or

0 ..

p

−

1

). Every PE knows

p

.

One machine instruction per clock cycle and PE synchronous

Shared global memory

PE 1 PE 2 PE

...

p

(14)

Access Conflicts?

EREW: Exclusive Read Exclusive Write. Concurrent access is

forbidden

CREW: Concurrent Read Exclusive Write. Concurrent read OK. Example: One PE writes, others read = „Broadcast“

CRCW: Concurrent Read Concurrent Write. avoid chaos:

common: All writers have to agree Example: OR in constant time (AND?)

arbitrary: someone succeeds (

6 =

random!) priority: writer with smallest ID succeeds

(15)

Example: Global Or

Input in

x

[

1 ..

p

]

Initilize memory location Result

=

0

Parallel on PE

i

=

1 ..

p

if x[i] then Result := 1

Global And

Initilize memory location Result

=

1

(16)

Example: Maximum on Common CRCW PRAM

[JáJá Algorithm 2.8]

Input:

A

[

1 ..

n

]

//

distinct elements

Output:

M

[

1 ..

n

]

//

M

[

i

] =

1 iff

A

[

i

] =

max

_j

A

[

j

]

forall

(

i

,

j

)

_{∈ {}

1 ..

n

_}

2

dopar

B

[

i

,

j

]

:=

A

[

i

]

_≥

A

[

j

]

forall

i

_{∈ {}

1 ..

n

_}

dopar

M

[

i

]

:=

n

^

j=1

B

[

i

,

j

]

//

p

arallel subroutine

O

₍

₁

₎

_time

Θ

n

2

PEs (!)

(17)

i

A

B 1

2

3

4 5 <- j

M

1

3 _*

0

1

0

1

2

5

1 _*

1

0

1

3

2

0

0 _*

0

1

4

8

1

1 _*

1

5

1

0

0 _*

1 A

3

5

2

8

1 ---i

A

B 1

2

3

4 5 <- j

M

1

3 _*

0

1

0

1

0

2

5

1 _*

1

0

1

0

3

2

0

0 _*

0

1

0

4

8

1

1 _*

1 1->maxValue=8

5

1

0

0 _*

0

(18)

Describing Parallel Algorithms

Pascal-like pseudocode

Explicitly parallel loops [JáJá S. 72]

Single Program Multiple Data principle. The PE-index is used to

(19)

Analysis of Parallel Algorithms

In principle – just another parameter:

p

. Find execution time

T

(

I

,

p

)

.

Problem: interpretation.

Work:

W

=

pT

(

p

)

is a cost measure. (e.g. Max:

W

=

Θ

n

2

) Span:

T

_∞

=

inf

_p

T

(

p

)

measures Parallelizability.

(absolute) Speedup:

S

=

T

_seq

/

T

(

p

)

.

Use the best known sequential algorithm.

Relative Speedup

S

_rel

=

T

(

1 )

/

T

(

p

)

is usually different! (e.g. Maximum:

S

=

Θ

(

n

)

,

S

_rel

=

Θ

n

2

)

Efficiency:

E

=

S

/

p

. Goal:

E

≈

1

or, at least

E

=

Θ

(

1 )

. „Superlinear speedup“:

E

>

1

. (possible?). Example maximum:

E

=

Θ

(

1 /

n

)

.

(20)

PRAM vs. Real Parallel Computers

Distributed Memory

PE 1 PE 2 PE

memorylocal memorylocal memorylocal

p

...

(21)

(Symmetric) Shared Memory

cache network

0 processors

1 ...

p−1

(22)

Problems

Asynchronous design, analysis, implementation, and

debugging are much more difficult than for PRAM

Contention for memory module / cache line Example:

The

Θ

(

1 )

PRAM Algorithm for global OR becomes

Θ

(

p

)

.

local/cache-memory is (much) faster than global memory

The network gets more complex with growing

p

while latencies grow

Contention in the network

Maximum local memory consumption more important than total memory consumption

(23)

Realistic Shared-Memory Models

asynchronous

aCRQW: asynchronous concurrent read queued write. When

x

PEs contend for the same memory cell, this costs time

O

(

x

)

.

consistent write operations using atomic operations

memory hierarchies

(24)

Atomic Instuctions: Compare-And-Swap

General and widely available:

Function

CAS

(

a

,

expected

,

desired

)

:

{

0 ,

1 }

BeginTransaction

if

_∗

a

=

expected

then

∗

a

:=

desired;

return

1 //

success

else

expected

:=

∗

a

;

return

0 //

_failure

(25)

Further Operations for Consistent Memory

Access:

Fetch-and-add

Hardware transactions

Function

fetchAndAdd

(

a

,

∆

)

expected

:=

∗

a

repeat

desired

:=

expected

+

∆

until

CAS

(

a

,

expected

,

desired

)

(26)

Parallel External Memory

M

PEM

...

(27)

Models with Interconnection Networks

memory RAMs network

0 processors

1 ...

...

p−1

PEs are RAMs

asynchronous processing

Interaction by message exchange

(28)

Real Maschines Today

Internet SSD disks tape main memory L3 L2 cache core compute node L1 SIMD network

more compute nodes threads

superscalar

processor

(29)

Handling Complex Hierarchies

Proposition: we get quite far with flat Models, in particular for shared

memory.

Design distributed, implement hierarchy adapted

(30)

Explicit „Store-and-Forward“

We know the interconnection graph

(

V

=

_{

1 , . . . ,

p

_}

,

E

_⊆

V

_×

V

)

. Variants:

–

V

=

_{

1 , . . . ,

p

_{} ∪}

R

with additional

„dumb“ router nodes (perhaps with buffer memory).

–

Buses

→

Hyperedges

In each time step each edge can transport up to

k

′ data packets of constant length (usually

k

′

=

1

)

On a

k

-port-machine, each node can simultaneously send or receive

k

Packets gleichzeitig senden oder empfangen.

k

=

1

is called single-ported.

(31)

Discussion

+

simple formulation

−

low level

⇒

„messy algorithms“

−

Hardware router allow fast communication whenever an unloaded communication path is found.

(32)

Typical Interconnection Networks

3D−mesh

hypercube

fat tree

root

mesh

torus

(33)

Fully Connected Point-to-Point

E

=

V

_×

V

, single ported

T

_comm

(

m

) =

α

+

mβ

. (

m

=

message length in machine words)

+

Realistic handling of message lengths

+

Many interconnection networks approximate fully connected networks

⇒

sensible abstraction

+

No overloaded edges

→

OK for hardware router

+

„artificially“ increasing

α

,

β

→

OK for „weak“ networks

+

Asynchronous model

(34)

Fully Connected: Variants

What can PE

I

do in time

T

_comm

(

m

) =

α

+

mβ

? message length

m

.

half duplex:

1 ×

send or

1 ×

receive (also called simplex) Telephone:

1 ×

send to PE

j

and

1 ×

receive from PE

j

(full)duplex:

1 ×

send and

1 ×

receive. Arbitrary comunication partners Effect on running time:

(35)

BSP

B

ulk

S

ynchronous

P

arallel

[McColl LNCS Band 1000, S. 46]

Machine

described by three parameters:

p

,

L

and

g

.

L

: Startup overhead for one collective message exchange – involving all PEs

g

: gap

≈

_{communication bandwidth}computation speed

Superstep:

Work locally. Then collective global synchronized data exchange with arbitrary messages.

w

: max. local work (clock cycles)

h

: max. number of machine words sent or received by a PE (

h

-relation)

(36)

BSP versus Point-to-Point

By naive direct data delivery:

Let

H

=

max #messages of a PEs. Then

T

≥

α

(

H

+

log

p

) +

h

β

.

Worst case

H

=

h

. Thus

L

≥

α

log

p

and

g

≥

α

?

By all-to-all and direct data delivery:

Then

T

≥

α

p

+

h

β

.

Thus

L

≥

α

p

and

g

≈

β

?

By all-to-all and

indirect

data delivery:

Then

T

=

Ω

(

log

p

(

α

+

h

β

))

.

(37)

BSP

Truly efficient parallel algorithms:

c

-optimal multisearch for an extension of the BSP model,

Armin Bäumker and Friedhelm Meyer auf der Heide, ESA 1995. Redefinition of

h

to # blocks of size

B

, e.g.

B

=

Θ

(

α

/

β

)

.

Let

M

_i be the set of messages sent or received by PE

i

. Let

h

=

max

_i

_∑

_m_∈_M_i

⌈|

m

|

/

B

⌉

.

Let

g

the gap between sending packets of size

B

. Then, once more

w

+

L

+

gh

(38)

BSP

versus Point-to-Point

With naive direct data delivery:

L

_≈

α

log

p

(39)

BSP

We augment BSP by collective operations

broadcast (all-)reduce prefix-sum

with message lengths

h

.

Stay tuned for algorithms which justify this.

BSP∗-algorithms are up to a factor

Θ

(

log

p

)

slower than BSP+-algorithms.

(40)

MapReduce

D = [ c_∈C ρ(c) C = _{(k,X) : k _∈ K_∧ X = _{x : (k,x) _∈ B_{} ∧}X ₆= /0_} A _⊆ I B = [ a_∈A µ(a) _⊆ K _×V ... 1: map 2: shuffle 3: reduce

(41)

MapReduce Example WordCount

(birthday, {1,1,1}) (to, {1,1}) (you, {1,1}) (dear, {1}) (Timmy, {1}) (happy, {1,1,1})

(to, 2) (you, 2) (birthday, 3)

(dear, 1) (Timmy, 1) (happy, 3) happy birthday to you

(birthday,1) (happy,1) (birthday,1) (to,1) (you,1) (happy,1)(birthday,1)

(dear,1) (Timmy,1) (happy,1) (birthday,1) (to,1) (you,1) happy birthday to you

happy birthday dear Timmy

D = [ c_∈C ρ(c) C = _{(k,X) : k _∈ K_∧ A _⊆ I B = [ a_∈A µ(a) _⊆ K _×V X = _{x : (k,x) _∈ B_{} ∧}X ₆= /0_} ... 1: map 2: shuffle 3: reduce

(42)

MapReduce Discussion

D = [ c_∈C ρ(c) C = _{(k,X) : k _∈ K_∧ X = _{x : (k,x) _∈ B_{} ∧}X ₆= /0_} A _⊆ I B = [ a_∈A µ(a) _⊆ K_×V ... 1: map 2: shuffle 3: reduce

+

Abstracts away difficult issues

* parallelization * load balancing * fault tolerance * memory hierarchies * . . .

−

Large overheads

−

Limited functionality

(43)

MapReduce – MRC Model

[Karloff Suri Vassilvitskii 2010]

D = [ c_∈C ρ(c) C = _{(k,X) : k _∈ K_∧ A _⊆ I B = [ a_∈A µ(a) _⊆ K _×V X = _{x : (k,x) _∈ B_∧ X ₆= /0_} ... 1: map 2: shuffle 3: reduce A problem is in MRC iff for input of size

n

:

solvable in

O

(

polylog

(

n

))

MapReduce steps

µ

and

ρ

evaluate in time

O

(

poly

(

n

))

µ

and

ρ

use space

O

n

1−ε

(“substantially sublinear”)

overall space for

B

O

n

2−ε

(“substantially subquadratic”) Roughly: count steps,

(44)

MapReduce – MRC Model

Criticism

A problem is in MRC iff for input of size

n

:

solvable in

O

(

polylog

(

n

))

MapReduce steps

µ

and

ρ

evaluate in time

O

(

poly

(

n

))

n

2?

n

42? “big” data?

µ

and

ρ

use space

O

n

1−ε

(“substantially sublinear”)

overall space for

B

O

n

2−ε

(“substantially subquadratic”) “big” data? Roughly: count steps, speedup? efficiency? very loose constraints on everything else

(45)

MapReduce – MRC+ Model

[Sanders IEEE BigData 2020]

w

Total work for

µ

,

ρ

ˆ

w

Maximal work for

µ

,

ρ

m

Total data volume for

A

∪

B

∪

C

∪

D

ˆ

m

Maximal object size in

A

∪

B

∪

C

∪

D

,

Theorem: Lower and upper bound for parallel execution:

Θ

w

p

+

w

ˆ

+

log

p

work,

Θ

m

p

+

m

ˆ

+

log

p

bottleneck communication volume.

D = [ c_∈C ρ(c) C = _{(k,X) : k _∈ K_∧ A _⊆ I B = [ a_∈A µ(a) _⊆ K _×V X = _{x : (k,x) _∈ B_∧ X ₆= /0_} ... 1: map 2: shuffle 3: reduce

(46)

Implementing MapReduce on BSP

2 Supersteps:

1. Map local data.

Send

(

k

,

v

)

∈

B

to PE

h

(

k

)

(hashing) 2. Receive elements of

B

.

Build elements of

C

. Run reducer.

(47)

MapReduce on BSP – Example

3 2

1

PE: 4

work data dependence

map superstep 1 superstep 2 shuffle reduce 27 15 2

21

(48)

MapReduce on BSP – Analysis

Time

2 L

+

O

w

p

+

g

m

p

if

(49)

Graph- und Circuit Representation of Algorithms

a a+b a+b+c a+b+c+d

a b c d

Many computations can be represented as a

directed acyclic graph

Input nodes have indegree

0

and a fixed output

Ouput nodes have outdegree

0

and indegree

1

The indegree is bounded by a small constant.

Inner nodes compute a function, that can be computed in constant time.

(50)

Circuits

Variant: When a constant number of bits rather than machine words are processed, we get circuits.

The depth

d

(

S

)

of the computation-DAG is the number of inner nodes on the longest path from an input to an output

∼

time

One circuit for each input size (specified algorithmically)

⇒

circuit family

(51)

Example: Associative Operations (

=

Reduction)

Satz 1.

Let

_⊕

denote an associative operator that can be

computed in constant time. Then,

M

i<n

x

_i

:

= (

···

((

x

₀

_⊕

x

₁

)

_⊕

x

₂

)

_{⊕ ··· ⊕}

x

_n₋₁

)

can be computed in time

O

₍

_log

_n

₎

_{on a PRAM and in time}

O

₍

_α

_log

_n

₎

_{on a linear array with hardware router}

(52)

Proof Outline for

n

=

2 (wlog?)

Induction hypothesis:

∃

circuit of depth

k

for

L

_i_<₂k

x

_i

k

=

0

: trivial

k

+

1

:

M

i<2k+1

x

_i

=

depth k

z }| {

_M

i<2k

x

_i

_⊕

depth k (IH)

z

_M

}|

{

i<2k

x

_i₊₂k

|

{z

}

Tiefe k+1

k

k+1

2

1

0

(53)

PRAM Code

PE index

i

∈ {

0 , . . . ,

n

₋

1 }

active := 1

for

0 _≤

k

<

_⌈

log

n

_⌉

do

if

active

then

if

bit

k

of

i

then

active := 0

else if

i

+

2

k

<

n

then

x

_i :=

x

_i

⊕

x

_i₊₂k

//

result is in

x

₀

Careful: much more complicated on a real asynchronous shared-memory machine.

Speedup? Efficiency?

log

x

here always

log

₂

x

1 2 3 4 5 6 7 8 9 a b c d e f 0

(54)

Analysis

n

PEs Time

O

(

log

n

)

Speedup

O

(

n

/

log

n

)

Efficiency

O

(

1 /

log

n

)

1 2 3 4 5 6 7 8 9 a b c d e f 0 x

(55)

Less is More (Brent’s Principle)

p

PEs

Each PE adds

n

/

p

elements sequentially. Then parallel sum

for

p

subsums

Time

T

_seq

(

n

/

p

) +

Θ

(

log

p

)

Efficiency

T

_seq

(

n

)

p

(

T

_seq

(

n

/

p

) +

Θ

(

log

p

))

=

1

1 +

Θ

(

p

log

(

p

))

/

n

=

1 −

Θ

p

log

p

n

if

n

≫

p

log

p

p n/p

(56)

D

istributed

M

emory

M

achine

PE index

i

∈ {

0 , . . . ,

n

₋

1 }

//

Input

x

_i

located on PE

i

active := 1

s

:=

x

_i

for

0 _≤

k

<

_⌈

log

n

_⌉

do

if

active

then

if

bit

k

of

i

then

sync-send

s

to PE

i

−

2

k active := 0

else if

i

+

2

k

<

n

then

receive

s

′ from PE

i

+

2

k

s

:=

s

⊕

s

′

//

result is in

s

on PE 0

1 2 3 4 5 6 7 8 9 a b c d e f 0 x

(57)

Analysis

fully connected:

Θ

((

α

+

β

)

log

p

)

linear array:

Θ

(

p

)

: step

k

needs time

2

k.

linear array with router:

Θ

((

α

+

β

)

log

p

)

, since edge congestion is one in every step

BSP

Θ

((

l

+

g

)

log

p

) =

Ω

log

2

p

(58)

Discussion Reductions Operation

Binary tree yields logarithmic running time

Useful for most models

Brent’s principle: inefficient algorithms get efficient by using less PEs

(59)

Matrix Multiplikation

Given: Matrices

A

∈

R

n×n,

B

∈

R

n×n with

A

= ((

a

_{i j}

))

und

B

= ((

b

_{i j}

))

R

: semiring

C

= ((

c

_{i j}

)) =

A

_·

B

well-known:

c

_{i j}

=

n

∑

k=1

a

_ik

_·

b

_{k j}

work:

Θ

n

3

arithmetical operations

(60)

A first PRAM Algorithm

n

3 PEs

for

i

:= 1

to

n

dopar

for

j

:= 1

to

n

dopar

c

_{i j}

:=

n

∑

k=1

a

_ik

_·

b

_{k j}

//

n

PE parallel sum

One PE für each product

c

_{ik j}

:=

a

_ik

b

_{k j}

Time

O

(

log

n

)

(61)

Distributed Implementation I

p

_≤

n

2 PEs

for

i

:= 1

to

n

dopar

for

j

:= 1

to

n

dopar

c

_{i j}

:=

n

∑

k=1

a

_ik

_·

b

_{k j}

Assign

n

2

/

p

of the

c

_{i j} to each PE

−

limited scalability

−

high communication volume. Time

Ω

β

n

2

√

_p

n/ p

n

(62)

Distributed Implementation II-1

[Dekel Nassimi Sahni 81, KGGK Section 5.4.4]

Assume

p

=

N

3,

n

is a multiple of

N

View

A

,

B

,

C

as

N

×

N

matrices,

each element is a

n

/

N

×

n

/

N

matrix

for

i

:= 1

to

N

dopar

for

j

:= 1

to

N

dopar

c

_{i j}

:=

N

∑

k=1

a

_ik

b

_{k j}

One PE for each product

c

_{ik j}

:=

a

_ik

b

_{k j}

n

n/N

(63)

Distributed Implementation II-2

store

a

_ik in PE

(

i

,

k

,

1 )

store

b

_{k j} in PE

(

1 ,

k

,

j

)

PE

(

i

,

k

,

1 )

broadcasts

a

_ik to PEs

(

i

,

k

,

j

)

for

j

∈ {

1 ..

N

}

PE

(

1 ,

k

,

j

)

broadcasts

b

_{k j} to PEs

(

i

,

k

,

j

)

for

i

∈ {

1 ..

N

}

compute

c

_{ik j}

:=

a

_ik

b

_{k j} on PE

(

i

,

k

,

j

)

//

local!

PEs

(

i

,

k

,

j

)

for

k

∈ {

1 ..

N

}

compute

c

_{i j}

:=

N

∑

k=1

c

_{ik j} to PE

(

i

,

1 ,

j

)

k

i

j

B

A

C

(64)

Analysis, Fully Connected, etc.

store

a

_ik in PE

(

i

,

k

,

1 )

//

free (or cheap)

store

b

_{k j} in PE

(

1 ,

k

,

j

)

//

free (or cheap)

PE

(

i

,

k

,

1 )

broadcasts

a

_ik to PEs

(

i

,

k

,

j

)

for

j

∈ {

1 ..

N

_}

PE

(

1 ,

k

,

j

)

broadcasts

b

_{k j} to PEs

(

i

,

k

,

j

)

for

i

∈ {

1 ..

N

}

compute

c

_{ik j}

:=

a

_ik

b

_{k j} on PE

(

i

,

k

,

j

)

//

T

_seq

(

n

/

N

) =

O

(

n

/

N

)

3

PEs

(

i

,

k

,

j

)

for

k

∈ {

1 ..

N

}

compute

c

_{i j}

:=

N

∑

k=1

c

_{ik j} to PE

(

i

,

1 ,

j

)

Communication:

2T

_broadcast

(

Obj. size

z }| {

(

n

/

N

)

2

,

PEs

z}|{

N

)+

T

_reduce

((

n

/

N

)

2

,

N

)

_≈

3 T

_broadcast

((

n

/

N

)

2

,

N

)

N=p1/3

···

O

n

3

p

+

β

n

2

p

2/3

+

α

log

p

(65)

Discussion Matrix Multiplikation

PRAM Alg. is a good starting point

DNS algorithm saves communication but needs factor

Θ

√

3

p

more space than other algorithms

good for small matrices (for big ones communication is irrelevant)

Pattern for dense linear algebra:

Locale Ops on submtrices

+

Broadcast

+

Reduce

(66)

Broadcast and Reduction

Broadcast: One for all

One PE (e.g. 0) sends message of length

n

to all PEs

p−1

0

1

2 n

...

Reduction: One for all

One PE (e.g. 0) receives sum of

p

messages of length

n

(vector addition

6 =

local addition)

(67)

Broadcast

Reduction

turn around direction of communication

add corresponding parts of arriving and own

messages All the following

broadcast algorithms yield

reduction algorithms

for commutative and associative operations.

Most of them (except Johnsson/Ho and some network embeddings) also work for noncommutative operations.

p−1

0

1

2 n

(68)

Modelling Assumptions

fully connected

fullduplex – parallel send and receive

Variants: halfduplex, i.e., send or receive, BSP, embedding into concrete networks

(69)

Naive Broadcast

[KGGK Section 3.2.1]

Procedure

naiveBroadcast

(

m

[

1 ..

n

])

PE 0:

for

i

:= 1

to

p

−

1 do

send

m

to PE

i

PE

i

>

0

: receive

m

Time:

(

p

−

1 )(

n

β

+

α

)

nightmare for implementing a scalable algorithme

p−1

0

1

2 n

...

p−1

(70)

Binomial Tree Broadcast

Procedure

binomialTreeBroadcast

(

m

[

1 ..

n

])

PE index

i

∈ {

0 , . . . ,

p

−

1 }

//

Message

m

located on PE 0

if

i

>

0 then

receive

m

for

k

:= min

{⌈

log

n

_⌉

,

trailingZeroes

(

i

)

} −

1 downto

0 do

send

m

to PE

i

+

2

k

//

noop if receiver

≥

p

1 2 3 4 5 6 7 8 9 a b c d e f 0

(71)

Analysis

Time:

⌈

log

p

_⌉

(

n

β

+

α

)

Optimal for

n

=

1

Embeddable into linear array

n

_·

f

(

p

)

n

+

log

p

?

1 2 3 4 5 6 7 8 9 a b c d e f 0

(72)

Linear Pipeline

Procedure

linearPipelineBroadcast

(

m

[

1 ..

n

]

,

k

)

PE index

i

∈ {

0 , . . . ,

p

−

1 }

//

Message

m

located on PE 0

//

assume

k

divides

n

define piece

j

as

m

[(

j

−

1 )

n_k

+

1 ..

j

n_k

]

for

j

:= 1

to

k

+

1 do

receive piece

j

from PE

i

−

1 //

noop if

i

=

0 or

j

=

k

+

1

and, concurrently,

send piece

j

−

1

to PE

i

+

1 //

noop if

i

=

p

−

1 or

j

=

1

5 4 3 2 1 7 6

(73)

Analysis

Time n_k

β

+

α

per step (

6 =

Iteration)

p

₋

1

steps until first packet arrives

Then 1 step per further packet

T

(

n

,

p

,

k

)

:

_n

k

β

+

α

(

p

+

k

₋

2 ))

optimales

k

:

r

n

(

p

₋

2 )

β

α

T

∗

(

n

,

p

)

:

≈

n

β

+

p

α

+

2 p

np

αβ

5 4 3 2 1 7 6

(74)

0 1 2 3 4 5 0.01 0.1 1 10 100 1000 10000 T/(nT byte +ceil(log p)T start ) nT_byte/T_start bino16 pipe16

(75)

0 2 4 6 8 10 0.01 0.1 1 10 100 1000 10000 T/(nT byte +ceil(log p)T start ) nT_byte/T_start bino1024 pipe1024

(76)

Discussion

Linear pipelining is optimal for fixed

p

und

n

→

∞

But for large

p

extremely large messages needed

(77)

Procedure

binaryTreePipelinedBroadcast

(

m

[

1 ..

n

]

,

k

)

//

Message

m

located on

root

, assume

k

divides

n

define piece

j

as

m

[(

j

−

1 )

n_k

+

1 ..

j

n_k

]

for

j

:= 1

to

k

do

if

parent exists

then

receive piece

j

if

left child

ℓ

exists

then

send piece

j

to

ℓ

if

right child

r

exists

then

send piece

j

to

r

right

recv left recv right left recv right left recv right left right 11 12 13 8 9 10

recv left recv right left left recv right left recv right 6

(78)

Example

right recv

left recv right left recv right left recv right left right

11 12 13

8 9 10

recv left recv right left left recv right left recv right

6

(79)

Analysis

time n_k

β

+

α

per step (

6 =

iteration)

2 j

steps until first packet reaches level

j

how many levels?

d

:=

⌊

log

p

⌋

then 3 steps for each further packet Overall:

T

(

n

,

p

,

k

)

:=

(

2 d

+

3 (

k

−

1 ))

_n

k

β

+

α

optimal

k

:

r

n

(

2 d

₋

3 )

β

3 α

(80)

Analysis

Time n_k

β

+

α

per step (

6 =

iteration)

d

:=

⌊

log

p

_⌋

levels

Overall:

T

(

n

,

p

,

k

)

:=

(

2 d

+

3 (

k

−

1 ))

_n

k

β

+

α

optimal

k

:

r

n

(

2 d

₋

3 )

β

3 α

substituted:

T

∗

(

n

,

p

) =

2 d

α

+

3 n

β

+

O

p

nd

αβ

(81)

Fibonacci-Trees

0

1

2

3

4

1

2

4

7

12

(82)

Analysis

Time n_k

β

+

α

per step (

6 =

iteration)

j

steps until first packet reaches level

j

How many PEs

p

_j with level

0 ..

j

?

p

₀

=

1

,

p

₁

=

2

,

p

_j

=

p

_j₋₂

+

p

_j₋₁

+

1

ask Maple,

rsolve(p(0)=1,p(1)=2,p(i)=p(i-2)+p(i-1)+1,p(i));

p

_j

_≈

3 √

5 +

5

5 (

√

5 −

1 )

Φ

j

≈

1 .

89 Φ

j with

Φ

=

1+ √ 5 2 (golden ratio)

d

_≈

log

_Φ

p

levels overall:

T

∗

(

n

,

p

) =

d

α

+

3 n

β

+

O

p

nd

αβ

(83)

Procedure

fullDuplexBinaryTreePipelinedBroadcast

(

m

[

1 ..

n

]

,

k

)

//

Message

m

located on

root

, assume

k

divides

n

define piece

j

as

m

[(

j

−

1 )

n_k

+

1 ..

j

n_k

]

for

j

:= 1

to

k

+

1 do

receive piece

j

from parent

//

noop for root or

j

=

k

+

1

and, concurrently, send piece

j

−

1

to right child

//

noop if no such child or

j

=

1

send piece

j

to left child

//

noop if no such child or

j

=

k

+

1

(84)

Analysis

Time n_k

β

+

α

per step

j

steps until first packet level

j

reaches

d

_≈

log

_Φ

p

levels

Then 2 steps for each further packet

(85)

0 1 2 3 4 5 0.01 0.1 1 10 100 1000 10000 T/(nT byte +ceil(log p)T start ) nT_byte/T_start bino16 pipe16 btree16

(86)

0 2 4 6 8 10 0.01 0.1 1 10 100 1000 10000 100000 1e+06 T/(nT byte +ceil(log p)T start ) nT_byte/T_start bino1024 pipe1024 btree1024

(87)

Discussion

Fibonacci trees are a good compromise for all

n

,

p

. General

p

:

(88)

(89)

(90)

Disadvantages of Tree-based Broadcasts

Leaves only receive their data. Otherwise they are unproductive

Inner nodes send more than they receive

(91)

23-Broadcast: Two T(h)rees for the Price of One

Binary-Tree-Broadcasts using two trees A und B at once

Inner nodes of A are

leaves of B and vice versa

Per double step:

One packet received as a leaf

+

one packet received and forwarde as inner node i.e., 2 packets sent and received

0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 1 1 0 13 12 11 10 9 8 7 6 5 4 3 2 1 0 14 0 1

(92)

Root Process

for

j

:= 1

to

k

step

2 do

send piece

j

+

0

along edge labelled 0

send piece

j

+

1

along edge labelled 1

0

1

0

1

0

1

0

1

0

1

0

13

12

11

10

9

8

7

6

5

4

3

2

1

0

14

0

1

(93)

Other Processes,

Wait for first piece to arrive

if

it comes from the upper tree over an edge labelled

b

then

∆

:= 2

_·

distance of the node from the bottom in the upper tree

for

j

:= 1

to

k

+

∆

step

2 do

along

b

-edges: receive piece

j

and send piece

j

−

2

along

1 −

b

-edges: receive piece

j

+

1 −

∆

and send piece

j

0

1

0

1

0

1

0

1

0

1

0

12

10

8

6

4

2

0

14

0

1

3

5

7

9

11

13

(94)

Arbitrary Number of Processors

0 1 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 1 1 1 0 1 13 12 11 10 9 8 7 6 5 4 3 2 1 0

(95)

Arbitrary Number of Processors

0

12

11

10

9

8

7

6

5

4

3

2

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

11

10

9

8

7

6

5

4

3

2

1

0

0 ₀

₁

1

0

1

0

(96)

Arbitrary Number of Processors

1

0

1

0

1

0

1

0

1

0

1

11

10

9

8

7

6

5

4

3

2

1

0

1

0

1

0

1

0

1

0

1

0

1

0

10

9

8

7

6

5

4

3

2

1

0

1

0

1

1 ₁

₀

(97)

Arbitrary Number of Processors

1

0

1

0

1

0

1

0

1

0

10

9

8

7

6

5

4

3

2

1

0

1

0

1

0

1

0

9

8

7

6

5

4

3

2

1

0

1

0

1 ₀

1

0

1

0

1

0

1

0 ₀

₁

₀

₁

1

0

(98)

Arbitrary Number of Processors

7

6

5

4

3

2

1

0

6

5

4

3

2

1

0

9

8

7

6

5

4

3

2

1

0

1

2

3

4

5

6

7

8

(99)

Constructing the Trees

Case

p

=

2

h

−

1

: Upper tree

+

lower tree

+

root

Upper tree: complete binary tree of height

h

−

1

,

−

richt leaf

Lower tree: complete binary tree of height

h

−

1

,

−

left leaf

Lower tree

≈

Upper tree shifted by one.

Inner nodes upper tree

=

leaves of lower tree. Inner nodes lower tree

=

leaves of upper tree.

0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 1 1 0 13 12 11 10 9 8 7 6 5 4 3 2 1 0 14 0 1

(100)

Building Smaller Trees (without root)

invariant

: last node has outdegree 1 in tree

x

invariant

: last node has outdegree 0 in tree

x

¯

p

₋

1

:

Remove last node:

right node in

x

now has degree 0 right node in

x

¯

now has degree 1

0 1 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 1 1 1 0 1 13 12 11 10 9 8 7 6 5 4 3 2 1 0

(101)

Coloring Edges

Consider bipartite graph:

B

= (

s

₀

, . . . ,

s

_p₋₁

_∪

r

₀

, . . . ,

r

_p₋₂

,

E

)

.

s

_i: Sender role of PE

i

.

r

_i: Receiver role of PE

i

.

2 ×

degree 1. all other degree

2

.

⇒

B

is a path plus even circles.

color edges alternatingly with 0 and 1.

14 13 13 12 11 10 9 8 7 6 5 4 3 2 1 0

s

r

12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 1 1 0 12 10 8 6 4 2 0 14 0 1 1 3 5 7 9 11 13 0

(102)

Open Question:

Parallel Coloring ?

In Time Polylog

(

p

)

using list ranking.

(unfortunately impractical for small inputs)

Fast explicit calculation color

(

i

,

p

)

without communication ? Mirror layout:

(103)

Jochen Speck’s Solution

//

Compute color of edge entering node

i

in the upper tree.

//

h

is a lower bound on the height of node

i

.

Function

inEdgeColor

(

p

,

i

,

h

)

if

i

is the root of

T

₁

then return

1

while

i

bitand

2

h

=

0 do

h

++

//

compute height

i

′

:=







i

₋

2

h if

2

h+1 bitand

i

=

1 ∨

i

+

2

h

>

p

i

+

2

h else

//

compute parent of

i

(104)

Analysis

Time n_k

β

+

α

pro step

2 j

steps until all PEs in level

j

are reached

d

=

_⌈

log

(

p

+

1 )

_⌉

levels

Then 2 steps for 2 further packets

T

(

n

,

p

,

k

)

_≈

n

k

β

+

α

(

2 d

+

k

₋

1 ))

, with

d

≈

log

p

optimal

k

:

r

n

(

2 d

₋

1 )

β

α

T

∗

(

n

,

p

)

:

≈

n

β

+

α

· 2 log

p

+

p

2 n

log

p

αβ

(105)

0 1 2 3 4 5 0.01 0.1 1 10 100 1000 10000 T/(nT byte +ceil(log p)T start ) nT_byte/T_start bino16 pipe16 2tree16

(106)

0 2 4 6 8 10 0.01 0.1 1 10 100 1000 10000 100000 1e+06 T/(nT byte +ceil(log p)T start ) nT_byte/T_start bino1024 pipe1024 2tree1024

(107)

Implementation in the Simplex-Model

2 steps duplex 4 steps simplex.

1 PE duplex 1 simplex couple

=

sender + receiver.

0

1

0

1

0

2

0

2

1

3

(108)

23-Reduktion

Numbering is an inorder-numbering for both trees !

root

<root

>root

n

otherwise:

13 12 11 8 7 6 5 4 3 2 1 0 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 14 9 10

(109)

Another optimal Algorithm

[Johnsson Ho 85: Optimal Broadcasting and Personalized

Communication in Hypercube, IEEE Transactions on Computers, vol. 38, no.9, pp. 1249-1268.]

Model: full-duplex limited to a single edge per PE (telephone model) Adaptation half-duplex: everything

×

2

(110)

Hypercube

H

_d

p

=

2

d PEs

nodes

V

=

{

0 ,

1 }

d, i.e., write node number in binary

edges in dimension

i

:

E

_i

=

(

u

,

v

)

:

u

⊕

v

=

2

i

E

=

E

₀

_{∪ ··· ∪}

E

_d₋₁

0

1

2

3

4

(111)

ESBT-Broadcasting

In step

i

communication along dimension

i

mod

d

Decompose

H

_d into

d

Edge-disjoint Spanning Binomial Trees

0

d cyclically distributes packet to roots of the ESBTs

ESBT-roots perform binomial tree broadcasting (except missing smallest subtree

0

d)

step 0 mod 3 step 1 mod 3 step 2 mod 3

100 101 110 111 101 011 001 010 100 111 011 110 010 100 001 111 110 101 001 010 110 101 011 100 000 111 010 000 001 011

(112)

Analysis, Telephone model

k

packets,

k

divides

n

k

steps until last packet left PE 0

d

steps until it has reached the last leaf

Overall

d

+

k

steps

T

(

n

,

p

,

k

) =

n

k

β

+

α

(

k

+

d

)

optimal

k

:

r

nd

β

α

T

∗

(

n

,

p

)

:

=

n

β

+

d

α

+

p

nd

αβ

(113)

Discussion

n

binomial

tree

p

linear

pipeline

small

large

binary tree

p=2^d

EBST

Y

N

23−Broadcast

spcecial alg.

depending on

network?

(114)

Reality Check

Libraries (z.B. MPI) often do not have a pipelined implementation of collective operations your own broadcast may be significantly faster than a library routine

choosing

k

is more complicated: only piece-wise linear cost function for point-to-point communication, rounding

Hypercube gets slow when communication latencies have large

variance

perhaps modify Fibonacci-tree etc. in case of asynchronous communication (Sender finishes before receiver). Data should reach all leaves about at the same time.

(115)

Broadcast for Library Authors

One

Implementation? 23-Broadcast

Few, simple Variants?

{

binomial tree

,

23-Broadcast

}

or

{

binomial tree

,

23-Broadcast

,

linear pipeline

}

(116)

Beyond Broadcast

Pipelining is important technique for handling large data sets

hyper-cube algorithms are often elegant and efficient. (and often simpler than ESBT)