Communication Cost in Big Data Processing

(1)

Communication Cost in Big

Data Processing

Dan Suciu

University of Washington

Joint work with Paul Beame and Paris Koutris

and the Database Group at the UW

(2)

Queries on Big Data

Big Data processing on distributed clusters

• General programs: MapReduce, Spark

• SQL: PigLatin,, BigQuery, Shark, Myria…

Traditional Data Processing

• Main cost = disk I/O

Big Data Processing

• Main cost = communication

(3)

This Talk

• How much communication is needed to solve

a “problem” on

p

servers?

• “Problem”:

–

 

In this talk = Full Conjunctive Query CQ (

≈

SQL)

–

 

Future work = extend to linear algebra, ML, etc.

• Some techniques discussed in this talk are

implemented in Myria (Bill Howe’s talk)

(4)

Models of Communication

• Combined communication/computation cost:

–

 

PRAM

à

MRC [Karloff’2010]

–

 

BSP [Valiant]

à

LogP Model [Culler]

• Explicit communication cost

–

 

[Hong&Kung’81] red/blue pebble games, for I/O

communication complexity

à

[Ballard’2012] extension to parallel algorithms

–

 

MapReduce model [DasSarma’2013]

MPC [Beame’2013,Beame’2014]

– this talk

(5)

Outline

• The MPC Model

• Communication Cost for Triangles

• Communication Cost for CQ’s

• Discussion

(6)

Massively Parallel Communication

Model (MPC)

Extends BSP [Valiant]

Server 1 . . . . Server p

Input (size=M)

Input data

= size

M

bits

Number of servers

=

p

O(M/p) _O(_M_/_p₎

(7)

Massively Parallel Communication

Model (MPC)

Extends BSP [Valiant] Server 1 . . . . Server p Server 1 . . . . Server p Input (size=M) Round 1

Input data

= size

M

bits

One round

=

Compute & communicate

Number of servers

=

p

O(M/p) _O(_M_/_p₎

(8)

Massively Parallel Communication

Model (MPC)

Extends BSP [Valiant] Server 1 . . . . Server p Server 1 . . . . Server p Server 1 . . . . Server p Input (size=M) . . . . Round 1 Round 2 Round 3 . . . .

Input data

= size

M

bits

Algorithm

= Several rounds

One round

=

Number of servers

=

p

O(M/p) _O(_M_/_p₎

(9)

Massively Parallel Communication

Model (MPC)

Input data

= size

M

bits

Max communication load per server

=

L

Algorithm

= Several rounds

One round

=

Number of servers

=

p

L L L L L L O(M/p) _O(_M_/_p₎ 9

(10)

Massively Parallel Communication

Model (MPC)

Input data

= size

M

bits

Max communication load per server

=

L

Ideal:

L

=

M

/

p

This talk:

L

=

M

/

p * p

ε

_{, where}

_ε

_{in [0,1] is called}

_{space exponent}

Degenerate:

L

=

M

(send everything to one server)

Algorithm

= Several rounds

One round

=

Number of servers

=

p

L L L L L L O(M/p) _O(_M_/_p₎ 10

(11)

Example:

Q(x,y,z) = R(x,y)

⋈

_S(y,z)

Server 1 . . . . Server p M/p _M_/_p

Input:

• R

,

S

uniformly

partitioned on

p

servers

size(R)+size(S)=M 11

(12)

Example:

Q(x,y,z) = R(x,y)

⋈

_S(y,z)

Server 1 . . . . Server p Server 1 . . . . Server p Round 1 O(M/p) O(M/p) M/p _M_/_p

Round 1: each server

• Sends record

R(x,y)

to server

h(y) mod p

• Sends record

S(y,z)

to server

h(y) mod p

Input:

• R

,

S

uniformly

partitioned on

p

servers

size(R)+size(S)=M

(13)

Example:

Q(x,y,z) = R(x,y)

⋈

_S(y,z)

R(x,y) ⋈ S(y,z) R(x,y) ⋈ S(y,z)

Output:

• Each server computes and outputs

the local join

R(x,y)

⋈

_S(y,z)

Round 1: each server

• Sends record

R(x,y)

to server

h(y) mod p

• Sends record

S(y,z)

to server

h(y) mod p

Input:

• R

,

S

uniformly

partitioned on

p

servers

size(R)+size(S)=M

(14)

Example:

Q(x,y,z) = R(x,y)

⋈

_S(y,z)

R(x,y) ⋈ S(y,z) R(x,y) ⋈ S(y,z)

Output:

• Each server computes and outputs

the local join

R(x,y)

⋈

_S(y,z)

Round 1: each server

• Sends record

R(x,y)

to server

h(y) mod p

• Sends record

S(y,z)

to server

h(y) mod p

Input:

• R

,

S

uniformly

partitioned on

p

servers

Assuming no Skew:

∀

y,

degree

_R

(y), degree

_S

(y)

≤

M

/

p

Rounds

=

1 Load

L

= O(

M

/

p

) =

O(

M

/

p * p

0

₎

size(R)+size(S)=M

Space exponent = 0

(15)

Example:

Q(x,y,z) = R(x,y)

⋈

_S(y,z)

R(x,y) ⋈ S(y,z) R(x,y) ⋈ S(y,z)

Output:

• Each server computes and outputs

the local join

R(x,y)

⋈

_S(y,z)

Round 1: each server

• Sends record

R(x,y)

to server

h(y) mod p

• Sends record

S(y,z)

to server

h(y) mod p

Input:

• R

,

S

uniformly

partitioned on

p

servers

Assuming no Skew:

∀

y,

degree

_R

(y), degree

_S

(y)

≤

M

/

p

Rounds

=

1 Load

L

= O(

M

/

p

) =

O(

M

/

p * p

0

₎

size(R)+size(S)=M Space exponent = 0

SQL is

embarrassingly

parallel!

(16)

Questions

Fix a query

Q

and a number of servers

p

• What is the communication load

L

needed

to compute

Q

in

one round

? This talk

based on [Beame’2013, Beame’2014]

• Fix a maximum load

L

(space exponent

ε

)

How

many rounds

are needed for

Q

?

Preliminary results in [Beame’2013]

(17)

Outline

• The MPC Model

• Communication Cost for Triangles

• Communication Cost for CQ’s

• Discussion

(18)

The Triangle Query

• Input

: three tables

R(X, Y), S(Y, Z), T(Z, X)

size(

R

)+size(

S

)+size(

T

)=

M

• Output

: compute

Q(x,y,z) = R(x,y)

⋈

S(y,z)

⋈

T(z,x)

Z X Fred Alice Jack Jim Fred Jim Carol Alice … Y Z Fred Alice Jack Jim Fred Jim Carol Alice … X Y Fred Alice Jack Jim Fred Jim Carol Alice … R S T 18

(19)

Triangles in Two Rounds

• Round 1: compute

temp(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

• Round 2: compute

Q(X,Y,Z) = temp(X,Y,Z)

⋈

T(Z,X)

Load

L

depends on the size of

temp

. Can

be much larger than

M

/

p

(20)

Triangles in One Round

• Factorize

p

=

p

⅓

×

p

⅓

×

p

⅓

• Each server identified by (

i

,

j

,

k

)

i j

k

(i,j,k)

P⅓

Place the servers in a cube: _Server_(i,j,k)

[Ganguli’92, Afrati&Ullman’10, Suri’11, Beame’13]

20

(21)

Triangles in One Round

k (i,j,k) P⅓ Z X Fred Alice Jack Jim Fred Jim Carol Alice … Jack Jim Y Z Fred Alice Jack Jim Fred Jim Carol Alice Jim Jack Jim Jack X Y Fred Alice Jack Jim Fred Jim Carol Alice … R S T i = h(Fred) j = h(Jim) Fred Jim Fred Jim Fred Jim Fred Jim Jim Jack Jim Jack Fred Jim Jim Jack Jim Jack Fred Jim Fred Jim

Jack _JackJim _Jim

Round 1:

Send R(x,y) to all servers (h(x),h(y),*) Send S(y,z) to all servers (*, h(y), h(z)) Send T(z,x) to all servers (h(x), *, h(z))

Output:

compute locally R(x,y) ⋈ S(y,z) ⋈ T(z,x)

21

(22)

Upper Bound is

L

_algo

= O(

M

/

p

⅔

₎

22

Theorem

Assume data has no skew: all degree

≤

O(

M

/

p

⅓

_).

Then the algorithm computes

Q

in one round,

with communication load

L

_algo

= O(

M

/

p

⅔

₎

Compare to two rounds:

• No more large intermediate result

• Skew-free up to degrees = O(

M

/

p

⅓

) (from O(

M

/

p

))

• BUT: space exponent increased to

ε

=

⅓

(from

ε

= 1

)

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X)

M

= size(

R

)+size(

S

)+size(

T

) (in bits)

(23)

Lower Bound is

L

_lower

=

M

/

p

⅔

# without this assumption need to add a multiplicative factor 3

* Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013

M

= size(

R

)+size(

S

)+size(

T

) (in bits)

23

(24)

Lower Bound is

L

_lower

=

M

/

p

⅔

M

= size(

R

)+size(

S

)+size(

T

) (in bits)

24

Theorem

* Any one-round deterministic algorithm

A

for

Q

requires

L

_lower

bits of communication per server

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X)

Assume that the three input relations R, S, T are stored on disjoint servers#

(25)

Lower Bound is

L

_lower

=

M

/

p

⅔

M

= size(

R

)+size(

S

)+size(

T

) (in bits)

25

Theorem

* Any one-round deterministic algorithm

A

for

Q

requires

L

_lower

bits of communication per server

We prove a stronger claim, about any algorithm communicating <

L

_lower

bits

Let

L

be the algorithm’s max communication load per server

We prove:

if

L

<

L

_lower

then, over random permutations

R, S, T,

the algorithm returns at most a fraction (

L

/

L

_lower

)

3/2

_{of the answers to}

_Q

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X)

Assume that the three input relations R, S, T are stored on disjoint servers#

(26)

Lower Bound is

L

_lower

=

M

/

p

⅔

Discussion:

• An algorithm with space exponent

ε

<

⅓

, has load

L = M/p

1-ε

_{and reports only (}

_L

_/

_L

lower

)

3/2

=

1/p

(1-3ε)/2

fraction of all answers. Fewer, as

p

increases!

• Lower bound holds only for random inputs: cannot

hold for fixed input

R, S, T

since the algorithm may

use a short bit encoding to signal this input to all

servers

• By Yao’s lemma, the result also holds for

randomized algorithms, some fixed input

26

(27)

[Balard’2012] consider Strassen’s matrix multiplication

algorithm

and prove

the following theorem for the

CDAG

model. If each server has

M

_

c

_·

_p2n/2lg7

memory, then the communication load per server:

IO

(n, p, M

) =

⌦

✓

n

p

M

◆

lg7

M

p

!

Thus, if

M

=

_p₂n_/2_lg₇

then

IO

(

n, p

) =

⌦

⇣

_p_2/n2_lg₇

⌘

Lower Bound is

L

_lower

=

M

/

p

⅔

• Stronger than MPC: no restriction on rounds

• Weaker than MPC

–

 

Applies only to an algorithm, not to a problem

–

 

CDAG model of computation v.s. bit-model

27

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X)

M here same as n2_here

(28)

Lower Bound is

L

_lower

=

M

/

p

⅔

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X)

Proof

28

(29)

Lower Bound is

L

_lower

=

M

/

p

⅔

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X)

Proof

E [#answers to Q] = Σ_i,j,k Pr [(i,j,k) ∈ R⋈S⋈T] = n3_{* (1/}_n₎3₌₁

29

Input: size = M = 3 log n!

•  R, S, T random permutations on [n]: Pr [(i,j) ∈R] = 1/n, same for S, T

(30)

Lower Bound is

L

_lower

=

M

/

p

⅔

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X)

Proof

30

One server v ∈[p]: receives L(v) = L_R(v) + L_S(v) + L_T(v) bits •  Denote a_ij = Pr [(i,j)∈R and v knows it]

Then (1) a_ij≤ 1/n (obvious) (2) Σ_i,j a_ij ≤ L_R(v) / log n (formal proof: entropy) •  Denote similarly b_jk,c_ki for S and T

(31)

Lower Bound is

L

_lower

=

M

/

p

⅔

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X)

Proof

31

•  E [#answers to Q reported by server v] = Σ_i,j,k a_ij b_jk c_ki≤ [(Σ_i,j a_ij2_{) * (}_Σ

j,k bjk2) * (Σk,i cki2)]½ [Friedgut]

≤ (1/n) 3/2_{* [}_L

R(v) * LS(v) * LT(v) / log3 n ]½ by (1), (2)

≤ (1/n) 3/2_{* [}_L(v)_{/ 3* log}_n_]3/2 _{geometric/arithmetic}

= [ L(v) / M ]3/2 _{expression for}_M_above

(32)

Lower Bound is

L

_lower

=

M

/

p

⅔

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X)

Proof

E [#answers to Q reported by all servers]

≤ Σ_v [ L(v) / M ]3/2 ₌_p_* _[_L_/_M_]3/2 _{= [}_L_/_L

lower ]3/2

32

•  E [#answers to Q reported by server v] = Σ_i,j,k a_ij b_jk c_ki≤ [(Σ_i,j a_ij2_{) * (}_Σ

j,k bjk2) * (Σk,i cki2)]½ [Friedgut]

≤ (1/n) 3/2_{* [}_L

R(v) * LS(v) * LT(v) / log3 n ]½ by (1), (2)

≤ (1/n) 3/2_{* [}_L(v)_{/ 3* log}_n_]3/2 _{geometric/arithmetic}

= [ L(v) / M ]3/2 _{expression for}_M_above

(33)

Outline

• The MPC Model

• Communication Cost for Triangles

• Communication Cost for CQ’s

• Discussion

(34)

General Formula

Consider an arbitrary conjunctive query (CQ):

We will:

• Give a lower bound

L

_lower

formula

• Give an algorithm with load

L

_algo

• Prove that

L

_lower

=

L

_algo

34

Q

(

x

₁

, . . . , x

_k

) =

S

₁

(

x

₁

)

o

n

. . .

o

n

S

_`

(

x

_`

)

(35)

Cartesian Product

(36)

Cartesian Product

S₁(x) à

S

2_(y)

à

Q(x,y) = S₁(x) × S₂(y) _size(_S₁₎₌_M₁_{, size(}_S₂_{) =}_M₂

L

_algo

= 2

✓

M

₁

_·

M

₂

p

◆

1 2

Algorithm

: factorize

p

=

p

₁

×

p

₂

• Send

S

₁

(x)

to all servers (

h(x)

,*)

Send

S

₂

(y)

to all servers (*,

h(y)

)

• Load =

M

₁

/

p

₁

+

M

₂

/

p

₂

;

(37)

Cartesian Product

37 S₁(x) à S 2_(y) à

Q(x,y) = S₁(x) × S₂(y) _size(_S₁₎₌_M₁_{, size(}_S₂_{) =}_M₂

L

_algo

= 2

✓

M

₁

_·

M

₂

p

◆

1 2

L

_lower

= 2

✓

M

₁

_·

M

₂

p

◆

1 2

Lower bound:

• Let

L(v) = L

₁

(

v

) +

L

₂

(

v

) = load at server

v

• The

p

servers must report all answers to

Q

:

M

₁

M

₂

≤

Σ

_v

L

₁

(

v

) *

L

₂

(

v

)

≤

½

Σ

_v

(

L

(

v

))

2

≤

½

p

max

_v

(

L

(

v

))

2

Algorithm

: factorize

p

=

p

₁

×

p

₂

• Send

S

₁

(x)

to all servers (

h(x)

,*)

Send

S

₂

(y)

to all servers (*,

h(y)

)

• Load =

M

₁

/

p

₁

+

M

₂

/

p

₂

;

(38)

A Simple Observation

Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1,size(S2) = M2, . . . ,size(S`) = M`

Proof similar to cartesian product on previous slide

Definition

An

edge packing

for

Q

is a set of atoms with no common variables:

U

=

_{

S

_j₁

(x

_j₁

)

,

S

_j₂

(x

_j₂

)

, . . . ,

S

_j_|_U_|

(x

_j_|_U_|

)

_}

₈

i, k

:

x

_j_i

_\

x

_j_k

=

_;

Claim

Any one-round algorithm for

Q

must also compute the cartesian product:

Q

0

(

x

_j₁

, . . . ,

x

_j_|_U_|

) =

S

_j₁

(

x

_j₁

)

o

n

S

_j₂

(

x

_j₂

)

o

n

. . .

o

n

S

_j_|_U_|

(

x

_j_|_U_|

)

Proof

Because the algorithm doesn’t know the other

S

_j

’s.

L

_lower

=

_|

U

_|

_·

✓

_M

j₁

· M

j₂

· · ·

M

j_|_U_|

p

◆

1 |U|

Assume that the three input relations

S

₁

, S

₂

, …

are stored on disjoint servers

#

(39)

The Lower Bound for CQ

39

Theorem

* Any one-round deterministic algorithm

A

for

Q

requires O(

L

_lower

) bits of communication per server:

L

_lower

=

✓

M

₁

u

1

· M

2 u

2

· · ·

M

`

u

`

p

◆

1 u₁+u₂+_···+u_`

*Beame, Koutris, Suciu, Skew in Parallel Data Processing, PODS 2014

Note that we ignore constant factors (like |U| on the previous slide)

Definition

A

fractional edge packing

are real numbers

u

1

, u

2

, . . . , u

`

s.t.:

8 j

₂

[

`

] :

u

_j

0

8 i

₂

[

k

] :

X

j₂[`]:xi2vars(Sj(xj))

(40)

Triangle Query Revisited

40

Q(X,Y,Z) = R(X,Y)

⋈

S(Y,Z)

⋈

T(Z,X) size(R) = M1, size(S) = M2, size(T) = M3

Packing

u

₁

,

u

₂

,

u

₃

L

= (

M

₁u1

_*

_M

2u2

*

M

3u3

/

p

)

1/(u1+u2+u3)

½

,

½

,

½

(

M

₁

M

₂

M

₃

)

⅓

_/

_p

⅔

1 ,

0 ,

0 M

₁

/

p

0 ,

1 ,

0 M

₂

/

p

0 ,

1 M

₃

/

p

L

_lower

= the largest of these four values

When

M

₁

=

M

₂

=

M

₃

=

M

, then the largest value is the first:

M

/

p

⅔

x y

z

½

½ ½

(41)

An Algorithm for CQ

[Afrati&Ullman’2010]

Factorize

p

=

p

₁

×

p

₂

×

…

×

p

_k

;

a server is (

v

₁

, …,

v

_k

)

∈

[

p

₁

]

×

…

×

[

p

_k

]

• Send

S

₁

(x

₁₁

, x

₁₂

, …)

to servers with coordinates

h(x

₁₁

)

,

h(x

₁₂

)

, …

• Send

S

₂

(x

₂₁

, x

₂₂

, …)

to servers with coordinates

h(x

₂₁

)

,

h(x

₂₂

)

, …

• …

• Compute the query locally at each server

• Load is O(

L

_algo

), where:

[A&U] use sum instead of max. With max we can compute the optimal p₁, p₂, …, p_k

L

_algo

= max

j

₂

[

`

]

M

_j

Q

(42)

L

_lower

=

L

_algo

L

_lower

=

✓ Q

j

M

juj

p

◆

_P 1 j uj

Apply log in base p. Denote µ_j = log_p M_j, e_i = log_p p_i

L

_algo

=

max

_j

Q

M

j

(43)

L

_lower

=

L

_algo

L

_lower

=

✓ Q

j

M

juj

p

◆

_P 1 j uj

minimize

X

i

e

i



1

8 j

:

+

X

j:xi2vars(Sj)

e

i

µ

j

L

_algo

=

max

_j

Q

M

j

i

:

x

i

2 vars

(

S

j

)

p

i

(44)

L

_lower

=

L

_algo

L

_lower

=

✓ Q

j

M

juj

p

◆

_P 1 j uj

minimize

X

i

e

i



1

8 j

:

+

X

j:xi2vars(Sj)

e

i

µ

j

maximize

₍

X

j

f

j

µ

j

f

0

)

X

j

f

j



1

8 i

:

X

j:xi2vars(Sj)

f

j



f

0 Primal/Dual u_j = f_j / f₀ u₀ = 1 / f₀ at optimality: u₀ = Σ_j u_j

L

_algo

=

max

_j

Q

M

j

i

:

x

i

2 vars

(

S

j

)

p

i

(45)

Outline

• The MPC Model

• Communication Cost for Triangles

• Communication Cost for CQ’s

• Discussion

(46)