Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

(1)

Locality-Sensitive Operators for

Parallel Main-Memory Database Clusters

Wolf R¨

odiger, Tobias M¨

uhlbauer, Philipp Unterbrunner*,

Angelika Reiser, Alfons Kemper, Thomas Neumann

(2)

Scale Out

▸

HyPer:

High-performance

in-memory transaction and

query processing system

▸

Scale out

to process

extremely large inputs

▸

Aiming at

clusters

with large

main memory capacity

lineitem

orders

join

group

sort

SWITCH

node 2

node 3

node 6

node 5

node 4

node 1

TPC-H Query 12

Main-Memory Cluster

(3)

Running Example (1)

▸

Focus on

analytical

query

processing in this talk

▸

TPC-H query 12 used as

running example

▸

Runtime dominated by

join

orders

&

lineitem

orders

join

group

sort

SWITCH

node 2

node 3

node 6

node 5

node 4

node 1

(4)

Running Example (2)

▸

Relations are

equally

distributed across nodes

▸

We make

no

other

assumptions on data

distribution

▸

Network communication

required as tuples join with

tuples on remote nodes

node 3

node 2

node 1

lineitem

key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP

orders

key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH 4 / 31

(5)

Scale Out: Network is the Bottleneck

▸

Single node:

Performance is

bound algorithmically

▸

Cluster:

Network is bottleneck

for query processing

▸

We propose a novel join

algorithm called

Neo-Join

▸

Goal:

Increase local processing to

close the performance gap

jo in p e rf o rma n ce [ M tu p le s/ s] 0 50 100 150 200 distributed local

(6)

Neo-Join: Network-optimized Join

1.

Open Shop Scheduling

Efficient network communication

2.

Optimal Partition Assignment

Increase local processing

3.

Selective Broadcast

Handle value skew

(7)

(8)

Standard Network Model

▸

Star topology

Nodes are connected to a

central switch

▸

Fully switched

All links can be used

simultaneously

▸

Fully duplex

Nodes can both send and

receive at full speed

SWITCH

node 2

node 3

node 6

node 5

node 4

node 1

8 / 31

(9)

Bandwidth Sharing

node 3

node 2

node 1

node 2

node 3

node 1

SWITCH

bottleneck

node 3

node 2

node 1

node 2

node 3

node 1

SWITCH

bottleneck

▸

Simultaneous use of a single link creates a

bottleneck

▸

Reduces bandwidth

by at least a factor of

2

(10)

Na¨ıve Schedule

node 1

5

1

4

1

4

node 2

node 3

time

node 1

4

5

1

4

1

4

1

node 2

node 3

maximum straggler

bandwidth sharing

time

▸

Node 2 and 3 send to node 1

at the same time

▸

Bandwidth sharing increases

network duration

significantly

(11)

Open Shop Scheduling (1)

▸

Avoiding bandwidth sharing

translates directly

to open

shop scheduling

▸

Network Transfer:

Receivers receive from at

most

one

sender, senders

send to at most

one

receiver

▸

Open Shop:

Processors perform

one

task

at a time, only

one

task of a

job is processed at a time

Open Shop

Network Transfer

task

data transfer

processor

receiver

job

sender

(12)

Open Shop Scheduling (2)

▸

Bipartite graph

of senders

and receivers

▸

Edge weights

represent

transfer size

▸

Scheduler repeatedly finds

perfect matchings

▸

Each matching specifies one

communication phase

▸

Transfers in a phase will

never

share bandwidth

sender

receiver

5

4

node 2

node 3

node 2

node 1

12 / 31

(13)

Optimal Schedule

node 1

5

1

4

1

4

node 2

node 3

time

node 1

4

5

1

4

1

4

1

node 2

node 3

maximum straggler

bandwidth sharing

time

▸

Open shop schedule achieves minimal

network duration

▸

Schedule duration determined by

maximum straggler

(14)

Optimal Partition

Assignment

Minimize network duration for distributed joins

(15)

Distributed Join

▸

Tuples may join with tuples

on

remote nodes

▸

Repartition and redistribute

both relations

for local join

▸

Tuples will join only with the

corresponding partition

▸

Using hash, range, radix, or

other

partitioning

scheme

▸

In any case:

Decide how to

assign

partitions to nodes

⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ fragmented redistributed ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ O2 L2 O1 L1 O3 L3 O L

(16)

Running Example: Hash Partitioning

node 3

node 2

node 1

lineitem key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP orders key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH ₅ ₄ ₄ 4 4 5 5 5 6

x+2 mod 3

P2 P3 P1 P2 P3 P1 P2 P3 P1 16 / 31

(17)

Assign Partitions to Nodes (1)

Option 1:

Minimize network traffic

▸

Assign partition to node that

owns its

largest part

▸

Only the

small fragments

of a

partition sent over the network

▸

Schedule with minimal network

traffic may have

high duration

5 6 5

2 3 3

hash partitioning (x mod 3)

4 4 5 5 4 4 P2 P3 P1 5 6 5 4 5 4 4 5 4

hash partitioning (x mod 3)

5 5 5 5 4 4 P2 P3 P1

traffic: 26 time: 26 traffic: 28 time: 10

13

4 5 1 4 4 1 4 4 1

open shop schedule open shop schedule n1 n3 n2 n1 n3 n2 n1 n2 n3 n1 n2 n3 5 6 5 2 3 3 2 3 3

4 4 5 5 4 4 P2 P3 P1 5 6 5 4 5 4 4 5 4

5 5 5 5 4 4 P2 P3 P1

traffic: 26 time: 26 traffic: 28 time: 10

13

4 5 1

4 4 1

open shop schedule open shop schedule

n1 n3 n2 n1 n3 n2 n1 n2 n3 n1 n2 n3

(18)

Assign Partitions to Nodes (2)

Option 2:

Minimize response time:

▸

Query response time

is time

from request to result

▸

Query response time dominated

by

network duration

▸

To minimize network duration,

minimize

maximum straggler

5 6 5

2 3 3

hash partitioning (x mod 3)

4 4 5 5 4 4 P2 P3 P1 5 6 5 4 5 4 4 5 4

5 5 5 5 4 4 P2 P3 P1

traffic: 26 time: 26 traffic: 28 time: 10

13

4 5 1

4 4 1

n1 n3 n2 n1 n3 n2 n1 n2 n3 n1 n2 n3 5 6 5 2 3 3 2 3 3

4 4 5 5 4 4 P2 P3 P1 5 6 5 4 5 4 4 5 4

5 5 5 5 4 4 P2 P3 P1

traffic: 26 time: 26 traffic: 28 time: 10

13

4 5 1

4 4 1

n1 n3 n2 n1 n3 n2 n1 n2 n3 n1 n2 n3 18 / 31

(19)

Minimize Maximum Straggler

▸

Formalized as mixed-integer

linear program

▸

Objective function minimizes

maximum straggler

▸

Shown to be

NP-hard

(see paper for proof sketch)

▸

In practice

fast enough

using CPLEX or Gurobi

(

<

0.5 % overhead for 32

nodes, 200 M tuples each)

▸

Partition assignment can

optimize

any partitioning

# tuples

Node 0 Node 1 Node 2

P0P1P2P3P4P5P6P7 P0P1P2P3P4P5P6P7 P0P1P2P3P4P5P6P7

P0 P1 P2 P3 P4 P5 P6 P7

network phase duration

➡ ➡ ➡ 2 11 2 3 3 9 2 0 7 1 7 7 1 4 1 4 1 3 2 2 10 3 10 1 0 2 4 6 8 10 12 Node 0 Node 1 Node 2 S S S 21 13 9 1 13 15 14 0 22 21 28 10 10 15 31 29 R R R R 10 10 1 13 8 1 22 29 26 3 11 7 16 3 15 2 S S S 7 19 17 18 24 19 27 12 30 24 19 16 14 26 18 24 R R R R 24 8 24 8 19 4 26 18 27 5 23 2 27 20 22 17 S S S 26 3 23 16 9 6 15 24 6 19 7 20 23 5 4 20 R R R R 20 11 23 6 23 4 18 5 14 22 6 15 23 4 3 6

(a) Create histograms according to the last three bits of the join key

# tuples

P0P1P2P3P4P5P6P7 P0P1P2P3P4P5P6P7P0P1P2P3P4P5P6P7

P0 P1 P2 P3 P4 P5 P6 P7

➡ ➡ ➡ 2 11 2 3 3 9 2 0 7 1 7 7 1 4 1 4 1 3 2 2 10 3 10 1 0 2 4 6 8 10 12 Node 0 Node 1 Node 2 S S S 21 13 9 1 13 15 14 0 22 21 28 10 10 15 31 29 R R R R 10 10 1 13 8 1 22 29 26 3 11 7 16 3 15 2 S S S 7 19 17 18 24 19 27 12 30 24 19 16 14 26 18 24 R R R R 24 8 24 8 19 4 26 18 27 5 23 2 27 20 22 17 S S S 26 3 23 16 9 6 15 24 6 19 7 20 23 5 4 20 R R R R 20 11 23 6 23 4 18 5 14 22 6 15 23 4 3 6

(b) Compute an optimal partition as-signment based on the histograms

# tuples

P0P1P2P3P4P5P6P7P0P1P2P3P4P5P6P7P0P1P2P3P4P5P6P7

P0 P1 P2 P3 P4 P5 P6 P7

➡ ➡ ➡ 2 11 2 3 3 9 2 0 7 1 7 7 1 4 1 4 1 3 2 2 10 3 10 1 0 2 4 6 8 10 12 Node 0 Node 1 Node 2 S S S 21 13 9 1 13 15 14 0 22 21 28 10 10 15 31 29 R R R R 10 10 1 13 8 1 22 29 26 3 11 7 16 3 15 2 S S S 7 19 17 18 24 19 27 12 30 24 19 16 14 26 18 24 R R R R 24 8 24 8 19 4 26 18 27 5 23 2 27 20 22 17 S S S 26 3 23 16 9 6 15 24 6 19 7 20 23 5 4 20 R R R R 20 11 23 6 23 4 18 5 14 22 6 15 23 4 3 6

(c) Resulting send/receive costs deter-mine the network phase duration Fig. 4. Example for the optimal partition assignment which aims at a minimal

network phase duration with three nodes and eight (23_{) radix partitions}

maximum stragglers with a cost of 12 as depicted in Fig. 4(c). For a perfect hash partitioning one would expect that every node has to send1�n-th of its tuples to every other node (≈ 21) and also receive 1�n-th of the tuples from every other node (also≈21). In this simplified example, radix partitioning reduced the duration of the network phase by almost a factor of two compared to hash partitioning.

B. Optimal Partition Assignment (Phase 2)

The previous section described how to repartition the input relations so that tuples with the same join key fall into the same partition. In general, the new partitions arefragmentedacross the nodes. Therefore, all fragments of one specific partition have to be transferred to the same node for joining. This section describes how to determine an assignment of partitions to nodes that minimizes the network phase duration.

We define the receive cost of a node as the number of tuples it receives from other nodes for the partitions that were assigned to it. Similarly, itssend costis defined as the number of tuples it has to send to other nodes. Section III-C4 shows that the minimum network phase duration is determined by the node with the maximum send/receive cost. The assignment is therefore optimized to minimize this maximum cost.

A na¨ıve approach would assign a partition to the node that

owns its largest fragment. While the assignment of partition 7 to node 0 reduces the send cost of node 0 by 4 tuples, it also increases its receive cost to a total of 13 tuples. As a result, the network phase duration increases from 12 to 13 (cf. Fig. 4(c)).

1) Mixed Integer Linear Programming: We phrase the par-tition assignment problem as a mixed integer linear program (MILP). As a result, one can use an integer programming solver to solve it. The linear program computes a configuration of the decision variablesxij∈ {0,1}. These decision variables

define the assignment of theppartitions to thennodes:xij=1

determines that partitionjis assigned to nodei, whilexij=0

specifies that partitionj is not assigned to nodei. Each partition has to be assigned to exactly one node:

n−1 � i=0

xij=1 for0≤j<p (1)

The linear program should minimize the duration of the network phase, which is equal to the maximum send or receive cost over all nodes. We denote the send cost of nodeiassi

and its receive cost asri. The objective function is therefore: min max

0≤i<n{si, ri} (2)

Using the decision variablesxijand the size of partitionj

at nodei—denoted withhij—we can express the amount of

data each node has to send (si) and receive (ri):

si= p−1 � j=0 hij⋅ (1−xij) for0≤i<n (3) ri= p−1 � j=0 � �xij n−1 � k=0,i≠k hkj� � for0≤i<n (4)

Equation 3 computes the send cost of nodei as the size of all local fragments of partitions that are not assigned to it. Likewise, equation 4 adds the size of remote fragments of partitions that were assigned to nodeito the receive cost.

MILPs require a linear objective, which minimizing a max-imum is not. Fortunately, we can rephrase the objective and instead minimize a new variablew. Additional constraints take care thatwassumes the maximum over the send/receive costs:

(OPT-ASSIGN) minimizew, subject to w≥ p−1 � j=0 hij(1−xij) 0≤i<n w≥ p−1 � j=0 � �xij n−1 � k=0,i≠k hkj� � 0≤i<n 1=n− 1 � i=0 xij 0≤j<p

One can obtain an optimal solution for a specific partition assignment problem (OPT-ASSIGN) by passing the mixed integer linear program to an optimizer such as Microsoft Gurobi3 _{or IBM CPLEX}4_{. These solvers can be linked as a}

(20)

Running Example: Locality

node 3

P2 P3 P1

node 2

P2 P3 P1

node 1

P2 P3 P1 lineitem key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP orders key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH

radix

15 11 11 2 1 1 1 20 / 31

(21)

Locality

▸

Running example exhibits

time-of-creation

clustering

▸

Radix repartitioning

on most

significant bits retains locality

▸

Partition assignment can

exploit locality

▸

Significantly reduces

query

response time

15 1 0 1 11 1 0 2 11 P2 P3 P1 radix partitioning (MSB) traffic: 5 time: 3 1 1 1 2

open shop schedule n1 n3 n2 n1 n2 n3 1 0 1 2 0 1 15 1 0 1 11 1 0 2 11 P2 P3 P1 radix partitioning (MSB) traffic: 5 time: 3 1 1 1 2

open shop schedule

n1 n3 n2 n1 n2 n3 1 0 1 2 0 1

(22)

Selective Broadcast

Handle value skew

(23)

Running Example: Skew

node 1

node 3

node 2

O1 lineitem key shipmode 1 MAIL 2 MAIL 2 MAIL 2 MAIL 2 SHIP 2 SHIP 2 SHIP 6 MAIL 6 MAIL 7 SHIP 8 SHIP 8 SHIP 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL orders key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 1-URGENT 5 1-URGENT 6 2-HIGH 7 3-MEDIUM 8 3-MEDIUM 9 2-HIGH 10 3-MEDIUM

x+2 mod 3

L1 O2L2 O3L3 2 1 1 1 1 6 1 1 1 1 6 O1L1 O2L2 O3L3 1 1 1 1 7 1

(24)

Skew

▸

Value skew

can lead to some

very large partitions

▸

Assignment of these partitions

increases

network duration

▸

One may try to balance skewed

partitions by partitioning the

input into

more partitions

▸

High skew

is still a problem

3 7 2

2 7 2

2 8 1

P1 P3

hash partitioning (mod 3)

traffic: 21 time: 14

7 2

2 7

1 2

open shop schedule n1 n3 n2 n1 n2 n3 P2 3 7 2 2 7 2 2 8 1 P1 P3

hash partitioning (mod 3)

traffic: 21 time: 14

7 2

2 7

1 2

open shop schedule n1 n3 n2 n1 n2 n3 P2 24 / 31

(25)

Broadcast

▸

Alternative

data

redistribution scheme

▸

Replicate

the smaller

relation between all nodes

▸

Larger relation

remains

fragmented

across nodes

broadcast O local join

O L O O L O

⨝

(26)

Selective Broadcast

▸

Decide

per partition

whether

to assign or broadcast

▸

Broadcast

partitions with

large relation size difference

▸

Assign

the other partitions

taking locality into account

▸

Role reversal possible:

Broadcast different partitions

by different relations

2 1 1 6 1 1

1 1 1 6 1 1

1 1 1 5 2 2

O2

hash partitioning (mod 3)

traffic: 14 time: 6 3 1

3 3

1 3

open shop schedule

n1 n3 n2 n1 n2 n3 L2 O1 L1 O3 L3 2 1 1 6 1 1 1 1 1 6 1 1 1 1 1 5 2 2 O2

hash partitioning (mod 3)

traffic: 14 time: 6

3 1

3 3

1 3

open shop schedule

n1 n3 n2 n1 n2 n3 L2 O1 L1 O3 L3 26 / 31

(27)

(28)

Locality

▸

Vary

locality

from

0 %

(uniform distribution) to

100 %

(range partitioning)

▸

Neo-Join improves

join

performance

from 29 M to

156 M tuples/s (

>

500 %)

▸

3 nodes (Core i7, 4 cores,

3.4 GHz, 32 GB RAM),

600 M tuples (64 bit key,

64 bit payload)

jo in p e rf o rma n ce [ tu p le /s] 0 M 40 M 80 M 120 M 160 M locality 0 % 25 % 50 % 75 % 100 %

Neo-Join Hadoop Hive

DBMS-X MySQL Cluster

(29)

Skew

▸

Zipfian distribution

models

realistic data skew

▸

Using

more partitions

alleviates the problem

▸

Selective broadcast actually

improves

performance for

skewed inputs

▸

4 nodes, 400 M tuples

Zipf factor s partitions 0.00 0.25 0.50 0.75 1.00 16 27 s 24 s 23 s 29 s 44 s 512 23 s 23 s 23 s 23 s 33 s 16 (SB) 24 s 24 s 23 s 20 s 10 s

(30)

TPC-H Results (scale factor 100)

▸

Results for three selected

TPC-H

queries

▸

Broadcast

outperforms

hash

for large relation size

differences

▸

Neo-Join always performs

better due to

selective

broadcast

and

locality

▸

4 nodes, scale factor 100

e xe cu tio n t ime [ s] 0 1 2 3 4 5 6 7 TPC-H queries Q12 Q14 Q19

hash

broadcast

Neo-Join

(31)

Summary

Motivation:

▸

Network is the

bottleneck

for distributed query processing

▸

Increase

local processing

to close the performance gap

Contributions:

▸

avoids bandwidth sharing

▸

Optimal Partition Assignment

minimizes query response

time and can exploit locality in the data distribution

▸

Selective Broadcast

combines repartitioning and broadcast