• No results found

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

N/A
N/A
Protected

Academic year: 2021

Share "Locality-Sensitive Operators for Parallel Main-Memory Database Clusters"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Locality-Sensitive Operators for

Parallel Main-Memory Database Clusters

Wolf R¨

odiger, Tobias M¨

uhlbauer, Philipp Unterbrunner*,

Angelika Reiser, Alfons Kemper, Thomas Neumann

(2)

Scale Out

HyPer:

High-performance

in-memory transaction and

query processing system

Scale out

to process

extremely large inputs

Aiming at

clusters

with large

main memory capacity

lineitem

orders

join

group

sort

SWITCH

node 2

node 3

node 6

node 5

node 4

node 1

TPC-H Query 12

Main-Memory Cluster

(3)

Running Example (1)

Focus on

analytical

query

processing in this talk

TPC-H query 12 used as

running example

Runtime dominated by

join

orders

&

lineitem

lineitem

orders

join

group

sort

SWITCH

node 2

node 3

node 6

node 5

node 4

node 1

(4)

Running Example (2)

Relations are

equally

distributed across nodes

We make

no

other

assumptions on data

distribution

Network communication

required as tuples join with

tuples on remote nodes

node 3

node 2

node 1

lineitem

key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP

orders

key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH 4 / 31
(5)

Scale Out: Network is the Bottleneck

Single node:

Performance is

bound algorithmically

Cluster:

Network is bottleneck

for query processing

We propose a novel join

algorithm called

Neo-Join

Goal:

Increase local processing to

close the performance gap

jo in p e rf o rma n ce [ M tu p le s/ s] 0 50 100 150 200 distributed local

(6)

Neo-Join: Network-optimized Join

1.

Open Shop Scheduling

Efficient network communication

2.

Optimal Partition Assignment

Increase local processing

3.

Selective Broadcast

Handle value skew

(7)

Open Shop Scheduling

(8)

Standard Network Model

Star topology

Nodes are connected to a

central switch

Fully switched

All links can be used

simultaneously

Fully duplex

Nodes can both send and

receive at full speed

SWITCH

node 2

node 3

node 6

node 5

node 4

node 1

8 / 31
(9)

Bandwidth Sharing

node 3

node 2

node 1

node 2

node 3

node 1

SWITCH

SWITCH

bottleneck

bottleneck

node 3

node 2

node 1

node 2

node 3

node 1

SWITCH

SWITCH

bottleneck

bottleneck

Simultaneous use of a single link creates a

bottleneck

Reduces bandwidth

by at least a factor of

2

(10)

Na¨ıve Schedule

node 1

5

5

1

1

1

1

1

4

1

1

1

1

1

4

node 2

node 3

time

node 1

4

5

1

4

4

1

4

4

1

node 2

node 3

maximum straggler

bandwidth sharing

time

Node 2 and 3 send to node 1

at the same time

Bandwidth sharing increases

network duration

significantly

(11)

Open Shop Scheduling (1)

Avoiding bandwidth sharing

translates directly

to open

shop scheduling

Network Transfer:

Receivers receive from at

most

one

sender, senders

send to at most

one

receiver

Open Shop:

Processors perform

one

task

at a time, only

one

task of a

job is processed at a time

Open Shop

Network Transfer

task

data transfer

processor

receiver

job

sender

(12)

Open Shop Scheduling (2)

Bipartite graph

of senders

and receivers

Edge weights

represent

transfer size

Scheduler repeatedly finds

perfect matchings

Each matching specifies one

communication phase

Transfers in a phase will

never

share bandwidth

sender

receiver

5

5

4

node 2

node 3

node 3

node 2

node 1

node 1

12 / 31
(13)

Optimal Schedule

node 1

5

5

1

1

1

1

1

4

1

1

1

1

1

4

node 2

node 3

time

node 1

4

5

1

4

4

1

4

4

1

node 2

node 3

maximum straggler

bandwidth sharing

time

Open shop schedule achieves minimal

network duration

Schedule duration determined by

maximum straggler

(14)

Optimal Partition

Assignment

Minimize network duration for distributed joins

(15)

Distributed Join

Tuples may join with tuples

on

remote nodes

Repartition and redistribute

both relations

for local join

Tuples will join only with the

corresponding partition

Using hash, range, radix, or

other

partitioning

scheme

In any case:

Decide how to

assign

partitions to nodes

⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ fragmented redistributed ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ O2 L2 O1 L1 O3 L3 O L

(16)

Running Example: Hash Partitioning

node 3

node 2

node 1

lineitem key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP orders key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH 5 4 4 4 4 5 5 5 6

x+2 mod 3

x+2 mod 3

x+2 mod 3

P2 P3 P1 P2 P3 P1 P2 P3 P1 16 / 31
(17)

Assign Partitions to Nodes (1)

Option 1:

Minimize network traffic

Assign partition to node that

owns its

largest part

Only the

small fragments

of a

partition sent over the network

Schedule with minimal network

traffic may have

high duration

5 6 5

2 3 3

2 3 3

hash partitioning (x mod 3)

4 4 5 5 4 4 P2 P3 P1 5 6 5 4 5 4 4 5 4

hash partitioning (x mod 3)

5 5 5 5 4 4 P2 P3 P1

traffic: 26 time: 26 traffic: 28 time: 10

13

13

4 5 1 4 4 1 4 4 1

open shop schedule open shop schedule n1 n3 n2 n1 n3 n2 n1 n2 n3 n1 n2 n3 5 6 5 2 3 3 2 3 3

hash partitioning (x mod 3)

4 4 5 5 4 4 P2 P3 P1 5 6 5 4 5 4 4 5 4

hash partitioning (x mod 3)

5 5 5 5 4 4 P2 P3 P1

traffic: 26 time: 26 traffic: 28 time: 10

13

13

4 5 1

4 4 1

4 4 1

open shop schedule open shop schedule

n1 n3 n2 n1 n3 n2 n1 n2 n3 n1 n2 n3

(18)

Assign Partitions to Nodes (2)

Option 2:

Minimize response time:

Query response time

is time

from request to result

Query response time dominated

by

network duration

To minimize network duration,

minimize

maximum straggler

5 6 5

2 3 3

2 3 3

hash partitioning (x mod 3)

4 4 5 5 4 4 P2 P3 P1 5 6 5 4 5 4 4 5 4

hash partitioning (x mod 3)

5 5 5 5 4 4 P2 P3 P1

traffic: 26 time: 26 traffic: 28 time: 10

13

13

4 5 1

4 4 1

4 4 1

open shop schedule open shop schedule

n1 n3 n2 n1 n3 n2 n1 n2 n3 n1 n2 n3 5 6 5 2 3 3 2 3 3

hash partitioning (x mod 3)

4 4 5 5 4 4 P2 P3 P1 5 6 5 4 5 4 4 5 4

hash partitioning (x mod 3)

5 5 5 5 4 4 P2 P3 P1

traffic: 26 time: 26 traffic: 28 time: 10

13

13

4 5 1

4 4 1

4 4 1

open shop schedule open shop schedule

n1 n3 n2 n1 n3 n2 n1 n2 n3 n1 n2 n3 18 / 31

(19)

Minimize Maximum Straggler

Formalized as mixed-integer

linear program

Objective function minimizes

maximum straggler

Shown to be

NP-hard

(see paper for proof sketch)

In practice

fast enough

using CPLEX or Gurobi

(

<

0.5 % overhead for 32

nodes, 200 M tuples each)

Partition assignment can

optimize

any partitioning

# tuples

Node 0 Node 1 Node 2

P0P1P2P3P4P5P6P7 P0P1P2P3P4P5P6P7 P0P1P2P3P4P5P6P7

P0 P1 P2 P3 P4 P5 P6 P7

Node 0 Node 1 Node 2

network phase duration

➡ ➡ ➡ 2 11 2 3 3 9 2 0 7 1 7 7 1 4 1 4 1 3 2 2 10 3 10 1 0 2 4 6 8 10 12 Node 0 Node 1 Node 2 S S S 21 13 9 1 13 15 14 0 22 21 28 10 10 15 31 29 R R R R 10 10 1 13 8 1 22 29 26 3 11 7 16 3 15 2 S S S 7 19 17 18 24 19 27 12 30 24 19 16 14 26 18 24 R R R R 24 8 24 8 19 4 26 18 27 5 23 2 27 20 22 17 S S S 26 3 23 16 9 6 15 24 6 19 7 20 23 5 4 20 R R R R 20 11 23 6 23 4 18 5 14 22 6 15 23 4 3 6

(a) Create histograms according to the last three bits of the join key

# tuples

Node 0 Node 1 Node 2

P0P1P2P3P4P5P6P7 P0P1P2P3P4P5P6P7P0P1P2P3P4P5P6P7

P0 P1 P2 P3 P4 P5 P6 P7

Node 0 Node 1 Node 2

network phase duration

➡ ➡ ➡ 2 11 2 3 3 9 2 0 7 1 7 7 1 4 1 4 1 3 2 2 10 3 10 1 0 2 4 6 8 10 12 Node 0 Node 1 Node 2 S S S 21 13 9 1 13 15 14 0 22 21 28 10 10 15 31 29 R R R R 10 10 1 13 8 1 22 29 26 3 11 7 16 3 15 2 S S S 7 19 17 18 24 19 27 12 30 24 19 16 14 26 18 24 R R R R 24 8 24 8 19 4 26 18 27 5 23 2 27 20 22 17 S S S 26 3 23 16 9 6 15 24 6 19 7 20 23 5 4 20 R R R R 20 11 23 6 23 4 18 5 14 22 6 15 23 4 3 6

(b) Compute an optimal partition as-signment based on the histograms

# tuples

Node 0 Node 1 Node 2

P0P1P2P3P4P5P6P7P0P1P2P3P4P5P6P7P0P1P2P3P4P5P6P7

P0 P1 P2 P3 P4 P5 P6 P7

Node 0 Node 1 Node 2

network phase duration

➡ ➡ ➡ 2 11 2 3 3 9 2 0 7 1 7 7 1 4 1 4 1 3 2 2 10 3 10 1 0 2 4 6 8 10 12 Node 0 Node 1 Node 2 S S S 21 13 9 1 13 15 14 0 22 21 28 10 10 15 31 29 R R R R 10 10 1 13 8 1 22 29 26 3 11 7 16 3 15 2 S S S 7 19 17 18 24 19 27 12 30 24 19 16 14 26 18 24 R R R R 24 8 24 8 19 4 26 18 27 5 23 2 27 20 22 17 S S S 26 3 23 16 9 6 15 24 6 19 7 20 23 5 4 20 R R R R 20 11 23 6 23 4 18 5 14 22 6 15 23 4 3 6

(c) Resulting send/receive costs deter-mine the network phase duration Fig. 4. Example for the optimal partition assignment which aims at a minimal

network phase duration with three nodes and eight (23) radix partitions

maximum stragglers with a cost of 12 as depicted in Fig. 4(c). For a perfect hash partitioning one would expect that every node has to send1�n-th of its tuples to every other node (≈ 21) and also receive 1�n-th of the tuples from every other node (also≈21). In this simplified example, radix partitioning reduced the duration of the network phase by almost a factor of two compared to hash partitioning.

B. Optimal Partition Assignment (Phase 2)

The previous section described how to repartition the input relations so that tuples with the same join key fall into the same partition. In general, the new partitions arefragmentedacross the nodes. Therefore, all fragments of one specific partition have to be transferred to the same node for joining. This section describes how to determine an assignment of partitions to nodes that minimizes the network phase duration.

We define the receive cost of a node as the number of tuples it receives from other nodes for the partitions that were assigned to it. Similarly, itssend costis defined as the number of tuples it has to send to other nodes. Section III-C4 shows that the minimum network phase duration is determined by the node with the maximum send/receive cost. The assignment is therefore optimized to minimize this maximum cost.

A na¨ıve approach would assign a partition to the node that

owns its largest fragment. While the assignment of partition 7 to node 0 reduces the send cost of node 0 by 4 tuples, it also increases its receive cost to a total of 13 tuples. As a result, the network phase duration increases from 12 to 13 (cf. Fig. 4(c)).

1) Mixed Integer Linear Programming: We phrase the par-tition assignment problem as a mixed integer linear program (MILP). As a result, one can use an integer programming solver to solve it. The linear program computes a configuration of the decision variablesxij∈ {0,1}. These decision variables

define the assignment of theppartitions to thennodes:xij=1

determines that partitionjis assigned to nodei, whilexij=0

specifies that partitionj is not assigned to nodei. Each partition has to be assigned to exactly one node:

n−1 � i=0

xij=1 for0≤j<p (1)

The linear program should minimize the duration of the network phase, which is equal to the maximum send or receive cost over all nodes. We denote the send cost of nodeiassi

and its receive cost asri. The objective function is therefore: min max

0≤i<n{si, ri} (2)

Using the decision variablesxijand the size of partitionj

at nodei—denoted withhij—we can express the amount of

data each node has to send (si) and receive (ri):

si= p−1 � j=0 hij⋅ (1−xij) for0≤i<n (3) ri= p−1 � j=0 � �xij n−1 � k=0,i≠k hkj� � for0≤i<n (4)

Equation 3 computes the send cost of nodei as the size of all local fragments of partitions that are not assigned to it. Likewise, equation 4 adds the size of remote fragments of partitions that were assigned to nodeito the receive cost.

MILPs require a linear objective, which minimizing a max-imum is not. Fortunately, we can rephrase the objective and instead minimize a new variablew. Additional constraints take care thatwassumes the maximum over the send/receive costs:

(OPT-ASSIGN) minimizew, subject to w≥ p−1 � j=0 hij(1−xij) 0≤i<n w≥ p−1 � j=0 � �xij n−1 � k=0,i≠k hkj� � 0≤i<n 1=n− 1 � i=0 xij 0≤j<p

One can obtain an optimal solution for a specific partition assignment problem (OPT-ASSIGN) by passing the mixed integer linear program to an optimizer such as Microsoft Gurobi3 or IBM CPLEX4. These solvers can be linked as a

(20)

Running Example: Locality

node 3

P2 P3 P1

node 2

P2 P3 P1

node 1

P2 P3 P1 lineitem key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP orders key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH

radix

radix

radix

15 11 11 2 1 1 1 20 / 31
(21)

Locality

Running example exhibits

time-of-creation

clustering

Radix repartitioning

on most

significant bits retains locality

Partition assignment can

exploit locality

Significantly reduces

query

response time

15 1 0 1 11 1 0 2 11 P2 P3 P1 radix partitioning (MSB) traffic: 5 time: 3 1 1 1 2

open shop schedule n1 n3 n2 n1 n2 n3 1 0 1 2 0 1 15 1 0 1 11 1 0 2 11 P2 P3 P1 radix partitioning (MSB) traffic: 5 time: 3 1 1 1 2

open shop schedule

n1 n3 n2 n1 n2 n3 1 0 1 2 0 1

(22)

Selective Broadcast

Handle value skew

(23)

Running Example: Skew

node 1

node 3

node 2

O1 lineitem key shipmode 1 MAIL 2 MAIL 2 MAIL 2 MAIL 2 SHIP 2 SHIP 2 SHIP 6 MAIL 6 MAIL 7 SHIP 8 SHIP 8 SHIP 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL orders key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 1-URGENT 5 1-URGENT 6 2-HIGH 7 3-MEDIUM 8 3-MEDIUM 9 2-HIGH 10 3-MEDIUM

x+2 mod 3

x+2 mod 3

x+2 mod 3

L1 O2L2 O3L3 2 1 1 1 1 6 1 1 1 1 6 O1L1 O2L2 O3L3 1 1 1 1 7 1
(24)

Skew

Value skew

can lead to some

very large partitions

Assignment of these partitions

increases

network duration

One may try to balance skewed

partitions by partitioning the

input into

more partitions

High skew

is still a problem

3 7 2

2 7 2

2 8 1

P1 P3

hash partitioning (mod 3)

traffic: 21 time: 14

7 2

2 7

1 2

open shop schedule n1 n3 n2 n1 n2 n3 P2 3 7 2 2 7 2 2 8 1 P1 P3

hash partitioning (mod 3)

traffic: 21 time: 14

7 2

2 7

1 2

open shop schedule n1 n3 n2 n1 n2 n3 P2 24 / 31

(25)

Broadcast

Alternative

data

redistribution scheme

Replicate

the smaller

relation between all nodes

Larger relation

remains

fragmented

across nodes

broadcast O local join

O L O O L O

(26)

Selective Broadcast

Decide

per partition

whether

to assign or broadcast

Broadcast

partitions with

large relation size difference

Assign

the other partitions

taking locality into account

Role reversal possible:

Broadcast different partitions

by different relations

2 1 1 6 1 1

1 1 1 6 1 1

1 1 1 5 2 2

O2

hash partitioning (mod 3)

traffic: 14 time: 6 3 1

3 3

1 3

open shop schedule

n1 n3 n2 n1 n2 n3 L2 O1 L1 O3 L3 2 1 1 6 1 1 1 1 1 6 1 1 1 1 1 5 2 2 O2

hash partitioning (mod 3)

traffic: 14 time: 6

3 1

3 3

1 3

open shop schedule

n1 n3 n2 n1 n2 n3 L2 O1 L1 O3 L3 26 / 31

(27)
(28)

Locality

Vary

locality

from

0 %

(uniform distribution) to

100 %

(range partitioning)

Neo-Join improves

join

performance

from 29 M to

156 M tuples/s (

>

500 %)

3 nodes (Core i7, 4 cores,

3.4 GHz, 32 GB RAM),

600 M tuples (64 bit key,

64 bit payload)

jo in p e rf o rma n ce [ tu p le /s] 0 M 40 M 80 M 120 M 160 M locality 0 % 25 % 50 % 75 % 100 %

Neo-Join Hadoop Hive

DBMS-X MySQL Cluster

(29)

Skew

Zipfian distribution

models

realistic data skew

Using

more partitions

alleviates the problem

Selective broadcast actually

improves

performance for

skewed inputs

4 nodes, 400 M tuples

Zipf factor s partitions 0.00 0.25 0.50 0.75 1.00 16 27 s 24 s 23 s 29 s 44 s 512 23 s 23 s 23 s 23 s 33 s 16 (SB) 24 s 24 s 23 s 20 s 10 s
(30)

TPC-H Results (scale factor 100)

Results for three selected

TPC-H

queries

Broadcast

outperforms

hash

for large relation size

differences

Neo-Join always performs

better due to

selective

broadcast

and

locality

4 nodes, scale factor 100

e xe cu tio n t ime [ s] 0 1 2 3 4 5 6 7 TPC-H queries Q12 Q14 Q19

hash

broadcast

Neo-Join

(31)

Summary

Motivation:

Network is the

bottleneck

for distributed query processing

Increase

local processing

to close the performance gap

Contributions:

Open Shop Scheduling

avoids bandwidth sharing

Optimal Partition Assignment

minimizes query response

time and can exploit locality in the data distribution

Selective Broadcast

combines repartitioning and broadcast

References

Related documents

This is important in project management terms, as one of a project manager’s main objectives should be to increase the engagement and motivation within their team, in order to

Marmara University Atatürk Faculty of Education Journal of Educational Sciences, 46(46), 75-95. Higher Education Institution. Education faculty teacher training undergraduate

These contributors include Southface Energy Institute; Energy and Environmental Building Association, Wisconsin ENERGY STAR Homes Program; Consortium for Energy Efficiency, Air

For the poorest farmers in eastern India, then, the benefits of groundwater irrigation have come through three routes: in large part, through purchased pump irrigation and, in a

This Service Level Agreement (SLA or Agreement) document describes the general scope and nature of the services the Company will provide in relation to the System Software (RMS

In prior research, I noticed a lack of work that explored how trans people created and used multiple accounts and networks on social media in order to explore and present

The research featured in this article focused on two different early-years teachers at one Australian school who each invited the parents of their students to