• No results found

Rethinking SIMD Vectorization for In-Memory Databases

N/A
N/A
Protected

Academic year: 2021

Share "Rethinking SIMD Vectorization for In-Memory Databases"

Copied!
53
0
0

Loading.... (view fulltext now)

Full text

(1)

SIGMOD 2015, Melbourne, Victoria, Australia

Rethinking SIMD Vectorization

for In-Memory Databases

(2)

Mainstream multi-core CPUs

Use complex cores (e.g. Intel Haswell)

Massively superscalarAggressively out-of-order

Pack few cores per chip

High power & area per core

2 – 18 cores

Core

Core

Core

Core

Core

Core

Core

Core

L3 cache

(3)

Latest Hardware Designs

Mainstream multi-core CPUs

Use complex cores (e.g. Intel Haswell)

Massively superscalarAggressively out-of-order

Pack few cores per chip

High power & area per core

Many Integrated Cores (MIC)

Use simple cores (e.g. Intel P54C)

In-order & non-superscalar (for SIMD)

Augment SIMD to bridge the gap

Increase SIMD register size

More advanced SIMD instructions

Pack many cores per chip

Low power & area per core

(4)

SIMD & Databases

Automatic vectorization

Works for simple loops only

Rare in database operators

a

+

b

=

c

(5)

SIMD & Databases

Automatic vectorization

Works for simple loops only

Rare in database operators

Manual vectorization

Linear access operators

❖ Predicate evaluation ❖ Compression

Ad-hoc vectorization

❖ Sorting (e.g. merge-sort, comb-sort, …) ❖ Merging (e.g. 2-way, bitonic, …)

Generic vectorization

Multi-way trees

Bucketized hash tables

(6)

Full vectorization

From O(f(n)) scalar to O(f(n)/W) vector operations

Random accesses excluded

Principles for good (efficient) vectorization

Reuse fundamental operations across multiple vectorizations

Favor vertical vectorization by processing different input data per laneMaximize lane utilization by executing different things per lane subset

(7)

Full vectorization

From O(f(n)) scalar to O(f(n)/W) vector operations

Random accesses excluded

Principles for good (efficient) vectorization

Reuse fundamental operations across multiple vectorizations

Favor vertical vectorization by processing different input data per laneMaximize lane utilization by executing different things per lane subset

Vectorize basic operators

Selection scans

Hash tables

Partitioning

(8)

Contributions & Outline

Full vectorization

From O(f(n)) scalar to O(f(n)/W) vector operations

Random accesses excluded

Principles for good (efficient) vectorization

Reuse fundamental operations across multiple vectorizations

Favor vertical vectorization by processing different input data per laneMaximize lane utilization by executing different things per lane subset

Vectorize basic operators & build advanced operators (in memory)

Selection scans

Hash tables

Partitioning

(9)

Full vectorization

From O(f(n)) scalar to O(f(n)/W) vector operations

Random accesses excluded

Principles for good (efficient) vectorization

Reuse fundamental operations across multiple vectorizations

Favor vertical vectorization by processing different input data per laneMaximize lane utilization by executing different things per lane subset

Vectorize basic operators & build advanced operators (in memory)

Selection scans

Hash tables

Partitioning

Show impact of good vectorization

On software (database system) & hardware design

Contributions & Outline

(10)
(11)
(12)

Selection Scans

Scalar

Branching

Branch to store qualifiers

Affected by branch misses

Branchless

Eliminate branching

(13)

Selection Scans

1) evaluate predicate(s)

k1 k2 k3 k4

v0 v2 v3

k0 k2 k3

output keys

input payloads

input keys

2) selectively

store qualifiers

Scalar

Branching

Branch to store qualifiers

Affected by branch misses

Branchless

Eliminate branching

Use conditional flags

Vectorized

Simple

❖ Evaluate predicates (in SIMD)

Selectively store qualifiers

❖ “Early” materialized

Advanced

❖ “Late” materialized (in the paper …)

k1 k2 k3 k4 k5 k6 k7

v1 v2 v3 v4 v5 v6 v7

C C C C

output payloads

(14)

Selection Scans

T

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

10

20

30

40

50

Selectivity (%)

0

1

2

5

10 20 50 100

Scalar (branching)

Scalar (branchless)

0

1

2

3

4

5

6

Selectivity (%)

0

1

2

5

10 20 50 100

Scalar (branching)

Scalar (branchless)

Xeon E3-1275v3 (Haswell)

branchless is slower on Xeon Phi !

branchless is faster on Haswell !

(15)

Selection Scans

0

1

2

3

4

5

6

Selectivity (%)

0

1

2

5

10 20 50 100

Scalar (branching)

Scalar (branchless)

Vectorized (early)

Vectorized (late)

T

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

10

20

30

40

50

Selectivity (%)

0

1

2

5

10 20 50 100

Scalar (branching)

Scalar (branchless)

Vectorized (early)

Vectorized (late)

important case

(16)

Hash Tables

Scalar hash tables

Branching or branchless

1 input key at a time1 table key at a time

(17)

Hash Tables

Scalar hash tables

Branching or branchless

1 input key at a time1 table key at a time

Vectorized hash tables

Horizontal vectorization

Proposed on previous work1 input key at a time

W table keys per input key

Load bucket with W keys

(18)

Hash Tables

Scalar hash tables

Branching or branchless

1 input key at a time1 table key at a time

Vectorized hash tables

Horizontal vectorization

Proposed on previous work1 input key at a time

W table keys per input key

Load bucket with W keys

However …

W are too many comparisons

❖ No advantage of larger SIMD

(19)

Hash Tables

Vectorized hash tables

Vertical vectorization

W input keys at a time1 table keys per input key

Gather buckets

Probing (linear probing)

Store matching lanes

(20)

Hash Tables

k1

k5

k6

k4

h1+1

h6

h6

input

keys

hash

indexes

selectively load

input keys

h5

0

1

1

0

linear

probing

offsets

linear probing

hash table

h4+1

k2

k3

k7

k5

matching lanes replaced

non-matching lanes kept

Vectorized hash tables

Vertical vectorization

W input keys at a time1 table keys per input key

Gather buckets

Probing (linear probing)

Store matching lanes

Replace finished lanes

(21)

Hash Tables

k1

k2

k3

k4

h1

h2

h3

h4

k0

null

null

gather buckets

linear probing

hash table

Vectorized hash tables

Vertical vectorization

W input keys at a time1 table keys per input key

Gather buckets

Probing (linear probing)

Store matching lanes

Replace finished lanes

Keep unfinished lanes

Building (linear probing)

Keep non-empty lanes

input

keys

hash

indexes

(22)

Hash Tables

non-conflicting lanes

conflicting lanes

non-empty bucket lanes

h1

h2

h3

h4

1

4

1

4

1

2

3

4

!=

4

1

2

3

4

3

1) selective


scatter

2) selective

gather

3) compare

linear probing

hash table

Vectorized hash tables

Vertical vectorization

W input keys at a time1 table keys per input key

Gather buckets

Probing (linear probing)

Store matching lanes

Replace finished lanes

Keep unfinished lanes

Building (linear probing)

Keep non-empty lanes

(23)

Hash Tables

scatter tuples

k1

k2

k3

k4

h1

h2

h3

h4

k0

linear probing

hash table

k1 v1

k4 v4

v0

input

keys

hash

indexes

non-conflicting lanes scattered

conflicting lanes

non-empty bucket lanes

Vectorized hash tables

Vertical vectorization

W input keys at a time1 table keys per input key

Gather buckets

Probing (linear probing)

Store matching lanes

Replace finished lanes

Keep unfinished lanes

Building (linear probing)

Keep non-empty lanes

Detect scatter conflicts

(24)

Hash Tables

selectively load

input keys

(2nd iteration)

Vectorized hash tables

Vertical vectorization

W input keys at a time1 table keys per input key

Gather buckets

Probing (linear probing)

Store matching lanes

Replace finished lanes

Keep unfinished lanes

Building (linear probing)

Keep non-empty lanes

Detect scatter conflicts

Scatter empty lanes & replace

Keep conflicting lanes

k5

k2

k3

k6

h5

h2

h3

h6

input

keys

hash

indexes

k0

linear probing

hash table

k1 v1

k4 v4

v0

non-conflicting lanes replaced

conflicting lanes kept

(25)

Hash Table (Linear Probing)

Pr

ob

ing

t

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

2

4

6

8

10

12

4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

0

0.5

1

1.5

2

4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

Hash table size

Xeon E3-1275v3 (Haswell)

Hash table size

(26)

Hash Table (Linear Probing)

Pr

ob

ing

t

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

2

4

6

8

10

12

4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

Vector (horizontal)

0

0.5

1

1.5

2

4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

Vector (horizontal)

Hash table size

Xeon E3-1275v3 (Haswell)

Hash table size

Xeon Phi 7120P

(27)

Pr

ob

ing

t

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

2

4

6

8

10

12

4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

Vector (horizontal)

Vector (vertical)

Hash Table (Linear Probing)

Hash table size

Xeon E3-1275v3 (Haswell)

Hash table size

Xeon Phi 7120P

much faster in cache !

(28)

Partitioning

Types

Radix

❖ 2 shifts (in SIMD)

Hash

❖ 2 multiplications (in SIMD)

Range

❖ Range function index

(29)

Partitioning

Types

Radix

❖ 2 shifts (in SIMD)

Hash

❖ 2 multiplications (in SIMD)

Range

❖ Range function index

❖ Binary search (in the paper …)

Histogram

Data parallel update

❖ Gather & scatter counts

Conflicts miss counts

k1 k2 k3 k4

h1 h2 h3 h4

histogram with P counts

+1

+1

+1

input keys

partition ids

+1

+1

+2

(30)

Partitioning

Types

Radix

❖ 2 shifts (in SIMD)

Hash

❖ 2 multiplications (in SIMD)

Range

❖ Range function index

❖ Binary search (in the paper …)

Histogram

Data parallel update

❖ Gather & scatter counts

Conflicts miss counts

Replicate histogram W times

❖ More solutions in the paper …

k1 k2 k3 k4

h1 h2 h3 h4

input keys

replicated histogram with P * W counts

partition ids

+1

+1

+1

(31)

His

to

g

ram

t

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

5

10

15

20

25

Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar radix

Scalar hash

Radix & Hash Histogram

in scalar code:

hash < radix

(32)

His

to

g

ram

t

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

5

10

15

20

25

Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar radix

Scalar hash

Vector radix/hash (replicate)

Radix & Hash Histogram

Histogram generation on Xeon Phi

(33)

His

to

g

ram

t

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

5

10

15

20

25

Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar radix

Scalar hash

Vector radix/hash (replicate)

Vector radix/hash (replicate & compress)

Radix & Hash Histogram

in L2

in L1

(34)

Partitioning

Shuffling

Update the offsets

❖ Gather & scatter counts

Conflicts miss counts

Tuple transfer

Scatter directly to output

Conflicts overwrite tuples

h1 h2 h3 h4

l1

l3

partition ids

offset array

l2

keys

k1 k2 k3 k4

1) gather offsets

l1 l2 l3 l4

k1

k3

v1

v3

output keys

k4

v4

output payloads

conflicting lanes

non-conflicting lanes

2) scatter tuples

(35)

Partitioning

Shuffling

Update the offsets

❖ Gather & scatter counts

Conflicts miss counts

Tuple transfer

❖ Scatter directly to output

Conflicts overwrite tuples

(36)

Partitioning

Shuffling

Update the offsets

❖ Gather & scatter counts

Conflicts miss counts

Tuple transfer

❖ Scatter directly to output

Conflicts overwrite tuples

Serialize conflicts

❖ More in the paper …

h1 h2 h3 h4

l1

l3

partition ids

offset array

l2

keys

k1 k2 k3 k4

1) gather offsets

3) scatter tuples

l1 l2 l3 l4

k1

k3

v1

v3

output keys

partition offsets

output payloads

conflicting lanes

non-conflicting lanes

0

0

0

1

+

k2 k4

v4

v2

(37)

Partitioning

Shuffling

Update the offsets

❖ Gather & scatter counts

Conflicts miss counts

Tuple transfer

❖ Scatter directly to output

Conflicts overwrite tuples

Serialize conflicts

❖ More in the paper …

Buffering

When input >> cacheHandle conflicts

Scatter tuples to buffersBuffers are cache-residentStream full buffers to output

h1 h2 h3 h4

l1

l3

partition ids

offset array

l2

keys

k1 k2 k3 k4

1) gather offsets

2) serialize conflicts

3) scatter to buffer

4) flush full buffers

l1 l2 l3 l4

k1

k3

v1

v3

buffered keys

partition offsets

buffered payloads

conflicting lanes

non-conflicting lanes

0

0

0

1

+

k2 k4

v4

v2

(38)

Partitioning

1-tuple

scatters

W-tuple

stream

stores

input

buffers

output

vertical logic

horizontal logic

Shuffling

Update the offsets

❖ Gather & scatter counts

Conflicts miss counts

Tuple transfer

❖ Scatter directly to output

Conflicts overwrite tuples

Serialize conflicts

❖ More in the paper …

Buffering

When input >> cacheHandle conflicts

(39)

0

1

2

3

4

5

6

Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar unbuffered

Scalar buffered

Buffered & Unbuffered Partitioning

Large-scale data shuffling on Xeon Phi

(40)

0

1

2

3

4

5

6

Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar unbuffered

Scalar buffered

Vector unbuffered

Vector buffered

Buffered & Unbuffered Partitioning

Large-scale data shuffling on Xeon Phi

vectorization

orthogonal to


(41)

Sorting & Hash Join Algorithms

Least-significant-bit (LSB) radix-sort

Stable radix partitioning passes

(42)

Least-significant-bit (LSB) radix-sort

Stable radix partitioning passes

Fully vectorized

Hash join

No partitioning

Build 1 shared hash table using atomics

Partially vectorized

(43)

Least-significant-bit (LSB) radix-sort

Stable radix partitioning passes

Fully vectorized

Hash join

No partitioning

Build 1 shared hash table using atomics

Partially vectorized

Min partitioning

Partition building table

❖ Build 1 hash table per thread

Fully vectorized

(44)

Least-significant-bit (LSB) radix-sort

Stable radix partitioning passes

Fully vectorized

Hash join

No partitioning

Build 1 shared hash table using atomics

Partially vectorized

Min partitioning

Partition building table

❖ Build 1 hash table per thread

Fully vectorized

Max partitioning

Partition both tables repeatedly

Build & probe cache-resident hash tables

Fully vectorized

(45)

Hash Joins

Jo

in t

im

e (s

ec

ond

s)

0

0.5

1

1.5

2

Scalar Vector

Scalar Vector

Scalar Vector

Partition

Build

Probe

Build & Probe

fastest

So

rt

t

im

e (s

ec

ond

s)

0

0.5

1

1.5

2

Scalar Vector

min partition join

max partition join

sort 400 million tuples &

join 200 & 200 million tuples

32-bit keys & payloads

on Xeon Phi

(46)

Hash Joins

Jo

in t

im

e (s

ec

ond

s)

0

0.5

1

1.5

2

Scalar Vector

Scalar Vector

Scalar Vector

Partition

Build

Probe

Build & Probe

not improved

improved

3.4X !

So

rt

t

im

e (s

ec

ond

s)

0

0.5

1

1.5

2

Scalar Vector

min partition join

max partition join

sort 400 million tuples &

join 200 & 200 million tuples

32-bit keys & payloads

on Xeon Phi

no partition join

LSB radixsort

improved

(47)

Power Efficiency

T

im

e (s

ec

ond

s)

0

0.15

0.3

0.45

0.6

Sort

Join

Histogram

Shuffle

Partition

(48)

Power Efficiency

T

im

e (s

ec

ond

s)

0

0.15

0.3

0.45

0.6

Sort

Join

Sort

Join

42% less

power

Histogram

Shuffle

Partition

Build & probe

NUMA split

1 co-processor


(Xeon Phi 7120P)

61 P54C cores

520 Watts TDP

1.238 GHz

80 GB/s *

* comparable speed

even if equalized


(in the paper …)

(49)

Conclusions

Full vectorization

O(f(n)) scalar —> O(f(n)/W) vector operations

Good vectorization principles improve performance

Define & reuse fundamental operations

(50)

Conclusions

Full vectorization

O(f(n)) scalar —> O(f(n)/W) vector operations

Good vectorization principles improve performance

Define & reuse fundamental operations

e.g. vertical vectorization, maximize lane utilization, …

Impact on software design

Vectorization favors cache-conscious algorithms

e.g. partitioned hash join >> non-partitioned hash join if vectorized

Vectorization is orthogonal to other optimizations

(51)

Conclusions

Full vectorization

O(f(n)) scalar —> O(f(n)/W) vector operations

Good vectorization principles improve performance

Define & reuse fundamental operations

e.g. vertical vectorization, maximize lane utilization, …

Impact on software design

Vectorization favors cache-conscious algorithms

e.g. partitioned hash join >> non-partitioned hash join if vectorized

Vectorization is orthogonal to other optimizations

e.g. both unbuffered & buffered partitioning get vectorization speedup

Impact on hardware design

Simple cores almost as fast as complex cores (for OLAP)

61 simple P54C cores ~ 32 complex Sandy Bridge cores

(52)
(53)

L

SB

rad

ix

ort

t

im

e (s

ec

ond

s)

0

0.2

0.4

0.6

0.8

1

1.2

8-b

it

16-b

it

32-b

it

64-b

it

2x

64-b

it

3x

64-b

it

4x

64-b

it

200 million 32-bit keys &


X payloads on Xeon Phi

Part

it

io

ned

has

h j

oin t

im

e (s

)

0

0.1

0.2

0.3

0.4

0.5

4:1x

64-b

it

3:1x

64-b

it

2:1x

64-b

it

1:1x

64-b

it

1:2x

64-b

it

1:3x

64-b

it

1:4x

64-b

it

10 & 100 million 32-bit keys &


X payloads on Xeon Phi

References

Related documents

• The increase in the revenue financial support to the capital programme will be supplemented by additional capital grants from the Homes and Communities Agency (£4m), together

Dallas Baptist University is a comprehensive Christ-centered university dedicated to producing servant leaders through the integration of faith and learning.. With an enrollment

The outcomes measured in these studies included substance use, drug use, alcohol use, PTSD symptom severity, trauma related to sexual abuse, treatment retention, treatment

The DAU’s role is to advocate for the rights, protection and equality of opportunity for PWDs living in Trinidad and Tobago by monitoring and coordinat- ing the implementation of the

SIMATIC 1516F GLOBAL E-STOP DOOR CONTROL LOCAL E-STOP ET200SP I/O CONTACTORS WITH F/B ACK / RESET.. Siemens Industrial Academy, FY15 Safety Hands-on Exercise - Wiring.

♦ The employer contribution rate for the Police Officers system is required to increase by 4.56 percentage points based on the funding ratio falling below the 90 percent threshold

Software &amp; Accessories and Printers, Displays &amp; Electronics offers do not apply to Recommended Solutions systems in the online system configuration pages

Our methodology is evaluated using BCP for propositionalization and the propositional rule learner RIPPER on four Alzheimer datasets [9] and we have used three metrics to present