Rethinking SIMD Vectorization for In-Memory Databases

(1)

SIGMOD 2015, Melbourne, Victoria, Australia

Rethinking SIMD Vectorization

for In-Memory Databases

(2)

❖

Mainstream multi-core CPUs

❖

Use complex cores (e.g. Intel Haswell)

❖ Massively superscalar ❖ Aggressively out-of-order

❖

Pack few cores per chip

❖ High power & area per core

2 – 18 cores

Core

L3 cache

(3)

Latest Hardware Designs

❖

Mainstream multi-core CPUs

❖

Use complex cores (e.g. Intel Haswell)

❖ Massively superscalar ❖ Aggressively out-of-order

❖

Pack few cores per chip

❖ High power & area per core

❖

Many Integrated Cores (MIC)

❖

Use simple cores (e.g. Intel P54C)

❖ In-order & non-superscalar (for SIMD)

❖

Augment SIMD to bridge the gap

❖ Increase SIMD register size

❖ More advanced SIMD instructions

❖

Pack many cores per chip

❖ Low power & area per core

(4)

SIMD & Databases

❖

Automatic vectorization

❖

Works for simple loops only

❖ Rare in database operators

a

+

b

=

c

(5)

SIMD & Databases

❖

Automatic vectorization

❖

Works for simple loops only

❖ Rare in database operators

❖

Manual vectorization

❖

Linear access operators

❖ Predicate evaluation ❖ Compression

❖

Ad-hoc vectorization

❖ Sorting (e.g. merge-sort, comb-sort, …) ❖ Merging (e.g. 2-way, bitonic, …)

❖

Generic vectorization

❖ Multi-way trees

❖ Bucketized hash tables

(6)

❖

Full vectorization

❖

From O(f(n)) scalar to O(f(n)/W) vector operations

❖ Random accesses excluded

❖

Principles for good (efficient) vectorization

❖ Reuse fundamental operations across multiple vectorizations

❖ Favor vertical vectorization by processing different input data per lane ❖ Maximize lane utilization by executing different things per lane subset

(7)

❖

Full vectorization

❖

From O(f(n)) scalar to O(f(n)/W) vector operations

❖

Principles for good (efficient) vectorization

❖

Vectorize basic operators

❖

Selection scans

❖

Hash tables

❖

Partitioning

(8)

Contributions & Outline

❖

Full vectorization

❖

From O(f(n)) scalar to O(f(n)/W) vector operations

❖

Principles for good (efficient) vectorization

❖

Vectorize basic operators & build advanced operators (in memory)

❖

Selection scans

❖

Hash tables

❖

Partitioning

(9)

❖

Full vectorization

❖

From O(f(n)) scalar to O(f(n)/W) vector operations

❖

Principles for good (efficient) vectorization

❖

Vectorize basic operators & build advanced operators (in memory)

❖

Selection scans

❖

Hash tables

❖

Partitioning

❖

Show impact of good vectorization

❖

On software (database system) & hardware design

Contributions & Outline

(10)

(11)

(12)

Selection Scans

❖

Scalar

❖

Branching

❖ Branch to store qualifiers

❖ Affected by branch misses

❖

Branchless

❖ Eliminate branching

(13)

Selection Scans

1) evaluate predicate(s)

k1 k2 k3 k4

v0 v2 v3

k0 k2 k3

output keys

input payloads

input keys

2) selectively

store qualifiers

❖

Scalar

❖

Branching

❖ Branch to store qualifiers

❖ Affected by branch misses

❖

Branchless

❖ Eliminate branching

❖ Use conditional flags

❖

Vectorized

❖

Simple

❖ Evaluate predicates (in SIMD)

❖ Selectively store qualifiers

❖ “Early” materialized

❖

Advanced

❖ “Late” materialized (in the paper …)

k1 k2 k3 k4 k5 k6 k7

v1 v2 v3 v4 v5 v6 v7

C C C C

output payloads

(14)

Selection Scans

T

hr

oug

hp

ut

 

(b

illio

n t

up

les

/ s

ec

ond

)

0

10

20

30

40

50 Selectivity (%)

0

1

2

5 10 20 50 100

Scalar (branching)

Scalar (branchless)

0

1

2

3

4

5

6 Selectivity (%)

0

1

2

5 10 20 50 100

Scalar (branching)

Scalar (branchless)

Xeon E3-1275v3 (Haswell)

branchless is slower on Xeon Phi !

branchless is faster on Haswell !

(15)

Selection Scans

0

1

2

3

4

5

6 Selectivity (%)

0

1

2

5 10 20 50 100

Scalar (branching)

Scalar (branchless)

Vectorized (early)

Vectorized (late)

T

hr

oug

hp

ut

 

(b

illio

n t

up

les

/ s

ec

ond

)

0

10

20

30

40

50 Selectivity (%)

0

1

2

5 10 20 50 100

Scalar (branching)

Scalar (branchless)

Vectorized (early)

Vectorized (late)

important case

(16)

Hash Tables

❖

Scalar hash tables

❖

Branching or branchless

❖ 1 input key at a time ❖ 1 table key at a time

(17)

Hash Tables

❖

Scalar hash tables

❖

Branching or branchless

❖

Vectorized hash tables

❖

Horizontal vectorization

❖ Proposed on previous work ❖ 1 input key at a time

❖ W table keys per input key

❖ Load bucket with W keys

(18)

Hash Tables

❖

Scalar hash tables

❖

Branching or branchless

❖

Vectorized hash tables

❖

Horizontal vectorization

❖ Proposed on previous work ❖ 1 input key at a time

❖ W table keys per input key

❖ Load bucket with W keys

❖

However …

❖ W are too many comparisons

❖ No advantage of larger SIMD

(19)

Hash Tables

❖

Vectorized hash tables

❖

Vertical vectorization

❖ W input keys at a time ❖ 1 table keys per input key

❖ Gather buckets

❖

Probing (linear probing)

❖ Store matching lanes

(20)

Hash Tables

k1

k5

k6

k4

h1+1

h6

input

keys

hash

indexes

selectively load

input keys

h5

0

1

0 linear

probing

offsets

linear probing

hash table

h4+1

k2

k3

k7

k5

matching lanes replaced

non-matching lanes kept

❖

Vectorized hash tables

❖

Vertical vectorization

❖

Probing (linear probing)

❖ Replace finished lanes

(21)

Hash Tables

k1

k2

k3

k4

h1

h2

h3

h4

k0

null

gather buckets

linear probing

hash table

❖

Vectorized hash tables

❖

Vertical vectorization

❖

Probing (linear probing)

❖ Keep unfinished lanes

❖

Building (linear probing)

❖ Keep non-empty lanes

input

keys

hash

indexes

(22)

Hash Tables

non-conflicting lanes

conflicting lanes

non-empty bucket lanes

h1

h2

h3

h4

1

4

1

4

1

2

3

4 !=

4

1

2

3

4

3 1) selective 

scatter

2) selective

gather

3) compare

linear probing

hash table

❖

Vectorized hash tables

❖

Vertical vectorization

❖

Probing (linear probing)

❖

Building (linear probing)

(23)

Hash Tables

scatter tuples

k1

k2

k3

k4

h1

h2

h3

h4

k0

linear probing

hash table

k1 v1

k4 v4

v0

input

keys

hash

indexes

non-conflicting lanes scattered

conflicting lanes

non-empty bucket lanes

❖

Vectorized hash tables

❖

Vertical vectorization

❖

Probing (linear probing)

❖

Building (linear probing)

❖ Detect scatter conflicts

(24)

Hash Tables

selectively load

input keys

(2nd iteration)

❖

Vectorized hash tables

❖

Vertical vectorization

❖

Probing (linear probing)

❖

Building (linear probing)

❖ Detect scatter conflicts

❖ Scatter empty lanes & replace

❖ Keep conflicting lanes

k5

k2

k3

k6

h5

h2

h3

h6

input

keys

hash

indexes

k0

linear probing

hash table

k1 v1

k4 v4

v0

non-conflicting lanes replaced

conflicting lanes kept

(25)

Hash Table (Linear Probing)

Pr

ob

ing

t

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

2

4

6

8

10

12 4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

0

0.5

1

1.5

2 4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

Hash table size

Xeon E3-1275v3 (Haswell)

Hash table size

(26)

Hash Table (Linear Probing)

Pr

ob

ing

t

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

2

4

6

8

10

12 4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

Vector (horizontal)

0

0.5

1

1.5

2 4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

Vector (horizontal)

Hash table size

Xeon E3-1275v3 (Haswell)

Hash table size

Xeon Phi 7120P

(27)

Pr

ob

ing

t

hr

oug

hp

ut

(b

illio

n t

up

les

/ s

ec

ond

)

0

2

4

6

8

10

12 4 K

B

16 K

B

64 K

B

256 K

B

1 M

B

4 M

B

16 M

B

64 M

B

Scalar

Vector (horizontal)

Vector (vertical)

Hash Table (Linear Probing)

Hash table size

Xeon E3-1275v3 (Haswell)

Hash table size

Xeon Phi 7120P

much faster in cache !

(28)

Partitioning

❖

Types

❖

Radix

❖ 2 shifts (in SIMD)

❖

Hash

❖ 2 multiplications (in SIMD)

❖

Range

❖ Range function index

(29)

Partitioning

❖

Types

❖

Radix

❖

Hash

❖

Range

❖ Binary search (in the paper …)

❖

Histogram

❖

Data parallel update

❖ Gather & scatter counts

❖ Conflicts miss counts

k1 k2 k3 k4

h1 h2 h3 h4

histogram with P counts

+1

input keys

partition ids

+1

+2

(30)

Partitioning

❖

Types

❖

Radix

❖

Hash

❖

Range

❖ Binary search (in the paper …)

❖

Histogram

❖

Data parallel update

❖ Replicate histogram W times

❖ More solutions in the paper …

k1 k2 k3 k4

h1 h2 h3 h4

input keys

**replicated histogram with P * W counts**

partition ids

+1

(31)

His

to

g

ram

t

hr

oug

hp

ut

 

(b

illio

n t

up

les

/ s

ec

ond

)

0

5

10

15

20

25 Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar radix

Scalar hash

Radix & Hash Histogram

in scalar code:

hash < radix

(32)

His

to

g

ram

t

hr

oug

hp

ut

 

(b

illio

n t

up

les

/ s

ec

ond

)

0

5

10

15

20

25 Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar radix

Scalar hash

Vector radix/hash (replicate)

Radix & Hash Histogram

Histogram generation on Xeon Phi

(33)

His

to

g

ram

t

hr

oug

hp

ut

 

(b

illio

n t

up

les

/ s

ec

ond

)

0

5

10

15

20

25 Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar radix

Scalar hash

Vector radix/hash (replicate)

Vector radix/hash (replicate & compress)

Radix & Hash Histogram

in L2

in L1

(34)

Partitioning

❖

Shuffling

❖

Update the offsets

❖

Tuple transfer

❖ Scatter directly to output

❖ Conflicts overwrite tuples

h1 h2 h3 h4

l1

l3

partition ids

offset array

l2

keys

k1 k2 k3 k4

1) gather offsets

l1 l2 l3 l4

k1

k3

v1

v3

output keys

k4

v4

output payloads

conflicting lanes

non-conflicting lanes

2) scatter tuples

(35)

Partitioning

❖

Shuffling

❖

Update the offsets

❖

Tuple transfer

❖ Scatter directly to output

(36)

Partitioning

❖

Shuffling

❖

Update the offsets

❖

Tuple transfer

❖ Serialize conflicts

❖ More in the paper …

h1 h2 h3 h4

l1

l3

partition ids

offset array

l2

keys

k1 k2 k3 k4

1) gather offsets

3) scatter tuples

l1 l2 l3 l4

k1

k3

v1

v3

output keys

partition offsets

output payloads

conflicting lanes

non-conflicting lanes

0

1 +

k2 k4

v4

v2

(37)

Partitioning

❖

Shuffling

❖

Update the offsets

❖

Tuple transfer

❖

Buffering

❖ When input >> cache ❖ Handle conflicts

❖ Scatter tuples to buffers ❖ Buffers are cache-resident ❖ Stream full buffers to output

h1 h2 h3 h4

l1

l3

partition ids

offset array

l2

keys

k1 k2 k3 k4

1) gather offsets

2) serialize conflicts

3) scatter to buffer

4) flush full buffers

l1 l2 l3 l4

k1

k3

v1

v3

buffered keys

partition offsets

buffered payloads

conflicting lanes

non-conflicting lanes

0

1 +

k2 k4

v4

v2

(38)

Partitioning

1-tuple

scatters

W-tuple

stream

stores

input

buffers

output

vertical logic

horizontal logic

❖

Shuffling

❖

Update the offsets

❖

Tuple transfer

❖

Buffering

❖ When input >> cache ❖ Handle conflicts

(39)

0

1

2

3

4

5

6 Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar unbuffered

Scalar buffered

Buffered & Unbuffered Partitioning

Large-scale data shuffling on Xeon Phi

(40)

0

1

2

3

4

5

6 Fanout (log)

3 4 5 6 7 8 9 10 11 12 13

Scalar unbuffered

Scalar buffered

Vector unbuffered

Vector buffered

Buffered & Unbuffered Partitioning

Large-scale data shuffling on Xeon Phi

vectorization

orthogonal to 

(41)

Sorting & Hash Join Algorithms

❖

Least-significant-bit (LSB) radix-sort

❖

Stable radix partitioning passes

(42)

❖

Least-significant-bit (LSB) radix-sort

❖

Stable radix partitioning passes

❖ Fully vectorized

❖

Hash join

❖

No partitioning

❖ Build 1 shared hash table using atomics

❖ Partially vectorized

(43)

❖

Least-significant-bit (LSB) radix-sort

❖

Stable radix partitioning passes

❖

Hash join

❖

No partitioning

❖

Min partitioning

❖ Partition building table

❖ Build 1 hash table per thread

(44)

❖

Least-significant-bit (LSB) radix-sort

❖

Stable radix partitioning passes

❖

Hash join

❖

No partitioning

❖

Min partitioning

❖ Partition building table

❖ Build 1 hash table per thread

❖

Max partitioning

❖ Partition both tables repeatedly

❖ Build & probe cache-resident hash tables

(45)

Hash Joins

Jo

in t

im

e (s

ec

ond

s)

0

0.5

1

1.5

2 Scalar Vector

Scalar Vector

Partition

Build

Probe

Build & Probe

fastest

So

rt

t

im

e (s

ec

ond

s)

0

0.5

1

1.5

2 Scalar Vector

min partition join

max partition join

sort 400 million tuples &

join 200 & 200 million tuples

32-bit keys & payloads

on Xeon Phi

(46)

Hash Joins

Jo

in t

im

e (s

ec

ond

s)

0

0.5

1

1.5

2 Scalar Vector

Scalar Vector

Partition

Build

Probe

Build & Probe

not improved

improved

3.4X !

So

rt

t

im

e (s

ec

ond

s)

0

0.5

1

1.5

2 Scalar Vector

min partition join

max partition join

sort 400 million tuples &

join 200 & 200 million tuples

32-bit keys & payloads

on Xeon Phi

no partition join

LSB radixsort

improved

(47)

Power Efficiency

T

im

e (s

ec

ond

s)

0

0.15

0.3

0.45

0.6 Sort

Join

Histogram

Shuffle

Partition

(48)

Power Efficiency

T

im

e (s

ec

ond

s)

0

0.15

0.3

0.45

0.6 Sort

Join

Sort

Join

42% less

power

Histogram

Shuffle

Partition

Build & probe

NUMA split

1 co-processor 

(Xeon Phi 7120P)

61 P54C cores

520 Watts TDP

1.238 GHz

80 GB/s *

* comparable speed

even if equalized 

(in the paper …)

(49)

Conclusions

❖

Full vectorization

❖

O(f(n)) scalar —> O(f(n)/W) vector operations

❖

Good vectorization principles improve performance

❖ Define & reuse fundamental operations

(50)

Conclusions

❖

Full vectorization

❖

O(f(n)) scalar —> O(f(n)/W) vector operations

❖

Good vectorization principles improve performance

❖ e.g. vertical vectorization, maximize lane utilization, …

❖

Impact on software design

❖

Vectorization favors cache-conscious algorithms

❖ e.g. partitioned hash join >> non-partitioned hash join if vectorized

❖

Vectorization is orthogonal to other optimizations

(51)

Conclusions

❖

Full vectorization

❖

O(f(n)) scalar —> O(f(n)/W) vector operations

❖

Good vectorization principles improve performance

❖ e.g. vertical vectorization, maximize lane utilization, …

❖

Impact on software design

❖

Vectorization favors cache-conscious algorithms

❖ e.g. partitioned hash join >> non-partitioned hash join if vectorized

❖

Vectorization is orthogonal to other optimizations

❖ e.g. both unbuffered & buffered partitioning get vectorization speedup

❖

Impact on hardware design

❖

Simple cores almost as fast as complex cores (for OLAP)

❖ 61 simple P54C cores ~ 32 complex Sandy Bridge cores

(52)

(53)