SIGMOD 2015, Melbourne, Victoria, Australia
Rethinking SIMD Vectorization
for In-Memory Databases
❖
Mainstream multi-core CPUs
❖
Use complex cores (e.g. Intel Haswell)
❖ Massively superscalar ❖ Aggressively out-of-order
❖
Pack few cores per chip
❖ High power & area per core
2 – 18 cores
Core
Core
Core
Core
Core
Core
Core
Core
L3 cache
Latest Hardware Designs
❖Mainstream multi-core CPUs
❖
Use complex cores (e.g. Intel Haswell)
❖ Massively superscalar ❖ Aggressively out-of-order
❖
Pack few cores per chip
❖ High power & area per core
❖
Many Integrated Cores (MIC)
❖
Use simple cores (e.g. Intel P54C)
❖ In-order & non-superscalar (for SIMD)
❖
Augment SIMD to bridge the gap
❖ Increase SIMD register size
❖ More advanced SIMD instructions
❖
Pack many cores per chip
❖ Low power & area per core
SIMD & Databases
❖Automatic vectorization
❖
Works for simple loops only
❖ Rare in database operators
a
+
b
=
c
SIMD & Databases
❖Automatic vectorization
❖
Works for simple loops only
❖ Rare in database operators
❖
Manual vectorization
❖
Linear access operators
❖ Predicate evaluation ❖ Compression
❖
Ad-hoc vectorization
❖ Sorting (e.g. merge-sort, comb-sort, …) ❖ Merging (e.g. 2-way, bitonic, …)
❖
Generic vectorization
❖ Multi-way trees
❖ Bucketized hash tables
❖
Full vectorization
❖
From O(f(n)) scalar to O(f(n)/W) vector operations
❖ Random accesses excluded
❖
Principles for good (efficient) vectorization
❖ Reuse fundamental operations across multiple vectorizations
❖ Favor vertical vectorization by processing different input data per lane ❖ Maximize lane utilization by executing different things per lane subset
❖
Full vectorization
❖
From O(f(n)) scalar to O(f(n)/W) vector operations
❖ Random accesses excluded
❖
Principles for good (efficient) vectorization
❖ Reuse fundamental operations across multiple vectorizations
❖ Favor vertical vectorization by processing different input data per lane ❖ Maximize lane utilization by executing different things per lane subset
❖
Vectorize basic operators
❖
Selection scans
❖Hash tables
❖Partitioning
Contributions & Outline
❖Full vectorization
❖
From O(f(n)) scalar to O(f(n)/W) vector operations
❖ Random accesses excluded
❖
Principles for good (efficient) vectorization
❖ Reuse fundamental operations across multiple vectorizations
❖ Favor vertical vectorization by processing different input data per lane ❖ Maximize lane utilization by executing different things per lane subset
❖
Vectorize basic operators & build advanced operators (in memory)
❖
Selection scans
❖Hash tables
❖Partitioning
❖
Full vectorization
❖
From O(f(n)) scalar to O(f(n)/W) vector operations
❖ Random accesses excluded
❖
Principles for good (efficient) vectorization
❖ Reuse fundamental operations across multiple vectorizations
❖ Favor vertical vectorization by processing different input data per lane ❖ Maximize lane utilization by executing different things per lane subset
❖
Vectorize basic operators & build advanced operators (in memory)
❖
Selection scans
❖Hash tables
❖Partitioning
❖
Show impact of good vectorization
❖
On software (database system) & hardware design
Contributions & Outline
Selection Scans
❖Scalar
❖
Branching
❖ Branch to store qualifiers
❖ Affected by branch misses
❖
Branchless
❖ Eliminate branching
Selection Scans
1) evaluate predicate(s)
k1 k2 k3 k4
v0 v2 v3
k0 k2 k3
output keys
input payloads
input keys
2) selectively
store qualifiers
❖Scalar
❖Branching
❖ Branch to store qualifiers
❖ Affected by branch misses
❖
Branchless
❖ Eliminate branching
❖ Use conditional flags
❖
Vectorized
❖
Simple
❖ Evaluate predicates (in SIMD)
❖ Selectively store qualifiers
❖ “Early” materialized
❖
Advanced
❖ “Late” materialized (in the paper …)
k1 k2 k3 k4 k5 k6 k7
v1 v2 v3 v4 v5 v6 v7
C C C C
output payloads
Selection Scans
T
hr
oug
hp
ut
(b
illio
n t
up
les
/ s
ec
ond
)
0
10
20
30
40
50
Selectivity (%)
0
1
2
5
10 20 50 100
Scalar (branching)
Scalar (branchless)
0
1
2
3
4
5
6
Selectivity (%)
0
1
2
5
10 20 50 100
Scalar (branching)
Scalar (branchless)
Xeon E3-1275v3 (Haswell)
branchless is slower on Xeon Phi !
branchless is faster on Haswell !
Selection Scans
0
1
2
3
4
5
6
Selectivity (%)
0
1
2
5
10 20 50 100
Scalar (branching)
Scalar (branchless)
Vectorized (early)
Vectorized (late)
T
hr
oug
hp
ut
(b
illio
n t
up
les
/ s
ec
ond
)
0
10
20
30
40
50
Selectivity (%)
0
1
2
5
10 20 50 100
Scalar (branching)
Scalar (branchless)
Vectorized (early)
Vectorized (late)
important case
Hash Tables
❖Scalar hash tables
❖
Branching or branchless
❖ 1 input key at a time ❖ 1 table key at a time
Hash Tables
❖Scalar hash tables
❖
Branching or branchless
❖ 1 input key at a time ❖ 1 table key at a time
❖
Vectorized hash tables
❖
Horizontal vectorization
❖ Proposed on previous work ❖ 1 input key at a time
❖ W table keys per input key
❖ Load bucket with W keys
Hash Tables
❖Scalar hash tables
❖
Branching or branchless
❖ 1 input key at a time ❖ 1 table key at a time
❖
Vectorized hash tables
❖
Horizontal vectorization
❖ Proposed on previous work ❖ 1 input key at a time
❖ W table keys per input key
❖ Load bucket with W keys
❖
However …
❖ W are too many comparisons
❖ No advantage of larger SIMD
Hash Tables
❖Vectorized hash tables
❖
Vertical vectorization
❖ W input keys at a time ❖ 1 table keys per input key
❖ Gather buckets
❖
Probing (linear probing)
❖ Store matching lanes
Hash Tables
k1
k5
k6
k4
h1+1
h6
h6
input
keys
hash
indexes
selectively load
input keys
h5
0
1
1
0
linear
probing
offsets
linear probing
hash table
h4+1
k2
k3
k7
k5
matching lanes replaced
non-matching lanes kept
❖
Vectorized hash tables
❖
Vertical vectorization
❖ W input keys at a time ❖ 1 table keys per input key
❖ Gather buckets
❖
Probing (linear probing)
❖ Store matching lanes
❖ Replace finished lanes
Hash Tables
k1
k2
k3
k4
h1
h2
h3
h4
k0
null
null
gather buckets
linear probing
hash table
❖
Vectorized hash tables
❖
Vertical vectorization
❖ W input keys at a time ❖ 1 table keys per input key
❖ Gather buckets
❖
Probing (linear probing)
❖ Store matching lanes
❖ Replace finished lanes
❖ Keep unfinished lanes
❖
Building (linear probing)
❖ Keep non-empty lanes
input
keys
hash
indexes
Hash Tables
non-conflicting lanes
conflicting lanes
non-empty bucket lanes
h1
h2
h3
h4
1
4
1
4
1
2
3
4
!=
4
1
2
3
4
3
1) selective
scatter
2) selective
gather
3) compare
linear probing
hash table
❖
Vectorized hash tables
❖
Vertical vectorization
❖ W input keys at a time ❖ 1 table keys per input key
❖ Gather buckets
❖
Probing (linear probing)
❖ Store matching lanes
❖ Replace finished lanes
❖ Keep unfinished lanes
❖
Building (linear probing)
❖ Keep non-empty lanes
Hash Tables
scatter tuples
k1
k2
k3
k4
h1
h2
h3
h4
k0
linear probing
hash table
k1 v1
k4 v4
v0
input
keys
hash
indexes
non-conflicting lanes scattered
conflicting lanes
non-empty bucket lanes
❖
Vectorized hash tables
❖
Vertical vectorization
❖ W input keys at a time ❖ 1 table keys per input key
❖ Gather buckets
❖
Probing (linear probing)
❖ Store matching lanes
❖ Replace finished lanes
❖ Keep unfinished lanes
❖
Building (linear probing)
❖ Keep non-empty lanes
❖ Detect scatter conflicts
Hash Tables
selectively load
input keys
(2nd iteration)
❖
Vectorized hash tables
❖
Vertical vectorization
❖ W input keys at a time ❖ 1 table keys per input key
❖ Gather buckets
❖
Probing (linear probing)
❖ Store matching lanes
❖ Replace finished lanes
❖ Keep unfinished lanes
❖
Building (linear probing)
❖ Keep non-empty lanes
❖ Detect scatter conflicts
❖ Scatter empty lanes & replace
❖ Keep conflicting lanes
k5
k2
k3
k6
h5
h2
h3
h6
input
keys
hash
indexes
k0
linear probing
hash table
k1 v1
k4 v4
v0
non-conflicting lanes replaced
conflicting lanes kept
Hash Table (Linear Probing)
Pr
ob
ing
t
hr
oug
hp
ut
(b
illio
n t
up
les
/ s
ec
ond
)
0
2
4
6
8
10
12
4 K
B
16 K
B
64 K
B
256 K
B
1 M
B
4 M
B
16 M
B
64 M
B
Scalar
0
0.5
1
1.5
2
4 K
B
16 K
B
64 K
B
256 K
B
1 M
B
4 M
B
16 M
B
64 M
B
Scalar
Hash table size
Xeon E3-1275v3 (Haswell)
Hash table size
Hash Table (Linear Probing)
Pr
ob
ing
t
hr
oug
hp
ut
(b
illio
n t
up
les
/ s
ec
ond
)
0
2
4
6
8
10
12
4 K
B
16 K
B
64 K
B
256 K
B
1 M
B
4 M
B
16 M
B
64 M
B
Scalar
Vector (horizontal)
0
0.5
1
1.5
2
4 K
B
16 K
B
64 K
B
256 K
B
1 M
B
4 M
B
16 M
B
64 M
B
Scalar
Vector (horizontal)
Hash table size
Xeon E3-1275v3 (Haswell)
Hash table size
Xeon Phi 7120P
Pr
ob
ing
t
hr
oug
hp
ut
(b
illio
n t
up
les
/ s
ec
ond
)
0
2
4
6
8
10
12
4 K
B
16 K
B
64 K
B
256 K
B
1 M
B
4 M
B
16 M
B
64 M
B
Scalar
Vector (horizontal)
Vector (vertical)
Hash Table (Linear Probing)
Hash table size
Xeon E3-1275v3 (Haswell)
Hash table size
Xeon Phi 7120P
much faster in cache !
Partitioning
❖Types
❖
Radix
❖ 2 shifts (in SIMD)
❖
Hash
❖ 2 multiplications (in SIMD)
❖
Range
❖ Range function index
Partitioning
❖Types
❖
Radix
❖ 2 shifts (in SIMD)
❖
Hash
❖ 2 multiplications (in SIMD)
❖
Range
❖ Range function index
❖ Binary search (in the paper …)
❖
Histogram
❖
Data parallel update
❖ Gather & scatter counts
❖ Conflicts miss counts
k1 k2 k3 k4
h1 h2 h3 h4
histogram with P counts
+1
+1
+1
input keys
partition ids
+1
+1
+2
Partitioning
❖Types
❖
Radix
❖ 2 shifts (in SIMD)
❖
Hash
❖ 2 multiplications (in SIMD)
❖
Range
❖ Range function index
❖ Binary search (in the paper …)
❖
Histogram
❖
Data parallel update
❖ Gather & scatter counts
❖ Conflicts miss counts
❖ Replicate histogram W times
❖ More solutions in the paper …
k1 k2 k3 k4
h1 h2 h3 h4
input keys
replicated histogram with P * W counts
partition ids
+1
+1
+1
His
to
g
ram
t
hr
oug
hp
ut
(b
illio
n t
up
les
/ s
ec
ond
)
0
5
10
15
20
25
Fanout (log)
3 4 5 6 7 8 9 10 11 12 13
Scalar radix
Scalar hash
Radix & Hash Histogram
in scalar code:
hash < radix
His
to
g
ram
t
hr
oug
hp
ut
(b
illio
n t
up
les
/ s
ec
ond
)
0
5
10
15
20
25
Fanout (log)
3 4 5 6 7 8 9 10 11 12 13
Scalar radix
Scalar hash
Vector radix/hash (replicate)
Radix & Hash Histogram
Histogram generation on Xeon Phi
His
to
g
ram
t
hr
oug
hp
ut
(b
illio
n t
up
les
/ s
ec
ond
)
0
5
10
15
20
25
Fanout (log)
3 4 5 6 7 8 9 10 11 12 13
Scalar radix
Scalar hash
Vector radix/hash (replicate)
Vector radix/hash (replicate & compress)
Radix & Hash Histogram
in L2
in L1
Partitioning
❖Shuffling
❖
Update the offsets
❖ Gather & scatter counts
❖ Conflicts miss counts
❖
Tuple transfer
❖ Scatter directly to output
❖ Conflicts overwrite tuples
h1 h2 h3 h4
l1
l3
partition ids
offset array
l2
keys
k1 k2 k3 k4
1) gather offsets
l1 l2 l3 l4
k1
k3
v1
v3
output keys
k4
v4
output payloads
conflicting lanes
non-conflicting lanes
2) scatter tuples
Partitioning
❖Shuffling
❖
Update the offsets
❖ Gather & scatter counts
❖ Conflicts miss counts
❖
Tuple transfer
❖ Scatter directly to output
❖ Conflicts overwrite tuples
Partitioning
❖Shuffling
❖
Update the offsets
❖ Gather & scatter counts
❖ Conflicts miss counts
❖
Tuple transfer
❖ Scatter directly to output
❖ Conflicts overwrite tuples
❖ Serialize conflicts
❖ More in the paper …
h1 h2 h3 h4
l1
l3
partition ids
offset array
l2
keys
k1 k2 k3 k4
1) gather offsets
3) scatter tuples
l1 l2 l3 l4
k1
k3
v1
v3
output keys
partition offsets
output payloads
conflicting lanes
non-conflicting lanes
0
0
0
1
+
k2 k4
v4
v2
Partitioning
❖Shuffling
❖
Update the offsets
❖ Gather & scatter counts
❖ Conflicts miss counts
❖
Tuple transfer
❖ Scatter directly to output
❖ Conflicts overwrite tuples
❖ Serialize conflicts
❖ More in the paper …
❖
Buffering
❖ When input >> cache ❖ Handle conflicts
❖ Scatter tuples to buffers ❖ Buffers are cache-resident ❖ Stream full buffers to output
h1 h2 h3 h4
l1
l3
partition ids
offset array
l2
keys
k1 k2 k3 k4
1) gather offsets
2) serialize conflicts
3) scatter to buffer
4) flush full buffers
l1 l2 l3 l4
k1
k3
v1
v3
buffered keys
partition offsets
buffered payloads
conflicting lanes
non-conflicting lanes
0
0
0
1
+
k2 k4
v4
v2
Partitioning
1-tuple
scatters
W-tuple
stream
stores
input
buffers
output
vertical logic
horizontal logic
❖Shuffling
❖
Update the offsets
❖ Gather & scatter counts
❖ Conflicts miss counts
❖
Tuple transfer
❖ Scatter directly to output
❖ Conflicts overwrite tuples
❖ Serialize conflicts
❖ More in the paper …
❖
Buffering
❖ When input >> cache ❖ Handle conflicts
0
1
2
3
4
5
6
Fanout (log)
3 4 5 6 7 8 9 10 11 12 13
Scalar unbuffered
Scalar buffered
Buffered & Unbuffered Partitioning
Large-scale data shuffling on Xeon Phi
0
1
2
3
4
5
6
Fanout (log)
3 4 5 6 7 8 9 10 11 12 13
Scalar unbuffered
Scalar buffered
Vector unbuffered
Vector buffered
Buffered & Unbuffered Partitioning
Large-scale data shuffling on Xeon Phi
vectorization
orthogonal to
Sorting & Hash Join Algorithms
❖Least-significant-bit (LSB) radix-sort
❖
Stable radix partitioning passes
❖
Least-significant-bit (LSB) radix-sort
❖
Stable radix partitioning passes
❖ Fully vectorized
❖
Hash join
❖
No partitioning
❖ Build 1 shared hash table using atomics
❖ Partially vectorized
❖
Least-significant-bit (LSB) radix-sort
❖
Stable radix partitioning passes
❖ Fully vectorized
❖
Hash join
❖
No partitioning
❖ Build 1 shared hash table using atomics
❖ Partially vectorized
❖
Min partitioning
❖ Partition building table
❖ Build 1 hash table per thread
❖ Fully vectorized
❖
Least-significant-bit (LSB) radix-sort
❖
Stable radix partitioning passes
❖ Fully vectorized
❖
Hash join
❖
No partitioning
❖ Build 1 shared hash table using atomics
❖ Partially vectorized
❖
Min partitioning
❖ Partition building table
❖ Build 1 hash table per thread
❖ Fully vectorized
❖
Max partitioning
❖ Partition both tables repeatedly
❖ Build & probe cache-resident hash tables
❖ Fully vectorized
Hash Joins
Jo
in t
im
e (s
ec
ond
s)
0
0.5
1
1.5
2
Scalar Vector
Scalar Vector
Scalar Vector
Partition
Build
Probe
Build & Probe
fastest
So
rt
t
im
e (s
ec
ond
s)
0
0.5
1
1.5
2
Scalar Vector
min partition join
max partition join
sort 400 million tuples &
join 200 & 200 million tuples
32-bit keys & payloads
on Xeon Phi
Hash Joins
Jo
in t
im
e (s
ec
ond
s)
0
0.5
1
1.5
2
Scalar Vector
Scalar Vector
Scalar Vector
Partition
Build
Probe
Build & Probe
not improved
improved
3.4X !
So
rt
t
im
e (s
ec
ond
s)
0
0.5
1
1.5
2
Scalar Vector
min partition join
max partition join
sort 400 million tuples &
join 200 & 200 million tuples
32-bit keys & payloads
on Xeon Phi
no partition join
LSB radixsort
improved
Power Efficiency
T
im
e (s
ec
ond
s)
0
0.15
0.3
0.45
0.6
Sort
Join
Histogram
Shuffle
Partition
Power Efficiency
T
im
e (s
ec
ond
s)
0
0.15
0.3
0.45
0.6
Sort
Join
Sort
Join
42% less
power
Histogram
Shuffle
Partition
Build & probe
NUMA split
1 co-processor
(Xeon Phi 7120P)
61 P54C cores
520 Watts TDP
1.238 GHz
80 GB/s *
* comparable speed
even if equalized
(in the paper …)
Conclusions
❖
Full vectorization
❖
O(f(n)) scalar —> O(f(n)/W) vector operations
❖
Good vectorization principles improve performance
❖ Define & reuse fundamental operations
Conclusions
❖
Full vectorization
❖
O(f(n)) scalar —> O(f(n)/W) vector operations
❖
Good vectorization principles improve performance
❖ Define & reuse fundamental operations
❖ e.g. vertical vectorization, maximize lane utilization, …
❖
Impact on software design
❖
Vectorization favors cache-conscious algorithms
❖ e.g. partitioned hash join >> non-partitioned hash join if vectorized
❖
Vectorization is orthogonal to other optimizations
Conclusions
❖
Full vectorization
❖
O(f(n)) scalar —> O(f(n)/W) vector operations
❖
Good vectorization principles improve performance
❖ Define & reuse fundamental operations
❖ e.g. vertical vectorization, maximize lane utilization, …
❖
Impact on software design
❖
Vectorization favors cache-conscious algorithms
❖ e.g. partitioned hash join >> non-partitioned hash join if vectorized
❖
Vectorization is orthogonal to other optimizations
❖ e.g. both unbuffered & buffered partitioning get vectorization speedup
❖
Impact on hardware design
❖
Simple cores almost as fast as complex cores (for OLAP)
❖ 61 simple P54C cores ~ 32 complex Sandy Bridge cores