Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems

(1)

CSE 6350 File and Storage System Infrastructure in Data centers Supporting Internet-wide Services

Part II: Software Infrastructure in Data Centers:

Key-Value Data Management Systems

(2)

2

Key-Value Store

Clients

PUT(key, value) value = GET(key)

DELETE(key)

Key-Value Store Cluster

• Dynamo at Amazon

• BigTable (LevelDB) at Google

• Redis at GitHub, Digg, and Blizzard Interactive

• Memcached at Facebook, Zynga and Twitter

• Voldemort at Linkedin

(3)

NoSQL DB and KV Store

❑

A NoSQL or Not Only SQL database stores and organizes

data differently from the tabular relations used in relational

databases.

❑

Why NoSQL?

✓

Simplicity of design, horizontal scaling, and finer control

over availability.

❑

It can be classified as column, document, key-value, and

graph store based on their data models.

❑

The key-value model is one of the simplest non-trivial data

models, and richer data models are often implemented on

top of it.

✓ Applications store their data in a schema-less way.

❑

Array, linked list, binary search trees, B+ tree, or hash

table …?

(4)

4

The First Example: Memcached

(5)

Memcached

◼

Memcached is a high-performance, distributed in-

memory object caching system, generic in nature

◼

It is a key-based cache daemon that stores data (or

objects) wherever dedicated or spare RAM is available

for very quick access

◼

It is a distributed hash table. It doesn’t provide

redundancy, failover or authentication. If needed the

client has to handle that.

(6)

6

Memcached

◼

To reduce the load on the database by caching

data BEFORE it hits the database

◼

Can be used for more than just holding database

results (objects) and improve the entire

application’s response time

◼

Feel the need for speed

◼

Memcache is in RAM - much faster then hitting the

disk or the database

(7)

A Typical Use Case of Memcached

Web

Servers

Database Memcached Servers

(8)

8

A Typical Use Case of Memcached

Web

Servers

Database Memcached Servers

SELECT `name` FROM `users`

WHERE `id` = 1;

(9)

Query Result

A Typical Use Case of Memcached

Web

Servers

Database Memcached Servers

(10)

10

Query Result

A Typical Use Case of Memcached

Web

Servers

Database Memcached Servers

Key = MD5(“SELECT `name` FROM

`users` WHERE `id` = 1;”) Value = Query Result

Hash (Key) → Server ID PUT(key, Value)

(11)

Query Result

A Typical Use Case of Memcached

Web

Servers

Database Memcached Servers

`users` WHERE `id` = 1;”) Value = Query Result

PUT(key, Value)Hash (Key) → Server ID

(12)

12

A Typical Use Case of Memcached

Web

Servers

Database Memcached Servers

SELECT `name` FROM `users`

WHERE `id` = 1;

`users` WHERE `id` = 1;”) GET(key)

(13)

A Typical Use Case of Memcached

Web

Servers

Database Memcached Servers

`users` WHERE `id` = 1;”) GET(key) Hash (Key) → Server ID

(14)

14

Memcached

Distributed memory caching system as a cluser

14 Hash Function

Memcached Servers Key

(15)

The Data Structure

❑ Slab-based memory allocation: memory is divided into 1 MB slabs

consisting of fixed-length chunks. E.g, slab class 1 is 72 bytes and each slab of this class has 14563 chunks; slab class 43 is 1 MB and each slab has only one chunk. Least-Recently-Used (LRU) algorithm to select the items for replacement in each slab-class queue. (why slab?)

❑ Lock has to be used for integrity of the structure, which is very expensive (why?)

(16)

16

A Well-known One: Google’s BigTable and LevelDB

Chang, et al., Bigtable: A Distributed Storage System for Structured Data in OSDI’06.

❑

A multi-layered LSM-tree structure

❑

Progressively sort data for small memory demand

❑

Small number of levels for effective Bloom Filter

use.

(17)

Scale Problem

• Lots of data

• Millions of machines

• Different project/applications

• Hundreds of millions of users

Storage for (semi-)structured data

No commercial system big enough

• Couldn’t afford if there was one

Low-level storage optimization helps performance significantly

Much harder to do when running on top of a database layer

Google’s BigTable and LevelDB to Scale Data Store

(18)

18

Bigtable

Fault-tolerant, persistent

Scalable

•

Thousands of servers

•

Terabytes of in-memory data

•

Petabyte of disk-based data

•

Millions of reads/writes per second, efficient

scans

Self-managing

•

Servers can be added/removed dynamically

•

Servers adjust to load imbalance

(19)

Data model: a big map

• <Row, Column, Timestamp> triple for key

•

Each value is an uninterpreted array of bytes

• Arbitrary “columns” on a row-by-row basis

•

Column family:qualifier. a small number of families and large number of columns

• Lookup, insert, delete API

Each read or write of data under a single row key is atomic

(20)

20

SSTable

Immutable, sorted file of key-value pairs

Chunks of data plus an index

• Index is of block ranges, not values

• Index loaded into memory when SSTable is opened

• Lookup is a single disk seek

Alternatively, client can load SSTable into memory

Index

4KB block

SSTable

(21)

Tablet

Contains some range of rows of the table

Unit of distribution & load balance

Built out of multiple SSTables

Index 4KB

block

4KB block

SSTable

Index 4KB

block

4KB block

SSTable Tablet Start:aardvark End:apple

(22)

22

Table

Multiple tablets make up the table

SSTable SSTable SSTable SSTable Tablet

aardvark apple

Tablet

apple_two_E boat

(23)

Editing/Reading a Table

Mutations are committed to a commit log (in GFS)

Then applied to an in-memory version (memtable)

Reads applied to merged view of SSTables & memtable

Reads & writes continue during tablet split or merge

SSTable (sorted) Tablet

apple_two_E boat Insert

Insert Delete Insert Delete

Insert

Memtable (sorted)

(24)

24

Bigtable Tablet

LevelDB is similar to a single Bigtable tablet

(25)

Compactions

Minor compaction – convert a full memtable into an

SSTable, and start a new memtable

•

Reduce memory usage

•

Reduce log traffic on restart

Major compaction

•

Combining a number of SSTables into possibly smaller

number of SSTables.

•

No deletion records, only live data

(26)

26 26

Log-Structured Merge-Tree (LSM-tree)

•

Optimized for fast random updates, inserts and deletes

with moderate read performance.

•

Convert the random writes to sequential writes

–

Accumulate recent updates in memory

–

Flush the changes to disks sequentially in batches

–

Merge on-disk components periodically

•

At the expense of read performance

Management of SSTable -- LSM-tree

(27)

Google’s LevelDB: Progressively Increasing Level Size

MemTable Write Immutable

MemTable Memory

Disk

Dump

……

… Level 0

Level 1 10MB

Level 2 100 MB

Compaction Log

Manifest

Current

SSTable

MemTable

K1 V1 K2 V2 K3 V3

Immutable MemTable

SSTable

SSTable SSTable

(28)

28

LevelDB Write Flow

(29)

LevelDB Read Flow

Bloom Filter

(30)

30

LevelDB Compact (L0/L1)

0 3 5 9

(31)

LevelDB Compact (L0/L1 Move)

(32)

32

Background on Bloom Filter

❑ Data structure proposed by Burton Bloom

❑ Randomized data structure

– Strings are stored using multiple hash functions

– It can be queried to check the presence of a string

❑ Membership queries result in rare false positives but

never false negatives

❑ Originally used for UNIX spell check

❑ Modern applications include :

– Content Networks

– Summary Caches

– route trace-back

– Network measurements

– Intrusion Detection

(33)

Programming a Bloom Filter

(34)

34

Querying a Bloom Filter

(35)

Querying a Bloom Filter (False Positive)

(36)

36

Optimal Parameters of a Bloom Filter

Bloom filter computes k hash functions on input

Key Point : false positive rate decreases exponentially

with linear increase in number of bits per string (item)

(37)

Three Metrics to Minimize

Memory overhead

Read amplification

Write amplification

• Ideally 0 (no memory overhead)

• Limits query throughput

• Ideally 1 (no wasted flash reads)

• Limits write throughput

= Metadata (e.g., index, BF, etc) size per entry

= Disk reads per query

= Disk writes per entry

(38)

38

Design I: FAWN for Fast Read and Fast Write

FAWN (Fast Array of Wimpy Nodes)

−

A Key-Value Storage System

–

I/O intensive, not computation-intensive

–

Massive, concurrent, random, small-sized data access

−

A new low-power cluster architecture

–

Nodes equipped with embedded CPUs + FLASH

–

A tenth of the power compared to conventional architecture

David G. Andersen, et al, FAWN: a Fast Array of Wimpy Nodes, on SOSP’09

(39)

FAWN System

❑Hash Index to map 160-bit keys to a value stored in the

data log;

❑It stores only a fragment of the actual key in memory to

find a location in the log => reduce memory use.

❑How to store, lookup, update, and delete?

❑How to do garbage collection?

❑How to reconstruct after a crash, and how to speed up the

reconstruction?

❑Why is a Delete entry necessary? (hint: fault tolerance)

(40)

40

FAWN DataStore

Chained hash

entries in each bucket

(41)

DRAM Must be Used Efficiently

DRAM used for indexing (locating) items on the disk 1 TB of data to store on the disk

4 bytes of DRAM for key-value pair (previous state-of-the-art)

Key-value pair size (bytes) Index size

(GB)

32 B: Data deduplication

=> 125 GB!

168 B: Tweet

=> 24 GB 1 KB: Small image

=> 4 GB

(42)

42

❑Memory use is not proportional to the KV item count.

❑Place the hash table buckets, including the links, to the

flash.

❑Only the first pointers to buckets (hash table directory) are

in memory.

Design II: SkimpyStash for Small Memory Demand

❑Debnath et al., SkimpyStash: RAM Space Skimpy Key-

Value Store on Flash-based Storage in SIGMOD’11

(43)

❑Use in-RAM write buffer to enable batched writes (timeout

threshold to bound response time and concurrent write and

flush)

❑Basic operations:

➢ lookup: HT directory in memory → bucket on the flash

➢ insert: buffer in memory → batched write to the flash

as a log and linked into HT directory.

➢ delete: write a NULL entry

Design II: SkimpyStash for Small Memory Demand

(cont’d)

(44)

44

SkimpyStash Architecture

(45)

❑Garbage collection in the log:

➢ Start from the log tail (other than the currently written end).

➢ Do page by page,

➢ Cannot update predecessor’s pointer in the bucket.

➢ Compact and relocate whole bucket → leave orphans for garbage collection.

❑The cost of write/delete/lookup

In the worst case how many flash reads are needed for one lookup?

❑ The consequence of unbalanced buckets

Exceptionally long buckets → unacceptably long lookup time!

❑Solution

: two-choice-based hashing: each key would be hashed to two candidate HT directory buckets, using two hash functions h1 and h2, and inserted into the one that has currently fewer elements

❑How to know in which bucket to search for a lookup?

Use of Bloom filter for each bucket. The filter is dimensioned one byte per key and assume average number of items in each bucket.

(46)

46

The consequence of

spreading a chain of

entries in a bucket

across pages.

Use compaction to

ameliorate the issue.

Compaction in SkimpyStash

(47)

❑ Long/Unpredictable/unbounded lookup time.

Weakness of SkimpyStach

(48)

48

Design III: BufferHash using Equally-sized Levels

Anand et al., Cheap and Large CAMs for High Performance Data-Intensive Networked Systems in NSDI’10

❑Move entire hash tables to the disk/flash

❑The store consists of multiple levels and each is

organized as a hash table.

(49)

The approach: Buffering insertions

Control the impact of random writes

Maintain small hash table (buffer) in memory

As in-memory buffer gets full, write it to flash

• We call in-flash buffer, incarnation of buffer

Incarnation: In-flash

Buffer: In-memory

DRAM Flash SSD

(50)

50

Two-level Memory Hierarchy

DRAM

Flash

Buffer

Incarnation table

Incarnation

1

2

3

4 Net hash table is: buffer + all incarnations

Oldest

incarnation Latest

incarnation

(51)

Lookups are impacted due to buffers

DRAM

Flash

Buffer

Incarnation table

Lookup key

In-flash

look ups

Multiple in-flash lookups. Can we limit to only one?

➔ Use Bloom Filters

4 3 2 1

(52)

52

Bloom filters for optimizing lookups

DRAM

Flash

Buffer

Incarnation table

Lookup key

Bloom filters

In-memory

look ups

False positive!

Configure carefully!

4 3 2 1

2 GB Bloom filters for 32 GB Flash for false positive rate < 0.01!

(53)

Update: naïve approach

DRAM

Flash

Buffer

Incarnation table

Bloom filters

Update key

Expensive

random writes

Discard this naïve approach

4 3 2 1

(54)

54

Lazy updates

DRAM

Flash

Buffer

Incarnation table

Bloom filters

Update key

Insert key

4 3 2 1

Lookups check latest incarnations first

Key, new value

Key, old value

(55)

Issues with using one buffer

Single buffer in DRAM

• For all operations and eviction policies

High worst case insert

latency

• Few seconds for 1 GB buffer

• New lookups stall

DRAM

Flash

Buffer

Incarnation table

Bloom filters

4 3 2 1

(56)

56

BufferHash: Putting it all together

Multiple BufferHash stores (sharding) to limit each store’s size.

Multiple incarnations per buffer in flash

One in-memory Bloom filter per incarnation

DRAM

Flash

Buffer 1 Buffer K

. .

Net hash table = all buffers + all incarnations

(57)

Weaknesses of BufferHash

Excessively large number of (incarnations) levels makes BF less effective.

Searching in individual incarnations is not efficient.

(58)

58

Design IV: SILT with Levels of

Dramatically-different Sizes

Read

Amplification

Memory overhead (bytes/entry) FAWN-DS BufferHash

SkimpyStash

SILT

Hyeontaek Lim et al, SILT: A Memory-Efficient, High-Performance Key-Value Store, in SOSP’11.

(59)

Seesaw Game?

Memory efficiency High performance

FAWN-DS BufferHash SkimpyStash

How can we improve?

(60)

60

SILT Sorted Index (Memory efficient)

SILT Log Index (Write friendly)

Solution Preview: (1) Three Stores with

(2) New Index Data Structures

Memory Flash SILT Filter

Inserts only go to Log Data are moved in background

Queries look up stores in sequence (from new to old)

(61)

LogStore: No Control over Data Layout

6.5+ bytes/entry 1

Memory overhead Write amplification

Inserted entries are appended

On-flash log

Memory Flash Still need pointers:

size ≥ log N bits/entry SILT Log Index (6.5+ B/entry)

(Older) (Newer)

Naive Hashtable (48+ B/entry)

(62)

62

LogStore: Using Cuckoo Hash to Embed Buckets into HT Directory

How to find the alternative slot for displacement by storing hash index in the tag?

(63)

HashStore: Remove in-memory HT (or the index)

HashStore saves memory over LogStore by eliminating the offsets and

reordering the on- flash (key,value) items from insertion order to hash order.

(64)

64

SortedStore: Space-Optimized Layout

On-flash sorted array

Memory Flash SILT Sorted Index (0.4 B/entry)

Need to perform bulk- insert to amortize cost To merge HashStore entries into the SortedStore, SILT must

generate a new SortedStore

(65)

0.7 bytes/entry Memory overhead

Read

amplification Write amplification

SILT’s Design (Recap)

On-flash sorted array SILT Sorted Index

On-flash log

SILT Log Index

On-flash hashtables SILT Filter

Merge Conversion

(66)

66

Any Issue with SILT?

❑

SILT provides both memory-efficient and high-

performance key-value store

➢

Multi-store approach

➢

Entropy-coded tries

➢

Partial-key cuckoo hashing

❑

The weakness: write amplification is way too

high!

(67)

Current KV storage systems have one or more of the issues:

(1)

very high data write amplifications;

(2)

large index set; and

(3)

dramatic degradation of read performance with overspill

index out of the memory.

LSM-trie:

(1) substantially reduces metadata for locating items,

(2) reduces write amplification by an order of magnitude,

Design V-A: the LSM-trie Store w/ Much Reduced WA

Wu et al, LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data, in USENIX ATC’15

(68)

68

LSM-trie: A New Level Growth Pattern

(69)

LSM-trie: Minimize Write Amplification

❑

To enable linear growth pattern, the SSTables in one

column of sub-levels of a level must have the same key

range.

❑

KV items are hashed into and organized in the store.

❑

160b-Hash key is generated with SHA-1 for uniform

distribution.

(70)

70

The Trie Structure

(71)

Compaction in LSM-trie

(72)

72

How about out-of-core the Bloom Filters?

❑ To scale the store to very large size in terms of both capacity

and KV-item count (e.g, a 10 TB store containing 100 billion

of 100-byte KV items). A big challenge on designing such a

large-scale store is the management of its metadata that often

have to be out of core (the DRAM).

❑ LSM-trie hashes keys into buckets within each SSTable

(Htable).

(73)

Load on the buckets in an Htable may not be balanced

(74)

74

Load on the buckets in an Htable not balanced.

(75)

Balance the load using item migration

❑ How to determine if an item has been migrated?

(76)

76

The load is balanced!

(77)

A Summary of the Use of 160b Hashkey

(78)

78

BF is clustered for at most one access at each level

(79)

Prototyped LSM-trie

❑

32MB HTables and an amplification factor (AF) of 8.

❑

The store has five levels. In the first four levels, LSM-trie uses

both linear and exponential growth pattern.

❑

All the Bloom filters for the first 32 sub-levels are of 4.5 GB,

assuming a 64B average item size and 16 bit Bloom filter per

key. Adding metadata about item migration within individual

HTables (up to 0:5 GB), LSM-trie needs up to only 5GB

memory to hold all necessary metadata

❑

At the fifth level, which is the last level, LSM-trie uses only

linear growth pattern. As one sub-level at this level has a

capacity of 128 G, it needs 8 such sub-levels for the store to

reach 1 TB, and 80 such sub-levels to reach 10 TB.

❑

LSM-trie uses 16-bit-per-item Bloom filters, the false positive

rate is only about 5% even for a 112-sub-level 10 TB KV store.

(80)

80

Write Throughput

(81)

Write Amplification

(82)

82

Read Throughput

(83)

Summary

LSM-trie is designed to manage a large set of small data.

It reduces the write-amplification by an order of magnitude.

It delivers high throughput even with out-of-core metadata.

The LSM-trie source code can be downloaded at:

https://github.com/wuxb45/lsm-trie-release

(84)

84

Address root cause of write amplification:

To maintain sorted non-overlapping SStables in each

level, the store has to rewrite data to the same level

multiple times

Design V-B: PebblesDB Supporting Range Search

Raju et al, PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees, in SOSP’17

(85)

How about removing the non-overlapping requirement?

Partially Sorted Levels

Read can be too slow!

Concrete boundaries (guards) to group overlapping SSTables

(86)

86

How about removing the non-overlapping requirement?

Writes Data Only Once per Level

Concrete boundaries (guards) to group overlapping SSTables

(87)

Compactions in PebblesDB

(88)

88

Compactions in PebblesDB

(89)

Selecting Guards

Guards are chosen dynamically according to key distribution.

(90)

90

Discussions

➢

Can PebblesDB keep each bucket (between two contiguous

guards) of similar size after change of insert key distribution?

➢

Can we easily select and apply a new set of guards after a key

distribution change?

(91)

➢

A new index data structure for sorted keys with an

asymptotically low cost (O(log L)) (L is the key

length).

➢

It leverages the advantages of three existing data

structures.

○ B+-tree

○ Prefix Tree (Trie)

○ Hash Table

Design VI: The Wormhole in-Memory KV Store Supporting

Range Search

Wu et al.,, "Wormhole: A Fast Ordered Index for In-memory Data

(92)

92

Index for In-memory Large-scale Data Stores

▪ The index can be very big.

– The memory becomes increasing large.

– The indexed data (key-value items) can be small.

– Many billions of keys have to be indexed.

▪ The index operations can be very expensive.

– The working set can be large.

– Cannot always assume very strong locality.

▪ The index needs to be sorted.

– Hash table with O(1) operations often isn’t a choice.

– But we will leverage it to accelerate search in a sorted index.

“The TPC-H queries spend up to 94% (35% on average) and TPC-DS queries

spend up to 77% (45% on average) of their execution time on indexing.”

Kocberber, et al. at IEEE/ACM MICRO’13

92

(93)

B+-tree can be too Expensive

▪ O(log N) search cost (N is the number of keys)

– ~30 key-comparisons for one billion keys.

▪ A key comparison may induce multiple cache misses.

▪ An index search may end up with 50-100 memory accesses.

(94)

94

Prefix Tree Can be Time- and Space-inefficient

▪ O(L) search cost (L is token count in a search key).

– 100+ tokens in a key (e.g, URL)

▪ High space cost due to small node size.

94

(95)

The Dilemma and our Solution

Data

Structure Pros Cons

B+ Tree * Space efficient (large leaf nodes)

* Support of range operations High lookup cost with a large N

Prefix Tree Lookup cost not correlated with N

* High lookup cost even with a moderate L

* Space inefficiency

Hash Table O(1) lookup cost No support of range operations

We will orchestrate B+-tree, prefix tree, and hash table in one index structure that:

– has O(log L) search cost;

– is memory-efficient;

– supports range search.

(96)

96

A Closer Look at B+-tree

The list of leaf nodes:

●All keys in sorted order

●Fast range operations

●Constant lookup cost within each node

The tree of internal nodes:

●O(log N) lookup cost

●Operations for maintaining tree balance

96

(97)

Replace MetaTree with a Hash Table?

However, this doesn’t work:

– No support of insert;

– No full support of range search.

(98)

98

Replace MetaTree with MetaTrie (a Prefix Tree)

▪ Keys-at-its-left < anchor ≤ keys-at-its-right - Example anchors: “Au”, “Jam”, “Jos”

▪ All prefixes of anchors are in MetaTrie.

98

(99)

Search for “James” …….

(100)

100

Search for “Denice” …….

D

100

(101)

Search for “Denice” …….

D

(102)

102

Search for “Julian” …….

u

102

(103)

But the Search Cost is Still O(L)

▪ Two phases in a search:

– Find the longest prefix match (LPM) on the trie [1st phase].

– Reach the right-most leaf of the missing node’s left sibling, or the left- most leaf of the missing node’s right sibling [2nd phase].

u

“Julian”

Binary search on the prefix length (O(L) → O(log L))

Record pointers of left-most and right-most nodes at each trie node (O(L) → O(1))

(104)

104

An Example: Binary Search on the Key with a Hash Table

▪ The anchor is “

Alexander

”

▪ All prefixes of the anchor are inserted to the hash table.

– “A”, “Al”, “Ale”, ... , “Alexander”

▪ Binary search of LPM for key “

Alexandria

”

– Len = 5 (⎡(0+9)/2⎤): “Alexa” exists.

104

(105)

Alexander

”

Alexandria

”

– Len = 5 (⎡(0+9)/2⎤): “Alexa” exists.

– Len = 7 (⎡(5+9)/2⎤): “Alexand” exists

(106)

106

Alexander

”

Alexandria

”

– Len = 5 (⎡(0+9)/2⎤): “Alexa” exists.

– Len = 7 (⎡(5+9)/2⎤): “Alexand” exists

– Len = 8 (⎡(7+9)/2⎤): “Alexandr” does not exist.

– so LPM is “Alexand” of length 7.

106

(107)

Wormhole:

Replace MetaTree with MetaTrieHT (a Prefix on a Hash Table)

(108)

108

Search for “Jacob” …….

▪ Search “Ja”

▪ But “Jac” is not in the table.

▪ Go to Jam’s left-most leaf node, which is Node 3

▪ Go to the node’s left sibling, which is Node 2

▪ ₁₀₈Search for “Jacob” in Node 2.

(109)

A Recap of Wormhole

▪ Its lookup cost is O(log L)

– Asymptotically (much) better than O(log N) and O(L)

▪ Its space cost is comparable to B+ tree

– One anchor per leaf node

▪ Efficient support of range Search

▪ Conduct split/Merge of leaf nodes (for insertion/deletion)

– Same as B+-tree (But do not need to maintain tree balance) – Efficient concurrency control (see the paper)

(110)

110

Performance Evaluation

▪ The workload

– Amazon reviews metadata (143M keys)

– URLs from (“http://zwischenruf.at/?p=1012”) (192M keys)

– Randomly generated keys of different sizes (8B, 16B, ... 1024B)

▪ The server

– 16-core Intel Xeon E5-2697A, 40MB shared LLC, 256GB DRAM@2400MHz

▪ Indexes in comparison

– Skiplist – B+-tree – Masstree

– Cuckoo Hash Table

110

(111)

Lookup Performance

● At least 1.6x higher

● Up to 5x higher

(112)

112

Compare to Cuckoo Hash Table (w/ Point Search)

ONLY 2x ~ 3x slower

112

(113)

A Summary

A new index data structure for sorted keys with an asymptotically low cost of O(log L).

A well optimized implementation delivers throughput multiple times higher.

A promising data structure for fine-grain-indexed big data store.

113