Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

(1)

Data Structures for Big Data:

Bloom Filter

Vinicius Vielmo Cogo

(2)

is relative

is not defined by a specific number of TB, PB, EB is when it becomes big for you

(3)

Data Structures for Big Data

Traditional DSs are subject to the same problems

e.g., lists, trees

(e.g., YARN, NoSQL)

or

(e.g., index, metadata)

(4)

Outline

Bloom Filter Use Cases

Implementations Other Filters

(5)

Bloom Filter

Membership testing

(6)

Bloom Filter

City

Coimbra Leiria

(7)

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Index i bf[i] http://billmill.org/bloomfilter-tutorial/ 7 / 30

(8)

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur

(9)

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur i=4 i=7

(10)

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur i=4 i=7

(11)

Bloom Filter

Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0

(12)

Bloom Filter

Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur i=2 i=9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0

(13)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur i=2 i=9

(14)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur

(15)

Bloom Filter

City Braga Guarda Coimbra Lisboa

(16)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=10 i=14

(17)

Result:

false

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=2 i=12

(18)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=4 i=7

(19)

Result:

true (but it is a false positive)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=7 i=9

(20)

Bloom Filter

DS proposed by Burton Howard Bloom in 1970 Design principles

Space-efficient

Smaller than the original dataset

Time-efficient

Low latency R/W

O(k), which is much smaller than O(n)

High throughput Probabilistic

E.g., myCollection.mightContain(myObject)

(21)

= Optimal number of hash functions Hash Function Fnv Murmur Important variables

Bloom Filter

= Expected collection size

= Bitmap size

= False positive rate (e.g., 0.0001% or 1 in 1M)

City Coimbra Leiria

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

(22)

Important variables

(23)

Users define two of them (normally n and any other) The other two are calculated with those equations Interesting relations:

Bigger collection ( ) Larger bitmap ( )

Bigger collection ( ) More false positives ( )

Larger bitmap ( Less false positives ( )

Larger bitmap ( ) Less hash functions ( )

Less hash functions ( )

Bloom Filter

(24)

Bloom filter size vs. False positive rate

(25)

Use Cases

Reducing unnecessary disk reads

Client BloomFilter Dataset

RAM Hard Disk

1 2 3 F T T F T 1? 2? 3? necessary read(2) unnecessary read(3) No 2 No F

(26)

Use Cases

Google BigTable, Apache Cassandra and HBase

Reducing disk lookups

Google Chrome

Lookup a list of known malicious URLs

Bitcoin

Get only the transactions relevant to your wallet

(27)

Implementations

-libraries https://code.google.com/p/guava-libraries/ Orestes-Bloomfilter https://github.com/Baqend/Orestes-Bloomfilter java-bloomfilter https://github.com/magnuss/java-bloomfilter java-longfastbloomfilter https://code.google.com/p/java-longfastbloomfilter/

(28)

Other Filters

Counting Bloom filters

Allow deletions (use a 4-bit counter instead of 1 bit)

Buffered Bloom filters

Sub-filters in SSD with buffered R/W exploring bit locality

Quotient and Cascade filters

(29)

Other DSs (and techniques) for Big Data

Locality-sensitive hashing (LSH)

Hashing similar elements into the same bucket with high probability

HyperLogLog for computing cardinality

Counting the number of distinct elements in a collection

Log Structured Merge (LSM) trees

(30)

Thank you!

Vinicius Vielmo Cogo