• No results found

Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

N/A
N/A
Protected

Academic year: 2021

Share "Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014."

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Structures for Big Data:

Bloom Filter

Vinicius Vielmo Cogo

(2)

is relative

is not defined by a specific number of TB, PB, EB is when it becomes big for you

(3)

Data Structures for Big Data

Traditional DSs are subject to the same problems

e.g., lists, trees

(e.g., YARN, NoSQL)

or

(e.g., index, metadata)

(4)

Outline

Bloom Filter Use Cases

Implementations Other Filters

(5)

Bloom Filter

Membership testing

(6)

Bloom Filter

City

Coimbra Leiria

(7)

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Index i bf[i] http://billmill.org/bloomfilter-tutorial/ 7 / 30

(8)

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur

(9)

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur i=4 i=7

(10)

Bloom Filter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur i=4 i=7

(11)

Bloom Filter

Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0

(12)

Bloom Filter

Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur i=2 i=9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0

(13)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur i=2 i=9

(14)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Coimbra Leiria Hash Function Fnv Murmur

(15)

Bloom Filter

City Braga Guarda Coimbra Lisboa

(16)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=10 i=14

(17)

Result:

false

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=2 i=12

(18)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=4 i=7

(19)

Result:

true (but it is a false positive)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0

Bloom Filter

Index i bf[i] City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=7 i=9

(20)

Bloom Filter

DS proposed by Burton Howard Bloom in 1970 Design principles

Space-efficient

Smaller than the original dataset

Time-efficient

Low latency R/W

O(k), which is much smaller than O(n)

High throughput Probabilistic

E.g., myCollection.mightContain(myObject)

(21)

= Optimal number of hash functions Hash Function Fnv Murmur Important variables

Bloom Filter

= Expected collection size

= Bitmap size

= False positive rate (e.g., 0.0001% or 1 in 1M)

City Coimbra Leiria

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

(22)

Important variables

(23)

Users define two of them (normally n and any other) The other two are calculated with those equations Interesting relations:

Bigger collection ( ) Larger bitmap ( )

Bigger collection ( ) More false positives ( )

Larger bitmap ( Less false positives ( )

Larger bitmap ( ) Less hash functions ( )

Less hash functions ( )

Bloom Filter

(24)

Bloom filter size vs. False positive rate

(25)

Use Cases

Reducing unnecessary disk reads

Client BloomFilter Dataset

RAM Hard Disk

1 2 3 F T T F T 1? 2? 3? necessary read(2) unnecessary read(3) No 2 No F

(26)

Use Cases

Google BigTable, Apache Cassandra and HBase

Reducing disk lookups

Google Chrome

Lookup a list of known malicious URLs

Bitcoin

Get only the transactions relevant to your wallet

(27)

Implementations

-libraries https://code.google.com/p/guava-libraries/ Orestes-Bloomfilter https://github.com/Baqend/Orestes-Bloomfilter java-bloomfilter https://github.com/magnuss/java-bloomfilter java-longfastbloomfilter https://code.google.com/p/java-longfastbloomfilter/

(28)

Other Filters

Counting Bloom filters

Allow deletions (use a 4-bit counter instead of 1 bit)

Buffered Bloom filters

Sub-filters in SSD with buffered R/W exploring bit locality

Quotient and Cascade filters

(29)

Other DSs (and techniques) for Big Data

Locality-sensitive hashing (LSH)

Hashing similar elements into the same bucket with high probability

HyperLogLog for computing cardinality

Counting the number of distinct elements in a collection

Log Structured Merge (LSM) trees

(30)

Thank you!

Vinicius Vielmo Cogo

References

Related documents

Enterprise Vault Server Host name: ev.company.internal Internet https://owa.company.com/exchange https://owa.company.com/enterprisevault Exchange 2003 Server https://owa.company

The victim on this occasion was a woman who, with her husband and children, had lived for years in the compound of the house adjoining our home in Naini Tai This woman, in

ing from the first third; PS 2, spine starting from the middle part; PS 3, spine starting from the last third of the penis; AS 1, short, straight spine; AS 2, short, bowed spine; AS

There exist a variety of libraries providing basic linear algebra data structures and operations (called kernels subsequently), and implementations of iterative solvers.. We will list

To graph the PPF you can choose a production plan for Good 1 and find the amount of Good 2 that you can produce using the quantity of labor not used in the production of Good

• Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions • Fast Search in Hamming Space with Multi-Index Hashing. • b-Bit

ADT A01 Admit/Visit notification Functional Description_BI_HL7v2.5_external.pdf ADT A02 Transfer a patient Functional Description_BI_HL7v2.5_external.pdf ADT A03 Discharge/End

A letter dated: Mount Lebanon, January 24, 1880; and signed: Ordeal, was written in response to an article “Experiences with the Spirit Enemies of Spiritualism.” The author