Probabilistic Deduplication for Cluster-Based Storage Systems

(1)

Probabilistic Deduplication for

Cluster-Based

Storage Systems

Davide Frey, Anne-Marie Kermarrec,

Konstantinos

Kloudas

(2)

Motivation

•

Volume of data stored increases exponentially.

•

Provided services are highly dependent on data.

(3)

Motivation

•

Traditional solutions combine data in tarballs

and store them on tape.

–

Pros

: cost efficient.

–

Cons

: low throughput.

3 Tape Disk Acquisition cost $407,000 $1,620,000 Operational cost $205,000 $573,000 Total cost $612,000 $2,193,000 * Source: www.backupworks.com

(4)

Deduplication

Store data

only once

and replace

duplicates

with

references.

(5)

Deduplication

Store data

only once

and replace

duplicates

with

references.

5

(6)

Deduplication

file1

file2

6

Store data

only once

and replace

duplicates

with

references.

(7)

Deduplication

7

file1

file2

Store data

only once

and replace

duplicates

with

references.

(8)

Deduplication

8

file1

file2

Store data

only once

and replace

duplicates

with

references.

(9)

Challenges

•

Single-node

deduplication systems.

–

Compact indexing structures.

–

Efficient duplicate detection.

(10)

Challenges

•

Single-node

deduplication systems.

–

Compact indexing structures.

–

Efficient duplicate detection.

•

Cluster-based

solutions.

–

Single-machine tradeoffs.

–

Deduplication vs Load balancing.

10

We focus on

Cluster-based Deduplication

(11)

Storage

Nodes

Coordinator

Clients

Example: Deduplication Vs

Load Balancing

A B C D

A client wants to store a file .

(12)

Storage

Nodes

Coordinator

Clients

Load Balancing

A B C D

The client sends the file to the

Coordinator.

(13)

Storage

Nodes

Coordinator

Clients

Load Balancing

A 10% B 30% C 60% D 0%

The

Coordinator

computes the overlap

between the contents of and those

of each

Storage Node

.

(14)

Storage

Nodes

Coordinator

Clients

Load Balancing

A 10% B 30% C 60% D 0%

To maximize

DEDUPLICATION

,

the new

file should go to node C.

(15)

Storage

Nodes

Coordinator

Clients

Load Balancing

A 10% B 30% C 60% D 0%

To achieve

LOAD BALANCING

,

the new

file should go to node D.

(16)

Goal

: Scalable Cluster Deduplication.

16

Load Balancing.

•

Minimize:

•

Ideally, equal to 1.

Good Data Deduplication.

•

Maximize:

(17)

Goal

17

Load Balancing.

•

Minimize:

•

Maximize:

•

Ideally, deduplication of a single-node system.

Scalability.

(18)

Goal

18

Load Balancing.

•

Minimize:

•

Maximize:

•

Ideally, deduplication of a single-node system.

Good Throughput.

•

Minimize CPU/Memory usage at

Coordinator

.

Scalability.

(19)

PRODUCK

architecture

Coordinator

Storage Nodes

Client

•

Split the file in chunks of data

.

•

Store and retrieve data.

•

Store the chunks

.

•

Provide directory services.

•

Assign chunks to nodes.

•

Keep the system load balanced.

(20)

Client: chunking

•

C

hunks

:

–

use

content-based chunking

techniques.

–

basic

deduplication unit

.

•

Super-chunks:

–

group of

consecutive

chunks

.

–

basic

routing and storage unit

.

(21)

Client: chunking

21

(22)

Client: chunking

22

Organize the chunks in super-chunks

(23)

Client: chunking

(24)

PRODUCK

architecture

Coordinator

Storage Nodes

Client

•

Split the file in chunks of data

.

•

Store and retrieve data.

•

Store the chunks

.

•

Provide directory services.

•

Assign chunks to nodes.

•

Keep the system load balanced.

(25)

Coordinator: goals

•

Estimate the

overlap

between a

super-chunk

and the

chunks

of a given node.

–

Maximize

deduplication

.

•

Equally distribute storage load among nodes.

–

Guarantee a load balanced system.

(26)

Coordinator: our contributions

•

Novel

chunk overlap estimation

.

–

Based on probabilistic counting—PCSA

[Flajolet et al. 1985, Michel et al. 2006]

.

–

Never used before in storage systems.

•

Novel

load balancing mechanism

.

–

Operating at chunk-level granularity.

–

Improving co-localization of duplicate chunks.

(27)

Coordinator: Overlap Estimation

•

Main observation :

–

Do not need the exact matches.

–

Need only an estimation of the size of the overlap.

•

PCSA permits :

–

Compact

set descriptors.

–

Accurate

intersection estimation.

–

Computationally efficient

.

(28)

Chunk₅

Chunk₁ Chunk₂ Chunk₃ Chunk₄

Original Set of Chunks

(29)

Coordinator: Overlap Estimation

Chunk₅

hash() 1 0 1 1 1 0 0 1 7 6 5 4 3 2 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0

Original Set of Chunks

(30)

Coordinator: Overlap Estimation

Chunk₅

1 0 1 1 1 0 0 1 7 6 5 4 3 2 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0

Original Set of Chunks

p(y) = min(bit(y, k))

1 1 0 1 0 0 0 0

0 1 2 BITMAP 3 4 5 6 7

hash()

(31)

Coordinator: Overlap Estimation

Chunk₅

Original Set of Chunks

1 1 0 1 0 0 0 0 hash() BITMAP p(y) = min(bit(y, k)) 0 1 2 3 4 5 6 7 31

INTUITION

P(bitmap[0] = 1) = 1/2 P(bitmap[1] = 1) = 1/4 P(bitmap[2] = 1) = 1/8 …

(32)

•

Intersection Cardinality Estimation

?

•

Union Cardinality Estimation ?

1 1 0 1 1 0 0 0 0 1 2 BITMAP(A) 3 4 5 6 7 0 1 0 0 1 1 0 0 0 1 2 BITMAP(B) 3 4 5 6 7 1 1 0 1 1 1 0 0 0 1 2 3 4 5 6 7 BITMAP(A V B) BitwiseOR 32

(33)

33

PCSA set cardinality estimation.

Set intersection estimation.

(34)

Coordinator: our contributions

•

Novel

.

–

.

–

Never used before in storage systems.

•

Novel

.

–

(35)

Load Balancing

35

•

Existing solution:

choose Storage Nodes that do not exceed

average load by a percentage threshold.

(36)

Load Balancing

Problems

•

Too aggressive, especially when a few data

are stored in the system.

36

•

Existing solution:

choose Storage Nodes that do not exceed

average load by a percentage threshold.

(37)

•

Bucket-based

storage quota management.

–

Measure storage space in fixed-size

buckets

.

–

Coordinator

grants

buckets

to nodes one by one.

–

No node can exceed the

least loaded

by more

than a

maximum allowed bucket difference.

37

(38)

•

Bucket-based

Bucket

38

(39)

•

Bucket-based

Bucket

Can I get a new Bucket?

39

(40)

•

Bucket-based

Bucket

Yes, you can.

40

(41)

•

Bucket-based

Bucket

Yes, you can.

41

(42)

•

Bucket-based

Bucket

42

(43)

•

Bucket-based

Bucket

43

(44)

•

Bucket-based

Bucket

Can I get a new Bucket?

44

(45)

•

Bucket-based

storage quota management.

Bucket

NO you cannot!

45

(46)

•

Bucket-based

Bucket

Searching for the

second biggest

overlap.

46

(47)

•

Bucket-based

Bucket

47

(48)

Contribution Summary

•

Novel

.

–

.

–

Never used before in storage systems.

•

Novel

.

–

(49)

Evaluation: Datasets

•

2 real world workloads:

•

2 competitors [Dong et al. 2011]:

–

Minhash

–

BloomFilter

! "#$%&' ! ( ( )*#%+&( )' , ( *$- ' , ( *$/' , ( *$. ' 01&2#"$3$4' 01&2#"$3$56' 716$)! 81%9: - ' 01&716$)! 83$4:-' 01&716$)! 83$56: -' 01&716$)! 83$4_{: /}_' 01&716$)! 83$56: /' 716$)! 81%9: /' 716$)! 81%9; <9: -' 716$)! 81%9; <9: /' => 2' => 2' 2#"$; 55#?%@$%&'A_5?_'

F i gur e 3: M essage ex ch an ges for st or i n g a ﬁ le.

t he act ual dat a t o t he select ed St or ageNode in a Super -Chunk message. W hen t he t ransmission of t he superchunk is over, t he St or ageNode t hat received it sends a Super

-Chunk A ck message t o t he node responsible for t heﬁle. T he

Super Chunk A ck cont ains t he checksum of t he dat a in t he

superchunk and t he corresponding ident iﬁer. T his enables

t he responsible node t o record which St or ageNode st ored which superchunk, and allows it t o check t he int egrity of t he dat a received by t he St or ageNode. If somet hing went wrong, t he responsible node request s t he Cl ient t o resend t he dat a t o t he St or ageNo de.

T he above process is repeat ed for all t he superchunks in

t he ﬁle. When t he responsible node has received

acknowl-edgement s for all t he superchunks, it sends an EOF message t o t he Coor dinat or , which t hen forwards it t o t he Cl ient .

3.2

Recover ing a File

T he process of reading a ﬁle is also init iat ed by a user

request t o t he client . A s depict ed in Figure 4, t he client react s t o t his request by cont act ing t he Coor dinat or wit h a Get F il eReq message, specifying t he ident iﬁer of t he ﬁle

t o ret rieve. T he Coor dinat or looks up t he responsible

node for t he ﬁle and forwards the Get Fil eReq message

t o it . T he responsible node replies t o t he client by

provid-ing t he ident iﬁer, checksum, and St or ageNode for each

superchunk in t he ﬁle in a Get Fil eResp message. T he

client downloads each superchunk from t he corresponding St or ageNo de and uses t he checksums received from t he responsible node t o verify t heir int egrit y.

4.

EXPERI M ENTAL M ETHODOL OGY

In t his sect ion we present our experiment al set -up: t he dat aset s we used, t he compet it ors we compared Pr oduck against , and t he met rics we evaluat ed. Since we focus on evaluat ing deduplicat ion efficiency and load balancing wit h respect t o st orage, we evaluat e Pr oduck t hrough simula-t ions simula-t o avoid any nesimula-t working side-effecsimula-t s. Yesimula-t , we builsimula-t our simulat or so t hat t he t ransit ion t o a real implement at ion only consist s in changing it s communicat ion primit ives from in-memory t ransact ions t o net work messages.

! "#$%&' ! ( ( )*#%+&( )' , ( *$- ' , ( *$/' , ( *$. ' 0 $&1#"$2$3' 0 $&456$)! 75%89_{- '} 0 $&456$)! 75%89/_' 0 $&1#"$2$3' 0 $&1#"$2$: 6! 456$)! 75%89- ' 456$)! 75%89/'

F i gur e 4: M essage ex ch an ges for r et r i ev i ng a ﬁ le.

4.1

Datasets

We evaluat e Pr oduck and it s compet it ors using t wo

real-world workloads. T heﬁrst is publicly available for download

from [1] and consist s of 16 full snapshot s of t he English

ver-sion of Wikipedia. T he oldest dat es back t o M arch 2011,

while t he newest was t aken in M ay 2012. T his dat aset

con-t ains full versions of all arcon-t icles in Wikipedi a and account s

for more t han 520GB of dat a. T he second dat aset cont ains images of t he environment s available for deployment on t he servers of t he Grid5000 experiment al plat form [2] and ac-count s for 142GB. T hese workloads cont ain bot h t ext and binary dat a, t hus covering many common cases. Table 1 present s more det ails on t hese dat aset s. In part icular, t he Deduplicat ion Fact or is t he rat io of t he original size of each dat aset divided by it s size aft er being deduplicat ed based on

our chunking mechanism.

D at aset Si ze ( G B ) D ed u p l i ca-t i on Facca-t or D at a For -m at W ikipedia 522 1.96 HT M L Images 142 4.27 OS images T ab l e 1: D at aset D escr i p t i on

4.2

Competitor s

Before evaluat ing t he speciﬁc aspect s of Pr oduck , we

compare it against two st at e-of-t he-art clust er-based dedu-plicat ion st orage syst ems present ed in [6]: B l oomF il t er , a

stateful st rat egy, and M inHash, a stateless one. T he lat t er

is used in a commercial product [6], t hus represent ing t he current st at e of commercial clust er-based deduplicat ion

sys-t ems. A s in Pr oduck , bosys-t h sys-t hese ssys-t rasys-t egies splisys-t ﬁles int o

chunks and superchunks using cont ent -based chunking wit h t he hash values of chunks serving as t heir signat ures. T he difference between t hem lies in t he way each st rat egy select s t he node t hat should st ore each superchunk.

(50)

Evaluation: Competitors

•

MinHash

:

–

Use the minimum hash from a

super-chunk

as its

fingerprint.

–

Assign

super-chunks

to

bins

using the

mod(# bins)

operator.

–

Initially assign

bins

to nodes randomly and

re-assign

bins

to nodes when unbalanced.

(51)

Evaluation: Competitors

•

BloomFilter

:

–

The

Coordinator

keeps a Bloom filter for each one

of the

Storage Nodes

.

–

If a node deviates more than 5% from the average

load, he is considered

overloaded

.

(52)

Evaluation: Metrics

Deduplication:

Load balancing:

Overall:



ED and TD are

normalized

to the performance of a

single-node system to ease comparison.

Throughput :

(53)

Evaluation: Effective Deduplication

Wikipedia Images

53

32 nodes : Wikipedia 7% Images 16% 64 nodes : Wikipedia 16% Images 21%

(54)

Evaluation: Throughput

54

Wikipedia Images

32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X

(55)

Evaluation: Throughput

55

Wikipedia Images

Memory : 64KB for Produck

9,6bits/chunk or 168GB for 140TB/node

32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X

(56)

Evaluation: Load Balancing

Load Balancing

56

(57)