• No results found

Probabilistic Deduplication for Cluster-Based Storage Systems

N/A
N/A
Protected

Academic year: 2021

Share "Probabilistic Deduplication for Cluster-Based Storage Systems"

Copied!
57
0
0

Loading.... (view fulltext now)

Full text

(1)

Probabilistic Deduplication for

Cluster-Based

Storage Systems

Davide Frey, Anne-Marie Kermarrec,

Konstantinos

Kloudas

(2)

Motivation

Volume of data stored increases exponentially.

Provided services are highly dependent on data.

(3)

Motivation

Traditional solutions combine data in tarballs

and store them on tape.

Pros

: cost efficient.

Cons

: low throughput.

3 Tape Disk Acquisition cost $407,000 $1,620,000 Operational cost $205,000 $573,000 Total cost $612,000 $2,193,000 * Source: www.backupworks.com

(4)

Deduplication

Store data

only once

and replace

duplicates

with

references.

(5)

Deduplication

Store data

only once

and replace

duplicates

with

references.

5

(6)

Deduplication

file1

file2

6

Store data

only once

and replace

duplicates

with

references.

(7)

Deduplication

7

file1

file2

Store data

only once

and replace

duplicates

with

references.

(8)

Deduplication

8

file1

file2

Store data

only once

and replace

duplicates

with

references.

(9)

Challenges

Single-node

deduplication systems.

Compact indexing structures.

Efficient duplicate detection.

(10)

Challenges

Single-node

deduplication systems.

Compact indexing structures.

Efficient duplicate detection.

Cluster-based

solutions.

Single-machine tradeoffs.

Deduplication vs Load balancing.

10

We focus on

Cluster-based Deduplication

(11)

Storage

Nodes

Coordinator

Clients

Example: Deduplication Vs

Load Balancing

A B C D

A client wants to store a file .

(12)

Storage

Nodes

Coordinator

Clients

Example: Deduplication Vs

Load Balancing

A B C D

The client sends the file to the

Coordinator.

(13)

Storage

Nodes

Coordinator

Clients

Example: Deduplication Vs

Load Balancing

A 10% B 30% C 60% D 0%

The

Coordinator

computes the overlap

between the contents of and those

of each

Storage Node

.

(14)

Storage

Nodes

Coordinator

Clients

Example: Deduplication Vs

Load Balancing

A 10% B 30% C 60% D 0%

To maximize

DEDUPLICATION

,

the new

file should go to node C.

(15)

Storage

Nodes

Coordinator

Clients

Example: Deduplication Vs

Load Balancing

A 10% B 30% C 60% D 0%

To achieve

LOAD BALANCING

,

the new

file should go to node D.

(16)

Goal

: Scalable Cluster Deduplication.

16

Load Balancing.

Minimize:

Ideally, equal to 1.

Good Data Deduplication.

Maximize:

(17)

Goal

: Scalable Cluster Deduplication.

17

Load Balancing.

Minimize:

Ideally, equal to 1.

Good Data Deduplication.

Maximize:

Ideally, deduplication of a single-node system.

Scalability.

(18)

Goal

: Scalable Cluster Deduplication.

18

Load Balancing.

Minimize:

Ideally, equal to 1.

Good Data Deduplication.

Maximize:

Ideally, deduplication of a single-node system.

Good Throughput.

Minimize CPU/Memory usage at

Coordinator

.

Scalability.

(19)

PRODUCK

architecture

Coordinator

Storage Nodes

Client

Split the file in chunks of data

.

Store and retrieve data.

Store the chunks

.

Provide directory services.

Assign chunks to nodes.

Keep the system load balanced.

(20)

Client: chunking

C

hunks

:

use

content-based chunking

techniques.

basic

deduplication unit

.

Super-chunks:

group of

consecutive

chunks

.

basic

routing and storage unit

.

(21)

Client: chunking

21

(22)

Client: chunking

22

Organize the chunks in super-chunks

(23)

Client: chunking

(24)

PRODUCK

architecture

Coordinator

Storage Nodes

Client

Split the file in chunks of data

.

Store and retrieve data.

Store the chunks

.

Provide directory services.

Assign chunks to nodes.

Keep the system load balanced.

(25)

Coordinator: goals

Estimate the

overlap

between a

super-chunk

and the

chunks

of a given node.

Maximize

deduplication

.

Equally distribute storage load among nodes.

Guarantee a load balanced system.

(26)

Coordinator: our contributions

Novel

chunk overlap estimation

.

Based on probabilistic counting—PCSA

[Flajolet et al. 1985, Michel et al. 2006]

.

Never used before in storage systems.

Novel

load balancing mechanism

.

Operating at chunk-level granularity.

Improving co-localization of duplicate chunks.

(27)

Coordinator: Overlap Estimation

Main observation :

Do not need the exact matches.

Need only an estimation of the size of the overlap.

PCSA permits :

Compact

set descriptors.

Accurate

intersection estimation.

Computationally efficient

.

(28)

Coordinator: Overlap Estimation

Chunk5

Chunk1 Chunk2 Chunk3 Chunk4

Original Set of Chunks

(29)

Coordinator: Overlap Estimation

Chunk5

Chunk1 Chunk2 Chunk3 Chunk4

hash() 1 0 1 1 1 0 0 1 7 6 5 4 3 2 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0

Original Set of Chunks

(30)

Coordinator: Overlap Estimation

Chunk5

Chunk1 Chunk2 Chunk3 Chunk4

1 0 1 1 1 0 0 1 7 6 5 4 3 2 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0

Original Set of Chunks

p(y) = min(bit(y, k))

1 1 0 1 0 0 0 0

0 1 2 BITMAP 3 4 5 6 7

hash()

(31)

Coordinator: Overlap Estimation

Chunk5

Chunk1 Chunk2 Chunk3 Chunk4

Original Set of Chunks

1 1 0 1 0 0 0 0 hash() BITMAP p(y) = min(bit(y, k)) 0 1 2 3 4 5 6 7 31

INTUITION

P(bitmap[0] = 1) = 1/2 P(bitmap[1] = 1) = 1/4 P(bitmap[2] = 1) = 1/8 …
(32)

Coordinator: Overlap Estimation

Intersection Cardinality Estimation

?

Union Cardinality Estimation ?

1 1 0 1 1 0 0 0 0 1 2 BITMAP(A) 3 4 5 6 7 0 1 0 0 1 1 0 0 0 1 2 BITMAP(B) 3 4 5 6 7 1 1 0 1 1 1 0 0 0 1 2 3 4 5 6 7 BITMAP(A V B) BitwiseOR 32

(33)

Coordinator: Overlap Estimation

33

PCSA set cardinality estimation.

Set intersection estimation.

(34)

Coordinator: our contributions

Novel

chunk overlap estimation

.

Based on probabilistic counting—PCSA

[Flajolet et al. 1985, Michel et al. 2006]

.

Never used before in storage systems.

Novel

load balancing mechanism

.

Operating at chunk-level granularity.

Improving co-localization of duplicate chunks.

(35)

Load Balancing

35

Existing solution:

choose Storage Nodes that do not exceed

average load by a percentage threshold.

(36)

Load Balancing

Problems

Too aggressive, especially when a few data

are stored in the system.

36

Existing solution:

choose Storage Nodes that do not exceed

average load by a percentage threshold.

(37)

Bucket-based

storage quota management.

Measure storage space in fixed-size

buckets

.

Coordinator

grants

buckets

to nodes one by one.

No node can exceed the

least loaded

by more

than a

maximum allowed bucket difference.

37

(38)

Bucket-based

storage quota management.

Bucket

38

(39)

Bucket-based

storage quota management.

Bucket

Can I get a new Bucket?

39

(40)

Bucket-based

storage quota management.

Bucket

Yes, you can.

40

(41)

Bucket-based

storage quota management.

Bucket

Yes, you can.

41

(42)

Bucket-based

storage quota management.

Bucket

42

(43)

Bucket-based

storage quota management.

Bucket

43

(44)

Bucket-based

storage quota management.

Bucket

Can I get a new Bucket?

44

(45)

Bucket-based

storage quota management.

Bucket

NO you cannot!

45

(46)

Bucket-based

storage quota management.

Bucket

Searching for the

second biggest

overlap.

46

(47)

Bucket-based

storage quota management.

Bucket

47

(48)

Contribution Summary

Novel

chunk overlap estimation

.

Based on probabilistic counting—PCSA

[Flajolet et al. 1985, Michel et al. 2006]

.

Never used before in storage systems.

Novel

load balancing mechanism

.

Operating at chunk-level granularity.

Improving co-localization of duplicate chunks.

(49)

Evaluation: Datasets

2 real world workloads:

2 competitors [Dong et al. 2011]:

Minhash

BloomFilter

! "#$%&' ! ( ( )*#%+&( )' , ( *$- ' , ( *$/' , ( *$. ' 01&2#"$3$4' 01&2#"$3$56' 716$)! 81%9: - ' 01&716$)! 83$4:-' 01&716$)! 83$56: -' 01&716$)! 83$4: /' 01&716$)! 83$56: /' 716$)! 81%9: /' 716$)! 81%9; <9: -' 716$)! 81%9; <9: /' => 2' => 2' 2#"$; 55#?%@$%&'A5?'

F i gur e 3: M essage ex ch an ges for st or i n g a fi le.

t he act ual dat a t o t he select ed St or ageNode in a Super -Chunk message. W hen t he t ransmission of t he superchunk is over, t he St or ageNode t hat received it sends a Super

-Chunk A ck message t o t he node responsible for t hefile. T he

Super Chunk A ck cont ains t he checksum of t he dat a in t he

superchunk and t he corresponding ident ifier. T his enables

t he responsible node t o record which St or ageNode st ored which superchunk, and allows it t o check t he int egrity of t he dat a received by t he St or ageNode. If somet hing went wrong, t he responsible node request s t he Cl ient t o resend t he dat a t o t he St or ageNo de.

T he above process is repeat ed for all t he superchunks in

t he file. When t he responsible node has received

acknowl-edgement s for all t he superchunks, it sends an EOF message t o t he Coor dinat or , which t hen forwards it t o t he Cl ient .

3.2

Recover ing a File

T he process of reading a file is also init iat ed by a user

request t o t he client . A s depict ed in Figure 4, t he client react s t o t his request by cont act ing t he Coor dinat or wit h a Get F il eReq message, specifying t he ident ifier of t he file

t o ret rieve. T he Coor dinat or looks up t he responsible

node for t he file and forwards the Get Fil eReq message

t o it . T he responsible node replies t o t he client by

provid-ing t he ident ifier, checksum, and St or ageNode for each

superchunk in t he file in a Get Fil eResp message. T he

client downloads each superchunk from t he corresponding St or ageNo de and uses t he checksums received from t he responsible node t o verify t heir int egrit y.

4.

EXPERI M ENTAL M ETHODOL OGY

In t his sect ion we present our experiment al set -up: t he dat aset s we used, t he compet it ors we compared Pr oduck against , and t he met rics we evaluat ed. Since we focus on evaluat ing deduplicat ion efficiency and load balancing wit h respect t o st orage, we evaluat e Pr oduck t hrough simula-t ions simula-t o avoid any nesimula-t working side-effecsimula-t s. Yesimula-t , we builsimula-t our simulat or so t hat t he t ransit ion t o a real implement at ion only consist s in changing it s communicat ion primit ives from in-memory t ransact ions t o net work messages.

! "#$%&' ! ( ( )*#%+&( )' , ( *$- ' , ( *$/' , ( *$. ' 0 $&1#"$2$3' 0 $&456$)! 75%89- ' 0 $&456$)! 75%89/' 0 $&1#"$2$3' 0 $&1#"$2$: 6! 456$)! 75%89- ' 456$)! 75%89/'

F i gur e 4: M essage ex ch an ges for r et r i ev i ng a fi le.

4.1

Datasets

We evaluat e Pr oduck and it s compet it ors using t wo

real-world workloads. T hefirst is publicly available for download

from [1] and consist s of 16 full snapshot s of t he English

ver-sion of Wikipedia. T he oldest dat es back t o M arch 2011,

while t he newest was t aken in M ay 2012. T his dat aset

con-t ains full versions of all arcon-t icles in Wikipedi a and account s

for more t han 520GB of dat a. T he second dat aset cont ains images of t he environment s available for deployment on t he servers of t he Grid5000 experiment al plat form [2] and ac-count s for 142GB. T hese workloads cont ain bot h t ext and binary dat a, t hus covering many common cases. Table 1 present s more det ails on t hese dat aset s. In part icular, t he Deduplicat ion Fact or is t he rat io of t he original size of each dat aset divided by it s size aft er being deduplicat ed based on

our chunking mechanism.

D at aset Si ze ( G B ) D ed u p l i ca-t i on Facca-t or D at a For -m at W ikipedia 522 1.96 HT M L Images 142 4.27 OS images T ab l e 1: D at aset D escr i p t i on

4.2

Competitor s

Before evaluat ing t he specific aspect s of Pr oduck , we

compare it against two st at e-of-t he-art clust er-based dedu-plicat ion st orage syst ems present ed in [6]: B l oomF il t er , a

stateful st rat egy, and M inHash, a stateless one. T he lat t er

is used in a commercial product [6], t hus represent ing t he current st at e of commercial clust er-based deduplicat ion

sys-t ems. A s in Pr oduck , bosys-t h sys-t hese ssys-t rasys-t egies splisys-t files int o

chunks and superchunks using cont ent -based chunking wit h t he hash values of chunks serving as t heir signat ures. T he difference between t hem lies in t he way each st rat egy select s t he node t hat should st ore each superchunk.

(50)

Evaluation: Competitors

MinHash

:

Use the minimum hash from a

super-chunk

as its

fingerprint.

Assign

super-chunks

to

bins

using the

mod(# bins)

operator.

Initially assign

bins

to nodes randomly and

re-assign

bins

to nodes when unbalanced.

(51)

Evaluation: Competitors

BloomFilter

:

The

Coordinator

keeps a Bloom filter for each one

of the

Storage Nodes

.

If a node deviates more than 5% from the average

load, he is considered

overloaded

.

(52)

Evaluation: Metrics

Deduplication:

Load balancing:

Overall:

ED and TD are

normalized

to the performance of a

single-node system to ease comparison.

Throughput :

(53)

Evaluation: Effective Deduplication

Wikipedia Images

53

32 nodes : Wikipedia 7% Images 16% 64 nodes : Wikipedia 16% Images 21%

(54)

Evaluation: Throughput

54

Wikipedia Images

32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X

(55)

Evaluation: Throughput

55

Wikipedia Images

Memory : 64KB for Produck

9,6bits/chunk or 168GB for 140TB/node

32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X

(56)

Evaluation: Load Balancing

Load Balancing

56

(57)

References

Related documents

In order to solve this issue in the cloud storage infrastructure, low cost PC cluster based storage server is configured in [6] to be activated for large amount of data to provide

To improve performance of system, different kind of load balancing techniques are used to distribute or redistribute workload among nodes.. Goal of load balancing

Peter Murray - Load DynamiX Leah Schoeb - Evaluator Group... 2014 Storage

Examples include load balancing after the addition or removal of a storage node with HP 3PAR StoreServ and HP StoreAll Storage; the movement of data to the right tier of storage

•  Enterprise storage for applications that are hosted in the cloud –  Dynamic provisioning of storage with careful attention to balancing?. capacity

Out-of-band deduplication of live, primary storage Process duplicates efficiently, in large batches Minimize contention on the index.. Resilient to stale

By default, the pd.conf file settings on the deduplication storage server apply to all load balancing servers and all clients that deduplicate their own data. You can edit the file

Experimental results show that the load among all nodes is balanced and the freezing time is low compared with other load-balancing mechanisms such as random