Probabilistic Deduplication for
Cluster-Based
Storage Systems
Davide Frey, Anne-Marie Kermarrec,
Konstantinos
Kloudas
Motivation
•
Volume of data stored increases exponentially.
•
Provided services are highly dependent on data.
Motivation
•
Traditional solutions combine data in tarballs
and store them on tape.
–
Pros
: cost efficient.
–
Cons
: low throughput.
3 Tape Disk Acquisition cost $407,000 $1,620,000 Operational cost $205,000 $573,000 Total cost $612,000 $2,193,000 * Source: www.backupworks.com
Deduplication
Store data
only once
and replace
duplicates
with
references.
Deduplication
Store data
only once
and replace
duplicates
with
references.
5
Deduplication
file1
file2
6
Store data
only once
and replace
duplicates
with
references.
Deduplication
7
file1
file2
Store data
only once
and replace
duplicates
with
references.
Deduplication
8
file1
file2
Store data
only once
and replace
duplicates
with
references.
Challenges
•
Single-node
deduplication systems.
–
Compact indexing structures.
–
Efficient duplicate detection.
Challenges
•
Single-node
deduplication systems.
–
Compact indexing structures.
–
Efficient duplicate detection.
•
Cluster-based
solutions.
–
Single-machine tradeoffs.
–
Deduplication vs Load balancing.
10
We focus on
Cluster-based Deduplication
Storage
Nodes
Coordinator
Clients
Example: Deduplication Vs
Load Balancing
A B C DA client wants to store a file .
Storage
Nodes
Coordinator
Clients
Example: Deduplication Vs
Load Balancing
A B C DThe client sends the file to the
Coordinator.
Storage
Nodes
Coordinator
Clients
Example: Deduplication Vs
Load Balancing
A 10% B 30% C 60% D 0%The
Coordinator
computes the overlap
between the contents of and those
of each
Storage Node
.
Storage
Nodes
Coordinator
Clients
Example: Deduplication Vs
Load Balancing
A 10% B 30% C 60% D 0%To maximize
DEDUPLICATION
,
the new
file should go to node C.
Storage
Nodes
Coordinator
Clients
Example: Deduplication Vs
Load Balancing
A 10% B 30% C 60% D 0%To achieve
LOAD BALANCING
,
the new
file should go to node D.
Goal
: Scalable Cluster Deduplication.
16
Load Balancing.
•
Minimize:
•
Ideally, equal to 1.
Good Data Deduplication.
•
Maximize:
Goal
: Scalable Cluster Deduplication.
17
Load Balancing.
•
Minimize:
•
Ideally, equal to 1.
Good Data Deduplication.
•
Maximize:
•
Ideally, deduplication of a single-node system.
Scalability.
Goal
: Scalable Cluster Deduplication.
18
Load Balancing.
•
Minimize:
•
Ideally, equal to 1.
Good Data Deduplication.
•
Maximize:
•
Ideally, deduplication of a single-node system.
Good Throughput.
•
Minimize CPU/Memory usage at
Coordinator
.
Scalability.
PRODUCK
architecture
Coordinator
Storage Nodes
Client
•
Split the file in chunks of data
.
•
Store and retrieve data.
•
Store the chunks
.
•
Provide directory services.
•
Assign chunks to nodes.
•
Keep the system load balanced.
Client: chunking
•
C
hunks
:
–
use
content-based chunking
techniques.
–
basic
deduplication unit
.
•
Super-chunks:
–
group of
consecutive
chunks
.
–
basic
routing and storage unit
.
Client: chunking
21
Client: chunking
22
Organize the chunks in super-chunks
Client: chunking
PRODUCK
architecture
Coordinator
Storage Nodes
Client
•
Split the file in chunks of data
.
•
Store and retrieve data.
•
Store the chunks
.
•
Provide directory services.
•
Assign chunks to nodes.
•
Keep the system load balanced.
Coordinator: goals
•
Estimate the
overlap
between a
super-chunk
and the
chunks
of a given node.
–
Maximize
deduplication
.
•
Equally distribute storage load among nodes.
–
Guarantee a load balanced system.
Coordinator: our contributions
•
Novel
chunk overlap estimation
.
–
Based on probabilistic counting—PCSA
[Flajolet et al. 1985, Michel et al. 2006]
.
–
Never used before in storage systems.
•
Novel
load balancing mechanism
.
–
Operating at chunk-level granularity.
–
Improving co-localization of duplicate chunks.
Coordinator: Overlap Estimation
•
Main observation :
–
Do not need the exact matches.
–
Need only an estimation of the size of the overlap.
•
PCSA permits :
–
Compact
set descriptors.
–
Accurate
intersection estimation.
–
Computationally efficient
.
Coordinator: Overlap Estimation
Chunk5Chunk1 Chunk2 Chunk3 Chunk4
Original Set of Chunks
Coordinator: Overlap Estimation
Chunk5Chunk1 Chunk2 Chunk3 Chunk4
hash() 1 0 1 1 1 0 0 1 7 6 5 4 3 2 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0
Original Set of Chunks
Coordinator: Overlap Estimation
Chunk5Chunk1 Chunk2 Chunk3 Chunk4
1 0 1 1 1 0 0 1 7 6 5 4 3 2 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0
Original Set of Chunks
p(y) = min(bit(y, k))
1 1 0 1 0 0 0 0
0 1 2 BITMAP 3 4 5 6 7
hash()
Coordinator: Overlap Estimation
Chunk5Chunk1 Chunk2 Chunk3 Chunk4
Original Set of Chunks
1 1 0 1 0 0 0 0 hash() BITMAP p(y) = min(bit(y, k)) 0 1 2 3 4 5 6 7 31
INTUITION
P(bitmap[0] = 1) = 1/2 P(bitmap[1] = 1) = 1/4 P(bitmap[2] = 1) = 1/8 …Coordinator: Overlap Estimation
•
Intersection Cardinality Estimation
?
•
Union Cardinality Estimation ?
1 1 0 1 1 0 0 0 0 1 2 BITMAP(A) 3 4 5 6 7 0 1 0 0 1 1 0 0 0 1 2 BITMAP(B) 3 4 5 6 7 1 1 0 1 1 1 0 0 0 1 2 3 4 5 6 7 BITMAP(A V B) BitwiseOR 32
Coordinator: Overlap Estimation
33
PCSA set cardinality estimation.
Set intersection estimation.
Coordinator: our contributions
•
Novel
chunk overlap estimation
.
–
Based on probabilistic counting—PCSA
[Flajolet et al. 1985, Michel et al. 2006]
.
–
Never used before in storage systems.
•
Novel
load balancing mechanism
.
–
Operating at chunk-level granularity.
–
Improving co-localization of duplicate chunks.
Load Balancing
35
•
Existing solution:
choose Storage Nodes that do not exceed
average load by a percentage threshold.
Load Balancing
Problems
•
Too aggressive, especially when a few data
are stored in the system.
36
•
Existing solution:
choose Storage Nodes that do not exceed
average load by a percentage threshold.
•
Bucket-based
storage quota management.
–
Measure storage space in fixed-size
buckets
.
–
Coordinator
grants
buckets
to nodes one by one.
–
No node can exceed the
least loaded
by more
than a
maximum allowed bucket difference.
37
•
Bucket-based
storage quota management.
Bucket
38
•
Bucket-based
storage quota management.
Bucket
Can I get a new Bucket?
39
•
Bucket-based
storage quota management.
Bucket
Yes, you can.
40
•
Bucket-based
storage quota management.
Bucket
Yes, you can.
41
•
Bucket-based
storage quota management.
Bucket
42
•
Bucket-based
storage quota management.
Bucket
43
•
Bucket-based
storage quota management.
Bucket
Can I get a new Bucket?
44
•
Bucket-based
storage quota management.
Bucket
NO you cannot!
45
•
Bucket-based
storage quota management.
Bucket
Searching for the
second biggest
overlap.
46
•
Bucket-based
storage quota management.
Bucket
47
Contribution Summary
•
Novel
chunk overlap estimation
.
–
Based on probabilistic counting—PCSA
[Flajolet et al. 1985, Michel et al. 2006]
.
–
Never used before in storage systems.
•
Novel
load balancing mechanism
.
–
Operating at chunk-level granularity.
–
Improving co-localization of duplicate chunks.
Evaluation: Datasets
•
2 real world workloads:
•
2 competitors [Dong et al. 2011]:
–
Minhash
–
BloomFilter
! "#$%&' ! ( ( )*#%+&( )' , ( *$- ' , ( *$/' , ( *$. ' 01&2#"$3$4' 01&2#"$3$56' 716$)! 81%9: - ' 01&716$)! 83$4:-' 01&716$)! 83$56: -' 01&716$)! 83$4: /' 01&716$)! 83$56: /' 716$)! 81%9: /' 716$)! 81%9; <9: -' 716$)! 81%9; <9: /' => 2' => 2' 2#"$; 55#?%@$%&'A5?'F i gur e 3: M essage ex ch an ges for st or i n g a fi le.
t he act ual dat a t o t he select ed St or ageNode in a Super -Chunk message. W hen t he t ransmission of t he superchunk is over, t he St or ageNode t hat received it sends a Super
-Chunk A ck message t o t he node responsible for t hefile. T he
Super Chunk A ck cont ains t he checksum of t he dat a in t he
superchunk and t he corresponding ident ifier. T his enables
t he responsible node t o record which St or ageNode st ored which superchunk, and allows it t o check t he int egrity of t he dat a received by t he St or ageNode. If somet hing went wrong, t he responsible node request s t he Cl ient t o resend t he dat a t o t he St or ageNo de.
T he above process is repeat ed for all t he superchunks in
t he file. When t he responsible node has received
acknowl-edgement s for all t he superchunks, it sends an EOF message t o t he Coor dinat or , which t hen forwards it t o t he Cl ient .
3.2
Recover ing a File
T he process of reading a file is also init iat ed by a user
request t o t he client . A s depict ed in Figure 4, t he client react s t o t his request by cont act ing t he Coor dinat or wit h a Get F il eReq message, specifying t he ident ifier of t he file
t o ret rieve. T he Coor dinat or looks up t he responsible
node for t he file and forwards the Get Fil eReq message
t o it . T he responsible node replies t o t he client by
provid-ing t he ident ifier, checksum, and St or ageNode for each
superchunk in t he file in a Get Fil eResp message. T he
client downloads each superchunk from t he corresponding St or ageNo de and uses t he checksums received from t he responsible node t o verify t heir int egrit y.
4.
EXPERI M ENTAL M ETHODOL OGY
In t his sect ion we present our experiment al set -up: t he dat aset s we used, t he compet it ors we compared Pr oduck against , and t he met rics we evaluat ed. Since we focus on evaluat ing deduplicat ion efficiency and load balancing wit h respect t o st orage, we evaluat e Pr oduck t hrough simula-t ions simula-t o avoid any nesimula-t working side-effecsimula-t s. Yesimula-t , we builsimula-t our simulat or so t hat t he t ransit ion t o a real implement at ion only consist s in changing it s communicat ion primit ives from in-memory t ransact ions t o net work messages.
! "#$%&' ! ( ( )*#%+&( )' , ( *$- ' , ( *$/' , ( *$. ' 0 $&1#"$2$3' 0 $&456$)! 75%89- ' 0 $&456$)! 75%89/' 0 $&1#"$2$3' 0 $&1#"$2$: 6! 456$)! 75%89- ' 456$)! 75%89/'
F i gur e 4: M essage ex ch an ges for r et r i ev i ng a fi le.
4.1
Datasets
We evaluat e Pr oduck and it s compet it ors using t wo
real-world workloads. T hefirst is publicly available for download
from [1] and consist s of 16 full snapshot s of t he English
ver-sion of Wikipedia. T he oldest dat es back t o M arch 2011,
while t he newest was t aken in M ay 2012. T his dat aset
con-t ains full versions of all arcon-t icles in Wikipedi a and account s
for more t han 520GB of dat a. T he second dat aset cont ains images of t he environment s available for deployment on t he servers of t he Grid5000 experiment al plat form [2] and ac-count s for 142GB. T hese workloads cont ain bot h t ext and binary dat a, t hus covering many common cases. Table 1 present s more det ails on t hese dat aset s. In part icular, t he Deduplicat ion Fact or is t he rat io of t he original size of each dat aset divided by it s size aft er being deduplicat ed based on
our chunking mechanism.
D at aset Si ze ( G B ) D ed u p l i ca-t i on Facca-t or D at a For -m at W ikipedia 522 1.96 HT M L Images 142 4.27 OS images T ab l e 1: D at aset D escr i p t i on
4.2
Competitor s
Before evaluat ing t he specific aspect s of Pr oduck , we
compare it against two st at e-of-t he-art clust er-based dedu-plicat ion st orage syst ems present ed in [6]: B l oomF il t er , a
stateful st rat egy, and M inHash, a stateless one. T he lat t er
is used in a commercial product [6], t hus represent ing t he current st at e of commercial clust er-based deduplicat ion
sys-t ems. A s in Pr oduck , bosys-t h sys-t hese ssys-t rasys-t egies splisys-t files int o
chunks and superchunks using cont ent -based chunking wit h t he hash values of chunks serving as t heir signat ures. T he difference between t hem lies in t he way each st rat egy select s t he node t hat should st ore each superchunk.
Evaluation: Competitors
•
MinHash
:
–
Use the minimum hash from a
super-chunk
as its
fingerprint.
–
Assign
super-chunks
to
bins
using the
mod(# bins)
operator.
–
Initially assign
bins
to nodes randomly and
re-assign
bins
to nodes when unbalanced.
Evaluation: Competitors
•
BloomFilter
:
–
The
Coordinator
keeps a Bloom filter for each one
of the
Storage Nodes
.
–
If a node deviates more than 5% from the average
load, he is considered
overloaded
.
Evaluation: Metrics
Deduplication:
Load balancing:
Overall:
ED and TD are
normalized
to the performance of a
single-node system to ease comparison.
Throughput :
Evaluation: Effective Deduplication
Wikipedia Images
53
32 nodes : Wikipedia 7% Images 16% 64 nodes : Wikipedia 16% Images 21%
Evaluation: Throughput
54
Wikipedia Images
32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X
Evaluation: Throughput
55
Wikipedia Images
Memory : 64KB for Produck
9,6bits/chunk or 168GB for 140TB/node
32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X
Evaluation: Load Balancing
Load Balancing
56