Environment
Mathias Defiebre
IBM Lab Services
Agenda
Data Deduplication Overview
Data Deduplication Theory
Data Deduplication Approaches in Practice
Data Deduplication Considerations and Value Proposition
TS7650 ProtecTIER Deduplication Gateway
TS7650 ProtecTIER Deduplication Appliance Series
A look in the Future
Data Deduplication Overview
With Data Deduplication repeated instances of identical data are
identified and stored only once
–
Identical data is referenced to a single instance
–
Saves storage capacity and network bandwidth
Data Deduplication is a feature of a storage device or an application
–
VTL, NAS-Box, backup application
Data Deduplication requires an I/O protocol
–
FCP, iSCSi, CIFS, NFS, API, Tape Library Emulation
Data Deduplication does not always make sense
–
Not all data can be deduplicated well
–
May interfere or work together with other technologies like compression, encryption
or with data security requirements
Data Deduplication is transparent
Data Deduplication Process (simplified)
F E D C B AA B C D A E F F D
Data Object / Stream
Identical Chunks
Data object or stream is subject for deduplication
(1) Data object is split in chunks (fixed or variable size)
Data Chunking
(2) For each junk an identity characteristic is determined
Identity Determination
(3a) Identical Chunks are referenced (pointer, reference)
(3b) Non-identical chunks (single instances) are stored unique
Methods for Data Chunking
1.
File based
One chunk is one file, most appropriate for file systems
2.
Block based
Data object is chunked into blocks of fixed or variable size
Used by block storage devices
3.
Format aware (Content aware)
Understands explicit data formats and chunks data objects according to the format
Example: Breaking a PowerPoint presentation into separate slides
4.
Format agnostic (Content agnostic)
Chunking is based on an algorithm that looks for logical breaks or similar elements
within a data object/stream
Methods for Determining Duplicates
1.
Hashing
Calculate a hash (MD-5, SHA-256) for each data chunk
Compare hash with hash of existing data
–
Identical hash means most likely identical data
Hash Collision: Identical hash but non-identical data
–
Must be prevented through secondary comparison (additional metadata,
second hash method, additional binary comparison)
2.
Binary Comparison
Compare all bits of similar chunks
3.
Delta Differencing
Computes a “delta” between two “similar” chunks of data where one chunk is the
baseline and the second is the delta
Since each delta is unique there is no possibility of collision
To reconstruct the original chunk the delta(s) have to be re-applied to the baseline
chunk
F E D C B AA B C D A E F F D
Data Deduplication Architectures
Client
Server
Storage Device
Client-side
+
Reduces load on Server
+
Reduces bandwidth on
LAN
–
Adds load to Client
–
No cross-correlation
among multiple clients
Server-side
+
Allows cross-correlation
among multiple Clients
–
Adds load to Server
LAN LAN or SAN
Storage-side
+
Transparent to Clients and
Servers
+
Reduces load on Server and
Clients
Data Deduplication Processing Time
In-line: Data is deduplicated before it is actually stored
+
Requires less storage capacity
–
Potential decrease of I/O performance
Post-processing: Data is first stored and deduplicated later in the
background
+
Better Performance expected
–
Requires more storage capacity to temporarily store the data
–
Data is written, read and written again – thus more I/O intensive
–
Deduplication window must be coordinated with backup window
Combination of In-Line and Post-processing
Practical Approaches Overview
Practical approaches combine
–
Chunking Method
–
Method for Determining/Checking Identity
Common Practical Approaches
Format
Agnostic
Format Aware
Fixed/Variable
Block Size
Binary Diff
Delta Diff
Hashing
Identity
Check
Chunking
Content Aware
HyperFactor
H
a
s
h
b
a
s
e
d
Hash Based Approach
1.
Slice data into chunks (fixed or variable)
2.
Generate Hash per chunk
3.
Compare hashes with hash table
4.
For identical hashes store reference, otherwise store chunk and
update hash table
A
h
B
h
C
h
D
h
E
h
Object
References
Storage
locations
Hash Value
A
B
C
D
E
Assessment for Hash Based Approach
Hash-Collisions must be handled
–
More overhead, especially for in-line deduplication
Requires a hash table to store hashes for all chunks
–
Hash table will grow with data volume
Hash Table must be quickly searchable and accessible
–
Growing hash table may become a performance bottleneck (doesn’t fit into RAM)
–
Scalability issues
Hash table must be protected
–
One copy might not be sufficient
Example:
Chunk size of 8KB, each hash is 20 bytes long …
With a 1 TB repository:
1 TByte repository has ~134,000,000 chunks of 8 KB each Need pointers scheme to reference inside 1 TByte
Hash table requires ~2.5 GB of memory – no issue
With a 100 TB repository:
HyperFactor Approach
HyperFactor has two indexes
–
HyperFactor Index
–
Restore Index
HyperFactor Index used for backup
–
Used to filter out similar elements from the incoming data stream
–
Fixed size of 4 GB, memory resident, synced to disk (repository) periodically
–
Can be restored from repository if lost
–
References up to 1 PB of physical data elements stored in the repository
Restore Index used for restore
–
Includes references to physical data elements
–
Dynamic index, growing
HyperFactor Approach
1.
Look through data stream for
similarity
and filter similar elements
–
Using HyperFactor Index (fixed size 4 GB)
2.
Read elements that are most
similar
from storage
–
Using Restore Index
3.
Binary compare element in stream with element(s) read from
storage
4.
Identical data is referenced by a new additional entry in the Restore
Index - unique data is stored in the repository
New Data Stream
Assessment for HyperFactor
No Hash Table required
–
No scalability issues
–
4 GB Index references up to 1 PB of physical data elements
No dependency of data format and application
–
Very flexible, no ongoing development effort due to format changes
HyperFactor index always fits into memory
–
Enables enterprise-class high-performance in-line deduplication
Eliminates the phenomenon of missed factoring opportunities
Data Deduplication Considerations and Value
Proposition
Not all Data Dedupe well
High Dedupe Ratio expected for ...
–
Structured Data
–
Database Files
–
E-mails
Low Dedupe Ratio expected for ...
–
Unstructured Data
–
Images
–
Videos
–
Voice Data
–
Seismic Data
–
Large collections of small files
Technologies influencing Data Deduplication
Compression
–
Archives
–
*.zip (Phil Katz zip: pkzip, pkunzip)
–
*.gz (GNU zip: gzip, gzip -d)
Compaction
–
Lotus Notes Database
Multiplexing
–
Multiple backup streams to a single tape drive
–
Veritas Backup Exec
–
Computer Associates ARCserve
–
Oracle RMAN multiplexing of backup sets
Encryption
Example: Data Deduplication and Encryption
Data
source 1
txpt tnatroemI
te tarpIxtntom
Data
source 2
Data
source 3
Important text
Important text
Important text
Data
Deduplication
No
encryption
Encryption
key 1
Important text
Encryption
key 2
2. After encryption,
1. Three data
3. Deduplication
Data Store
4. Text files are
Data encryption prior to
de-duplication processing can
subvert data reduction
Data encryption prior to
de-duplication processing can
subvert data reduction
Important text
txpt tnatroemI
te tarpIxtntom
Compression possible
Dedupe Value Proposition & potential Drawbacks
Data Deduplication Value Proposition
–
Disk storage savings
–
Network Bandwidth savings
–
Energy savings (Green IT)
–
Better utilization of existing floor and rack space
–
Increased scalability
Data Deduplication Potential Drawbacks
–
Loss of one single data chunk may cause loss of multiple files
–
Repository or Index required to store meta data
–
must be protected
–
requires additional storage capacity
–
may slow down performance
ProtecTIER Architecture Overview
Linux server-based application running on a System x server
Emulates a tape library unit, including drives, cartridges, and robotics
Uses Fibre Channel (FC) attached disk storage system as the backup medium
Has a build-in deduplication engine (HyperFactor)
Backup Server
FC
Disk Storage
System
Virtual Tape Library
ProtecTIER Server
“It’s a Tape
Library and
Drives”
ProtecTIER
Application
Data Storage
Backup Servers
FC Switch
ProtecTIER
Server
Disk Arrays
HyperFactor
Memory
Resident Index
(4 GB, may contain predefined elements) Existing DataNew Data Stream
•
Read similar elements from storage and compare
Dedupe Ratio depends on ...
Data Change Rate
–
the percentage of data in the incomming backup data stream that is new for
ProtecTIER and not already stored physically in the repository
Backup Policies
–
# full backups
–
# Inc backups
–
backup frequency
ProtecTIER Native Replication – Key new feature R2.3
Physical capacity ProtecTIER Gateway Backup Server Backup Server Represented capacityPrimary Site
Represented capacity Physical capacity ProtecTIER Gateway BackupSecondary Site
Significant bandwidth reduction
ProtecTIER IP replication
TS7650 Appliance Series
500MB/sec 500MB/sec Standalone – 4700 32 spindle 450GB (2 drawer) 7TB 100MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3 1u empty space or TSSC 1u empty space or TSSC 8 1 0 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space X3850 M2 3 x 6core, 24GB RAM Standalone – 4700 64 spindle 450GB (4 drawer) 18TB 250MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3 8 1 0 E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S 1u empty space or TSSC 1u empty space or TSSC 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space X3850 M2 3 x 6core, 24GB RAM Standalone – 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3 1u empty space or TSSC 1u empty space or TSSC 8 1 0 E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S D S 4 7 0 0 M M M M M m m m m mUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S 1u empty space 1u empty space 1u empty space 1u empty space X3850 M2 3 x 6core, 24GB RAM Clustered 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3Ethernet Switch (1U) Ethernet Switch (1U)
1u empty space or TSSC 1u empty space or TSSC 8 1 0 E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S D S 4 7 0 0 M M M M M m m m m mUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S W T I S w it c h X3850 M2 3 x 6core, 24GB RAM X3850 M2 3 x 6core, 24GB RAM 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space
Appliances can be upgraded one step forward ...
A look in the Future
Some observations from the VTL and Dedupe Market
–
Vendors converge to a common point
–
Scalable appliances with multiple I/O interfaces (FCP, iSCSI, CIFS, NFS, Library
Emulation)
–
Replication becomes more and more commodity
–
Replication benefits from deduped data
–
Intelligent storage devices will be tighly integrated with 3rd party backup
applications
Links I
TS7650G ProtecTIER Deduplication Gateway
http://www-03.ibm.com/systems/storage/tape/ts7650g/index.html
TS7650 ProtecTIER Deduplication Appliance
http://www-03.ibm.com/systems/storage/tape/ts7650a/index.html
Whitepaper: IBM Data Deduplication Strategy and Operations
http://www.ibm.com/developerworks/wikis/display/tivolistoragemana
ger/IBM+Tivoli+Storage+Manager+V6.1+Data+Deduplication+Strate
gy+and+Operations
Redbook: The IBM System Storage TS7650G and TS7650
ProtecTIER Servers
Links II
TS7650G ProtecTIER Implementation Workshops
IBMer:
https://w3-01.sso.ibm.com/learning/lms/Saba/Web/Main/goto/learningActivity?c
ourseNum=SS92E1DE&deepLinkRedirect=false
Business Partner:
http://www-304.ibm.com/jct03001c/services/learning/ites.wss/de/de?pageType=
course_description&includeNotScheduled=y&courseCode=SS92E1
DE
IBM Dynamic Infra-structure Leadership Center for Information Infrastructure
Business, Channel & Skill Enablement & Training
DI Education & Briefings
Demos & Showcases
IT Transformation Road-maps & Workshops
BP Certification
IBM European Storage Competence Center & Systems Lab Europe
IBM Executive Briefing Center & TMCC
Business, Channel & Skill Enablement & Training
Customer and Group Briefings
Product & SW Demos
Integrated Solution Demos
IBM STG Europe Storage Software Development
Software Development
Storage & Tape
Linux
Mainframe
File Systems
Storage Competence at the Mainz Location
IBM Germany‘s fourth
largest location offers
you a broad portfolio of
IBM System Storage
Services
__________________________________________________________________________________________________________________________________________
Business, Channel & Skill Enablement & Training
End-to-end client support
Workshops
Solution Design
Lab Services
Customer Relationship Management
Our Services
Client Briefings & Education
Systems Lab Services & Training
Customized Workshops
System Storage Demos
Advanced Technical Support
Solution Design
Proof of Concepts
Benchmarks
Product Field Engineering
Our Expertise
Skilled technical storage experts covering the whole IBM System Storage Portfolio Information Infrastructure: Compliance Availability Retention Security HW / SW & Performance
IBM System Storage Solutions Center of Excellence
We offer technical
support from the
planning phase through
well after installation
__________________________________________________________________________________________________________________________________________
Our Systems Lab Europe
1500 sqm lab space
Me rci
Grazie
Gracias
Obrigado
Dan ke
Japanese HebrewThank You
Thank You
English French Russian German Italian Spanish Brazilian Portuguese Hindi Korean Simplified Chinese ArabicTak
DanishDisclaimer I
Copyright©2009 by International Business Machines Corporation.
No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation.
The performance data contained herein were obtained in a controlled, isolated environment. Results obtained in other operating environments may vary significantly. While IBM has reviewed each item for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. These values do not constitute a guarantee of performance. The use of this information or the implementation of any of the techniques discussed herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into their operating environment. Customers attempting to adapt these techniques to their own environments do so at their own risk.
Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. This information could include technical inaccuracies or
typographical errors. IBM may make improvements and/or changes in the product(s) and/or
program(s) at any time without notice. Any statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Any reference to an IBM Program Product in this document is not intended to state or imply that only that program product may be used. Any functionally equivalent program, that does not
infringe IBM's intellectually property rights, may be used instead. It is the user's responsibility to evaluate and verify the operation of any on-IBM product, program or service.
Disclaimer II
THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT. IBM shall have no responsibility to update this information. IBM products are warranted according to
the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided. IBM is not responsible for the performance or interoperability of any non-IBM products discussed herein.
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be made, in writing, to:
IBM Director of Licensing IBM Corporation
North Castle Drive
Armonk, NY 10504-1785 U.S.A.
Trademarks
The following terms are trademarks or registered trademarks of the IBM Corporation in either the United States, other countries or both.
– IBM, TotalStorage, zSeries, pSeries, xSeries, S/390, ES/9000, AS/400, RS/6000 – z/OS, z/VM, VM/ESA, OS/390, AIX, DFSMS/MVS, OS/2, OS/400, ESCON, Tivoli
– iSeries, ES/3090, VSE/ESA, TPF, DFSMSdfp, DFSMSdss, DFSMShsm, DFSMSrmm, FICON, – ProtecTIER, XIV
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Other company, product, and service names mentioned may be trademarks or registered trademarks of their respective companies.