Data Deduplication in a Virtual Tape Library Environment

(1)

Environment

Mathias Defiebre

IBM Lab Services

(2)

Agenda

Data Deduplication Overview

Data Deduplication Theory

Data Deduplication Approaches in Practice

Data Deduplication Considerations and Value Proposition

TS7650 ProtecTIER Deduplication Gateway

TS7650 ProtecTIER Deduplication Appliance Series

A look in the Future

(3)

(4)

Data Deduplication Overview

With Data Deduplication repeated instances of identical data are

identified and stored only once

–

Identical data is referenced to a single instance

–

Saves storage capacity and network bandwidth

Data Deduplication is a feature of a storage device or an application

–

VTL, NAS-Box, backup application

Data Deduplication requires an I/O protocol

–

FCP, iSCSi, CIFS, NFS, API, Tape Library Emulation

Data Deduplication does not always make sense

–

Not all data can be deduplicated well

–

May interfere or work together with other technologies like compression, encryption

or with data security requirements

Data Deduplication is transparent

(5)

(6)

Data Deduplication Process (simplified)

F E D C B A

A B C D A E F F D

Data Object / Stream

Identical Chunks

Data object or stream is subject for deduplication

(1) Data object is split in chunks (fixed or variable size)

Data Chunking

(2) For each junk an identity characteristic is determined

Identity Determination

(3a) Identical Chunks are referenced (pointer, reference)

(3b) Non-identical chunks (single instances) are stored unique

(7)

Methods for Data Chunking

1. File based

One chunk is one file, most appropriate for file systems

2. Block based

Data object is chunked into blocks of fixed or variable size

Used by block storage devices

3. Format aware (Content aware)

Understands explicit data formats and chunks data objects according to the format

Example: Breaking a PowerPoint presentation into separate slides

4. Format agnostic (Content agnostic)

Chunking is based on an algorithm that looks for logical breaks or similar elements

within a data object/stream

(8)

Methods for Determining Duplicates

1. Hashing

Calculate a hash (MD-5, SHA-256) for each data chunk

Compare hash with hash of existing data

–

Identical hash means most likely identical data

Hash Collision: Identical hash but non-identical data

–

Must be prevented through secondary comparison (additional metadata,

second hash method, additional binary comparison)

2. Binary Comparison

Compare all bits of similar chunks

3. Delta Differencing

Computes a “delta” between two “similar” chunks of data where one chunk is the

baseline and the second is the delta

Since each delta is unique there is no possibility of collision

To reconstruct the original chunk the delta(s) have to be re-applied to the baseline

chunk

F E D C B A

A B C D A E F F D

(9)

Data Deduplication Architectures

Client

Server

Storage Device

Client-side

+

Reduces load on Server

+

Reduces bandwidth on

LAN

–

Adds load to Client

–

No cross-correlation

among multiple clients

Server-side

+

Allows cross-correlation

among multiple Clients

–

Adds load to Server

LAN LAN or SAN

Storage-side

+

Transparent to Clients and

Servers

+

Reduces load on Server and

Clients

(10)

Data Deduplication Processing Time

In-line: Data is deduplicated before it is actually stored

+

Requires less storage capacity

–

Potential decrease of I/O performance

Post-processing: Data is first stored and deduplicated later in the

background

+

Better Performance expected

–

Requires more storage capacity to temporarily store the data

–

Data is written, read and written again – thus more I/O intensive

–

Deduplication window must be coordinated with backup window

Combination of In-Line and Post-processing

(11)

(12)

Practical Approaches Overview

Practical approaches combine

–

Chunking Method

–

Method for Determining/Checking Identity

Common Practical Approaches

Format

Agnostic

Format Aware

Fixed/Variable

Block Size

Binary Diff

Delta Diff

Hashing

Identity

Check

Chunking

Content Aware

HyperFactor

H

a

s

h

b

a

s

e

d

(13)

Hash Based Approach

1. Slice data into chunks (fixed or variable)

2. Generate Hash per chunk

3. Compare hashes with hash table

4. For identical hashes store reference, otherwise store chunk and

update hash table

A

_h

B

_h

C

_h

D

_h

E

_h

Object

References

Storage

locations

Hash Value

A

B

C

D

E

(14)

Assessment for Hash Based Approach

Hash-Collisions must be handled

–

More overhead, especially for in-line deduplication

Requires a hash table to store hashes for all chunks

–

Hash table will grow with data volume

Hash Table must be quickly searchable and accessible

–

Growing hash table may become a performance bottleneck (doesn’t fit into RAM)

–

Scalability issues

Hash table must be protected

–

One copy might not be sufficient

Example:

Chunk size of 8KB, each hash is 20 bytes long …

With a 1 TB repository:

1 TByte repository has ~134,000,000 chunks of 8 KB each Need pointers scheme to reference inside 1 TByte

Hash table requires ~2.5 GB of memory – no issue

With a 100 TB repository:

(15)

HyperFactor Approach

HyperFactor has two indexes

–

HyperFactor Index

–

Restore Index

HyperFactor Index used for backup

–

Used to filter out similar elements from the incoming data stream

–

Fixed size of 4 GB, memory resident, synced to disk (repository) periodically

–

Can be restored from repository if lost

–

References up to 1 PB of physical data elements stored in the repository

Restore Index used for restore

–

Includes references to physical data elements

–

Dynamic index, growing

(16)

HyperFactor Approach

1. Look through data stream for

similarity

and filter similar elements

–

Using HyperFactor Index (fixed size 4 GB)

2. Read elements that are most

similar

from storage

–

Using Restore Index

3. Binary compare element in stream with element(s) read from

storage

4. Identical data is referenced by a new additional entry in the Restore

Index - unique data is stored in the repository

New Data Stream

(17)

Assessment for HyperFactor

No Hash Table required

–

No scalability issues

–

4 GB Index references up to 1 PB of physical data elements

No dependency of data format and application

–

Very flexible, no ongoing development effort due to format changes

HyperFactor index always fits into memory

–

Enables enterprise-class high-performance in-line deduplication

Eliminates the phenomenon of missed factoring opportunities

(18)

Data Deduplication Considerations and Value

Proposition

(19)

Not all Data Dedupe well

High Dedupe Ratio expected for ...

–

Structured Data

–

Database Files

–

E-mails

Low Dedupe Ratio expected for ...

–

Unstructured Data

–

Images

–

Videos

–

Voice Data

–

Seismic Data

–

Large collections of small files

(20)

Technologies influencing Data Deduplication

Compression

–

*.zip (Phil Katz zip: pkzip, pkunzip)

–

*.gz (GNU zip: gzip, gzip -d)

Compaction

–

Lotus Notes Database

Multiplexing

–

Multiple backup streams to a single tape drive

–

Veritas Backup Exec

–

Computer Associates ARCserve

–

Oracle RMAN multiplexing of backup sets

Encryption

(21)

Example: Data Deduplication and Encryption

Data

source 1

txpt tnatroemI

te tarpIxtntom

Data

source 2

Data

source 3

Important text

Data

Deduplication

No

encryption

Encryption

key 1

Important text

Encryption

key 2

2. After encryption,

1. Three data

3. Deduplication

Data Store

4. Text files are

Data encryption prior to

de-duplication processing can

subvert data reduction

Data encryption prior to

de-duplication processing can

subvert data reduction

Important text

txpt tnatroemI

te tarpIxtntom

Compression possible

(22)

Dedupe Value Proposition & potential Drawbacks

Data Deduplication Value Proposition

–

Disk storage savings

–

Network Bandwidth savings

–

Energy savings (Green IT)

–

Better utilization of existing floor and rack space

–

Increased scalability

Data Deduplication Potential Drawbacks

–

Loss of one single data chunk may cause loss of multiple files

–

Repository or Index required to store meta data

–

must be protected

–

requires additional storage capacity

–

may slow down performance

(23)

(24)

ProtecTIER Architecture Overview

Linux server-based application running on a System x server

Emulates a tape library unit, including drives, cartridges, and robotics

Uses Fibre Channel (FC) attached disk storage system as the backup medium

Has a build-in deduplication engine (HyperFactor)

Backup Server

FC

Disk Storage

System

Virtual Tape Library

ProtecTIER Server

“It’s a Tape

Library and

Drives”

ProtecTIER

Application

(25)

Data Storage

Backup Servers

FC Switch

ProtecTIER

Server

Disk Arrays

HyperFactor

Memory

Resident Index

(4 GB, may contain predefined elements) Existing Data

New Data Stream

• Read similar elements from storage and compare

(26)

Dedupe Ratio depends on ...

Data Change Rate

–

the percentage of data in the incomming backup data stream that is new for

ProtecTIER and not already stored physically in the repository

Backup Policies

–

# full backups

–

# Inc backups

–

backup frequency

(27)

ProtecTIER Native Replication – Key new feature R2.3

Physical capacity ProtecTIER Gateway Backup Server Backup Server Represented capacity

Primary Site

Represented capacity Physical capacity ProtecTIER Gateway Backup

Secondary Site

Significant bandwidth reduction

ProtecTIER IP replication

(28)

(29)

TS7650 Appliance Series

500MB/sec 500MB/sec Standalone – 4700 32 spindle 450GB (2 drawer) 7TB 100MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3 1u empty space or TSSC 1u empty space or TSSC 8 1 0 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space X3850 M2 3 x 6core, 24GB RAM Standalone – 4700 64 spindle 450GB (4 drawer) 18TB 250MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3 8 1 0 E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S 1u empty space or TSSC 1u empty space or TSSC 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space X3850 M2 3 x 6core, 24GB RAM Standalone – 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3 1u empty space or TSSC 1u empty space or TSSC 8 1 0 E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S D S 4 7 0 0 M M M M M m m m m mUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S 1u empty space 1u empty space 1u empty space 1u empty space X3850 M2 3 x 6core, 24GB RAM Clustered 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3

Ethernet Switch (1U) Ethernet Switch (1U)

1u empty space or TSSC 1u empty space or TSSC 8 1 0 E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S D S 4 7 0 0 M M M M M m m m m mUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S W T I S w it c h X3850 M2 3 x 6core, 24GB RAM X3850 M2 3 x 6core, 24GB RAM 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space

Appliances can be upgraded one step forward ...

(30)

(31)

A look in the Future

Some observations from the VTL and Dedupe Market

–

Vendors converge to a common point

–

Scalable appliances with multiple I/O interfaces (FCP, iSCSI, CIFS, NFS, Library

Emulation)

–

Replication becomes more and more commodity

–

Replication benefits from deduped data

–

Intelligent storage devices will be tighly integrated with 3rd party backup

applications

(32)

(33)

Links I

TS7650G ProtecTIER Deduplication Gateway

http://www-03.ibm.com/systems/storage/tape/ts7650g/index.html

TS7650 ProtecTIER Deduplication Appliance

http://www-03.ibm.com/systems/storage/tape/ts7650a/index.html

Whitepaper: IBM Data Deduplication Strategy and Operations

http://www.ibm.com/developerworks/wikis/display/tivolistoragemana

ger/IBM+Tivoli+Storage+Manager+V6.1+Data+Deduplication+Strate

gy+and+Operations

Redbook: The IBM System Storage TS7650G and TS7650

ProtecTIER Servers

(34)

Links II

TS7650G ProtecTIER Implementation Workshops

IBMer:

https://w3-01.sso.ibm.com/learning/lms/Saba/Web/Main/goto/learningActivity?c

ourseNum=SS92E1DE&deepLinkRedirect=false

Business Partner:

http://www-304.ibm.com/jct03001c/services/learning/ites.wss/de/de?pageType=

course_description&includeNotScheduled=y&courseCode=SS92E1

DE

(35)

IBM Dynamic Infra-structure Leadership Center for Information Infrastructure

Business, Channel & Skill Enablement & Training

DI Education & Briefings

Demos & Showcases

IT Transformation Road-maps & Workshops

BP Certification

IBM European Storage Competence Center & Systems Lab Europe

IBM Executive Briefing Center & TMCC

Customer and Group Briefings

Product & SW Demos

Integrated Solution Demos

IBM STG Europe Storage Software Development

Software Development

Storage & Tape

Linux

Mainframe

File Systems

Storage Competence at the Mainz Location

IBM Germany‘s fourth

largest location offers

you a broad portfolio of

IBM System Storage

Services

__________________________________________________________________________________________________________________________________________

End-to-end client support

Workshops

Solution Design

Lab Services

Customer Relationship Management

(36)

Our Services

Client Briefings & Education

Systems Lab Services & Training

Customized Workshops

System Storage Demos

Advanced Technical Support

Solution Design

Proof of Concepts

Benchmarks

Product Field Engineering

Our Expertise

Skilled technical storage experts covering the whole IBM System Storage Portfolio Information Infrastructure: Compliance Availability Retention Security HW / SW & Performance

IBM System Storage Solutions Center of Excellence

We offer technical

support from the

planning phase through

well after installation

__________________________________________________________________________________________________________________________________________

Our Systems Lab Europe

1500 sqm lab space

(37)

Me rci

Grazie

Gracias

Obrigado

Dan ke

Japanese Hebrew

Thank You

English French Russian German Italian Spanish Brazilian Portuguese Hindi Korean Simplified Chinese Arabic

Tak

Danish

(38)

Disclaimer I

No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation.

The performance data contained herein were obtained in a controlled, isolated environment. Results obtained in other operating environments may vary significantly. While IBM has reviewed each item for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. These values do not constitute a guarantee of performance. The use of this information or the implementation of any of the techniques discussed herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into their operating environment. Customers attempting to adapt these techniques to their own environments do so at their own risk.

Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. This information could include technical inaccuracies or

typographical errors. IBM may make improvements and/or changes in the product(s) and/or

program(s) at any time without notice. Any statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Any reference to an IBM Program Product in this document is not intended to state or imply that only that program product may be used. Any functionally equivalent program, that does not

infringe IBM's intellectually property rights, may be used instead. It is the user's responsibility to evaluate and verify the operation of any on-IBM product, program or service.

(39)

Disclaimer II

THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT. IBM shall have no responsibility to update this information. IBM products are warranted according to

the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided. IBM is not responsible for the performance or interoperability of any non-IBM products discussed herein.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be made, in writing, to:

IBM Director of Licensing IBM Corporation

North Castle Drive

Armonk, NY 10504-1785 U.S.A.

(40)

Trademarks

The following terms are trademarks or registered trademarks of the IBM Corporation in either the United States, other countries or both.

– IBM, TotalStorage, zSeries, pSeries, xSeries, S/390, ES/9000, AS/400, RS/6000 – z/OS, z/VM, VM/ESA, OS/390, AIX, DFSMS/MVS, OS/2, OS/400, ESCON, Tivoli

– iSeries, ES/3090, VSE/ESA, TPF, DFSMSdfp, DFSMSdss, DFSMShsm, DFSMSrmm, FICON, – ProtecTIER, XIV

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Other company, product, and service names mentioned may be trademarks or registered trademarks of their respective companies.