• No results found

Data Deduplication in a Virtual Tape Library Environment

N/A
N/A
Protected

Academic year: 2021

Share "Data Deduplication in a Virtual Tape Library Environment"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Environment

Mathias Defiebre

IBM Lab Services

(2)

Agenda



Data Deduplication Overview



Data Deduplication Theory



Data Deduplication Approaches in Practice



Data Deduplication Considerations and Value Proposition



TS7650 ProtecTIER Deduplication Gateway



TS7650 ProtecTIER Deduplication Appliance Series



A look in the Future

(3)
(4)

Data Deduplication Overview



With Data Deduplication repeated instances of identical data are

identified and stored only once

Identical data is referenced to a single instance

Saves storage capacity and network bandwidth



Data Deduplication is a feature of a storage device or an application

VTL, NAS-Box, backup application



Data Deduplication requires an I/O protocol

FCP, iSCSi, CIFS, NFS, API, Tape Library Emulation



Data Deduplication does not always make sense

Not all data can be deduplicated well

May interfere or work together with other technologies like compression, encryption

or with data security requirements



Data Deduplication is transparent

(5)
(6)

Data Deduplication Process (simplified)

F E D C B A

A B C D A E F F D

Data Object / Stream

Identical Chunks

Data object or stream is subject for deduplication

(1) Data object is split in chunks (fixed or variable size)

 Data Chunking

(2) For each junk an identity characteristic is determined







 Identity Determination

(3a) Identical Chunks are referenced (pointer, reference)

(3b) Non-identical chunks (single instances) are stored unique







(7)

Methods for Data Chunking

1.

File based



One chunk is one file, most appropriate for file systems

2.

Block based



Data object is chunked into blocks of fixed or variable size



Used by block storage devices

3.

Format aware (Content aware)



Understands explicit data formats and chunks data objects according to the format



Example: Breaking a PowerPoint presentation into separate slides

4.

Format agnostic (Content agnostic)



Chunking is based on an algorithm that looks for logical breaks or similar elements

within a data object/stream

(8)

Methods for Determining Duplicates

1.

Hashing



Calculate a hash (MD-5, SHA-256) for each data chunk



Compare hash with hash of existing data

Identical hash means most likely identical data



Hash Collision: Identical hash but non-identical data

Must be prevented through secondary comparison (additional metadata,

second hash method, additional binary comparison)

2.

Binary Comparison



Compare all bits of similar chunks

3.

Delta Differencing



Computes a “delta” between two “similar” chunks of data where one chunk is the

baseline and the second is the delta



Since each delta is unique there is no possibility of collision



To reconstruct the original chunk the delta(s) have to be re-applied to the baseline

chunk

F E D C B A

A B C D A E F F D

(9)

Data Deduplication Architectures

Client

Server

Storage Device

Client-side

+

Reduces load on Server

+

Reduces bandwidth on

LAN

Adds load to Client

No cross-correlation

among multiple clients

Server-side

+

Allows cross-correlation

among multiple Clients

Adds load to Server

LAN LAN or SAN

Storage-side

+

Transparent to Clients and

Servers

+

Reduces load on Server and

Clients

(10)

Data Deduplication Processing Time



In-line: Data is deduplicated before it is actually stored

+

Requires less storage capacity

Potential decrease of I/O performance



Post-processing: Data is first stored and deduplicated later in the

background

+

Better Performance expected

Requires more storage capacity to temporarily store the data

Data is written, read and written again – thus more I/O intensive

Deduplication window must be coordinated with backup window



Combination of In-Line and Post-processing

(11)
(12)

Practical Approaches Overview



Practical approaches combine

Chunking Method

Method for Determining/Checking Identity



Common Practical Approaches

Format

Agnostic

Format Aware

Fixed/Variable

Block Size

Binary Diff

Delta Diff

Hashing

Identity

Check

Chunking

Content Aware

HyperFactor

H

a

s

h

b

a

s

e

d

(13)

Hash Based Approach

1.

Slice data into chunks (fixed or variable)

2.

Generate Hash per chunk

3.

Compare hashes with hash table

4.

For identical hashes store reference, otherwise store chunk and

update hash table

A

h

B

h

C

h

D

h

E

h

Object

References

Storage

locations

Hash Value

A

B

C

D

E

(14)

Assessment for Hash Based Approach



Hash-Collisions must be handled

More overhead, especially for in-line deduplication



Requires a hash table to store hashes for all chunks

Hash table will grow with data volume



Hash Table must be quickly searchable and accessible

Growing hash table may become a performance bottleneck (doesn’t fit into RAM)

Scalability issues



Hash table must be protected

One copy might not be sufficient

Example:

Chunk size of 8KB, each hash is 20 bytes long …

With a 1 TB repository:

1 TByte repository has ~134,000,000 chunks of 8 KB each Need pointers scheme to reference inside 1 TByte

  

 Hash table requires ~2.5 GB of memory – no issue

With a 100 TB repository:

  

(15)

HyperFactor Approach



HyperFactor has two indexes

HyperFactor Index

Restore Index



HyperFactor Index used for backup

Used to filter out similar elements from the incoming data stream

Fixed size of 4 GB, memory resident, synced to disk (repository) periodically

Can be restored from repository if lost

References up to 1 PB of physical data elements stored in the repository



Restore Index used for restore

Includes references to physical data elements

Dynamic index, growing

(16)

HyperFactor Approach

1.

Look through data stream for

similarity

and filter similar elements

Using HyperFactor Index (fixed size 4 GB)

2.

Read elements that are most

similar

from storage

Using Restore Index

3.

Binary compare element in stream with element(s) read from

storage

4.

Identical data is referenced by a new additional entry in the Restore

Index - unique data is stored in the repository

New Data Stream

(17)

Assessment for HyperFactor



No Hash Table required

No scalability issues

4 GB Index references up to 1 PB of physical data elements



No dependency of data format and application

Very flexible, no ongoing development effort due to format changes



HyperFactor index always fits into memory

Enables enterprise-class high-performance in-line deduplication



Eliminates the phenomenon of missed factoring opportunities

(18)

Data Deduplication Considerations and Value

Proposition

(19)

Not all Data Dedupe well



High Dedupe Ratio expected for ...

Structured Data

Database Files

E-mails



Low Dedupe Ratio expected for ...

Unstructured Data

Images

Videos

Voice Data

Seismic Data

Large collections of small files

(20)

Technologies influencing Data Deduplication



Compression

Archives

*.zip (Phil Katz zip: pkzip, pkunzip)

*.gz (GNU zip: gzip, gzip -d)



Compaction

Lotus Notes Database



Multiplexing

Multiple backup streams to a single tape drive

Veritas Backup Exec

Computer Associates ARCserve

Oracle RMAN multiplexing of backup sets



Encryption

(21)

Example: Data Deduplication and Encryption

Data

source 1

txpt tnatroemI

te tarpIxtntom

Data

source 2

Data

source 3

Important text

Important text

Important text

Data

Deduplication

No

encryption

Encryption

key 1

Important text

Encryption

key 2

2. After encryption,

1. Three data

3. Deduplication

Data Store

4. Text files are

Data encryption prior to

de-duplication processing can

subvert data reduction

Data encryption prior to

de-duplication processing can

subvert data reduction

Important text

txpt tnatroemI

te tarpIxtntom

Compression possible

(22)

Dedupe Value Proposition & potential Drawbacks



Data Deduplication Value Proposition

Disk storage savings

Network Bandwidth savings

Energy savings (Green IT)

Better utilization of existing floor and rack space

Increased scalability



Data Deduplication Potential Drawbacks

Loss of one single data chunk may cause loss of multiple files

Repository or Index required to store meta data

must be protected

requires additional storage capacity

may slow down performance

(23)
(24)

ProtecTIER Architecture Overview



Linux server-based application running on a System x server



Emulates a tape library unit, including drives, cartridges, and robotics



Uses Fibre Channel (FC) attached disk storage system as the backup medium



Has a build-in deduplication engine (HyperFactor)

Backup Server

FC

Disk Storage

System

Virtual Tape Library

ProtecTIER Server

“It’s a Tape

Library and

Drives”

ProtecTIER

Application

(25)

Data Storage

Backup Servers

FC Switch

ProtecTIER

Server

Disk Arrays

HyperFactor

Memory

Resident Index

(4 GB, may contain predefined elements) Existing Data

New Data Stream

Read similar elements from storage and compare

(26)

Dedupe Ratio depends on ...



Data Change Rate

the percentage of data in the incomming backup data stream that is new for

ProtecTIER and not already stored physically in the repository



Backup Policies

# full backups

# Inc backups

backup frequency

(27)

ProtecTIER Native Replication – Key new feature R2.3

Physical capacity ProtecTIER Gateway Backup Server Backup Server Represented capacity

Primary Site

Represented capacity Physical capacity ProtecTIER Gateway Backup

Secondary Site

Significant bandwidth reduction

ProtecTIER IP replication

(28)
(29)

TS7650 Appliance Series

500MB/sec 500MB/sec Standalone – 4700 32 spindle 450GB (2 drawer) 7TB 100MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3 1u empty space or TSSC 1u empty space or TSSC 8 1 0 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space X3850 M2 3 x 6core, 24GB RAM Standalone – 4700 64 spindle 450GB (4 drawer) 18TB 250MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3 8 1 0 E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S 1u empty space or TSSC 1u empty space or TSSC 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space X3850 M2 3 x 6core, 24GB RAM Standalone – 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3 1u empty space or TSSC 1u empty space or TSSC 8 1 0 E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S D S 4 7 0 0 M M M M M m m m m mUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S 1u empty space 1u empty space 1u empty space 1u empty space X3850 M2 3 x 6core, 24GB RAM Clustered 4700 128 spindle 450GB (8 drawer) 36TB 450MB/sec F 0 5 B a s e F ra m e a s e P o w e r: F C 1 9 0 3

Ethernet Switch (1U) Ethernet Switch (1U)

1u empty space or TSSC 1u empty space or TSSC 8 1 0 E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S D S 4 7 0 0 M M M M M m m m m mUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S E X P 8 1 0 U U U U PUUUUPUUUUP S W T I S w it c h X3850 M2 3 x 6core, 24GB RAM X3850 M2 3 x 6core, 24GB RAM 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space 1u empty space

Appliances can be upgraded one step forward ...

(30)
(31)

A look in the Future



Some observations from the VTL and Dedupe Market

Vendors converge to a common point

Scalable appliances with multiple I/O interfaces (FCP, iSCSI, CIFS, NFS, Library

Emulation)

Replication becomes more and more commodity

Replication benefits from deduped data

Intelligent storage devices will be tighly integrated with 3rd party backup

applications

(32)
(33)

Links I



TS7650G ProtecTIER Deduplication Gateway

http://www-03.ibm.com/systems/storage/tape/ts7650g/index.html



TS7650 ProtecTIER Deduplication Appliance

http://www-03.ibm.com/systems/storage/tape/ts7650a/index.html



Whitepaper: IBM Data Deduplication Strategy and Operations

http://www.ibm.com/developerworks/wikis/display/tivolistoragemana

ger/IBM+Tivoli+Storage+Manager+V6.1+Data+Deduplication+Strate

gy+and+Operations



Redbook: The IBM System Storage TS7650G and TS7650

ProtecTIER Servers

(34)

Links II



TS7650G ProtecTIER Implementation Workshops

IBMer:

https://w3-01.sso.ibm.com/learning/lms/Saba/Web/Main/goto/learningActivity?c

ourseNum=SS92E1DE&deepLinkRedirect=false

Business Partner:

http://www-304.ibm.com/jct03001c/services/learning/ites.wss/de/de?pageType=

course_description&includeNotScheduled=y&courseCode=SS92E1

DE

(35)

IBM Dynamic Infra-structure Leadership Center for Information Infrastructure

Business, Channel & Skill Enablement & Training

 DI Education & Briefings

 Demos & Showcases

 IT Transformation Road-maps & Workshops

 BP Certification

IBM European Storage Competence Center & Systems Lab Europe

IBM Executive Briefing Center & TMCC

 Business, Channel & Skill Enablement & Training

 Customer and Group Briefings

 Product & SW Demos

 Integrated Solution Demos

IBM STG Europe Storage Software Development

Software Development

 Storage & Tape

 Linux

 Mainframe

 File Systems

Storage Competence at the Mainz Location

IBM Germany‘s fourth

largest location offers

you a broad portfolio of

IBM System Storage

Services

__________________________________________________________________________________________________________________________________________

Business, Channel & Skill Enablement & Training

End-to-end client support

Workshops

Solution Design

Lab Services

Customer Relationship Management

(36)

Our Services

Client Briefings & Education

Systems Lab Services & Training

Customized Workshops

System Storage Demos

Advanced Technical Support

Solution Design

Proof of Concepts

Benchmarks

Product Field Engineering

Our Expertise

Skilled technical storage experts covering the whole IBM System Storage Portfolio Information Infrastructure:  Compliance  Availability  Retention  Security HW / SW & Performance

IBM System Storage Solutions Center of Excellence

We offer technical

support from the

planning phase through

well after installation

__________________________________________________________________________________________________________________________________________

Our Systems Lab Europe

1500 sqm lab space

(37)

Me rci

Grazie

Gracias

Obrigado

Dan ke

Japanese Hebrew

Thank You

Thank You

English French Russian German Italian Spanish Brazilian Portuguese Hindi Korean Simplified Chinese Arabic

Tak

Danish

(38)

Disclaimer I

 Copyright©2009 by International Business Machines Corporation.

 No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation.

 The performance data contained herein were obtained in a controlled, isolated environment. Results obtained in other operating environments may vary significantly. While IBM has reviewed each item for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. These values do not constitute a guarantee of performance. The use of this information or the implementation of any of the techniques discussed herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into their operating environment. Customers attempting to adapt these techniques to their own environments do so at their own risk.

 Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. This information could include technical inaccuracies or

typographical errors. IBM may make improvements and/or changes in the product(s) and/or

program(s) at any time without notice. Any statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only

 References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Any reference to an IBM Program Product in this document is not intended to state or imply that only that program product may be used. Any functionally equivalent program, that does not

infringe IBM's intellectually property rights, may be used instead. It is the user's responsibility to evaluate and verify the operation of any on-IBM product, program or service.

(39)

Disclaimer II

 THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT.  IBM shall have no responsibility to update this information. IBM products are warranted according to

the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided. IBM is not responsible for the performance or interoperability of any non-IBM products discussed herein.

 Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

 The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be made, in writing, to:

IBM Director of Licensing IBM Corporation

North Castle Drive

Armonk, NY 10504-1785 U.S.A.

(40)

Trademarks

 The following terms are trademarks or registered trademarks of the IBM Corporation in either the United States, other countries or both.

– IBM, TotalStorage, zSeries, pSeries, xSeries, S/390, ES/9000, AS/400, RS/6000 – z/OS, z/VM, VM/ESA, OS/390, AIX, DFSMS/MVS, OS/2, OS/400, ESCON, Tivoli

– iSeries, ES/3090, VSE/ESA, TPF, DFSMSdfp, DFSMSdss, DFSMShsm, DFSMSrmm, FICON, – ProtecTIER, XIV

 Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Other company, product, and service names mentioned may be trademarks or registered trademarks of their respective companies.

References

Related documents

Automate: Disk Backup with deduplication and replication DATA CENTER Simplify: manage with single pane of glass Consolidate: Disk Backup with deduplication Virtual servers

Provides planning and configuration information on the use of Data Domain devices for data deduplication backup and storage in a NetWorker

Provides planning and configuration information on the use of Data Domain devices for data deduplication backup and storage in a NetWorker

Direct Attached Disk Production Pool Protection Pool Data Services Backup Server Pyhsical Tape Library Data Services CDP VTL Serverless Backup Virtual to Physical.. © 2013

Symantec Backup Exec™ 12.5 for Windows® Servers is the gold standard in Windows data protection for physical and virtual systems, providing comprehensive disk and tape backup

File archive data File backup ETERNUS CS High End Tape Library Disk Library Backup data SAN environment Mainframes UNIX Servers X86 Servers. *VTL = Virtual

In this case, fast primary storage is used for the live data, SATA disk or NAS disk arrays for the secondary data and tape libraries for the archive.. The reason for this structure

Once the client servers complete their data backups, the backup server can fetch the data directly from the Storage Concentrator logical volume and place it on the tape..