E-Guide. Sponsored By:

(1)

An in-depth look at data

deduplication methods

This E-Guide will discuss the various approaches to data deduplication. You’ll

learn the pros and cons of each, and will benefit from independent expert insight that will help you get the most out of the approach you take with data deduplica-tion technology.

deduplication methods

(3)

The pros and cons of file-level vs. block-level data

deduplication technology

Lauren Whitehouse

Data Deduplication has dramatically improved the value proposition of disk-based data protection as well as WAN-based remote- and branch-office backup consolidation and disaster recovery (DR) strategies. It identifies duplicate data, removing redundancies and reducing the overall capacity of data transferred and stored.

Some deduplication approaches operate at the file level, while others go deeper to examine data at a sub-file, or block, level. Determining uniqueness at either the file or block level will offer benefits, though results will vary. The differences lie in the amount of reduction each produces and the time each approach takes to determine what's unique.

File-level deduplication

Also commonly referred to as single-instance storage (SIS), file-level data deduplication compares a file to be backed up or archived with those already stored by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to the existing file is stored. The result is that only one instance of the file is saved and subsequent copies are replaced with a "stub" that points to the original file.

Block-level deduplication

Block-level data deduplication operates on the sub-file level. As its name implies, the file is typically broken down into segments -- chunks or blocks -- that are examined for redundancy vs. previously stored information.

The most popular approach for determining duplicates is to assign an identifier to a chunk of data, using a hash algorithm, for example, that generates a unique ID or "fingerprint" for that block. The unique ID is then compared with a central index. If the ID exists, then the data segment has been processed and stored before. Therefore, only a pointer to the previously stored data needs to be saved. If the ID is new, then the block is unique. The unique ID is added to the index and the unique chunk is stored.

The size of the chunk to be examined varies from vendor to vendor. Some have fixed block sizes, while others use variable block sizes (and to make it even more confusing, a few allow end users to vary the size of the fixed block). Fixed blocks could be 8 KB or maybe 64 KB -- the difference is that the smaller the chunk, the more likely the opportunity to identify it as redundant. This, in turn, means even greater reductions as even less data is stored. The only issue with fixed blocks is that if a file is modified and the deduplication product uses the same fixed blocks from the last inspection, it might not detect redundant segments because as the blocks in the file are changed or moved, they shift downstream from the change, offsetting the rest of the comparisons.

Variable-sized blocks help increase the odds that a common segment will be detected even after a file is modified.

The pros and cons of file-level vs. block-level data deduplication technology

(4)

This approach finds natural patterns or break points that might occur in a file and then segments the data

accordingly. Even if blocks shift when a file is changed, this approach is more likely to find repeated segments. The tradeoff? A variable-length approach may require a vendor to track and compare more than just one unique ID for a segment, which could affect index size and computational time.

The differences between file- and block-level deduplication go beyond just how they operate. There are advantages and disadvantages to each approach.

File-level approaches can be less efficient than block-based deduplication:

• A change within the file causes the whole file to be saved again. A file, such as a PowerPoint presenta-tion, can have something as simple as the title page changed to reflect a new presenter or date -- this will cause the entire file to be saved a second time. Block-based deduplication would only save the changed blocks between one version of the file and the next. Reduction ratios may only be in the 5:1 or less range whereas block-based deduplication has been shown to reduce capacity in the 20:1 to 50:1 range for stored data.

File-level approaches can be more efficient than block-based data deduplication:

• Indexes for file-level deduplication are significantly smaller, which takes less computational time when duplicates are being determined. Backup performance is, therefore, less affected by the deduplication process. File-level processes require less processing power due to the smaller index and reduced number of comparisons. Therefore, the impact on the systems performing the inspection is less. The impact on recov-ery time is low. Block-based deduplication will require "reassembly" of the chunks based on the master index that maps the unique segments and pointers to unique segments. Since file-based approaches store unique files and pointers to existing unique files there is less to reassemble.

The pros and cons of file-level vs. block-level data deduplication technology

(5)

Contact FalconStor at 866-NOW-FALC (866-669-3252)

or visit www.falconstor.com

FalconStor® VirtualTape Library (VTL) provides disk-based data protection and de-duplication to vastly improve the reliability, speed, and predictability of backups. To learn more about our industry-leading VTL solutions with de-duplication:

“We selected FalconStor

because we were confident they could offer

a highly scalable VTL solution that provides

data de-duplication, offsite replication,

and tape integration, with zero impact

to our backup performance.”

Henry Denis, IT Director

(6)

Data deduplication methods: Block-level versus

byte-level dedupe

Lauren Whitehouse

Data deduplication identifies duplicate data, removing redundancies and reducing the overall capacity of data transferred and stored. In my last article, I reviewed the differences between file-level and block-level data

deduplication. In this article, I'll assess byte-level versus block-level deduplication. Byte-level deduplication provides a more granular inspection of data than block-level approaches, ensuring more accuracy, but it often requires more knowledge of the backup stream to do its job.

Block-level approaches

Block-level data deduplication segments data streams into blocks, inspecting the blocks to determine if each has been encountered before (typically by generating a digital signature or unique identifier via a hash algorithm for each block). If the block is unique, it is written to disk and its unique identifier is stored in an index; otherwise, only a pointer to the original, unique block is stored. By replacing repeated blocks with much smaller pointers rather than storing the block again, disk storage space is saved.

The criticism of block-based approaches are 1) the use of a hash algorithm to calculate the unique ID brings the risk of generating a false positive; and 2) storing unique IDs in an index can slow the inspection process as it grows larger and requires disk I/O (unless the index size is kept in check and data comparison occurs in memory).

Hash collisions could spell a false positive when use a hash-based algorithm for determining duplicates. Hash algorithms, such as MD5 and SHA-1, generate a unique number for the chunk of data being examined. While hash collisions and the resulting data corruption are possible, the chances are slim that a hash collision will occur.

Byte-level data deduplication

Analyzing data streams at the byte level is another approach to deduplication. By performing a byte-by-byte comparison of new data streams versus previously stored ones, a higher level of accuracy can be delivered. Deduplication products that use this method have one thing in common: It's likely that the incoming backup data stream has been seen before, so it is reviewed to see if it matches similar data received in the past.

Products leveraging a byte-level approach are typically "content aware," which means the vendor has done some reverse engineering of the backup application's data stream to understand how to retrieve information such as the file name, file type, date/time stamp, etc. This method reduces the amount of computation required to determine unique versus duplicate data. The caveat? This approach typically occurs post-process -- performed on backup data once the backup has completed. Backup jobs, therefore, complete at full disk performance, but require a reserve of disk cache to perform the deduplication process. It's also likely that the deduplication process is limited to a backup stream from a single backup set and not applied "globally" across backup sets.

Data deduplication methods: Block-level versus byte-level dedupe

(7)

Once the deduplication process is complete, the solution reclaims disk space by deleting the duplicate data. Before space reclamation is performed, an integrity check can be performed to ensure that the deduplicated data matches the original data objects. The last full backup can also be maintained so recovery is not dependent on reconstituting deduplicated data, enabling rapid recovery.

Which approach Is best?

Both block- and byte-level methods deliver the benefit of optimizing storage capacity. When, where, and how the processes work should be reviewed for your backup environment and its specific requirements before selecting one approach over another. Your vetting process should also include references from organizations with similar charac-teristics and requirements.

Data deduplication methods: Block-level versus byte-level dedupe

(8)

Resources from FalconStor Software

Resources from FalconStor Software

Book Chapter: SAN For Dummies, Chapter 13: Using Data De-duplication to Lighten the Load

White Paper: Demystifying Data De-Duplication: Choosing the Best Solution

Webcast: Enhancing Disk-to-Disk Backup with Data Deduplication

E-Guide. Sponsored By:

An in-depth look at data

deduplication methods

Sponsored By:

Table of Contents:

An in-depth look at data

deduplication methods

The pros and cons of file-level vs. block-level data

deduplication technology

File-level deduplication

Block-level deduplication

Contact FalconStor at 866-NOW-FALC (866-669-3252)

or visit www.falconstor.com

“We selected FalconStor

because we were confident they could offer

a highly scalable VTL solution that provides

data de-duplication, offsite replication,

and tape integration, with zero impact

to our backup performance.”

Henry Denis, IT Director

Data deduplication methods: Block-level versus

byte-level dedupe

Block-level approaches

Byte-level data deduplication

Which approach Is best?

Resources from FalconStor Software

About FalconStor Software