Data-Deduplication Reducing Data to the Minimum

(1)

Data-Deduplication – Reducing Data to the Minimum

By:

Walter Graf,

Principal Consultant for Data Protection Solutions, Fujitsu Siemens Computers

There’s a new technology in the storage world that everyone is talking about: data deduplication. Many suppliers are currently offering a wide range of different

deduplication solutions designed to help IT operators to drastically reduce the volume of data they are backing up.

In this white paper, we examine in which environments and under what conditions it is already possible to use data deduplication today, discover to what extent archives can actually be reduced and see how this looks in actual practice. This white paper is intended to help storage managers identify the potential as well as the problems involved in implementing dedupe technology – and to provide support in choosing whether or not to implement data deduplication.

Today, one of the major challenges facing IT managers is how to deal with today’s massive volumes of data. Over the last 15 years, we have come to accept that each new innovation in the field of storage media is quickly overtaken by the growing need for increased storage capacity. Up to now, the supply-demand situation has more or less remained in balance.

However, according to a White Paper published by IDC at the end of 2007 (The Expanding Universe: A Forecast of Worldwide Information Growth Through 2010), the quantity of data generated is in danger of exceeding available capacities for the first time in the history of the IT industry. Although the initial impact is minimal, the trend has significant longer-term implications, since it is not yet necessary to store all newly-generated data long term.

Nevertheless, it stands to reason that, in the future, companies may no longer be in a position to store all the data they want and need, for the long term.

Even if we choose to disregard this potential scenario, constantly growing data volumes do pose a problem. IT managers, for example, must allocate a significant proportion of their budgets on data storage and management alone, instead of investing in strategically

important innovations. After all, a data backup concept covers not only the one-time storage (online system backups) but also provides as complete a backup as possible of all data generated or updated in the last months or years. No wonder, then, that the search is on for new technologies to find a way to meet these requirements.

What exactly is data deduplication?

The basic idea is quite simple: a given set of data is analyzed and any duplicated information is saved only once, rather than making multiple copies of the same file. In combination with software that can reconstruct original data sets from deduped backups, this creates archives that are much smaller and consist exclusively of dramatically reduced, yet complete

information. A good analogy is to think of how a dedupe system would backup the individual bricks used in a house built from Lego bricks. Instead of making a copy of each individual brick, the types of building blocks would be identified, then only one block of each type would be added to the archive, alongside a construction plan of the building. Today, software manufacturers are already using this strategy at the file level in e-mail environments, where data is only stored once. The data deduplication approach takes this concept one step further by identifying identical iterations in the files. While an incremental backup performs a full save of files that have been only slightly altered file, a deduplication system saves only the data that has been changed within the file.

(2)

Four factors that determine the deduplication reduction factor

1. Redundant instances in the data to be stored. For data generated by conventional office applications, for example, where many of the file characteristics are very similar, the potential for reduction is much higher than in a typical database. A number of similar systems (for example PCs or server farms with virtual machines) also offer high potential.

This is one of the major reasons why the backup solutions that employ deduplication today focus on the already mentioned “ROBO” (Remote Office/Branch Office) market.

Compared to the original data set, very good reduction values (approximately 30 to 40 percent) can be reached for office applications.

2. The data reduction factor provided by conventional compression. A further

compression process can reduce already-reduced data by another factor of two to three.

Since this method is also available in the conventional backup environment, it should not really be used for the purposes of comparison. In other words, if a supplier talks about a reduction factor of 20, this translates as an effective savings of a factor of 10 compared to a conventional backup strategy (assuming a compression factor of 2).

3. The backup strategy. An organization’s current backup strategy has a major influence on the attainable reduction. The frequency of full backups and the retention time in particular play a dominant role. Assuming the simplest case – that backup files are never changed once backed up, and assuming that the volume of primary data is not

appreciably reduced, it would require a saving of 20 full backups to attain a reduction factor of 20. At a rate of one full backup a week, it will take more than four months to reach the factor of 20.

4. The change rate of the data to be stored. If data is altered significantly over the course of time, then the quantity of data that must be stored also increases. In contrast to

conventional incremental backups, which store entire files even when only a part of the file has been changed, data deduplication offers the option of merely saving the block of data in the file that has been changed. In other words, when approximately five percent of files are altered daily, then perhaps only two percent of the data at the block level is actually affected. Good deduplication algorithms exploit this potential. Poor algorithms, on the other hand, are hardly better than using an incremental backup. Also, the change rate determines the maximum deduplication factor that can be reached. At a daily change rate of five percent it is not possible to exceed a factor of 20 and it will usually take many months to reach that target.

Deduplication technology can make backup applications and Virtual Tape Libraries (VTL) considerably more efficient. Both scenarios have the advantage that they optimize the entire backup archive and not just the original data on the hard disk.

Solving the problem at source with deduplication in backup software

In addition to reducing the volume of backups, integrating deduplication technology into backup software offers other benefits. Many backup clients today are only available via slow or frequently overloaded network connections. Data reduction at the source therefore allows a more optimized use of these connections, making it possible to integrate a previously decentralized backup solution (for example from branch offices) into a centralized backup concept and to operate this system at a lower overall cost.

For the IT operator, the way in which backups are performed also changes: from a logical point of view, now only full data backups are made, although in fact, only the blocks of data that are not yet in the backup archive are actually saved. In the event that data needs to be restored, this process really pays off because it is no longer necessary to create a backup status based on different full and incremental backups.

(3)

The pragmatic alternative: Deduplication in a Virtual Tape Library

Despite the gains that deduplication can offer, many IT operators shy away from the effort involved in converting their backup application and all the other processes. In addition, there are not currently any backup software programs with deduplication technology available that can yet guarantee high-performance operation of all the typical backup scenarios when a dedupe function is added.

This is where a Virtual Tape Library (VTL)-based solution serves as a pragmatic alternative, since backup application and processes remain unchanged. Deduplication technology is encapsulated in a VTL black box, rendering it invisible to IT operations, making the use of deduplication technology quick and easy. Achieving reduction of overall data volumes is an important goal. There are other advantages, too: archives can be more easily replicated over longer distances and storage systems are better equipped to deal with disaster recovery.

Data deduplication: The hype and the reality

As a technology, deduplication will likely be most commonly deployed in VTLs in the short term. A number of manufacturers already offer interesting backup applications with

deduplication technology and concentrate on the integration of so-called ROBO (Remote Office/Back Office) environments in a centralized backup concept to realize the advantages of a consolidated backup landscape. However, the technology still has some obstacles to overcome before its high potential can be exploited for use in sophisticated data-center operation.

For example, the issue of fragmentation: A deduplicated archive stores each file broken down into its basic components. In a tape-based archive, this file would therefore very probably need to be read from different sections of a tape or even from several different backup tapes, which is one reason why secondary (tape) backup is not currently conducive to the use of deduplication technology. In order to be able to restore the reduced data at a reasonable speed, it must be stored on a hard disk. This raises a more fundamental

question: whether a backup archive of unreduced data, stored on inexpensive tape is more economical than a reduced backup archive on more expensive hard disk.

The answer depends on the expected reduction factor from dedupe. In reality, the saving is about the same as the cost savings of tape compared to disk, but adds a new layer of complexity to backup strategies. However, without intimate knowledge of the data and its changes over time, it is not possible to estimate an exact reduction factor. If the most- important variables are given, it is possible to get a good impression of how the data reduction factor (see textbox: four factors that determine the reduction factor)

Many manufacturers of deduplication solutions are currently trying to outdo one another by claiming higher reduction factors. These numbers, however, are often not comparable and it’s always a good idea to read the “fine print”. For a fair comparison of the technologies, actual archive size for a solution with and without deduplication should be taken into account.

It’s safe to say that reduction factors of more than 100 are not based on this type of comparison. The fact that it is also possible to already achieve a data reduction of up to a factor of three using tried-and-tested file compression algorithms is also often overlooked in the dedupe hype. As a result, in a direct comparison, a reduction factor of 20 can quickly drop to less than 10 and even this value is often not reached without having to put up with a correspondingly longer data backup window.

Another factor that should be taken into account for the practical application of deduplication technology is the speed at which data is deduplicated. Dedupe is of little use to an IT

operator whose backup strategy depends on transfer speeds in the Gbyte/s range, if data can only be deduplicated in the 100 MByte/s range.

(4)

In order to achieve acceptable throughput values with deduplication at all, there must be references to the blocks already archived in the deduplication server’s main memory. If this reference information no longer fits into the main memory, throughput will drop dramatically.

Consequently, this also has a direct effect on the maximum addressable archive size on disk.

With the main memory capacities available today, many algorithms can only support an archive size of a few TBs.

The IT operator therefore has two important criteria to consider when determining whether or not a deduplication solution will deliver any advantages: maximum attainable throughput and the maximum supported archive size.

The matter of data integrity is also a particularly sensitive aspect in this evaluation process.

After all, the loss or corruption of a data block has a much greater impact here than in a normal archive environment. Firstly, it is a reasonable expectation that a lost or corrupted block belongs to more than one file and second, in contrast to traditional backup solutions, deduplication does not allow the user to access an older copy of the same file. Therefore, if very high security standards apply to the stored information, a data-reduced backup archive could cause problems.

In spite of all these potential drawbacks, data deduplication still harbors enormous potential, whether implemented in a ROBO backup application or in a VTL. In many cases, it is

possible to achieve significant size reductions for the data archives, in combination with simplified administration processes. Data islands that were previously only weakly connected to a network can be integrated via a central backup concept, with the benefits such as

improved quality of data management, process efficiency and reduction of operating costs.

Leaner data volumes also make data replication easier over longer distances and IT operations more disaster-tolerant.

Finding a data deduplication solution

An IT operator must first find out whether deduplication is a viable option in a backup application or in a VTL (see corresponding table). In addition, the absolute data reduction factor should be determined as accurately as possible and used as the basis to calculate the resulting profitability. In the same way, it is important to make sure that the IT solution with data deduplication on offer is able to satisfy daily operation requirements in terms of performance and supported capacity. Another question to consider is whether or not the mechanisms provided to protect data integrity are adequate for the data that will be stored.

Last but not least, it is a good idea to perform a comparison of the direct, objectively measurable cost drivers between two solution alternatives – for example tape with compression versus disk with deduplication.

In many cases, however, data managers will find it difficult to answer all the questions accurately. This is where solutions that allow a “soft” entry into the world of deduplication come into play. For example, in case of doubt, a VTL or a backup application that does not force the IT operator to abandon tape backup forever is a more sensible alternative. With CentricStor, the Virtual Tape Appliance, Fujitsu Siemens Computers is following the approach of integrating real tape fully and automatically into the backup and archiving process. By adding deduplication functionality, which will be available soon, CentricStor will also continue to allow for the intelligent coexistence of disk and tape allowing users to benefit from the best of both worlds.

(5)

About Fujitsu Siemens Computers

Fujitsu Siemens Computers is the leading European IT provider with a strategic focus on next- generation Mobility and Dynamic Data Center products, services and solutions. With a portfolio of exceptional depth, our offering extends from handhelds through desktops to enterprise-class IT infrastructure solutions. Fujitsu Siemens Computers has a presence in all key markets across Europe, the Middle East and Africa. Leveraging the strengths, innovation and global reach of our joint

shareholders, Fujitsu Limited and Siemens AG, we make sure we meet the needs of customers: large corporations, small and medium enterprises and private users. The company is a member of the United Nations Global Compact initiative.

For more information on Fujitsu Siemens Computers, please visit: http://www.fujitsu-siemens.co.uk