HTBackup and it´s Deduplication technology is
touted as one of the best ways to manage
today's explosive data growth. If you're new to
the technology, these key facts will help you get
up to speed.
Deduplication is used in any number of different products. Compression utilities such as WinZip perform deduplication, but so do many of the WAN optimization solutions. HTBackup, HTBase’s backup & recovery product, also offers deduplication.
The effectiveness of data deduplication is measured as a ratio. Although higher ratios do convey a higher degree of deduplication, they can be misleading. It is impossible to deduplicate a file in a way that shrinks the file by 100%. Hence, higher compression ratios have diminishing returns. To show you what I mean, consider what happens when you deduplicate 1 TB of data. A 20:1 compression ratio reduces the size of the data from 1 TB to 51.2 GB. However, a 25:1
compression ratio reduces the size of the data to only 40.96 GB. Going from 20:1 to 25:1 only yields an extra 1% savings and reduces the data by about 10 GB more than using 20:1.
Many deduplication algorithms work by hashing chunks of data and then comparing the hashes for duplicates. This hashing process is CPU intensive. This isn't usually a big deal if the
deduplication process is offloaded to an appliance or if it occurs on a backup target, but when source deduplication takes place on a production server, the process can sometimes affect the server's performance.
Deduplication is used for a variety of purposes
Higher rations produce diminishing returns
One of the benefits of performing deduplication across virtual machines on a host server is that doing so reduces the amount of physical disk space consumed by virtual machines. For some organizations, this might make the use of solid-state storage more practical for use with
virtualization hosts. Solid-state drives have a much smaller capacity than traditional hard drives, but they deliver better performance because there are no moving parts.
The practical benefits of this technology depend upon various factors like – 1 Point of Application - Source vs. Target
2 Time of Application - Inline vs. Post-Process 3 Granularity - File vs. Sub-File level
4 Algorithm - Fixed size blocks vs. Variable length data segments
A simple relation between these factors can be explained using the diagram below:
Target based deduplication acts on the target data storage media. In this case the client is unmodified and not aware of any deduplication. The deduplication engine can embedded in the hardware array, which can be used as NAS/SAN device with deduplication capabilities. Alternatively it can also be offered as an independent software or hardware appliance which acts as intermediary between backup server and storage arrays. In both cases it improves only the storage utilization.
A more practical use of drives
Technological Classification
Source based deduplication, on the contrary, acts on the data at the source before it’s moved. A
deduplication aware backup agent is installed on the client which backs up only unique data. The result is improved bandwidth and storage utilization. But, this imposes additional computational load on the backup client.
Fixed-length block approach, as the name suggests, divides the files into fixed size length blocks and uses simple checksum (MD5/SHA etc.) based approach to find duplicates. Although it's possible to look for repeated blocks, the approach provides very limited effectiveness. The reason is that the primary opportunity for data reduction is in finding duplicate blocks in two transmitted datasets that are made up mostly - but not completely - of the same data segments.
For example, similar data blocks may be present at different offsets in two different datasets. In other words the block boundary of similar data may be different. This is very common when some bytes are inserted in a file, and when the changed file processes again and divides into fixed-length blocks, all blocks appear to have changed.
Therefore, two datasets with a small amount of difference are likely to have very few identical fixed length blocks.
Variable-Length Data Segment technology divides the data stream into variable length data segments using a methodology that can find the same block boundaries in different locations and contexts. This allows the boundaries to "float" within the data stream so that changes in one part of the dataset have little or no impact on the boundaries in other locations of the dataset.
Each organization has a capacity to generate data. The extent of savings depends upon – but not directly proportional to – the number of applications or end users generating data. Overall the deduplication savings depend upon following parameters –
1 No. of applications or end users generating data 2 Total data
3 Daily change in data
4 Type of data (emails/ documents/ media etc.)
5 Backup policy (weekly-full – daily-incremental or daily-full) 6 Retention period (90 days, 1 year etc.)
7 Deduplication technology in place
The actual benefits of deduplication are realized once the same dataset is processed multiple times over a span of time for weekly/daily backups. This is especially true for variable length data segment technology, which has a much better capability for dealing with arbitrary byte insertions.
Contacts
Bruno Andrade Product Director HTBase Canada e-‐Mail: [email protected] Samuel Ayres Sales Director
HTBase Latin America e-‐Mail: [email protected]