The do’s and don’ts
of data deduplication
Data Deduplication continues to gain momentum as one of the
most popular backup trends hitting the storage market today.
This E-Guide will highlight the benefits, as well as key areas to
focus on when considering or implementing Data Deduplication.
Learn more about the do’s and don’ts of data deduplication as
well as how to use data deduplication in a disk based backup
sys-tem.
Sponsored By:
Table of Contents
Table of Contents:
Dedupe dos and don'ts: Data deduplication technology best practices
Where and how to use data deduplication technology in disk-based backup
About EMC and your local EMC Solution Provider
The do’s and don’ts of
data deduplication
Dedupe dos and don'ts: Data deduplication technology best
practices
By: W. Curtis Preston
Data deduplication technology can significantly reduce the amount of disk needed to store your data backups, but things that you do in your data center can actually be working against deduplication. And there are other things that don't necessarily hurt your deduplication system, but aren't a good idea anyway.
This article will explain those things you should (or shouldn't) be doing if you are performing data deduplication.
Target dedupe
This first section applies only to target deduplication systems. These include data dedupe appliances and software-based target dedupe in your backup software (e.g., CommVault Simpana and Symantec Corp. NetBackup PureDisk Media ServerOption).
Don't perform more full backups just to get a better data deduplication ratio.Some customers have been told by their target dedupe sales engineers to perform nightly full backups to increase their dedupe ratios. Please don't do this. Perform more frequent full backups because it makes your recoveries better, or makes your DBAs sleep better. (DBAs have always had a trust issue with incremental backups.) Don't do this just to get a better deduplication ratio.
Do consider increasing your backup retention period on disk.Once you have your first set of backups on disk, adding additional backups to that same deduped system will take up less space than sending them to tape. So if you're already storing 30 days on disk and 60 days on tape, consider storing all 90 days on disk. You'll be surprised at how little disk additional backups take up, assuming you're getting a good dedupe ratio.
Don't use multiplexing if you're backing up to a virtual tape library (VTL).Some people carry over this practice from physical tape to virtual tape, and this can have devastating effects on your dedupe ratio. Even systems that can de-multiplex the data (FalconStor, Sepaton) recommend that your turn it off, as they are just wasting computing cycles de-multiplexing the data -- cycles that could otherwise be used to dedupe your data faster. Instead of multiplexing 40 backups to 10 virtual tape drives, create 40 virtual tape drives and turn off multiplexing.
Source and target deduplication
Don't obsess over your deduplication ratio, period. You should closely examine this number when you are comparing multiple products.See who gets the better dedupe ratio when you send the same data to each system. But once your system is installed, try to not overanalyze this number, especially when it's first installed. Dedupe ratios are always low at first and grow over time. You should periodically check to see if there has been a significant, negative change in your dedupe ratio, as it may indicate that something is wrong.
The do’s and don’ts of data deduplication
Dedupe dos and don’ts: Data deduplication technology best practices
Don't encrypt data before the dedupe system sees it.
For example, do not back up a Windows Encrypted Filesystem to a dedupe system and expect to see anything other than 1:1 as your dedupe ratio. Dedupe systems look for patterns and encryption systems get rid of patterns -- ergo, no dedupe.
Don't compress data before the dedupe system sees it -- for two reasons.
The first is that all dedupe systems compress after they dedupe, so you are accomplishing nothing. The second reason is that compression can "scramble" the data, creating difficulties for the dedupe system when looking for patterns. (Note: CommVault's dedupe system allows you to encrypt and compress your backups once they're fingerprinted, and that should not impact your dedupe ratio.)
Do learn what data doesn't dedupe very well and consider not deduping it.
With most dedupe systems, data that is created by a human (e.g., Office documents, database entries) dedupes well. Data that is automatically created by a computer doesn't dedupe well. Photos, video, audio, imaging, seismic data, are all examples of data that don't dedupe very well. Consider storing it on non-deduped storage. (Some dedupe systems can turn off dedupe on certain sets of data.)
Do read the best practices documentation for your particular dedupe systems and follow their suggestions.
The suggestions in this article should apply to most (if not all) dedupe systems, but your particular product may have some idiosyncrasies you should be aware of when using it.
Do test multiple dedupe systems before buying one of them.
There are some really good products out there, but there are also products with some real limitations. Only by comparing multiple products do some of these rear their ugly heads.
Do test copying data from your dedupe system to tape if you plan to do that.
This is one of the areas that separate the men from the boys, so to speak.
Don't believe a vendor that tells you that none of the dedupe products can't stream your tape drive at its maximum speed.
The list may be short, but it does exist. Some of the products, unfortunately, can only stream drives that are five to six years old.
Test everything.Believe nothing. Read the documentation and follow its advice, and all should be right in dedupe land.
You know that virtualization is vital to improving efficiency and productivity. But proving that to senior executives can be challenging. Together, and EMC offer a holistic approach to virtualizing your data center that will give you full visibility into the cost and expected benefits associated with virtualization. So you can demonstrate all that virtualization can do—and earn the complete confidence of senior executives. The next step in building your VMware environment starts now with and EMC.
Gain full support for your virtualization journey with EMC.
EMC2, EMC, and where information lives are registered trademarks or trademarks of EMC Corporation in the United States and other countries.
© Copyright 2010 EMC Corporation. All rights reserved.
Next:
Senior executives
totally on board.
Now:
Battling
virtualization
skeptics.
Where and how to use data deduplication technology in
disk-based backup
By: Lauren Whitehouse
Data deduplication promises to reduce the transfer and storage of redundant data, which optimizes network bandwidth and storage capacity. Storing data more efficiently on disk lets you retain data for longer periods or "recapture" data to protect more applications with disk-based backup, increasing the likelihood that data can be recovered rapidly. Transferring less data over the network also improves performance. Reducing the data trans-ferred over a WAN connection may allow organizations to consolidate backup from remote locations or extend disaster recovery to data that wasn't previously protected. The bottom line is that data dedupe can save organiza-tions time and money by enabling more data recovery from disk and reducing the footprint and power and cooling requirements of secondary storage. It can also enhance data protection.
Read the fine print when selecting a data dedupe product
The first point of confusion lies in the many ways storage capacity can be optimized. Data deduplication is often a catch-all category for technologies that optimize capacity. Archiving, single-instance storage, incremental "forever" backup, delta differencing and compression are just a few technologies or methods employed in the data protection process to eliminate redundancy and the amount of data transferred/stored. Unfortunately, firms have to wade through a lot of marketing hype to understand what's being offered by vendors who toss around these terms. In data protection processes, dedupe is a feature available in backup applications and disk storage systems to reduce disk and bandwidth requirements. Data dedupe technology examines data to identify and eliminate redun-dancy. For example, data dedupe may create a unique data object with a hash algorithm and check that fingerprint against a master index. Unique data is written to storage and only a pointer to the previously written data is stored.
Granularity and dedupe
Another issue is the level of granularity the dedupe solution offers. Dedupe can be performed at the file, block and byte levels. There are tradeoffs for each method, including computational time, accuracy, level of duplication detected, index size and, potentially, the scalability of the solution.
File-level dedupe (or single-instance storage) removes duplicated data at the file level by checking file attributes and eliminating redundant copies of files stored on backup media. This method delivers less capacity reduction than other methods, but it's simple and fast.
Deduplicating at the sub-file level (block level) carves the data into chunks. In general, the block or chunk is "fingerprinted" and its unique identifier is then compared to the index. With smaller block sizes, there are more chunks and, therefore, more index comparisons and a higher potential to locate and eliminate redundancy (and
produce higher reduction ratios). One tradeoff is I/O stress, which can be greater with more comparisons; in addi-tion, the size of the index will be larger with smaller chunks, which could result in decreased backup performance. Performance can also be impacted because the chunks have to be reassembled to recover the data.
Byte-level reduction is a byte-by-byte comparison of new files and previously stored files. While this method is the only one that guarantees full redundancy elimination, the performance penalty could be high. Some vendors have taken other approaches. A few concentrate on understanding the format of the backup stream and evaluating duplication with this "content-awareness."
Where and when to dedupe
The work of data dedupe can be performed at one or more places between the data source and the target storage destination. Dedupe occurring at the app or file-server level (before the backup data is transmitted across the network) is referred to as client-side deduplication (a must-have if bandwidth reduction is important). Alternatively, dedupe of the backup stream can happen at the backup server, which can be referred to as proxy deduplication, or on the target device, which is called target-based deduplication.
Deduplication can be timed to occur before data is written to the disk target (inline processing) or after data is written to the disk target (post-processing).
Post-process deduplication will write the backup image to a disk cache before starting to dedupe. This lets the backup complete at full disk performance. Post-process dedupe requires disk cache capacity sized for the backup data that's not deduplicated plus the additional capacity to store deduped data. The size of the cache depends on whether the dedupe process waits for the entire backup job to complete before starting deduplication or if it starts to deduplicate data as it's written and, more importantly, when the deduplication process releases storage space. Inline dedupe could negatively impact backup performance when the app uses a fingerprint database that grows over time. Inline approaches inspect and dedupe data on the way to the disk target. Performance degradation depends on several factors, including the method of fingerprinting, granularity of dedupe, where the inline process-ing occurs, network performance, how the dedupe technology workload is distributed and more.
Hardware versus software deduplication
Many of today's most popular hardware-based approaches may solve the immediate problem of reducing data in disk-to-disk backup environments, but they mask the issues that will arise as the environment expands and evolves.
The issue is software versus hardware. On the hardware side, purpose-built appliances offer faster deployments, integrating with existing backup software and providing a plug-and-play experience. The compromise? There are limitations when it comes to flexibility and scalability. Additional appliances may need to be added as demand for capacity increases, and the resulting appliance "sprawl" not only adds management complexity and overhead, but may limit deduplication to each individual appliance.
The do’s and don’ts of data deduplication
Where and how to use data deduplication technology in disk-based backup
With software approaches, disk capacity may be more flexible. Disk storage is virtualized, appearing as a large pool that scales seamlessly. In a software scenario, the impact on management overhead is less and the effect on dedu-plication may be greater since dedudedu-plication occurs across a larger data set than most individual appliance architec-tures.
Software-based client-side and proxy dedupe optimize performance by distributing dedupe processing across a large number of clients or media servers. Target dedupe requires powerful, purpose-built storage appliances as the entire backup load needs to be processed on the target. Because software implementations offer better workload distribution, inline dedupe performance may be improved over hardware-based equivalents.
Choosing a software or hardware approach may depend on your current backup software implementation. If the backup software in place doesn't have a dedupe feature or option, switching to one that does may pose challenges.
About EMC and your local EMC Solution Provider
About EMC
EMC Corporation is the world leader in products, services and solutions for information storage and management that help organizations extract the maximum value from their information, at the lowest total cost, across every point in the information lifecycle. We are the information storage standard for every major computing platform and, through our solutions, serve as caretaker for more than two-thirds of the world's most essential information. We help enterprises of all sizes manage their growing volumes of information--from creation to disposal--according to its changing value to the business through information lifecycle management (ILM) strategies. EMC information infrastructure solutions are at the heart of this mission, helping organizations manage, use, protect, and share their information assets more efficiently and cost-effectively. Our world-class solutions integrate networked storage technologies, storage systems, software, and services.
Find an EMC Partner Near You!
The do’s and don’ts of data deduplication
About EMC and your local EMC Solution Provider
Sponsored by: Page 9 of 11
About ExtraTeam:
“ExtraTeam is a thirteen-year veteran of the technology industry in providing guidance in communications, networking and collaboration services and solutions across multiple vertical markets. We are com-mitted to making a positive impact within our clients’ organizations by:
• delivering comprehensive and efficient IT services for mission critical applications and networks.
• fostering innovative solutions that reduce risk and empowering IT to better support and align with their organization’s priority business goals and objectives.
• offering the very best in real world technologies, from planning, implementation to support.”
http://www.extrateam.com/
Geographic Coverage:
About EMC and your local EMC Solution Provider
About TROI IT:
Too much information is as bad in business as it is in conversation. Your company needs to move, store, and protect vital data without gorging on server space, cluttering libraries, and slowing down systems. TROI IT Solutions understands your need to securely archive your files and directories for full recovery, easy access, and speedy transfer. Our experts construct, implement, and support EMC’s leading-edge data streamlining and backup technologies to keep your business operating at its fastest pace with the least drain on resources. http://www.mytroi.com/index.php Geographic Coverage: •Oregon •Washington •Idaho •Montana About En Pointe:
As a nation-wide technology solutions provider that supplies I.T. products and services to medium and large enterprises, educational institutions, government agencies and non-profits nationwide, En Pointe Technologies simplifies the dynamics of the information tech-nology infrastructure by reducing complexities, removing obstacles, minimizing risks, and dramatically lowering the costs associated with procurement, product integration and supplier management. Our pro-curement tools drive business efficiency and can also help you seam-lessly meet your supplier-diversity purchasing goals.
http://www.enpointe.com/
Geographic Coverage:
•Texas •Colorado •Virginia •West Virginia
•Pennsylvania •New Jersey
About Logicalis:
With over 1,200 people worldwide, Logicalis delivers smart solutions based on specific needs, not the latest IT trend. Logicalis provides options, direction and support to more than 5,000 corporate and public sector customers. The company attributes its success to the everyday positive experiences with its customers and strategic part-ners.
www.us.logicalis.com
Geographic Coverage:
•Massachusetts •New York City
•New Jersey •Connecticut
The do’s and don’ts of data deduplication
About EMC and your local EMC Solution Provider
Sponsored by: Page 11 of 11
About Universal Data, Inc.:
Universal Data, Inc. is a systems integrator specializing in collabora-tion technologies. Our goal is to enable businesses to stretch limited personal resources further and maximize efficiency with today's flat-ter organizational structures. Through technologies such as Voice and Video over IP, WEB based Document Management, Integrated E-Mail and Fax systems, Thin Clients, Cellular and Wireless networking, and advanced Internet Firewalls with VPNs, UDI makes technology a strategic part of our customer's organizations.
www.udi.com
Geographic Coverage:
•Louisiana •Mississippi
About NET Source:
NET Source specializes in architecting; developing and deploying complete data solutions that require varied degrees of integration across heterogeneous environments. We have the experience and the technical expertise needed to design and implement a solution to store, manage, share, protect and retrieve your valued information. Customers rely on the systems we build to secure their data without failure. NET Source offers multiple Data Deduplication solutions for small businesses to the largest enterprises. If you are looking to enhance your backup and recovery environment, please contact us for a complete overview of our Data Deduplication and Virtual Tape Library offerings.
http://www.netsource.cc/index.html
Geographic Coverage:
•Colorado
About Fortis Data Systems:
Fortis Data Systems is a value-added reseller which places an emphasis on providing an end-to-end solution by leveraging technol-ogy and services which are defined based on client specific need and are delivered in a cost effective manner in support of Enterprise-based organizations worldwide. Fortis Data enables clients to effec-tively scale the infrastructure, ensuring a reasonable total cost of ownership, business continuity and long term return on investment. By better understanding the client’s unique business requirements and industry-specific needs we are able to implement unified solu-tions that combine best-of-breed technologies and second to none implementation services that span supplier lines.
www.fortisdata.com
Geographic Coverage: