Modernizing Data Storage Archive Infrastructure
Avoid massive cost, obsolescence and litigation risk
by modernizing archive storage infrastructure
Multiple risk factors spanning business, legal and regulatory dimensions create daunting problems for many organizations to address archive issues. Solving these challenges while ensuring long-term data access requires modernizing data archives and addressing key business requirements: long-term data preservation, search & retrieval accessibility, and security requirements, all of which will evolve over a span of time measured in decades and beyond. In order to modernize archive, organizations should take two steps: first, address the cost of exponentially increasing data volumes by adopting data deduplication technology to reduce storage consumption by 95% or more; and second, ensure that software systems responsible for the long-term preservation and management of electronic data meet basic prerequisites:
• Archive: archive content federation; metadata to support legal/compliance requirements, full-text indexing with efficient search capabilities; and separate content storage from applications to facilitate physical and logical migration support at the data layer.
• Storage: massive scalability, greater operational predictability, resiliency and little to no downtime.
Reappraisal should also address rapidly increasing storage management costs as well as the disruption caused by frequent technology refreshes. It has become clear that traditional approaches for managing increasing volumes of data have become inadequate due to rising complexity and risks.
Table of Contents
2 Challenge of Unstructured Data 3 Factors Driving Reappraisal of
4 Cost of Storing Static Data on Primary Storage Arrays 5 Solving Archive Challenges 6 Scalability
6 Lower TCO
6 Ease of upgrades
7 Global data deduplication 7 Conclusion
8 About NEC
Of necessity, requirements for both day-to-day storage and data archive need to evolve to meet these challenges:
• Organizations continue to create and lose more data: This problem is exacerbated by rapid data growth; one government agency expects tenfold data growth between 2006 and 2010, with ongoing annual growth of file data estimated at 40 percent. (1) Data loss occurs over time as the result of obsolete storage media, software or hardware.
• Organizations retain data for longer time periods: For many organizations, compliance with data retention policies requires records to be kept for decades, sometimes even indefinitely. At the same time, legal requirements are changing how organizations need to access and manage data. Despite advances in records and retention management technology, the majority of organizations have yet to address this issue and continue to keep historical data for extended periods without full consideration of best practices.
Retention length is also being driven by the necessity to sift greater amounts of value from data. Product lifecycles can stretch for extremely long periods of time leading to the so-called “long tail” phenomenon, enabling firms to mine data for longer periods of time to find new value. The Storage Networking Industry Association (SNIA) has analyzed organizational practices to identify five key categories driving longer retention periods, including business, legal, security, compliance, and other risk factors, such as losing organizational memory. Drawn from this data is the conclusion that losing historical data is a top concern for organizations, while compliance is the top concern for record & information managers (RIMs) and legal risk topping the concerns for IT, security, and legal personnel. (2)
Challenges of Unstructured Data
The majority of stored data is static and unstructured. Collectively termed “Electronically Stored Information” (ESI), this includes email and email attachments; documents; file system content; instant messages, wiki, web and social networking content; databases; ERP and other host systems output; voice and video files; and images. Data types, usage, users, and other components of unstructured data sets, should be expected to change regularly. The result of this unpredictability is that managing and migrating ESI data can become an never-ending process and much more challenging without comprehensive software for long-term data preservation. Based on conversations TheInfoPro has had with storage professionals:
“An estimated 80% of large organizations lack standard data migration operating processes and tool sets: they use whatever local knowledge or on-board tool shipped
with the storage array or filer, which can be resource-intensive and error-prone.
“Consequently data migration projects are disruptive, cause application and sever downtime, and can even sometimes cause irrevocable data loss. Growth rates of 50% and greater will cause the cost of migration to soar even further in the years ahead.” (3) Despite challenges with data migration, enterprise data centers are performing them at ever-increasing rates, either through choice — to take advantage of significant improvements
■■■”Data migration projects are disruptive, cause application and server down -time, and can even sometimes cause irrevocable data loss. Growth rates of 50% and greater will cause the cost of migration to soar even further in the years ahead.”
Rob Stevenson, TheInfoPro
available in newer storage systems, or by force — to migrate data to accommodate data growth, whether consolidating on new systems or re-purposing legacy storage systems to extract additional data value. Solving migration challenges is complicated by evolving circumstances governing archival storage.
According to TheInfoPro, large enterprises annually spend an estimated $300K on average for data migration software. “However, even more impressive than those spending patterns, is the labor and other costs involved with migrations such as downtime, increased staffing costs, additional hardware needs, and business disruptions.”
“These conditions can easily result in costs that are more than what organizations spend on data migration software tools. When you take into account that these are annual spending trends and in an era where data has to be stored for decades, it is possible
cumulative costs eventually exceed tens of millions of dollars.” (4)
Factors Driving Reappraisal of Archival Storage Systems
Because of its long life, the cost implications associated with archive data can be quite large. In a 2003 case, several oil companies faced litigation because their MTBE additive had seeped into ground water. Gasoline production data was spread among hundreds of terabytes of stored data spanning as far back as 30 years. With data in both active and inactive databases, and other data available only from old backup tapes, the forecasted cost for litigation preparation — including eDiscovery, review and production of required documents — was hundreds of millions, if not billions, of dollars. The case was settled in May 2008 for $423M. (5)
In the face of these costs and risks, organizations have implemented high cost storage
infrastructures in an attempt to enable quick response to business, legal and regulatory demands. Existing regulations, such as SEC Rule 17a-4, NASD 3010/3100, Sarbanes-Oxley Act 2002 and the Financial Services and Markets Act 2000, not only provide guidelines for how to store, retrieve and recover data, but often require certain data to be “readily accessible” for specified durations, whether two years, three years, five years, 99 years or for unspecified periods.
While ‘readily accessible’ is not well-defined, it is often represented as available within a 24 hour period. The readily accessible requirement alone necessitates that data be stored on a medium that can support search, access and extract of potentially large volumes of data within a short time frame; tape will not support this requirement. The common legal process of filtering and culling information to respond to a discovery order with appropriate documents is also not possible when data are stored on tape.
Comparison of Tape vs. Disk-based Archive Storage Architecture
• “Not Readily Accessible” • “Readily Accessible” • $500K eDiscovery Cost • $50K eDiscovery Cost • Frequent Media Failures • Infrequent Media Failures
• Not Scalable • Scalable
• Slower Performance • Faster Performance
■■■“Large enterprises annually spend an estimated $300K on average for data migration software...
“When you take into account that these are annual spending trends and in an era where data has to be stored for decades, it is possible cumulative costs eventually exceed tens of millions of dollars.” TheInfoPro
As a result of legal and regulatory developments, there are clear consequences for the cost of data storage; storage expenses related to legal discovery now represent 50 to 70 percent of the cost of litigation. (6) The costs to completely and comprehensively manage electronically stored information have risen dramatically with an average of $1,000 to $2,000 per gigabyte of data to identify, collect and process ESI. (7)
As eDiscovery grows in importance, requirements for storing archive data have begun to change. Historically, tapes have been used for archive data because they were viewed as inexpensive to purchase, do not have to be kept online, and do not consume power or require cooling. However, tapes are prone to media failures which can render their data unrecoverable. Even without media failure, recovery from tape is time-consuming and costly because it cannot be efficiently searched online. Companies utilizing third parties for retrieval of relevant documents from tape for eDiscovery find their costs can exceed $2,000 per tape. (8) Since relying on tape for archive purposes leads to high eDiscovery costs, organizations are shifting to disk-based archives, which can be significantly more cost-effective than tape. The Taneja Group estimates Fortune 1000 companies assume a minimum of $500,000 per lawsuit in discovery costs with tape-based archives, whereas eDiscovery costs against a disk-based archive for the same lawsuit could easily be one-tenth the cost. (9)
Reappraisal of archive data storage should reflect the changing business environment, the evolution of laws and regulations, the different types of data, and the shortcomings of existing tape-based systems. First-generation alternatives to tape did not effectively overcome the known issues, they have not scaled well, have performed poorly and burdened organizations with additional operational costs.
One of the reasons cited for these failures is companies did not deploy comprehensive records management software with scalable storage designed for archive. Addressing the archive issue from a storage-only perspective also fails to effectively address the core requirements for long-term data retention and access. (10)
Storing Static Data on Primary Storage Arrays is Expensive
Historically, disk-based storage, which was designed to enable dynamic modification and fast retrieval of data, has been used to access “live” active data with tape limited to backup and long-term preservation of static data. However, organizations are increasingly recognizing that even within a short timeframe (30 to 60 days), data updates are rare or perhaps even prohibited, The Storage Networking Industry Association now estimates 60 - 80 percent of all stored data is static. (11)
Emerging from these facts is the understanding that two classes of disk-based storage are required: primary storage, associated with data creation and dynamic updating; and secondary storage, used to describe systems and devices intended for long-term access, moderate levels of retrieval performance, and much greater levels of economy.
When reappraising archive firms need to evaluate the appropriate mix of storage devices and systems that properly accounts for the different value and nature of data. While primary storage offers the fastest possible retrieval performance and is used ‘just in case’
■■■Storage expenses related to legal discovery now represent 50 to 70 percent of the cost of litigation. Contoural
of need, storing all data on primary storage is costly, estimated at five to eight times more expensive than using secondary storage systems. (12) Given this large cost differential, it is worthwhile to reconsider assumptions about storage infrastructure, especially considering that data retrieval from secondary disk-based storage can easily meet timelines imposed by eDiscovery events or business opportunity. Disk-based secondary storage is a sound economic and risk-mitigation alternative.
Considering the economic benefits of secondary storage, the stage has been set for new solutions that optimize the function and economics of archive storage.
NEC – Unify Archive System Overview
Figure 1. Scalable, resilient archive features
Solving Archive Challenges
Unify and NEC have joined forces to create a comprehensive solution for data archiving meeting all core requirements and featuring four standout characteristics: scalability; low TCO based on in-place technology refresh on an ageless storage platform (including application-level support for physical and logical migrations); easy upgrades, and global data deduplication.
The solution is based on NEC’s proven HYDRAstor storage grid technology and makes use of Unify’s Core Archive Platform for Electronically Stored Information with key applications for Search & Discovery, Records Management, Legal Case Management and Content Supervision, all with open web-based integration capabilities.
■■■Storing all data on primary storage is costly, estimated at five to eight times more expensive than using secondary storage systems. The Taneja Group
Large enterprises already manage hundreds of terabytes of data due in large part to the advent of personal computers, which has resulted in multiple sources of data and opportunities for repurposing data. This phenomenon along with the lack of efficient means for data deduplication has resulted in large amounts of redundant data. With annual data growth over the next five years expected to average 40 to 60 percent (1), most large enterprises can expect to be managing multiple petabytes of data by the early part of next decade.
To deliver the best possible scalability, NEC and Unify have created a comprehensive core archival platform for ESI. This integrated hardware & software solution has been architected to address corporate data tiering, compliance and legal discovery requirements with a single point of access to disparate archived data formats regardless of originating application or messaging system.
Designed for long-term data management and accessibility, the product’s dynamic data archive translation strategy is designed to promote content accessibility and readability over the lifespan of the data, regardless of availability of the original application or survivability of the originating data format. An integrated grid-based server bank provides memory speed searches across large and growing volumes of archived data.
The storage platform features global data deduplication to reduce storage space by up to 50 percent more effectively than alternative solutions. The platform also includes an integrated, auditable chain of custody facility to track up to one trillion objects and events; support for different data types and email formats; and support for both global, regional and centralized implementations of archive systems with federated search to enable fast, accurate access and retrieval of data.
Write-Once Read-Many (WORM) functionality for regulatory compliance and the capacity for discretely incrementing bandwidth or storage depending on workload requirements are just a few of the features that enable the NEC and Unify solution to deliver a better overall value proposition than alternatives.
Enabling easy upgrade of processors and disks without incurring ‘forklift’ upgrades, allows resources to be added over time and across technology generations on an as-needed basis. When capacity is added, performance increases (unlike monolithic storage architectures, where performance degrades as capacity increases). HYDRAstor takes advantage of the inherent redundancy built into grid architectures to maintain availability and high performance without service disruption or loss of data.
The NEC-Unify solution relies on HYDRAstor’s in-place technology refresh capability to eliminate downtime due to data migration. This capability reduces or eliminates the expense of data migration, which can be up to $1,800 per terabyte, as well as application and system downtime. Both bandwidth and capacity scale independently to allow deduplication of up to
Solution HighlightsEase of Management * Single point of access * Corporate data tiering * Long-term content access Functionality
* Fast search & retrieval * Federated search Cost-effectiveness * Global data deduplication * Bandwidth-friendly
* Auditable chain of custody * WORM functionality
20TB of data per day, dynamically utilizing whichever combination of bandwidth or storage is required to optimize performance and responsiveness.
The NEC-Unify solution provides a simplified single management system and compliance archive application architecture which integrates the management of both archive functions and day-to-day storage. Management costs associated with provisioning (such as deploying multiple appliances, devices and arrays; and managing multiple data formats) are greatly reduced, while centralized control reduces operational expenses for personnel. By having only one system with one interface to administer, TCO is lowered for both hardware and data. As a true archive technology, Unify software requires no associated relational database. This architecture ensures metadata and data are stored together within the storage system, not maintained apart, allowing reduced costs and providing infrastructure benefits for backup, restores, and disaster recovery/business continuity. The proven archive architecture enhances support for replication strategies by replicating all metadata and data together vs. having to replicate each separately in a relational data-store.
Global data deduplication
Global data deduplication enables the NEC-Unify solution to tune and optimize component interoperation to minimize access time, storage usage, hardware deployment, bandwidth consumption, and staff time spent on administration. Data deduplication is a proven means of reducing storage space consumption by 95 percent or more.
One of the unique aspects of NEC’s grid storage system is its ability to extend deduplication to become application-aware. NEC’s solution extends the opportunity for cost-savings and operational improvements. Because of application-aware deduplication, HYDRAstor is able to further reduce storage consumption by as much as 130 percent more effectively than application software-based deduplication.
This paper discusses the advantages associated with modernizing archive infrastructure. A modernized system offers the opportunity to increase the flexibilty, scalability and resiliency while lowering costs associated with day-to-day management of storage infrastructure and those specifically associated with eDiscovery.
The NEC-Unify solution has the architectural foundation, features and proven deployment model to enable a single, all-encompassing data storage archive regardless of application, server and networking infrastructure, or scope of operations because of its scalability, in-place technology refresh capability, and overall resiliency and cost-efficiency. The offering provides the business continuance technology required for global operations.
NEC and Unify have created an opportunity for enterprise data centers to solve several fundamental problems associated with managing data over the very long term. The system offers comprehensive data management throughout the data lifecycle, can be deployed enterprise-wide, and eliminates the need for planned downtime during system upgrades.
■■■Data deduplication is a proven means of reducing storage space consumption by 95 percent or more.
■■■Because of application-aware deduplication, HYDRAstor is able to further reduce storage consumption by as much as 50 percent more effectively than other deduplication alternatives.
Beyond delivering the scalability and resiliency required at the enterprise level with the lowest TCO, just as significantly, the NEC-Unify archive solution does all this at a price point enabling all data to be brought under active management to extend the benefits across the enterprise.
1. Contoural presentation at TechTarget E-mail and File Archiving Seminar, July 2009
2. The Storage Networking Industry Association, http://www.csi1000.com/docs/100YrATF_Archive- Requirements-Survey_20070619.pdf
3. TheInfoPro Storage Study, Q3 2009 4. TheInfoPro Storage Study, Q3 2009
5. Contoural paper, “Understanding Archiving from an IT Perspective”, 2008, page 5 6. Contoural paper, “”Understanding Archiving from an IT Perspective”, 2008, page 5 7. Contoural paper, “Is there a Return on Investment for e-mail Archiving/”, 2009, page 6
8. Contoural paper, “e-Discovery: Six Critical Steps for Managing E-mail, Lowering Costs and reducing Risks.”, 9. Taneja Group paper, “Evaluating Grid Storage for Enterprise Backup, DR and Archiving: NEC HYDRAstor”, September 2008, page 4
10. Peer Incite discussion hosted by Wikibon July 31, 2009 entitled “Building a Strategic Information Plan to Tame Unstructured Data.”
11. The Storage Networking Industry Association, http://www.csi1000.com/docs/100YrATF_Archive- Requirements-Survey_20070619.pdf
12. Taneja Group paper, “Evaluating Grid Storage for Enterprise Backup, DR and Archiving: NEC HYDRAstor”, September 2008, page 2
About NEC Corporation of America
NEC Corporation of America is a leading technology provider of network, IT and identity management solutions. Headquartered in Irving, Texas, NEC Corporation of America is the North America subsidiary of NEC Corporation. NEC Corporation of America delivers technology and professional services ranging from server and storage solutions, IP voice and data solutions, optical network and microwave radio communications to biometric security and virtualization. NEC Corporation of America serves carrier, SMB and large enterprise clients across multiple vertical industries. For more information, please visit www.necam.com.
© 2009, NEC Corporation of America. HYDRAstor, DynamicStor, DataRedux, Distributed Resilient Data (DRD), RepliGrid and
HYDRAlock are trademarks of NEC Corporation; NEC is a registered trademark of NEC Corporation. Intel, the Intel logos, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation in the U.S. and other countries. All other trademarks and registered trademarks are the property of their respective owners. All rights reserved. All specifications subject to change. (WP129-1_0909)
NEC Corporation of America
2880 Scott Blvd. Santa Clara, CA 95050 1 866 632-3226 1 408 844-1299 firstname.lastname@example.org www.necam.com/hydrastor