HYDRAstor: New Architecture for Disk-based Backup
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 1
GlassHouse Whitepaper
Introduction
Disk is no longer an option in a properly designed backup system – it is an essential component. At this point in the industry, upgrading your tape drives without adding disk to the picture can actually result in a decrease in backup performance. This two part paper will start with an explanation of why this is the case and will then explain how many of the current methods for adding disk to your backup system have limitations that can create significant performance, manageability, and data integrity issues. The second part of the paper will describe an ideal backup architecture and then explain how a new addition to the disk‐based backup market ‐‐ HYDRAstor from NEC – gives you the benefits of disk without the drawbacks associated with other disk
solutions.
The Problem with Tape-based Backups
Most everyone acknowledges that disk increases the performance and reliability of a backup system. They also understand that disk helps them deal with the recurring problems of shorter backup windows, increasing amounts of data, and more stringent restore requirements. What isn’t commonly understood is the actual problem that disk solves. The common misconception is that tape is slow and disk is fast; however, this perception masks the real reason why disk performs so well as a backup target. The real reason is that backups perform better when sent to a device that can match the speed of the backup – whether slow or fast – and disk does that much better than tape. Tape’s inability to match incoming backup speeds is the key reason behind its performance and reliability issues during both backup and restore.
A fundamental architectural issue affecting performance for tape backups exists at a purely physical level. A tape drive recording head must be moved at a brisk pace across its recording medium in order to achieve a high signal to noise ratio, and a high signal to noise ratio is essential in reliably recording data to a tape. Based on this fact, all tape drives have a minimum speed at which they can reliably write data to tape – they cannot write slower than that speed. This is true even for newer drives that support variable speeds; they have a minimum speed below which they cannot operate. In addition, compression actually increases this minimum speed. For example, if a tape drive’s minimum speed is 30 MB/s and if data being sent to the drive is compressing at a rate of 1.5:1, the drive’s real minimum speed is 45 MB/s.
When you send data to a tape drive at a rate slower than its minimum speed, it will write short bursts at that speed, stop, rewind, and then spin up to its minimum speed again once the tape drive buffer is full. This process of stopping, rewinding, and starting again is called backhitching, and if you do it a lot, it’s called shoeshining.
If your backup system is already performing a great deal of “shoeshining” – and this is typically the case ‐‐ and you upgrade to faster tape drives, things could actually get worse, and you will experience an even greater hit to your performance! For example, if you
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 2 are currently sending 15 MB/s to a tape drive whose minimum speed is 20 MB/s, and you upgrade to a drive whose minimum speed is 30‐50 MB/s, you will actually increase the level of “shoeshining”. This so‐called ‘upgrade’ will cause a decrease in backup performance and an increase in backup failures due to media problems! Buying faster tape drives may actually be ‘slower’ than you think.
Many readers may be surprised to learn that their servers cannot supply this kind of speed to a tape drive for backups. The problem is that fragmented file systems, fragmented databases, and file systems with many files often supply only a few megabytes per second – especially during incremental backup. As to the network, people often upgrade only a few servers to Gigabit, and even Gigabit Ethernet is often unable to supply this kind of throughput. These two bottlenecks – physical attributes of tape architecture, and file fragmentation leading to speed constraints ‐‐ conspire to stop us from sending enough data to stream modern tape drives.
Some backup software products use multiplexing, also called interleaving, to send multiple simultaneous backups to the same tape drive in order to mitigate these issues. However, every additional backup job sent simultaneously to a tape drive reduces the restore speed of all jobs sent to that tape drive because a restore of a single job from a multiplexed tape must read and throw away all other backups to get to the one backup from which it is restoring. This drawback is why many people consider multiplexing a necessary evil when backing up to tape ‐‐ getting more necessary and more evil all the time.
Disk-based Products have their Pro’s and Con’s
Disk solves these challenges because disk can read or write data at any required speed. As a result, disk can be used to simultaneously and reliably receive all of your slower backups without the issues associated with multiplexing. Once backups have been sent to disk, they can then easily be streamed to tape at tape’s native speeds, or sent to another disk backup device. In addition, there are a number of newer backup and recovery methods (CDP, near‐CDP, and data de‐duplication) that can be applied only if sent to a disk target; they work with backups to tape. Now that we’ve established how important disk is to a backup and recovery system and why it is becoming a required element in most enterprises, let’s take a closer look at the usual methods for deploying disk in a backup system.
Disk targets for backup systems are most often categorized into two distinct types: disk‐ as‐disk and disk‐as‐tape. A disk‐as‐disk system presents to your backup server a target of either raw disk or a file system. A disk‐as‐tape targets presents to your backup system as devices that look like tape drives (when in reality they are disk). The first challenge with both types of current disk targets is cost. Where a fully‐
configured mid‐range tape library loaded with media tends to range from $2‐5/GB, disk system prices tend to range from $3‐30/GB, with most recognizable names towards the high end of that scale. Most VTL systems range from $4 ‐$12/GB. The answer to the cost issue lies in the elimination of duplicate data. Writing two sequential full backups of 1TB each to disk or tape requires 2TB of capacity.
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 3
Duplicate elimination technologies (also referred to as “commonality factoring”, data de‐duplication, and similar terms) recognize that much data — perhaps as much as 99% of it — does not change from one backup to the next. Because of this fact, these
technologies eliminate the duplicate data in the second backup. While you still need 1TB capacity for the first backup, you may need only 0.05TB for the second. For a standard backup policy, this happens week after week. After two to three months of backups, you find that for 20TB written, you need only about 1TB of capacity. This finding is
equivalent to a duplicate elimination ratio of 20:1. Therefore, effective pricing per GB will change as data de‐duplication features start to become generally available, as less disk space will be required for data protection.
However, some vendors’ implementations of de‐duplication have come at the expense of performance or capacity. For example, some disk targets can de‐duplicate only within a single appliance and can handle only a certain amount of storage within that appliance. If you need more performance or capacity than a single appliance can provide, you’re forced to buy another appliance. Data will not be de‐duplicated across those appliances, resulting in a significant loss of aggregate capacity as the overall duplicate elimination ratio can drop from 20:1 to about 10:1 or even lower with each additional appliance. Thus, the higher the number of appliances, the lower the duplicate elimination ratio and the higher the cost of hardware (more capacity) and cost of
operations (more systems to manage)!
The second challenge with most disk targets is how they achieve resiliency, or how they ensure a system survives disk and system failures without suffering data loss.
Resiliency is also increasingly important as data de‐duplication becomes popular. In order to eliminate duplicate data, the data stream is split into small chunks such as a part of a spreadsheet with financial information. The system then checks if the same chunk has be stored before. If so, only a pointer to the stored copy of the chunk is saved, the second copy is discarded. One piece (sometimes referred to a fragment or a chunk, indicating a piece below the file or the block level) of data is now used by potentially many different backups. Without sufficient resiliency, losing one data chunk can result in the loss of many, many backups! Mirroring would be too expensive due to its 100% storage overhead, so most of these units use parity‐based RAID (RAID 3‐6) for resiliency purposes. Unfortunately, parity RAID comes with a number of limitations: RAID levels 3‐5 can handle the loss of only a single disk in the RAID group, while RAID 6 can recover from two simultaneous drive failures.
RAID has another disadvantage, and that is performance. The first performance penalty comes from the calculation and storage of parity information, which is why all parity RAID levels have a performance penalty during write operations. There’s also a significant performance penalty when reading data with a missing or damaged drive – referred to as degraded mode ‐‐ because all data read from the RAID group must be rebuilt from parity.
However, nothing compares to the performance hit of a rebuild. In addition to
rebuilding any requested data from parity, an additional process is reading every parity block in the RAID group and using those parity blocks to recalculate the blocks of data for the missing disk, then writing those blocks of data to the replacement drive. These
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 4 operations are all managed by the same RAID controller that’s using the same RAID group to write new data for backups or to read data for a restore.
Anybody who has suffered a parity RAID rebuild knows how big this performance hit can be – and the recent utilization of large ATA disk drives in RAID arrays has made this issue only worse. These larger disk drives have rebuild times that take days instead of hours, increasing the length of time that you could potentially lose more drives than your RAID configuration was built to withstand leaving your data vulnerable and creating too much risk. RAID 5/6 arrays also cannot easily be expanded, which is why most disk units are expanded by adding another RAID 5/6 group to RAID groups that are already configured.
Disk-as-disk targets
Disk‐as‐disk targets often experience a number of challenges when being used with backup systems; the bigger the backup system, the more challenges you are likely to have with a typical disk‐as‐disk unit. To start with, it can be quite a challenge to provision a disk‐as‐disk system for use with multiple backup servers. You must decide how much data each backup server is likely to generate, and then create and provision a file system of appropriate size for each backup server (or group of backup servers in some cases).
To prevent the challenges associated with under‐provisioning, most people over‐ provision. Over‐ provisioning, however, results in a significant amount of wasted disk, which in turn makes the solution even more expensive. Another method for dealing with the challenges of provisioning disk‐as‐disk systems is to create a large share on a NAS filer and share it between multiples backup servers. However, this method creates another performance bottleneck, as all data is routed through a single filer head. There are newer NAS filers that can help solve this problem through use of a global
namespace, but some filers are not as good at reading or writing very fast streams of data. In addition, they’re still based on parity RAID and therefore suffer from the challenges associated with parity RAID.
Speaking of performance, although most backup systems derive the most performance from many smaller volumes, this setup is in direct opposition to the desire to minimize the number of file systems to reduce the amount of work associated with provisioning so many file systems.
Finally, most disk‐as‐disk systems suffer from fragmentation when used as a target for backup and recovery systems. This fragmentation can result in a significant decrease in performance over time and often can be fixed only by re‐provisioning and moving data from one file system to another.
Disk-as-tape (VTL) targets
Virtual tape libraries (i.e., disk‐as‐tape) solve many of the challenges listed above. For example, it is much easier to provision a VTL between multiple backup servers because backup software companies have already figured out how to share a tape library. They
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 5 also use customized file systems that were designed to store backup data, usually
removing the fragmentation issue.
However, VTLs aren’t perfect. Some still require a significant amount of provisioning work behind the scenes. For instance, some VTLs require you to create RAID groups and file systems just as you would with a disk‐as‐disk system. All VTLs also use parity RAID, which means they suffer the previously mentioned limitations of parity RAID, including resiliency, rebuild and risk problems.
Not all VTLs are equally scalable, either. While some can scale with additional disk and CPUs, many are fixed‐capacity units that can be scaled for performance or capacity only by buying another separate unit. These limitations increase management and load balancing related tasks. Finally, while VTLs are designed to enhance traditional backup software, they cannot be used as targets for advanced backup techniques such as CDP or near‐CDP, or for other general purpose storage needs, since these applications require that their targets that acts like disk, not tape.
A new idea
For the second part of this white paper, let’s consider a disk‐as‐disk target that solves all of the previously mentioned limitations of today’s disk‐based targets. To do so, it would have to have the following attributes:
• Affordable effective cost comparable to tape
o System‐wide data de‐duplication that reduces the amount of disk required by a factor of 20:1 or more, but does not slow down as the amount of data stored on it increases.
• Enterprise‐level scalability
o A single system designed to minimize operational cost and provide hundreds of thousands of megabytes per second and handle tens of petabytes as a single logical pool, all of which is managed as a single entity.
o Capacity on demand by simply plugging another “storage node” into the network. The system will automatically see and begin using the additional capacity, without having to enlarge or rebuild any RAID groups (since there aren’t any), provision capacity, or perform any load balancing. Any required behind the scenes work is done automatically without affecting performance.
o Performance on demand by simply adding “accelerator nodes” to boost overall performance of the system.
• Highest resiliency of any system on the market
o User‐configurable resiliency level to provide protection against as many concurrent disk or system failures as you would like, without the
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 6 limitations of RAID. How many simultaneous failures do you want to prepare for? Three systems or disks? Four? Seventeen? The level of resiliency is up to you.
o No RAID groups, volumes, or RAID controllers, hence none of their
limitations.
• Self‐managing and self‐healing system
o No capacity pre‐allocation of any particular file system is necessary. Never again ask, “How big do I need to make this volume behind this backup server? How much data will this backup server (or other application) generate?”
o Automatic load‐balancing among all components to optimize capacity
and performance
o Automatic self‐healing of all components. i.e., the system recognizes failed disks and components automatically and rebuilds lost data in the background while maintaining full access and performance to all data. HYDRAstor™ from NEC
NEC is offering a new answer to these age‐old problems and limitations – the
HYDRAstor data protection appliance. Although Glasshouse has not yet been able to perform testing of this new system, we discussed its design extensively with NEC, and they described to us a storage system based on a grid storage architecture that includes the above essential attributes and more.
The name HYDRAstor comes from Greek mythology. The Hydra was a multi‐headed, creature that was incredibly hard to kill. It was so hard to kill that killing it was given as a test to Hercules. If he cut off one head, it grew another one, and that is what the modern HYDRAstor is like. Tell it how many heads you think you might lose at one time, and it will automatically “grow” heads as needed.
From one perspective, HYDRAstor is a distributed system of inexpensive, very resilient NAS shares that are managed as a single entity and collectively scale to tens of petabytes without requiring the customer to create or manage RAID groups, logical volumes or physical volumes. All you need to determine in order to setup a HYDRAstor system are the following three requirements for your environment:
• Your throughput requirements (i.e., what you want your aggregate throughput to be)
• Your capacity requirements (i.e., how much data you want to store)
• Your resiliency requirements (i.e., how many simultaneous failures you can withstand without data loss)
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 7 Like the mythical hydra, a HYDRAstor consists of many heads – series of independent, industry standard servers (nodes) that act as a single entity and are connected via a private network. Some of HYDRAstor’s characteristics include:
• Throughput is determined by the number of Accelerator Nodes (ANs) you deploy, as each provides at least one NAS share and over 100 MB/s of throughput. • Capacity is determined by how many Storage Nodes (SNs) you have, as each Storage Node contains 3 TB of raw storage, which (after resiliency overhead considerations) can store roughly 40‐50 TB of de‐duplicated data.
• Resiliency is determined by your requirements and the total number of nodes you have; the more nodes you have, the more resilient your system will be. Even the smallest HYDRAstor is designed to survive at least three simultaneous independent disk failures and one complete system failure.
Once you determine your throughput, resiliency, and capacity requirements, you simply need to supply HYDRAstor with enough Accelerator Nodes and Storage Nodes to meet your specified requirements (the standard base package is 2 ANs with 4 SNs).
Additional nodes can then be easily added at any time to increase throughput or capacity.
Storage in a HYDRAstor system is found inside Storage Nodes. There are no storage arrays, no RAID controllers – just a series of interconnected nodes. Consider the HYDRAstor depicted in Figure 1, which represents the minimum recommended configuration consisting of four Storage Nodes (SNs), each of which currently has six 500 GB disk drives, and two Accelerator Nodes (ANs), each of which can access all Storage Nodes via a private network. As will be discussed later, additional nodes — including newer Accelerator Nodes, with faster processors, and/or Storage Nodes with larger and/or faster disks — can be added independently at any time to add additional capacity or throughput..
HYDRAstor supports heterogeneous nodes; this factor allows IT to take advantage of new hardware technology without a forklift upgrade and lots of downtime. As you add SNs, you increase capacity and compute power for accessing the data the disks within the SN, and this is a main key to HYDRA scalability. With a standard storage array, you don’t add more controllers every time you add more storage disks.
AN
AN
SN
SN
SN
SN
Private Network
AN – Accelerator Node SN – Storage Node Figure 1: A minimum HYDRAstor configuration© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 8 The configuration as depicted will supply over 200 MB/s of throughput while also providing uninterrupted service to all NAS shares during any of the following scenarios:
• Simultaneous failure of multiple disk drives (the number is up to you)
The default setting protects against a minimum of any three simultaneous disk failures, but you can increase this number if desired.
• Failure of a single Storage Node
The default setting also protects data against the failure of an entire Storage Node. As will be explained later, as more Storage Nodes are added to the system, HYDRAstor automatically increases the number of simultaneous Storage Node failures that can be tolerated without data loss.
• Failure of a single Accelerator Node
Each Accelerator Node in a HYDRAstor grid supplies over 100 MB/s of
throughput, and all Accelerator Nodes run in an active‐active configuration. If one Accelerator Node fails, another Accelerator Node can take over and serve the shares that the failed Accelerator Node was providing. In the minimum standard configuration of two ANs and four SNs (illustrated in Figure 1), all shares would end up being served by the remaining Accelerator Node in the event the other Accelerator Node failed. (Throughput would, of course, reduce to 100 MB/s, as that’s the amount of throughput any single Accelerator Node can provide.) In larger HYDRAstor deployments, the load could be spread across all remaining Accelerator Nodes.
Just like Storage Nodes, though, with each Accelerator Node you add to the grid, you increase the number of simultaneous Accelerator Node failures you can tolerate. Since shares can be redistributed among all operational Accelerator Nodes in the event of an Accelerator Node failure, the more Accelerator Nodes you deploy, the less of a
performance effect the loss of a single Accelerator Node would have on your system. But how does HYDRAstor work?
If you’re like I was when NEC first explained HYDRAstor to me, you’re chomping at the bit wanting to know more about how this new thing works. Let’s not prolong the suspense any longer. The following section provides an overview of how HYDRAstor works, starting with a minimal configuration example then following a backup stream throughout its lifecycle to illustrate the unique features and benefits of the HYDRAstor architecture. We will then illustrate how the expansion of a HYDRAstor system increases its capacity, throughput, and resiliency.
The best way to explain how HYDRAstor works is to begin with powering up a system, understand how it initializes itself, and then follow how backup data is handled. Consider the HYDRAstor pictured in Figure 2, which represents the minimum
configuration of two Accelerator Nodes and four Storage Nodes. For simplicity’s sake in this example, we will use the default settings. (You wouldn’t need to know or
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 9 understand these settings for HYDRAstor to work, but we do want to explain them so you can see how HYDRAstor works.) In normal operation, you can, of course, always accept the defaults or change them at any time. In either case, HYDRAstor adjusts automatically.
Initialize HYDRAstor
We power on the HYDRAstor system by turning on all Accelerator Nodes and Storage Nodes in the HYDRAstor rack. Then, all we need to do is setup an administrator account and password and an optional email address for alerts. Then we tell
HYDRAstor to create at least one share (i.e., a file system) on each of the Accelerator Nodes. To do so, we simply give each share a name — that’s it, we are done.
We don’t have to specify how large the file system is, or associate it with any volumes, real or otherwise. (Remember, there are no RAID groups or logical volumes in HYDRAstor, a fact which greatly simplifies the storage management overhead and configuration time.)
All other tasks are handled by HYDRAstor’s self‐management feature. The Storage Nodes automatically discover each other, detect the default resiliency setting of three, and automatically start setting up what needs to be created for a new storage system. Data Backup: HYDRAstor de‐duplicates data
The following section explains the next step in the process, namely, backing up to the HYDRAstor as a backup target.
When backing up to the HYDRAstor, a backup application would write a file to one of the shares served by the HYDRAstor. HYDRAstor would then break that file into chunks. Although a chunk is similar to a block, since HYDRAstor does not always split data chunks at specific block boundaries, the term chunk is used to prevent possible confusion. (The detail of how data are split into chunks is beyond the scope of this paper. Suffice it to say the HYDRAstor dynamically splits data into variable size chunks with very high performance based on NEC’s patented technology. The patents filed by NEC on chunking, hashing, and duplicate elimination focus on optimizing performance and speed.
Once HYDRAstor splits each file into chunks, it needs to verify that:. 1. This same chunk has not been stored before.
2. The chunk is written in a way to meet the default or user‐specified resiliency requirements.
The next thing that happens for each chunk of data is that the Accelerator Node which receives the chunk calculates a hash (“DNA fingerprint”) on the chunk.
Think of the hash as a unique – or virtually unique ‐‐ multi‐bit “DNA fingerprint” or “digital DNA” that is tied to that chunk of data. Regardless of the hashing algorithm, no
Ease of installation
HYDRAstor’s automated setup simplifies data protection by establishing basic system parameters
by default while also providing options to customize if you desire
more control.
“High Performance Duplicate Elimination”
HYDRAstor uses patent-pending technology to
break files into “data chunks”, tag them with their own unique “DNA fingerprint” and eliminate
duplicate chunks with very high efficiency.
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 10 vendor can state to a computational certainty that only one chunk of data would result in a given hash value. We can state that the odds of two chunks creating the same hash (referred to as a “hash collision”) are extremely rare (1 in 2160 to 1 in 2256). However even
though the chances of a hash collision are extremely small, NEC performs a bit‐level comparison as a further “safety net” for data integrity to verify that two chunks are the same before discarding one of them.
The hash key is then passed on (based on a least recently used algorithm) to an available Storage Node. Once a Storage Node has been selected to either write or discard a given chunk of data, it requests the actual chunk in question and then has both the hash, which represents the data chunk, as well as the actual data chunk. If the hash lookup matches a hash already seen by HYDRAstor, the Storage Node requests the previous data chunk that the new data chunk allegedly matches from the super node, then performs a bit‐level comparison of the two chunks. As long as the bit level comparison verifies the match, the new data chunk is discarded, and a reference pointer is stored instead. In the extremely rare chance that it doesn’t match, the new chunk will be written to the Storage Node. The system will also write the data chunk to disk, of course, if the hash lookup does not show a match to a previous hash.
Storing Data with High Resiliency, Increasing Rebuild Performance
This next section will explain how HYDRAstor writes new data chunks in a manner that provides maximum resiliency while guaranteeing the default or user defined resiliency settings. First, each data chunk is broken into nine fragments (by default). Three parity fragments are calculated against the nine original data fragments in such a way that the data chunk can be recreated from any nine of the twelve fragments. To ensure
maximum resiliency HYDRAstor then distributes the twelve fragments across as many Storage Nodes and within each Storage Node across as many disks as possible.
Management of all fragments of a chunk is handled by one of several virtual super nodes. A super node is a collection of several members, where a member is a portion of a disk. Hence in our example, each super node has 12 members. With the default set so that three members share one disk, in the minimum configuration setup we are using for our example, 24 physical disks in the four Storage Nodes now become 72 (24 x 3) members which allows for 6 (72 / 12) super nodes.
Without requiring any user intervention or configuration, HYDRAstor automatically determines how many members are in a super node based on resiliency setting defaults. Simply powering on the HYDRAstor with its default values results in the creation of six super nodes, each with twelve members.
Given the number of ANs and SNs in the configuration, the distribution of those super nodes is automatically optimized to achieve maximum resiliency for that configuration. A HYDRAstor system with four Storage Nodes automatically configured with standard default settings is depicted in Figure 2. Super Node A consists of the first third of the first twelve disks in the system, Super Node B consists of the second third of the first twelve disks, and so on.
Backup Performance During Rebuild
Its unique
rebuilding capability is how HYDRAstor is able
to maintain its backup performance, even when
one or more disks or Storage Nodes are being
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 11
Figure 2: Six super nodes in a four Storage Node configuration
(this is a simplified representation for ease of understanding)
In order to reduce the chance of data loss and to decrease the time needed to restore, not all members of A, B and C (or D, E, and F) share the same physical disks. If you need to restore or just read/write your data, you want to spread the load across as many disks as possible. While the final data write may be considered analogous to RAID, there are key differences between the resiliency offered by HYDRAstor and that offered by RAID technology. Besides the obvious fact that there are no RAID controllers, volume managers, or RAID groups in HYDRAstor (a not insubstantial time and management headache‐saving task), the parity in HYDRAstor is always calculated in RAM. Writing parity to RAM thus requires only one chunk to be written with no need to re‐read data to recalculate parity, as is the case with RAID levels 5 & 6, which suffer from significant performance impacts as a result.
Comparing the performance of an NEC system that’s in rebuild mode for one of its disks or Storage Nodes to a RAID system rebuilding a disk, and there will be no doubt that there is a major difference between the two. The following explanation of how reads work will help further explain how HYDRAstor achieves this performance. .
The HYDRAstor de‐duplicates data. When an Accelerator Node requests a data chunk, it requests the chunk from one of the twelve members, which then requests the data chunk’s fragments from the other members of the super node. HYDRA then takes the first nine fragments it has been provided in order to recreate the chunk. If some of the nine supplied fragments are parity fragments, the parity fragments are used to
automatically reconstruct the fragment in RAM before being passed on to the
application. This unique rebuilding capability is how HYDRAstor is able to maintain its backup performance, even when one or more disks or Storage Nodes are being rebuilt. This rebuilding optimization is a result of the fact that when add an SN, you not only get more capacity, but you get additional CPU processing power as well for that SN CPU to handle data on that node. The two‐tier architecture and the unique method that
HYDRAstor provides resiliency allows the system to maintain performance even during a data rebuild operation. Unlike in current storage products, HYDRAstor doesn’t have one controller that becomes a bottleneck to the system.
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 12
How HYDRAstor Processes Hash Tables
As hashes for all data chunks are calculated as discussed above, a list of all known hashes is also created and is referred to as the hash table.
For performance reasons, each super node is automatically nominated to manage one portion of the overall hash table. Distributing the hash table in this way optimizes the scalability characteristics while greatly increasing the speed of de‐duplication processing of the underlying data structure. To ensure proper distribution of both the hashes themselves as well as the de‐duplication key (i.e. the hash) each super node is given responsibility only for a certain range of hashes. Figure 2 illustrates this feature; for example, Super Node A will be responsible for any hashes starting with 00, Super Node B will be responsible for those beginning with 010, and so on.
For resiliency purposes, each Storage Node containing a member of any given super node will also be given a copy of the portion of the hash table that manages the data for that super node. In the above minimum configuration, every Storage Node has a member from every super node. As a result, every Storage Node also actually has a copy of the entire hash table. As Storage Nodes are added to the configuration, the
distribution would change to spread the hash table across more Storage Nodes. (This built‐in data protection functionality is one reason to consider deploying more than the minimum configuration.)
HYDRAstor’s division of its hash table across multiple Storage Nodes, is quite different than today’s systems that use hashing, because in those systems, the more data
referenced by a system, the bigger the hash table grows, and the slower the performance becomes. HYDRAstor’s method of distributing the hash table across multiple Storage Nodes allows it to scale linearly as you add Storage Nodes, increasing the speed of hash table look‐ups as more “HYDRA heads” grow.
It is important to note that all we did to make all the steps discussed so far happen was to create the two shares/file systems on the Accelerator Nodes and point backup processes at them. Everything else — including the creation of the Super Nodes, the system’s resiliency, and the distribution of the hash table for maximum system performance — happens automatically when we powered on the system. Expanding a HYDRAstor System
Even with the storage efficiencies of data de‐duplication methodologies, you will at some point still need more storage. Unlike with other storage systems, adding capacity to an existing HYDRAstor system is simple: if you need more capacity, you simply plug another Storage Node into the private network. HYDRAstor automatically sees the new node, and the existing Accelerator Nodes automatically start using its new capacity to write new data. HYDRAstor also automatically redistributes existing data to the new Storage Node in order to distribute the read‐write load, optimize performance, and increase resiliency.
This protection is automatically achieved by splitting hash tables and extending Super Nodes or creating new ones: the larger the HYDRAstor system gets, the more nodes it
Speed and Scalability
HYDRAstor has a unique methodology for creating, distributing, and storing its
hash table to greatly increase scalability while also improving the de-duplication processing speed. “Plug-n-Play” Expansion Adding capacity to an existing HYDRAstor system is simple: if you need more capacity, you
simply plug another Storage Node into the private network. Auto-discovery and auto-load
balancing allows scalability to tens of
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 13 can create to distribute the load across, a key reason why HYDRAstor can scale to tens of petabytes with no decrease in performance. Auto‐load balancing occurs as a background task and does not impact the system’s performance.
Figure 3: An expanded HYDRAstor
(this is a simplified representation for ease of understanding)
Consider the HYDRAstor configuration depicted in Figure 3. We have now added eight more Storage Nodes and four more Accelerator Nodes to get a total of six ANs and 12 SNs. At least one share would need to be created on each new Accelerator Node – as you may remember, you can accomplish this task by simply giving each share a name ‐‐ in order to take advantage of that node’s available bandwidth. However, you do not have to associate that name with any volumes, file systems, Storage Nodes, or super nodes, as HYDRAstor automatically completes that level of configuration for you. If adding ANs sounds good, let’s take a look at what happens when we add new Storage Nodes. This process is the true magic act, especially when you compare the following scenario to a typical storage system expansion process: To expand HYDRAstor, all you have to do is simply plug additional Storage Nodes into the private network, that’s it! During expansion, the HYDRAstor system automatically discovers the capacity of the additional Storage Nodes and starts expanding Super Nodes to increase its performance and resiliency. In our example where we are expanding outward from the basic setup, HYDRAstor notices that the Super Nodes in the original configuration of four Storage Nodes could be even organized with optimized resiliency by assigning one member to each of the twelve Storage Node instead of keeping 3 members on each of the original four SNs and none on the added eight SNs. HYDRAstor will therefore automatically move one member to each of the twelve Storage Nodes in our now expanded
configuration. Because there are now many more disks available, HYDRAstor has also automatically created more super nodes (super nodes G‐R), and divided the hash table into smaller pieces for improved resiliency and improved speed.
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 14
Replacing or Upgrading Nodes in a HYDRAstor System
As with any storage system, it’s important to be able to replace or upgrade components on demand, as, for example, when larger, faster hard drives become available as inevitably seems to occur. When that happens, HYDRAstor provides you with the choice of seamlessly replacing the older units, or leaving them in place and adding the newer units as additional capacity, the choice is yours.
To update a Storage Node configuration, all you need to do is plug in the newer, faster, Storage Node(s) into the private network, then tell HYDRAstor which Storage Node(s) you want to retire. HYDRAstor then automatically migrates the data off the soon‐to‐be‐ retired Storage Node(s) and moves it onto the new Storage Node(s) you’ve added to the system configuration, all without impacting end user performance or requiring
scheduled downtime.
It is important to note just how low impact this update process is — you don’t have to actively manage any of this data migration. All you have to do is the equivalent of “say hello to these new nodes, say goodbye to these old nodes,” and HYDRAstor handles everything else.
Possibility for Improvement
Starting out as a NAS filer makes HYDRAstor easier to deploy and easier to interface to existing backup systems, and HYDRAstor even in its initial rollout actually has more available throughput than most backup servers can even use. Since network backup servers typically can handle only about 30‐60 MB/s via their Ethernet interface, you would actually need multiple backup servers to generate enough data to saturate even one Accelerator Node.
In some really large environments, individual servers may have more than a few terabytes behind them. These servers, however, have typically moved to LAN‐free backups due to the IP load they were experiencing with LAN‐based backups. Since HYDRAstor’s initial release is IP‐based, these servers will not be able to deploy HYDRAstor for LAN‐free backups, since they need to send their backup data via the LAN to the NFS/CIFS share provided by the HYDRAstor.
For this reason, GlassHouse believes the addition of block‐level access constitutes a possible area for improvement for HYDRAstor. Glasshouse has discussed this
suggestion with NEC, and NEC has indicated plans to add this functionality according to customer demand. Due to the HYDRAstor architecture, an enhancement of this type would require a change only in the Accelerator Node code and would not require any change at the Storage Node level, thus greatly simplifying future deployment should this change be made available in a future release.
Total Cost of Ownership
On the surface, it’s easy to see that a device that costs at most $.90/GB would be less expensive to operate than a device that costs $3‐5/GB, which is what most mid‐range fully‐populated tape libraries cost. The real savings, however, comes from the cost of management.
Ease of Expansion and Manageability
HYDRAstor enables data migration with no active management required. All you have to do is say “hello to new nodes, goodbye to
old nodes”, and HYDRAstor handles
© Copyright 2007 GlassHouse Technologies, Inc. All rights reserved. 15 The first type of savings will come from doing away with the management activities typically associated with tape. Media errors will disappear, and failed backups due to media errors will disappear. Therefore, system administrators can spend less time ensuring that backups have completed. In addition, all of the time spent trying to ensure that the tape drives are streaming will also cease. A disk device doesn’t need to be streamed, so these activities are simply not needed.
The biggest savings, though, will come to an environment that is deciding between HYDRAstor and any system that requires maintaining RAID arrays. Think of the time associated with laying out and creating RAID volumes. Consider the time spent ensuring that operations continue successfully when any RAID arrays operate in degraded mode. The real time and management savings come when it’s time to upgrade hardware. Most RAID‐based systems require forklift upgrades and migration of data, costing management time, downtime, and reducing system availability. Some of this cost can be mitigated with migration software, but that software doesn’t come free either.
With the HYDRA architecture, there are never any forklift hardware upgrades. You simply plug in new nodes and tell it which nodes you want to retire, and HYDRAstor does everything else. Last but not least, HYDRAstor grows with your needs and no matter how large, there is only one system which is mostly self‐managed – a big difference to managing and provisioning many RAID arrays separately.
Summary
HYDRAstor truly is a unique creature. While much of the industry is talking about grid storage, NEC is delivering it today with the HYDRAstor platform. It’s difficult to find fault with such a scalable, robust system that allows you to start small, yet incrementally and economically grow to hundreds of petabytes with only a few terabytes at a time. With beta deployments starting in Winter of 2006, Glasshouse Technologies looks forward to testing its capabilities as soon as we can.
HYDRAstor, DataRedux, and Distributed Resilient Data (DRD) are trademarks of NEC Corporation; NEC is a registered trademark of NEC Corporation. All other trademarks and registered trademarks are the property of their respective owners. All rights reserved. All specifications subject to change.