Enabling NAS Clusters

(1)

Enabling NAS Clusters

The Silicon Storage Appliance

WHITE PAPER Revised July 2006

Robert C. Woolery Vice President, Product Marketing

(2)

1 Executive Summary... 4 2 Limits on Computing ... 6 2.1 Performance Challenges ... 6 2.2 Availability Challenges ... 6 2.3 Scalability Challenges... 6 2.4 Management Challenges ... 7

2.5 Meeting the Challenge, the Silicon Storage Appliance... 7

3 Next Evolution in Computing ... 9

3.1 Cluster NAS Hybrid, Shared SAN File System... 10

3.2 Cluster NAS Hybrid, Parallel File System ... 11

3.3 Cluster NAS Hybrid, Object-Based Parallel File System ... 13

3.4 Improving Clusters, InfiniBand ... 14

3.4.1 RDMA versus SCSI Socket for Block Storage I/O... 16

3.5 S2A9550 InfiniBand Solution ... 17

4 S2A9550 Enabling Cluster NAS with InfiniBand... 18

4.1 S2A Unique Cluster NAS Architecture... 19

4.2 S2A9550 Unique Benefits... 21

4.3 S2A9550 Enabling Parallel and Shared SAN File Systems ... 22

4.4 NAS with Symmetric and Asymmetric Global File Systems ... 24

4.5 Example S2A Cluster NAS Implementations... 26

5 S2A9550 Architecture Discussion ... 28

5.1 S2A9550 Hardware Architecture ... 28

5.1.1 Host-Side Architecture... 28

5.1.2 Disk Controller Engines (DCE) ... 30

5.1.3 System Configuration ... 31

5.2 S2A9550 Software Architecture ... 34

5.2.1 Host Side ... 34

5.2.2 DirectRAID Parity implementation ... 34

5.3 Disk Side ... 35

5.4 Management Capabilities... 36

6 S2A9550 Cluster NAS Benefit... 37

Appendix A: S2A9550 and PolyServe NAS Solution for Shared Data Clustering ... 39

NAS Appliance Pain ... 39

The Economics of File Serving Clusters... 40

The Challenge of Scaling and Managing NAS ... 42

The PolyServe Solution ... 43

PolyServe versus NAS Appliances... 43

Cost-Effective Scalability ... 44

Leverage Existing Resources ... 46

Improved Manageability... 47

Inherent High Availability ... 47

S2A9550, the Right Storage for PolyServe File Serving ... 48

S2A9550 versus Generic SAN RAID... 48

S2A9550 Cluster Benefits... 49

(3)

Appendix B: S2A9550 and Lustre File System for Large Scale Clustering... 53

The Pain of NFS in Large Compute Clusters... 53

The Economics of NFS Clusters... 54

The Challenge of Scaling and Managing NFS ... 54

The Lustre Solution ... 55

Lustre versus NFS... 55

Cost-Effective Scalability ... 56

Create High Performance Clusters... 58

Leverage Existing Resources ... 59

Inherent High Availability ... 59

S2A9550 for Lustre High Performance Clusters... 59

S2A9550 versus Generic SAN RAID... 60

S2A9550 Cluster NAS Benefits ... 61

(4)

1 Executive

Summary

Today, many organizations view data availability as critical. Compounding this challenge is the sheer volume of data. Where a few short years ago organizations needed to access gigabytes of data, now they need to access terabytes quickly moving to petabytes. During this same period, there has been a proliferation of low cost commodity servers with the need to access these large data stores.

This confluence has many organizations resembling high performance computing environments where hundreds to thousands of computers access terabytes to petabytes of data. To harness all this compute power and reduce computing costs, these organizations are employing high performance computing architectures like clustering.

Clustering architecture is perfectly suited to harness the compute power of moderately or loosely-coupled parallel problems and may scale to 10,000 processors. This scale of computing has overwhelmed file serving and generic SAN RAID systems forcing the next evolution in storage architectures, the merger of SAN and NAS (NAS Clusters).

DataDirect Network’s Silicon Storage Appliances (S2As) were designed with this storage architecture in mind, and future interconnects like InfiniBand (IB). DataDirect’s S2A family is highly enabling in NAS Cluster implementations and in particular the file systems that support them ─ parallel and shared SAN.

In the case of parallel file systems, the S2A’s capability to provide wire-speed performance through multiple target LUNs and huge capacity scalability allows a single S2A9550 to saturate the capability of multiple file servers with far fewer components and therefore far better reliability, system efficiency and cost than generic RAID or alternate approaches. For shared SAN file systems, the S2A’s “PowerLUN” feature can support full-speed transfers in parallel to multiple computers simultaneously greatly reducing the software striping and network complexity required of generic RAID system implementations.

Building a high-performance networked computing environment using commodity CPU and storage devices results in complex, hard to scale and hard to manage systems. One of the goals of an InfiniBand topology is to improve performance and scalability by collapsing network and channel fabrics into one consolidated interconnect. The S2A9550 InfiniBand Solution is the next step in this evolution reducing complexity by using a DataDirect Networks Page 4 of 64 July 2006

(5)

cache-centric parallel architecture to minimize the encumbrances of the physical hardware stack and software protocol stack allowing for very tight integration via native parallelism.

Designed for High Performance Computing (HPC) and Rich Media, DataDirect's S2A9550 is 3.5 times faster than other storage systems in the world. With multiple levels of redundancy, double parity protection and Dual Redundancy Disk for RAID 6 functionality, the S2A storage solution delivers enterprise reliability with world-class performance.

In all cases, the S2A9550 can provide higher aggregate NAS system performance with greater scalability and with far fewer components – result, the least amount of complexity possible, and the best price per performance and best price per capacity per square foot storage solution in the industry.

This paper provides a basic overview describing the most commonly used storage strategies for clusters, and some emerging technologies that attempt to answer the problems associated with larger-scale data distribution in large clusters. It assumes the reader is familiar with the basics of RAID, file serving, Network Attached Storage (NAS), distributed file systems, and their storage management, scalability and performance shortcomings.

The focus of this paper will be on the next level solution that overcomes these shortcomings and enables large-scale shared data clusters via an integrated SAN and NAS solution and the use of InfiniBand. This white paper will provide a technical overview of the S2A9550 Solution architecture and how it enables large-scale shared data clusters that integrate NAS and SAN storage solutions. It will also show why the S2A9550 is the best choice for NAS solutions by providing a more effective utilization of hardware and software resources, simpler scalability, and increased availability and performance.

(6)

2 Limits on Computing

Today, many organizations view data availability as critical. Compounding this challenge is the sheer volume of data. Where a few short years ago organizations needed to access gigabytes of data, now they need to access terabytes quickly moving to petabytes.

During this same period, compute platforms have improved their performance price curve where there they can populate an organization easily and cheaply. In fact many organizations now replace computers rather than repair them. This confluence of events has many organizations resembling high performance computing environments where hundreds even thousands of low cost “commodity” computers need access to petabytes of data. This confluence has forced computing, in particular file serving and SAN RAID, to evolve.

2.1 Performance Challenges

Performance challenges follow data availability especially when it comes to large amounts of data. On some levels, servers are not working to full capacity and on other levels they are over utilized.

On some storage levels, capacity is captive in islands and under utilized. On others, file servers and generic SAN RAID are unable to scale linearly or cost effectively to solve performance bottlenecks.

On the network level, the client/server model is leveraging low cost bandwidth and taking advantage of high performance interconnects and messaging protocols. But as channel fabrics scale, they introduce complexity and latency that limit performance.

2.2 Availability Challenges

The affects and costs of downtime are well known. Although downtime for an individual system is inevitable, data does not have to be unavailable. Along with the performance challenges, organizations are looking to evolve their file serving plants to provide high availability to large pools of data.

2.3 Scalability Challenges

Many storage solutions work well with a small number of clients but few work well when the count increases. Typical issues are stable performance, cost effective scalability and the ability to handle a storm of I/O.

(7)

It is now common for commodity computers to be clustered together to handle more complex I/O intensive tasks. In fact, clusters are rarely used in isolation, and if there is one, there are many. These clusters quickly overwhelm NAS and generic SAN RAID architectures and adding NAS appliances or generic SAN RAID increases complexity and cost with limited aggregate performance gain.

2.4 Management Challenges

Managing a high volume of servers is difficult and time consuming. With each file server implemented, IT must maintain system hardware, deliver software updates, and identify, locate and correct system problems.

Storage management has also become more complex and time consuming. With generic SAN RAID, IT must make extensive use of striping to meet performance demands. This makes it complicated and lengthy to re-architect the environment when performance or capacity demands change. The result is an infrastructure that is difficult to scale and expensive to manage.

2.5 Meeting the Challenge, the Silicon Storage

Appliance

To meet these challenges, a storage solution needs to be architected from the ground up with a clustered world in mind ─ an environment that enables thousands of computers to access petabytes of storage. The architecture must scale linearly and cost effectively for performance, capacity and availability. It must be easy to manage lowering the cost of management. And, it must be flexible to leverage evolutions in technology while protecting existing investments.

DataDirect’s S2A family was designed from the ground up to meet the challenges of high performance computing NAS Clusters implementations and in particular the file systems that support them. The Silicon Storage Appliance 9500 (S2A9550) is the world’s fastest storage solution delivering up to 3 Gigabytes per second whether reading or writing data. Designed for HPC and Rich Media, DataDirect's S2A9550 is 3.5 times faster than other storage systems. With multiple levels of redundancy and double parity protection, the S2A storage solution delivers enterprise reliability with world-class performance.

The S2A9550 leads the industry in scalability and flexibility. S2A9550 can fit 5 times more capacity than any single storage system supporting up to 1,120 disk drives including Fibre Channel, SATA and SAS (future).

(8)

The S2A9550 flexibility extends to server connectivity as well. With its modular design, the S2A9550 can support mixing and matching Fibre Channel-4 (FC-4) and InfiniBand-4X (IB-4X) front end server connectivity protecting your current investment with an upgrade path to the future. All this scalability and flexibility is easily managed from a single storage system.

DataDirect’s S2A9550 delivers a world-class storage solution with an industry disruptive price. Regardless of your configuration, the S2A9550 can deliver the industry’s best price/performance or best price/capacity/square foot solutions.

The S2A9550 difference is its architectural approach. A 7th generation RAID storage networking system, the S2A9550 is built around a lightning fast cache-centric design. S2A9550 has 11.2 Gigabytes/second of internal bandwidth, supported by a parallel and fully redundant host and disk side architecture with complete separation of command and data paths.

This cache-centric design is supported by eight plug and play FC-4 or four IB-4X host side connections, and 20 parallel disk side ports (FC and IB host ports can be intermix). Fully redundant host and disk side architecture with real-time parity checking (dual parity protection) supports up to 480TB in two racks reliably.

This architecture delivers a unique ability to source all of its performance from a single Logical Unit (PowerLUN) through truly parallel host accesses. This architectural approach puts it in a class by itself. The S2A9550’s architectural parallelism and raw performance enables and enhances all high-performance shared file system environments providing concurrent and redundant parallel access, eliminating the need for host-based software striping freeing up server, storage and network resources, reducing software and hardware purchases as well as management costs. Moreover, this architectural parallelism enables mass scalability with robust reliability. Architectural parallelism extends through the entire system to the back end allowing drive configurations that support up to 1,120 Fibre Channel or SATA disks. Built into this parallelism are advanced data protection features enabled by RAID protected cache and disks, dual parity protection and Dual Redundancy Disk (protecting against double disk failures in the same redundancy group). This parallel, fully redundant, dual pathing front and back end delivers true mission critical data integrity and performance.

The S2A9550 overcomes the performance, availability, scalability and management challenges organizations face in deploying cluster

(9)

architectures with global file systems and NAS storage. Scale as fast and large as you need without expensive re-architecting. Provide the highest levels of Quality of Service for consistent, predictable data delivery to real-life application environments. Deploy, support and manage simply and easily reducing management costs. All this with enterprise-wide reliability ensuring high data availability and uptime even while servicing full host loads.

As a platform for parallel file systems and shared SAN file systems, high performance computing clusters, visualization and simulation, animation, special effects and post-production, as well as broadcast streaming, the S2A9550 InfiniBand Solution or Fibre-4 Solution are unmatched.

This optimal block level and file system performance design with massive scalability, supports a broad range of computational, and visualization environments with NAS solutions. DataDirect’s S2A offers broad infrastructure support powering compute clusters from IBM, Dell, HP, Cray, SGI, Bull and others.

With the S2A9550, organizations can create the industry’s best price per performance and best price per capacity per square foot disk-based solutions for parallel and shared SAN file systems, and high performance NAS computing clusters.

3 Next Evolution in Computing

Clusters using commodity compute nodes are gaining rapid adoption, generally running Linux (although MS Windows and Apple OSX have significant installed systems) on either Intel or AMD CPUs. These systems are perfectly suited for the moderately or loosely-coupled parallel problems and may scale into 10,000’s of processors. These clusters require global file systems and can scale to petabyte levels. These CPU clusters are typically served by an I/O cluster of server nodes that are directly attached to SAN storage. The I/O nodes are often called “NAS Gateways” in the case of standard (NFS/CIFS) protocol servers, when using Lustre, for example, they are called “Object Storage Servers” (OSSs) or “Object Storage Targets” (OSTs). I/O nodes can range in performance from 200-300MB/s all the way to a recently demonstrated 2.5 GB/second for a high-performance Lustre Object Storage Server. Another computing architecture is smaller clusters of high-throughput visualization or content-creation systems. These environments typically require cross-platform capability mixing NAS and SAN. Besides direct SAN-attached systems, they will often have small to moderate compute clusters that necessitate attaching to file servers. Often, real-time streams must be served in these environments since video or film-resolution (or DataDirect Networks Page 9 of 64 July 2006

(10)

greater) images must be generated. Throughputs on a per-seat basis can range up to a Gigabyte per second, or more for larger systems with multiple graphic pipelines.

In an attempt to meet the demands of compute clusters, and accelerated by the InfiniBand trend, SAN and NAS storage have started to merger into an integrated cluster NAS solution. The goal is to deliver the advantages of both ─ cost effective performance, availability and scalability ─ without the current limitations.

Cluster NAS solutions, however, are severely limited by generic SAN RAID systems. Generic SAN RAID was designed as general-purpose device for database and back-office environments not for large or scalable cluster NAS solutions or InfiniBand fabrics. Generic SAN RAID suffers from poor linear performance scalability. While, this may be acceptable when accessing data from a small number of lower performance computers, it does not work well when the client count or the performance demands increase. Lacking a parallel design, generic SAN RAID systems congests fabrics, create lots of switching latencies in the SAN and lots of port contention. Lacking a “PowerLUN” feature, they require host striping which robs CPU performance from the file servers. Attempts to overcome these shortcomings require many more components increasing complexity and cost.

3.1 Cluster NAS Hybrid, Shared SAN File System

One of the leading ways to address the performance and availability limitations of NAS file servers is to employ a SAN behind the NAS servers. This scalable NAS hybrid utilizes an exportable SAN file system to allow each NAS file server to mount and re-export the exact same file system located on the shared SAN storage. Thus each NAS file server becomes a window into the same data. Adding NAS file servers adds aggregate performance to the file system (Diagram 1).

This scalable NAS hybrid also leverages the SANs ability to consolidate and manage storage centrally making it easy to scale on demand. This ease of scalability allows organizations to create a more efficient storage infrastructure by reallocating unused capacity while increasing performance.

This approach is quite attractive, as no additional software is required on each individual compute node (or any node on the TCP/IP network, for that matter), since native NFS and/or CIFS clients are available on virtually every operating system. Of course each individual client will still only realize “normal” NAS performance. High performance clients may be direct-attached to the SAN, providing a combined NAS hybrid. DataDirect Networks Page 10 of 64 July 2006

(11)

PolyServe Matrix Server, Ibrix Fusion, Red Hat GFS, ADIC StorNext and SGI CxFS are examples of a Shared SAN file system.

Diagram 1: Cluster NAS Hybrid, Shared SAN File System

3.2 Cluster NAS Hybrid, Parallel File System

Another approach to meet the cluster challenge is the implementation of parallel file system with a NAS hybrid. Parallel file systems can come in two flavors, parallel NAS or object-based.

Parallel global file systems are client/server implementations that utilize both client node software and server node software that provides data access generally via a custom application protocol running on top of a TCP/IP protocol stack. This special file system client code spreads file requests across multiple NAS file servers. This protocol may be implemented as an object-based abstraction such as that provided by the open-source Lustre™ file system from Cluster File Systems Inc, or the more traditional general approaches of IBM’s GPFS or the open-source PVFS developed by the university community. Server nodes in parallel DataDirect Networks Page 11 of 64 July 2006

(12)

file systems typically do not share block-level storage data with other server nodes; rather the client nodes share all the server nodes (Diagram 2).

The traditional approach can provide the benefit of aggregating the throughput performance of the multiple NAS servers, but single client performance is typically not improved. Unfortunately most parallel NAS file systems do not yet scale efficiently or eliminate performance bottlenecks, and each NAS file server does not provide any better performance than normal NAS systems. Therefore, scaling in general is still very difficult and not linear. Complexity and cost still remain.

Traditional parallel file systems are designed for large numbers of lower bandwidth CPU nodes, but are not well suited for extremely high-bandwidth computing needs since block-level storage is usually a better way to carry very high single-system throughput. They are also typically supported in only homogeneous operating system environments as well Creating NAS hybrid architecture can be a good complement to drive multiple server nodes in a traditional parallel file system implementation. This requires a SAN RAID system that can easily virtualize access to individual LUNs by individual computers, and can scale performance linearly as additional host ports are utilized. In addition, it should have a wide range of LUN layout options supported by an internal non-blocking architecture to allow for inherent load balancing.

(13)

Diagram 2: Cluster NAS Hybrid, Parallel File System

3.3 Cluster NAS Hybrid, Object-Based Parallel File

System

There is a next generation approach that shares the same physical hardware architecture as a traditional parallel file system with NAS hybrids, but replaces the file-level calls and server implementation with object-based calls and a streamlined protocol.

This approach yields performance levels very similar to direct SAN implementations at the clients, and better performance through the OSS than NFS/CIFS approaches offer. Additionally, the file system is parallel at the client to the OSSs, thus the aggregate file system performance increases as OSSs are added. Object-based implementations have been deployed in number of supercomputing applications demonstrating impressive performance and scalability gains.

Each client must have a file system driver installed, although they are lightweight and consume less CPU overhead than NFS or CIFS client

(14)

implementations. Lustre is currently the leading open-source object-based storage file system, with numerous sites running in production.

Creating a NAS hybrid solution is also a good compliment for object-based parallel file systems. This requires a SAN RAID system that delivers multiple streams simultaneously to multiple computers at wire speed. It also must scale performance linearly as additional host ports are utilized without additional SAN RAID systems and host based striping. In addition, it should deliver a solution with fewer components increasing reliability and system efficiency as well as dramatically reducing costs.

Diagram 3: Cluster NAS Hybrid, Object-Based Parallel File System

3.4 Improving Clusters, InfiniBand

Even with all the innovation in terms of cluster NAS architectures, parallel and shared file systems, there is room for improvement. One, collapse the network and channel fabrics into one consolidated fabric delivering a better, faster, lower latency interconnect that will reduce the number of ports needed, and dramatically lower port and management costs. Second,

(15)

improve the block-level protocol implementation using RDMA versus SCSI sockets. The obvious choice to do this is InfiniBand (IB).

The term InfiniBand is derived from the words “infinite bandwidth.” IB is a low latency, switched-fabric I/O technology that ties together servers, storage devices and network devices, easing the I/O subsystem bottlenecks created by data-intensive files such as streaming video, voice and audio. Unlike the present I/O subsystem in a computer, IB is a full-fledged I/O network and the bus itself is a switch with control information that will determine the route a given message will follow in getting to its destination.

With IB, data is transmitted in packets that together form a communication called a message. A message can be a remote direct memory access (RDMA) read or write operation, a channel send or receive message, a transaction-based operation (that can be reversed), or a multicast transmission. Like the channel model many mainframe users are familiar with, all transmission begins or ends with a channel adapter. Each processor (your PC or a data center server, for example) has what is called a host channel adapter (HCA) and each peripheral device has a target channel adapter (TCA). These adapters can potentially exchange information that ensures security or work with a given Quality of Service level.

IB technology will be used to connect servers with remote storage and networking devices, and other servers. It will also be used inside servers for inter-processor communication in parallel clusters. Customers requiring dense server deployments, such as HPC and ISPs, will also benefit from the small form factors being proposed. Other benefits include greater performance, lower latency, easier and faster data sharing, built in security and quality of service, improved usability (the new form factor will be far easier to add/remove/upgrade than today's shared-bus I/O cards).

Additionally, IB architecture reduces total cost of ownership by focusing on data center reliability and scalability. The technology addresses reliability by creating multiple redundant paths between nodes (reducing hardware that needs to be purchased). It also moves from the load-and-store-based communications methods used by shared local bus I/O to a more reliable message passing approach.

Scalability needs are addressed in two ways. First, the I/O fabric itself is designed to scale without encountering the latencies that some shared bus I/O architectures experience as workload increases. Second, the physical

(16)

modularity of IB technology will avoid the need for customers to buy excess capacity up-front in anticipation of future growth. Instead, they will be able to buy what they need at the outset and "pay as they grow" to add capacity without impacting operations or installed systems.

3.4.1 RDMA versus SCSI Socket for Block Storage I/O

The second improvement for clustering is combining RDMA messaging with IB. Using SCSI Socket for block storage I/O, the transfer moves through multiple layers from the I/O server node to the RAID storage system creating copies at each stop along the way. In some cases there are at least seven copies of the data. Managing and handling of these copies creates CPU sapping and latency-inducing software heroics at the file system or interface card driver level.

Raw Disk RAID Infiniband Fabric Infiniband Fabric CPU Nodes I/O Node RAID Cache Socket Buffer Adapter Buffer RAID Cache Socket Buffer Adapter Buffer System Memory Socket Buffer Adapter Buffer System Memory Socket Buffer Adapter Buffer File System

Standard SCSI Socket Block-Level Transfers

RAID System

Multiple Copies

I/O Server Nodes

Multiple Copies

Single Infiniband Fabric

Logically Segmented

File Services Storage Services

Diagram 4: SCSI Socket Block Storage I/O

With RDMA messaging, these copies are reduced to less than half transferring data directly from system memory to RAID cache. This creates a more efficient I/O network reducing CPU, network and storage resources as well as creating lower cost components. Management also becomes less complicated with fewer components and software layers to handle.

(17)

Raw Disk RAID Infiniband Fabric Infiniband Fabric CPU Nodes I/O Node RAID Cache System Memory File System RDMA Block-Level Transfers

Zero Memory Copy

Single Infiniband Fabric

Logically Segmented

File Services Storage Services

Diagram 5: RDMA Block Storage I/O

3.5 S2A9550 InfiniBand Solution

The S2A9550 can also be configured as a native InfiniBand Solution. The S2A9550 InfiniBand Solution delivers up to 3 GB/second sustained performance and 1,120 drive support making it the fastest, most scalable InfiniBand storage solution in the market. This combination of speed, capacity and density leads the InfiniBand storage industry with the best price/performance and best price/capacity/square foot solution.

The S2A9550 InfiniBand Solution has the same lightning fast cache-centric design as the Fibre Channel 4 solution supporting a parallel and fully redundant host and disk side architecture. The difference is host connectivity. The modular design of the S2A9550 allows for interchangeable front end host ports from Fibre Channel to InfiniBand and back. You can mix and match front end connectivity delivering infrastructure protection.

Building a high-performance networked storage environment using commodity storage devices results in complex, hard to scale and hard to manage systems. One goal of InfiniBand is to reduce complexity, and improve performance and scalability by collapsing network and channel fabrics into one consolidated low cost interconnect. The benefits are a better, faster, lower latency infrastructure. What’s in it for organizations? Cost effective, easy to scale infrastructure that can increase processing, bandwidth and capacity by simply adding building blocks.

(18)

The S2A9550 InfiniBand Solution is the next step in this evolution reducing complexity and improving performance and scalability by collapsing network and channel functionality into an InfiniBand RAID solution. Using a cache-centric parallel architecture to minimize the encumbrances of the physical hardware stack and software protocol stack, the S2A9550 InfiniBand allows for very tight integration with InfiniBand via native parallelism.

Modern shared storage networks employ file systems to provide simultaneous and parallel access to thousands of compute nodes. These large global file systems, whether parallel or shared SAN implementations, mandate true parallelism to both host and LUN storage targets.

Native parallelism in the S2A9550 InfiniBand Solution inherently provides multi-pathing and parallel access without requiring CPU sapping and latency-inducing software heroics at the file system or interface card driver level. This parallel architecture around an intelligent cache will reduce CPU, network and storage resources for less complicated, lower cost per port infrastructure.

The S2A9550 InfiniBand Solution’s parallelism allows true simultaneous access from multiple servers at wire speeds. This parallel architecture around an intelligent cache enables optimized, incredibly fast and easily scalable cluster NAS implementations.

The S2A9550 includes RDMA messaging for tight integration with system memory. This innovation when exploited by file system vendors will leverage the lighting fast cache and parallel architecture design of the S2A9550 creating the next level InfiniBand solution.

As a tightly integrated platform, the S2A9550 InfiniBand Solution delivers high performance cluster NAS solution allowing High Performance Computing and Rich Media environments to create large shared storage network file system implementations easily, reliably and affordably.

4 S2A9550 Enabling Cluster NAS with

InfiniBand

The S2A family in general and the S2A9550 in particular have been specifically designed to enable the creation of the world’s fastest and most scalable NAS shared storage InfiniBand networks. The I/O requirements of the high-throughput computing facilities that need these networks are growing exponentially driven by the rise of parallel compute processing

(19)

and the workflow requirements of cooperative content acquisition and/or generation, simulation, analysis, visualization and data archival.

Modern shared storage network file systems must provide truly simultaneous and parallel access to potentially many thousands of compute nodes, with throughput requirements for aggregate bandwidths reaching into the 100’s of Gigabytes per second (GB/s), and individual computers requiring in excess of 1 GB/s.

Examples of these scalable NAS shared storage networks implementations are found at HPC facilities performing basic scientific research, earth, atmospheric, and space sciences, biotech and life sciences, and large system engineering design. In addition, the requirements of the communication, entertainment, and on-line content industries require thousands (or many more) streams of real-time shared content and in some cases multiple individual streams of data whose rates can exceed 1 GB/s. Only highly-scalable NAS shared storage driven by S2A parallel block-level storage controllers have been proven to reliably meet these needs in a manageable and affordable way.

The nature of large-scale InfiniBand NAS access mandates that the block-level storage be capable of supplying true parallelism in both host and Logical Unit (LUN) storage targets. The S2A9550 block-level storage controllers embody this parallelism via a multi-processing multi-ported core that inherently allows truly simultaneous access from multiple host ports driven at wire speeds; and significantly from even the same block of data or single LUN volume.

4.1 S2A Unique Cluster NAS Architecture

The S2A family of products is differentiated by several key attributes from any other available storage system. First and foremost is the raw sustained bandwidth from media. The current flagship S2A9550 sustains up to 3 GBytes/s in reads AND writes, and due to the family’s in-band parallel parity processing engine (P3E) calculates not only write parity but checks read parity in real time (dual parity protection) delivering increased data reliability. With its broad backend, there is no performance impact in the case of a disk failure or even under a disk rebuild (and rebuilds still actually happen in a timely manner).

Using 300GB Fibre-Channel disk drives, a single S2A9550 can scale to a huge 336TB in total capacity. With 500GB SATA drives 560TB of total capacity may be realized. And, 750GB SATA drives (later in 2006), an S2A9550 will scale to an amazing 840TB. The S2A’s ability to virtualize the storage through LUN aliasing, WWN masking/filtering and port zoning allows for very easy deployments and ongoing system DataDirect Networks Page 19 of 64 July 2006

(20)

management. Additionally, a large variety of statistical data is available and clearly presented to enable easy tuning, optimization, and network troubleshooting.

The DataDirect S2A9550’s unique parallel architecture allows even a single “PowerLUN” (RAIDed disk target) to provide the entire aggregate bandwidth of the system – without file system based software striping that can cause excessive switching latencies and storage-port contention in InfiniBand fabric switches. Provided by a shared memory architecture and advanced cache coherency capabilities, this parallelism enables far greater scalability than any system built with generic RAID systems (almost all of which are designed as general-purpose devices for databases and back-office environments). In fact, thanks to the S2A’s multi-port architecture, many SAN storage networks may be built without any InfiniBand switches at all.

Diagram 6 below contrasts the S2A approach on the left with what a “comparable” generic storage area network (SAN) would look like on the right. The generic SAN is characterized by multiple RAID systems (probably dual-controllers with at least two LUNs per controller) with small, limited performance LUNs striped by the file system across and through an InfiniBand switch (really probably two required for redundancy). In this diagram note that each and every computer, for each and every I/O it performs, must stripe across all of the 8 storage ports (an “8-way stripe”). Each one of those I/Os therefore generates 8 switching events – and of course encumbers 8 switching latencies, and monopolizes each switch port connected to the RAIDs for 1/8th of the time – not including wait states introduced when ports are busy with other computer’s requests. When all 8 computers operate simultaneously, those RAID switch ports start to exhibit heavy congestion – very similar to dialing through a telephone switch and getting busy signals. That “port contention” and build up of switching latencies quickly robs aggregate performance of the SAN and severely inhibits scaling.

(21)

Diagram 6:S2A and Generic SAN Comparison

In contrast, the S2A provides access in a much more efficient and reliable manner. Access to the S2A behaves much more like straws into a glass of water – each of the 8 computers in the diagram has private and full-time access to the S2A “liquid” (storage), without any switching events or port contention. The liquid can simultaneously be accessed by the straws truly in parallel – in fact a single LUN can be built that can all by itself support up to 3 GBytes/s in the S2A9550, without any file-system based striping! Doubling the aggregate performance using two S2A systems can be done with only a 2-way stripe – versus likely a 32-way stripe using eight “competitive” RAID arrays to attempt to get the same performance.

The S2A approach is far simpler and scales much better than competing architectures not only because of its tremendous read and write performance, but also because of its unique parallel architecture.

4.2 S2A9550 Unique Benefits

First and foremost, the S2A9550 provides a large increase in throughput over previous S2A models and sustains up to 3 Gigabytes/second in both reads and writes. This unprecedented performance is several times faster than any other available block-level storage device, and extends the S2A family’s already wide throughput performance lead.

In addition to the family benefits shared by all S2A systems, the S2A9550 has a groundbreaking new modular architecture that provides a versatility and customization capability never before available in high-performance storage networking systems.

(22)

The S2A9550 features modular front and back-end interface architectures. At the host side, each S2A9550 Couplet (controller pair) may initially be configured with up to eight FC-4 or four InfiniBand ports; or a combination of the two.

The back-end architecture features a similar modularity with supported interface options for native Fibre-Channel and SATA/SAS disk attachment.

This host and disk interface modularity is implemented using a well-defined physical hardware and software driver structure designed to allow for easy porting of future interface types. Faster versions of Fibre-Channel, InfiniBand, and SATA/SAS and other interface options such as Ethernet and PCI Advanced Switching may become common storage interconnects as well. The S2A9550 frees users to choose the right connectivity infrastructure and disk types for their particular requirements, and be assured of a rich growth path and investment protection.

Also implemented for the first time in a S2A system is a true Dual Redundancy Disk capability, which overcomes double disk failures in the same redundancy group without adversely affecting data availability or system bandwidth. This function provides data access protection even in the case of multiple simultaneous disk failures in individual parity groups, especially important with very large disk farms and particularly with modern ultra-capacity drives that can take quite a while to rebuild.

4.3 S2A9550 Enabling Parallel and Shared SAN File

Systems

The S2A9550 is a perfect complement to drive multiple server nodes in a parallel file system implementation due to its ability to easily virtualize access to individual LUNs by individual computers, and its linear performance scalability as additional host ports are utilized. In addition, the S2A family allows a wide range of LUN layout options to allow for inherent load balancing if so configured.

Parallel file systems typically scale very well for large numbers of CPU nodes due to a tight integration between the CPU client and I/O server software, but are not well suited for extremely high-bandwidth computing. Block-level storage is usually a better way to carry very high single-system throughput. They are also typically supported in only homogeneous operating system environments as well.

Shared SAN file systems by contrast utilize a sharing mechanism at the block storage level. Thus there are no “server nodes” that the client nodes must attach to. Every node running the shared SAN file system software DataDirect Networks Page 22 of 64 July 2006

(23)

shares and can access all of the block-level storage; there is no privately held storage under the nodes as in parallel file systems. The shared SAN nodes MAY, however, be used to export the shared SAN file system via common networking storage protocols such as NFS or SMB/CIFS to “client nodes”, thus emulating the functionality inherent in “parallel” file systems.

Shared SAN file systems are typically better suited to smaller numbers of high-performance clients than the parallel file systems. Individual clients may be provided with extremely high performance, but the re-export of the file systems to the more general clustered CPU nodes is performed using the NFS or CIFS/SMB protocol that are not as efficient or scalable as the custom protocols found in parallel file systems. Many shared SAN file systems also have the additional benefit of natively supporting cross-platform operating systems.

While the providers of both types of file systems are improving their offerings and over time the advantages and disadvantages of each approach may become less clear, the S2A architecture is and will continue to provide significantly better system level efficiency in clustered and networked environments.

Diagram 7: Parallel File System Implementation

Diagram 8: Shared SAN File System Implementation

DataDirect’s S2A family is highly enabling in both parallel and shared SAN file systems with cluster NAS implementations and InfiniBand fabrics, in particular. In the case of parallel file systems, the S2A’s capability to provide wire-speed performance with inherent load balancing through multiple target LUNs and huge capacity scalability allows a single S2A9550, for example, to saturate the capability of multiple file servers

(24)

with far fewer components and therefore far better reliability and system efficiency than alternate approaches.

For shared SAN file systems the S2A’s “PowerLUN” capability to build LUN targets that can support full-speed transfers in parallel to multiple computers simultaneously greatly reduces the software striping and network complexity required versus generic RAID system implementations.

In all cases, the S2A9550 can provide higher aggregate cluster NAS system performance for global file systems, with greater scalability and with far fewer components than any other solution – resulting in the least amount of complexity possible.

4.4 NAS with Symmetric and Asymmetric Global File

Systems

All file systems whether single-computer, parallel or shared SAN must maintain the metadata associated with where and how the data has been placed upon the target storage. This metadata is the map that allows the user application to find its data. Most file systems also provide a locking mechanism that ensures two user processes do not try and write to the same place on the target at the same time. Additionally, most modern file systems use a journal to track all file system changes – essential if a problem does occur and the file system needs to be “rolled back” to a known good point in time for recovery.

(25)

Both parallel and shared SAN file systems may be implemented with their file system metadata either out-of-band (asymmetric) or in-band (symmetric). The S2A family easily supports and enables both symmetric and asymmetric file system types.

Asymmetric global file systems utilize an out-of-band (generally TCP/IP) path to a (usually) dedicated metadata server that is solely responsible for maintaining all metadata transactions. For metadata performance scalability some asymmetric file systems now support clustered metadata servers. At a high level this architecture requires that each client computer request access privileges and file allocation information from the metadata server before they can access data. The benefit of asymmetric architectures is that the in-band data path is utilized solely for data transfers, providing for extremely high sequential data throughput. The S2A9550 is extremely well suited for this type of data transfers, primarily as a result of its nearly perfect streaming performance profile and characteristics.

Symmetric global file systems access the metadata through the same path as the actual data access. Symmetric file systems might or might not require dedicated metadata servers, and can place the metadata itself either in private LUNs or distributed among the file system data. The S2A architecture is ideal for symmetric file systems due to its truly parallel data access capability and the ability to serve multiple requests concurrently via deep command queues and wide back-end disk architecture. Moreover, the private metadata LUN can be locked in the S2A9550’s cache providing even faster response times.

(26)

Global File System File System Type Symmetric or Asymmetric Dedicated Metadata Server S2A9550 Enabling Benefits Cluster File Systems Lustre

Parallel Asymmetric Yes Native Load Balancing and Failover. 40-50% per server Performance gains. IBM GPFS Parallel or Shared SAN Symmetric No 50% sustained throughput performance gain. Reduced complexity. SGI CXFS Shared

SAN Symmetric Yes Improved QoS performance Write and improvements. 30% Component reduction. ADIC StorNext Shared

SAN Asymmetric Yes Native multipathing support. active/active Managed latencies and linear scalability

Red Hat GFS

Shared SAN

Symmetric No 75% performance

gain for exported NFS. Ibrix Fusion Parallel and Shared SAN

Symmetric No Reduced hotspots and massive capacity scalability.

Polyserve Matrix Server

Shared

SAN Symmetric No Eliminated management and volume scaling issues.

Greatly simplified architecture. Table 1: Cluster NAS Implementations with Common Global File Systems

4.5 Example S2A Cluster NAS Implementations

Lawrence Livermore National Laboratory uses IBM’s GPFS file system for ASCI White and Blue Pacific with DataDirect’s S2A Storage Networking systems for solving a wide range of scientific problems, including molecular modeling; quantum molecular dynamics; the dynamics of turbulent fluid flow and material modeling. GPFS is a symmetric file system that may be used either in parallel or shared SAN mode. In addition, DataDirect S2A technology storage is used in a variety of Lawrence Livermore’s simulation environments, including DataDirect Networks Page 26 of 64 July 2006

(27)

BlueGene/L, presently at 135 growing to 360 TeraFLOP/s, Thunder at 23 TeraFLOP/s, MCR (at 11.1 TeraFLOP/s) and ALC (at 9.6 TeraFLOP/s). This massive, world-class simulation environment is powered by more than one hundred of DataDirect’s S2A storage controllers closely coupled with GPFS and Lustre systems, and served by over 1.25 petabytes of aggregate DataDirect storage capacity.

The National Center for Supercomputing (NCSA) has implemented a cluster utilizing the Lustre™ file system from Cluster File Systems Inc. Lustre is an object-based parallel, asymmetric file system that is becoming a dominant option for large, extremely scalable compute clusters. 150 Terabytes of capacity managed by thirteen DataDirect S2A8500 appliances with 104 Object Storage Servers provide approximately 12 GBytes/sec throughput from the file system. NCSA is approaching 500TB of total S2A managed capacity.

Warner Brothers uses both FC and SATA S2A systems for their real-time non-compressed film “Digital Intermediate” post production environment. The S2A systems, with the ADIC StorNext asymmetric shared SAN file system, supply multiple real-time 300MB/sec streams to a variety of creative workstations and servers. 20 Terabytes of Real-Time Fibre-Channel disks behind two S2A8000 systems are supplemented by 120 Terabytes of SATA behind another S2A cluster. Data is migrated to and from the different storage tiers to enable efficient use of the storage infrastructure resources.

The Supercomputing Center at NASA Goddard utilizes S2A appliances and the SAM-QFS File System and HSM from Sun Microsystems. QFS is a shared SAN, asymmetric file system that can support active sharing across different operating systems and can be configured to provide very high throughput to individual computers, key to NASA’s environment that includes large SMP compute and visualization systems.

Zoic, a Hollywood animation facility, uses a S2A8500 with the Polyserve Matrix Server symmetric shared SAN file system to provide high-performance shared access to their 700-node render (compute) cluster and creative workstations to create animations for television and film. 2D and 3D animators use high-performance desktop workstations to build the frame-by-frame sequences that are staged to the S2A storage cluster for processing by the cluster for final rendering (a very heavy compute task), to complete the video or film sequence. Facilitated by the S2A8500 shared storage environment, a simple and fast workflow provides for greatly enhanced efficiency.

(28)

5 S2A9550

Architecture

Discussion

5.1 S2A9550 Hardware Architecture

A S2A9550 system consists of either one or two controllers, called “Singlets”, that are self-contained standard 19” rackmount units that are 2U high. Two Singlets may be configured into a Couplet, utilizing the “directCluster” capabilities to provide a single system image and the unique cache-coherency capability that allows true parallel data access across all host ports. When in Couplet mode the two Singlets are connected directly together via two dedicated interconnects: a private Ethernet connection running a custom inter-appliance link protocol that provides most of the system status and management information as well as cache-coherency messaging if cache coherency is enabled; and a serial connection that provides a second heartbeat signal and a failover path for system management.

5.1.1 Host-Side Architecture

On the host connectivity side, each S2A9550 Couplet has four PCI-X busses. Each S2A9550 Singlet can contain either two or four FC-4, two or four IB, or two FC plus two IB interfaces.

The PCI-X bus slots are driven by a group of DataDirect custom field-programmable gate arrays (FPGA) that perform a variety of functions. The streamlined architecture separates commands from data, with the command path managed by a dedicated I/O CPU interfaced to the central system management CPU responsible for processing I/O metadata transactions. No data passes through the I/O or system management CPUs.

(29)

Diagram 9: S2A Hardware Architecture

The S2A9550’s Parallel Parity Processing Engine, in cooperation with the High-Speed FPGA bridges, takes the serial data streams from the hosts and parallelizes it into 8-way parallel stripes (512 byte data segments) and, depending on whether the system is in the 8+1 or 8+2 parity mode, calculates either one or two 512 byte parity segments in real-time. Thus there are nine or ten 512 byte segments coming down from the FPGA group. If only one parity segment is chosen for an 8+1 architecture (determined by the systems administrator on a per-system basis) the tenth segment is dedicated for hot spare disk transfers, especially useful for expediting rebuilds in the case of disk failures.

In the case of a read, data is retrieved in the same manner in parallel and the data is reconstructed in real-time (with the unique and extremely useful capability to perform parity checking in real time). The ability to

(30)

perform a parity check on the fly delivers another level of data protection not seen in other solutions. This double parity protection feature ensures data integrity as well as allows the S2A9550 to anticipate problems and proactively correct them before they occur. This Parallel Parity Processing Engine (P3E) provides the heart of the S2A9550’s RAID generation function.

Note that at this level the system is generating either a 8+1 or 8+2 (data+parity) environment that is most closely associated with the classic DataDirect DirectRAID construct in that parity is dedicated to particular segments (and thus particular disk drives), and not intermixed within data segments. Within each segment the actual disks are free to re-order their own command queues and the spindles in the parity group are not locked. This capability provides excellent high-I/O and threading, multi-streaming capability, plus sustains very high sequential throughput, equivalent read and write performance, and full host performance (even under crippled mode). This multi-threaded parallel disk capability is managed by the Disk Controller Engines.

5.1.2 Disk Controller Engines (DCE)

Each Disk Controller Engine (DCE) contains a very high speed cache memory along with a FPGA-based memory controller managed by a dedicated I/O CPU. This cache is used to buffer data, and a very deep data queue may be managed due to this parallel structure, with the ability to allow for very efficient asynchronous I/O – key for any multi-threaded or multi-streaming application.

The dedicated I/O CPU on each disk controller not only manages the command queues for its associated segments, but can also provide a “vertical striping” capability down the disk drives in its responsible segments. The S2A PowerLUN capability embodies this 2-dimensional distributed multi-processing striping capability: the P3E FPGA set on the motherboard generates the horizontal stripe and generates parity; and the DCE CPUs stripe vertically down the segments. PowerLUNs can therefore span multiple parity groups and a large number of disks providing for extremely high performance from the single PowerLUN target.

LUNs may be formatted with block sizes from the historic standard 512 Bytes all the way up to 8KB, enabling the creation of LUNs up to 32TB in size. The ability to create LUNs this large can enable simpler network and file system configurations, and provides an inherent load-balancing capability by spreading I/O over a very large number of disk drives.

(31)

Due to the parallel nature of the S2A9550 data and parity segments each of the disk channels must be populated with disks (and chassis’ to hold them). An absolute minimum configuration requires either 9 drives in an 8+1 configuration or 10 drives in an 8+1+1 configuration. When configured in 8+1+1 mode the 10th data segment is reserved for global hot spare drives. When configured in 8+2 mode spares may reside on each data segment and are global only to that data segment.

Diagram 10: S2A9550 Couplet Logical Diagram

5.1.3 System Configuration

DataDirect provides various disk chassis that may be used to populate the 10 disk channels. These include a 3U, 16-bay FC disk chassis with a native FC interface. This chassis, the SFBx016 series, may be configured from a minimum of 2 up to 70 chassis with the S2A9550. In the same form factor, the SA2016 chassis utilizes native SATA disk drives with a FC-to-SATA bridge in the chassis itself. A minimum of 5 up to 70 chassis of this type are supported with the S2A9550. A 4U, 48-slot chassis is available creating a very high-density solution that provides native support

(32)

for SATA and SAS (future) disk drives. See the S2A Nearline SATA Solution or S2A9550 Archive Solution white papers for more details. When configured as a dual controller system, or “Couplet”, two S2A9550 Singlets are directly connected together via their private Ethernet and Serial Inter-Appliance Links, and the drive chassis’ are attached to both Singlets through their standard dual-port FC interfaces off of the Disk Controller Engines.

The system configuration pictured here utilizes ten 16-slot Fibre-Channel or SATA disk chassis’, one to populate each of the ten pairs of loops coming down from the S2A9550 controllers. Each loop is only responsible for servicing its own data segment. Parity groups (called Tiers) have one disk in each chassis (in the case of an 8+2 parity layout), or the last chassis is dedicated for hot spare drives (in the case of an 8+1 parity layout).

In the case of ten 16-slot chassis’, up to 16 Tiers may be contained within them; but additional chassis’ may be daisy-chained to add space for additional Tiers of drives.

(33)

Diagram 11: S2A9550 Couplet, SFB 2016 x 10 Logical Rack Cabling

(34)

5.2 S2A9550 Software Architecture

5.2.1 Host Side

The raw protocols supported for host connectivity are Fibre-Channel Protocol (FCP) SCSI over Fibre-Channel in the case of the FC-4 host port options, or SCSI RDMA Protocol (SRP) when using the InfiniBand Interfaces. Support for the iSER (iSCSI RDMA) InfiniBand protocol is under investigation. The target drivers are highly optimized, and particularly in the case of InfiniBand provide extremely high performance and low latency. The S2A9550 IB RDMA interface presents the S2A as a shared memory space to the file system obviating the need for memory copies and reducing the time required to navigate the protocol stack. Significant command queues may be managed, essential for maintaining performance in large-scale storage networks.

There are a variety of software virtualization capabilities providing hosts with managed access to the target LUNs. LUNs have native internal numbering (up to 8192 LUNs are supported) but may be exposed to the initiators in any numbering scheme desired. This LUN aliasing allows different computers to see different LUNs with the same numbering scheme (excellent for parallel file systems where servers manage private storage but keeping the servers configured identically simplifies management), or conversely the same LUN with different numbers.

LUNs may be zoned through any or all of the available host ports on a per-computer basis, providing for a very easy way to dedicate multiple paths to an initiator or load balance the computers across the available ports. LUNs of course may be completely filtered from all computers but those they have been specifically assigned to.

LUNs may be built across a single Tier, across many Tiers, or across partial Tiers. They may be formatted with block sizes ranging from 512 bytes to 8KB, allowing for LUNs up to 32TB in capacity.

5.2.2 DirectRAID Parity implementation

The S2A9550 provides a patent-pending, fully parallelized, distributed and multitasking hardware RAID implementation that uses an 8+1+1 or 8+2 parity scheme called DirectRAID. It is a unique implementation that combines the virtues of RAID 3, RAID 5 and RAID 0. Like RAID 3 a dedicated parity drive is used per 8+1 parity group, or two parity drives are dedicated in the case of an 8+2 parity group. DirectRAID exhibits RAID 3 characteristics such as tremendous large block-transfer - READ and WRITE - capability with NO performance degradation in crippled mode.

(35)

Like RAID 5 however, DirectRAID does not lock drive spindles and does allow the disks to re-order commands to minimize seek latency, and the RAID 0 -like functionality allows multiple parity groups to be striped, providing PowerLUNs that can span 100's of disk drives. These "PowerLUNs" support very high throughput (up to 3 GB/s in the S2A9550) and a greatly enhanced ability to handle small I/O (particularly as disk spindles are added) and many, many streams of real-time content. Parity groups (or "Tiers") of drives are added 9 drives (8+1) or 10 drives (8+2) at a time. As additional Tiers are added data may be logically striped across the tiers, over and above the actual parity stripe which occurs on a per-Tier basis. In the 8+1 mode any arbitrary number of global hot spares (up to 125, depending on the particular disk chassis ID numbering scheme) is supported on the dedicated 10th loop. In the 8+2 mode spares are global to each of the ten dual disk loops.

Using the 16-slot Fibre-Channel or SATA disk chassis up to 112 drives may be configured on each of the ten dual-port disk loops. Therefore 112 Tiers of drives may be used. 112 Tiers of drives yields up to 112*8=896 usable data drives for the S2A9550.

Note also that the architecture of the S2A provides tremendous back-end performance to the disks – substantially more than needed to supply the hosts. This extra performance is essential when background activities like drive rebuild, or mirror generation must not affect host performance, yet still get the job done is a timely manner - unlike competing RAID array architectures.

DirectMirror is included standard with the S2A9550. With DirectMirror up to 128 LUNs may be cloned up to seven times. A LUN clone may be broken from its parent LUN and used as a point-in-time copy, or for other purposes such as backup or testing, and then either re-synchronized with its parent LUN or turned into a stand-alone LUN (and then be cloned itself if desired).

5.3 Disk Side

Disk side protocols supported include FCP in the case of Fibre-Channel disk DCE-based systems, or either Serial Attach SCSI (SAS) or Serial ATA (SATA) when using the SATA/SAS DCEs.

The disk side software architecture is implemented as essentially a parallel buffered stripe, with each of the data segments (and optional stripes) managed by each DCE acting independently of and in parallel to the other DCEs. The P3E RAID core steers the data segments, after generating the parity information, across the DCEs, but each DCE manages its own DataDirect Networks Page 35 of 64 July 2006

(36)

dedicated disks. The disks are allowed to reorder their own command queues, and the deep cache on each DCE is used as a staging area to hold the data that is delivered to and from the disks asynchronously (since the disks will service the data in the most efficient manner by reducing seeks as much as possible). Once all DCEs have the data from a command in cache, the 4KB word across the DCEs will then be transferred synchronously back through the P3E.

In order to manage latencies in the case of slow or “lazy” disks (or retries, bad-block replacement, etc), there is also a unique “Fast AV” mode available that utilizes the tremendous real-time parity calculation capability of the P3E RAID core. When Fast AV is enabled, the system will deliver data after any 8 disks in a Tier are ready. So, even if one of the 8 data drives (in the 8+1 or 8+2 Tier) fails to deliver its data in a timely manner, its data is calculated in real-time and the transfer is delivered to the host.

The DCEs also have the unique capability to stripe data, on a per-LUN basis, across drives in its data segments. When LUNs are striped across Tiers, it is the DCEs themselves that perform the striping. This capability is essential in providing the multi-processing PowerLUN feature, whereby a LUN not only exists as a horizontal stripe across a single parity group Tier, but in addition is striped vertically across Tiers. This two-dimensional striping provides for inherently load-balanced, very high-performance, very high-I/O LUNs to be built; and at very large capacities.

5.4 Management Capabilities

There are a variety of methods to manage the S2A9550. A command line interface (CLI) is available via either Telnet or serial port; a freely-available API may be used to create a custom scripted interface, and optionally a java-based GUI and an optional system console may be purchased that can manage multiple S2A systems, record syslogs, provide secure remote access, and send e-mails and pages. SNMP is also supported.

Extensive measurement and optimization tools are available to facilitate fast deployment and maintain optimal performance. Timers are available at all levels of the access stack, allowing for example to create histograms of I/O size and alignment profiles, host response times, disk response times, cache efficiencies, port and disk throughput, and other system activities. These analyzer-like capabilities provide unprecedented tools to gain a very rapid and accurate understanding of what is happening within the storage network.

(37)

6 S2A9550 Cluster NAS Benefit

In all cases when used in a NAS cluster implementation, the S2A9550 makes the entire solution perform faster. Designed for HPC and Rich Media clusters, DataDirect's S2A9550 is 3.5 times faster than other storage systems providing higher aggregate NAS system performance with greater scalability and far fewer components. The result is a storage solution with the least amount of complexity possible, and the best price per performance and best price per capacity per square foot in the industry.

The S2A9550 leads the industry in scalability and flexibility. S2A9550 can fit 5 times more capacity than any single storage system supporting up to 480TB and 840TB (later in 2006) in a two rack footprint. It also supports Fibre Channel, SATA and SAS (future) disk drives. As organizations consolidate greater portions of their data into large disk pools, additional levels of robustness and resiliency will be required. The S2A9550’s groundbreaking ability to withstand multiple drive failures within and across individual parity groups maintains data availability, host performance and quality of service under the most severe failure modes. With multiple levels of redundancy, double parity protection and Dual Redundancy Disk, the S2A storage solution delivers enterprise reliability with world-class scalability.

The S2A9550 flexibility extends to server connectivity as well. With its modular design, the S2A9550 can support mixing and matching Fibre Channel-4 (FC-4) and InfiniBand-4X (IB-4X) front end server connectivity protecting your current investment with an upgrade path to the future. All this scalability and flexibility is easily managed from a single storage system.

The S2A9550 overcomes the performance, availability, scalability and management challenges organizations face in deploying cluster architectures with global file systems and NAS storage. Scale as fast and large as you need without expensive re-architecting. Provide the highest levels of Quality of Service for consistent, predictable data delivery to real-life application environments. Deploy, support and manage simply and easily reducing management costs. All this with enterprise-wide reliability ensuring high data availability and uptime even while servicing full host loads.

As a platform for parallel file systems and shared SAN file systems, high performance computing clusters, visualization and simulation, animation, special effects and post-production, as well as broadcast streaming, the S2A9550 InfiniBand Solution or Fibre-4 Solution are unmatched.

(38)

This optimal block level and file system performance design with massive scalability, supports a broad range of computational, and visualization environments with NAS solutions. DataDirect’s S2A offers broad infrastructure support powering compute clusters from IBM, Dell, HP, Cray, SGI, Bull and others.

With the S2A9550, organizations can create the industry’s best price per performance and best price per capacity per square foot disk-based solutions for parallel and shared SAN file systems, and high performance NAS and InfiniBand computing clusters.

(39)

Appendix A: S2A9550 and PolyServe NAS

Solution for Shared Data Clustering

1

NAS Appliance Pain

The confluence of hundreds of clients accessing terabytes of data has overwhelmed the world of file serving and in particular NAS appliances. While the initial appliance was easy to install, manage and worth the expense, organizations with cluster requirements have out grown the benefits of NAS appliances. For example:

• Manageability: Clusters have a large number of compute clients which drive organizations to deploy multiple UNIX servers or NAS appliances introducing complexity and overhead in maintaining redundant copies of data.

• Scalability: Clusters generate large amounts of I/O, some organizations have turned to very high-end servers and NAS appliances to simplify the provisioning of file services or to support heavy workloads. However, this comes at a steep cost and, in many cases, does not deliver desired I/O performance.

• Performance: The architectural tradeoff that made NAS appliances simple for a single system came at the expense of reduced performance. To meet cluster I/O loads, many NAS solutions must deliver multiple appliances with redundant data copies. This propagation of appliances has overwhelmed IT staffs while not cost effectively meeting performance needs.

• Availability: When NAS appliances are paired to provide cluster availability, the solution cost doubles. Moreover, maintenance and downtime proportionally increase with the doubling of hardware. • Fit with Existing Infrastructure: With proprietary hardware,

software and management tools, NAS appliances add yet more complexity to the cluster and another set of processes for IT to learn.

• Cost Reduction: IT departments are being challenged to reduce operating expenses. The proliferation of NAS appliances has

1_{Source for PolyServe File System information, claims, data, tests, diagrams and text is}

PolyServe, Inc.’s white paper “The UnAppliance Provides Higher Performance, Lower Cost File Serving.” All trademarks and names are property of their respective owners.