High Performance Computing: Advanced Computational and Three Dimensional Modeling and Simulation

(1)

Advanced Computational and

Three Dimensional Modeling

and Simulation

Silicon Storage Appliance Implementation for

High Performance Computing

High Performance, Massively Scalable Storage Network with

Increased Efficiency and Lower Operating Costs

By Robert Woolery

Vice President,

Corporate Development and Strategic Planning

(2)

Business Model and Information Challenge: HPC, Network Storage Enabled... 4

HPC Applications ... 5

Solution Architecture and DataDirect Networks’ Silicon Storage Appliance... 7

System 1 - Consolidate and Connect Multiple Systems and Pools of Data ... 9

System 2 - Architect a Flexible, ScalableNetwork Topology ... 11

IP Model ... 11

File Sharing Model... 11

Phase 3 - Automate Storage Management ... 15

Benefits of DataDirect Silicon Storage Appliance in a HPC Infrastructure ... 16

(3)

Ensuring the reliability and safety of nuclear stockpiles, finding cures to life threatening diseases and stemming the rise of global warming, these are some of the challenges that lie before high performance computing (HPC) centers. The stakes couldn’t be higher.

Along with these missions, high performance computing centers have been besieged with enormous situational changes including implementation of Y2K remediation and Internet connectivity, new legislation and regulations, changes in agency mission, as well as ever-increasing security threats.

Because the cost to maintain HPC center capabilities runs in the tens of millions of dollars, it doesn’t take long for mission performance accountability to occur. IT infrastructures occupy center stage as high performance computing centers look to get more out of their IT investments in terms of increased performance, lower operating costs, extended useful life and return on investment.

Driving IT performance becomes critical while adapting to vertical leaps in requirements, technology, process, investment and mission. Information and its strategic relevance to mission execution now dominates the guidelines for IT planning, acquisition, and implementation. Several factors are radically affecting HPC IT implementation each and every day:

• Exploding functional requirements; • Exponential scale - functionality, performance, connectivity, capacity; • Demanding expectations - mission, productivity, cost savings;

• Escalating IT accountability to mission performance;

• Increasing IT costs and shrinking IT staff resources

HPC centers create huge amounts of sensitive data. That data needs to go through an intensive and complex simulation, analysis, and numeric computation process via sophisticated hardware and software IT technology to increase its mission value and/or profit potential. Minimizing the time and cost to simulate, analyze and predict results, are critical to meeting a mission or achieving profits. This process proves to be very problematic for a HPC center as it expands its capture, production and delivery phases.

Recognizing the need to put modeling, analysis, results and data delivery at the center of its operation, HPC centers also realize they can not support even current growth and cost demands on existing infrastructure. Having a centralized, flexible, and scalable IT infrastructure would assure that the HPC center would be positioned properly for future internal growth as well as growth through regulatory requirements and market changes. A HPC center that looks for ways to consolidate its data, workflow and applications to gain efficiencies in production, staffing, data management, and most important, predictable delivery of mission-critical data. A HPC center wants to focus on simulating, modeling and delivering the next test, drug, or revolutionary outcome, while quickly bringing resources and assets to the mission cost effectively. It doesn’t want to worry whether its data is available. Quite simply, if a HPC center can’t simulate and deliver its test, outcome, or drug, it can’t assure its mission. That is the reason HPC IT managers are deciding to move from proprietary server-centric architectures to information-centric storage network architectures that are high performance, scalable, very reliable, and proven with storage network appliances.

(4)

HPC centers provide the world with its most advanced scientific research and development with ongoing projects in national security, financial modeling, energy, biomedicine and environmental technologies.

With these HPC centers comes new simulation, modeling and rendering technologies inviting a broad reassessment of established research and development practices, competitive strategies, and regulatory requirements. Digital simulation also represents a new technological infrastructure for HPC centers and thus a new economic and speed-to-mission/market competitive paradigm, particularly for networks focused on the processing and delivery of testing, modeling, health and other time-sensitive content.

To meet the ever increasing high performance, high availability demands of their projects, HPC centers are looking to acceleration visual simulation and rendering work resulting from complex supercomputing applications that model, analyze, and predict the performance of drugs, nuclear weapons, and environmental scenarios. The ability to easily move and process very large datasets at high I/O rates is of paramount importance to HPC facilities and is a key consideration in developing their IT infrastructure.

Complicating this task is the ever-growing gap between Computer Processing Unit (CPU) performance (Moore’s Law) and disk subsystem performance. As disk drives double in size every 9 months, their performance capabilities lag in relation to performance innovation of CPU chips. As such, maintaining data flow to faster and faster CPU’s has become a necessary requirement of storage networks for an efficient and balanced computing infrastructure. The information technology infrastructure of today’s research and development centers and life science research companies demand the same attention as the lab or field experiments

and tests they support. No longer can one operate without the other. HPC centers are beginning to shift their attention to efficiencies in analyzing data from experimentation and testing, and the challenges it imposes on their processes, procedures and infrastructures. This means both simulation, analysis and delivery must come off without a hitch providing for data integrity and uptime with the highest levels of application availability and performance.

The constantly changing environment, mission evolution and competitive intensity force research and development centers to accelerate program acquisition, simulation, and analysis, speeding time-to-mission/market with a minimum of cost. This puts a new level of pressure on acquisition, simulation and analysis process and systems, focusing on each step of high performance computing, applying new efficient, rigorous controls to the production and delivery of scientific data. This new level of global competition ensures the production of research data to be accurate, efficient and consistent, while modernizing HPC production. Historic processes are costly and inefficient.

The goal of HPC centers is to integrate data acquisition, simulation and analysis into one interlinked environment increasing efficiencies while reducing costs.

IT Challenges

• Very large datasets needing high I/O processing rates.

• Distributed and separate islands of HPC content that can not be shared.

• Deliver Terabytes of simulation data concurrently.

• Uptime and data integrity for application availability and mission delivery.

• Quick resource reallocation to adjust to change in mission or budget

• Ever growing data stores and delivery requirements

Business Model and Information Challenge: HPC, Network

Storage Enabled

(5)

In 1995, the president of the United States directed the Department of Energy (DOE) to undertake all necessary activities to ensure the safety and reliability of the nation’s nuclear weapons stockpile without real-world testing. Since advanced computations, specifically three-dimensional modeling and simulation capability are viewed as the cornerstone of safe and reliable maintenance of the nation’s nuclear arsenal, the goal was to create a system that moved computing to a new level. The result was the development of the world’s fastest supercomputer--capable of 12 trillion calculations per second.

A network of supercomputing platforms known as Accelerated Strategic Computing Initiative (ASCI) drives this particular HPC mission. An ASCI supercomputer is more than three times faster than the next most powerful computer in existence today. Along with ASCI White, ASCI Pacific Blue, other ASCI supercomputers, and the HPSS storage resource, the tri-lab ASCI alliance of Lawrence Livermore National Laboratory, Los Alamos Laboratory and Sandia National Laboratories are creating modeling and simulation data for analysis by a myriad of researchers throughout the country.

With these high performance computing platforms, there came a need for high performance storage networking solutions that could:

• Accelerate time to mission for new research, development;

• Improve productivity for scientists; • Increase data accuracy;

• Reduce IT costs and IT staff resources drains.

Like the ASCI HPC labs, other HPC centers have a mix of computing platforms from Linux clusters all the way to monolithic supercomputers accessed by thousands of users running diverse applications. Recognizing the advantages of advanced storage network technologies, supercomputing labs started seeking out design, engineering and

implementation of storage network appliance systems. One that would increase production efficiencies, cost-effectively scale, and ensure application uptime while being easy to manage.

Investigation of many first generation SAN implementations did not satisfy the cost-effective scalability, performance and reliability needs. To provide the required performance, individual scientists within a HPC lab may require 50MB/second or more, as well as concurrent access to the test data. The main compute complex may require aggregate bandwidth of 10GB/second or more. The ability to support several workgroups and the main compute complex with simultaneous access to all content in real-time was paramount. First generation Storage Area Networks can impose many limitations and restrictions to users trying to move data. One option to solve this limitation in first generation SANs is to create multiple SANs and overlay a software layer to transfer data from SAN to SAN. This replication method is costly, difficult to scale and manage, and requires redundant resources solely to meet performance requirements. The second option in first generation SANs is to stripe the content across a large number of RAID controllers. This, however, introduces extra CPU load, fabric contention, and I/O switching latencies within the architecture that does not fulfill reliability and performance requirements, and this, like the first option, severely limits scalability, and is extremely difficult to configure and manage.

For example, first generation SAN design hopes to improve the overall bandwidth by adding multiple RAID systems. However, because each subsystem is a single RAID controller this results in many separate disk targets called Logical Units (LUNs). A single LUN does not meet the performance or concurrent access requirements for even small-sized HPC systems. Therefore, the RAIDs are striped using the server’s file system or volume manager software to provide striping across multiple LUNs. This

HPC Applications

(6)

Before Storage Network Appliances After Storage Network Appliances

6

DataDirect Networks

configuration mandates a Fibre Channel switch to provide access from each workstation or server to each RAID system. Apart from being an overly complicated design, the performance is disappointing with the aggregate performance falling far short of the sum of what each

individual RAID can provide. Performance degradations of 50% or more are not unusual in even moderate-sized SANs due solely to a large number of collisions and switching events inherent in large, high performance storage networks.

(7)

Following the decision to upgrade their storage infrastructures with high performance, high reliability, and proven storage network appliances, the HPC lab re-organizes and consolidates its disparate pools of content into one information infrastructure. The goal of the HPC lab is to integrate simulation, analysis and predictive systems and processes into one interlinked environment increasing production efficiencies while reducing costs.

In our sample case study HPC facility, there are twelve multi-processor UNIX workstations with Fibre Channel connectivity, thirty-two Linux or Windows clients connected via Ethernet, four Network Attached Server (NAS) servers and one monolithic supercomputer. The infrastructure must be able to deliver 1.5GB/second bandwidth. The bandwidth requirement is as follows: • 12 UNIX FC Workstation

(25MB/s. per System)

• 4 NAS Servers (50MB/s. per Server) • 1 Monolithic Supercomputer (1GB/s.)

All systems must be able to access the same volume concurrently. Starting on-line content capacity is 20TB, scaling to 200TB. Application availability and scaling is required for minimal impacts on environment due to failures, particularly for simulation.

To do this, the HPC lab worked with DataDirect Networks’ Professional Services to design a new HPC information infrastructure based on DataDirect Silicon Storage Appliance in three phases.

Phase one includes consolidation of twelve

UNIX servers, thirty-two NFS clients, four NAS servers and one supercomputer into one storage area network (SAN) accessing the 20TB of storage. The UNIX servers, NAS servers and a supercomputer are connected via Fibre Channel to two Silicon Storage Appliances that provides access to three shared LUNs, scratch space (12TB), user space (2TB) and archive cache (6TB).

Solution Architecture and

DataDirect Networks’ Silicon Storage Appliance

High Performance Computing Facility

DataDirect Networks 7 SD SiliconGraphics OR I G I N 2100 S2_A Da taD irect NET WO RKS S2_A Da_taD irect NE TWO RKS

NFS and/or iSCSI Clients

Fibre Channel Clients

Fibre Channel Switch

Silicon Storage Appliance Monolithic Supercomputer Ethernet Switch

NAS File Servers iSCSI Bridges

(8)

8

DataDirect Networks The consolidation reduces high system

maintenance costs with centralized administration and a robust tool set to simplify management. Storage amortization reduces acquisition costs by eliminating the need for storage replication. Consolidation also enables concurrent access to content accelerating the production process for more efficient workflow and increased productivity.

Phase two develops a scalability paradigm for

connectivity and performance, “IP” and “File Sharing” models. A continuous stream of technology and scalability change characterizes HPC environments; therefore, in this phase DataDirect designed the scalability required for infrastructure flexibility delivering pay-as-you-grow scalability and an overall architecture that allows for increases in infrastructure growth without major reconfigurations or the associated downtime. This ease of scalability and flexibility bestows a competitive advantage for the HPC lab in its ability to respond to a fast, ever-changing mission landscape.

A SAN file system software solution is employed to provide the locking capability required to share files as well as failover functionality with the Silicon Storage Appliance. A separate Ethernet network is utilized as the command messaging mechanism for the SAN shared file system.

This File Sharing model will provide increased productivity via workgroup collaboration along with decreased TCO from storage consolidation via ease of management.

The Silicon Storage Appliance was configured with multi-pathing and failover software to take advantage of the exceptional high-availability, load balancing and zero-time failover features of the Silicon Storage Appliance.

Finally, DataDirect works with the HPC lab to

implement operational and management policies across the infrastructure to take advantage of the features of simplified management and quick reallocation of storage with no increase in staff.

(9)

The HPC lab wanted to build a high performance, reliable and scalable IT infrastructure, so it replaced its direct-attached JBOD and first generation SAN with a pair of Silicon Storage Appliances. Before implementing the Silicon Storage Appliances, the IT staff was spending much of its time replacing, optimizing and reconfiguring storage and the network as well as restoring lost or corrupted data. The staff spent a lot of time repurposing and reconfiguring systems to accommodate changing needs in the work flow process. The staff also spent many hours trying to manage the vast amounts of individual production files, many which are multi-gigabyte in size. With each failure there was the potential for data loss, loss of valuable compute time or worse, corruption. The Silicon Storage Appliance’s reliable, highly available architecture delivers not just data but the performance required to feed that data in a timely manner, i.e. high Quality of Service (QoS).

DataDirect Networks’ Silicon Storage Appliance greatly simplified the storage design. With eight full duplex Fibre Channel ports, UNIX servers, NAS servers and supercomputers could directly connect to the storage access device or via Fibre Channel switches. Since the Silicon Storage Appliance virtualizes storage within the engine, each connected host has full bandwidth access to a shared LUNs. This eliminated the additional complexity of LUN striping or the contention it creates within the SAN.

Some of the most important benefits are performance and reduced complexity for higher reliability. The Silicon Storage Appliance architecture provides true parallel and fully pipelined access to storage and has demonstrated at 800MB/second through a single Silicon Storage Appliance.

The consolidated storage is configured in an active/active configuration with 14 total tiers of

System 1 - Consolidate and Connect Multiple Systems and Pools of Data

HPC Infrastructure Incorporating Fibre Channel and IP Networks

DataDirect Networks 9 SD SiliconGraphics OR I G I N 2100 S2 A DataD irect NETW OR KS S2 A DataD irect NETW ORKS NFS Clients

Fibre Channel Clients

Silicon Storage Appliance (Active/Active) Monolithic Supercomputer Ethernet Switch NAS Servers NAS Servers Fibre Channel Director-Class Switch 20TBs

(10)

10

DataDirect Networks 180GB drives. With consolidated, virtualized

storage, the access can be configured via a single LUN with no stripe or two LUNs with a two-way stripe for increased performance. The storage is configured as three LUNs per Silicon Storage Appliance, one each for scratch space, user space and archive cache. The scratch space will be 12 TB total equally apportioned between the two Silicon Storage Appliances and accessed via a two-way strip. The user space and archive cache will be 2 TB and 6 TB respectively. Each of these elements will be equally spilt between the two Silicon Storage Appliances but accessed without any host striping. This management flexibility provides easy performance scalability without the complexity, reduced reliability and application performance of many LUNs and widely stripe sets.

Performance resiliency was of primary importance and the Silicon Storage Appliance with its parallel processing and fully pipelined architecture ensured HPC performance even in the event of a component failure. The appliance has fully redundant systems including the High Speed Traffic Directors (HSTD) where the appliance’s boards and chip sets reside. The Silicon Storage Appliance provided the added protection of parity protected data with parallel processing RAID engines using a proprietary RAID algorithm for SAN data storage suited and optimized for distributed,

multi-server storage network environments. In the event of a failure of a disk, parity loop or path, data is automatically reconstructed in real time. This advanced failover mechanism ensures that data transfers are not hindered or impeded in the event of a loop failure. The ability to continue delivering data at rated bandwidth in the event of a loop, disk or cache failure contributes to resiliency and ensures QoS commitments can be met during recovery and rebuild operations. This consolidation model can be configured with a single 64 port director-class Fibre Channel switch or four 16 port Fibre Channel switches. With director class switches, UNIX servers, NAS servers and supercomputer would be attached via a Fibre Channel HBA to the 64 port director-class Fibre Channel switch. The Director director-class switch would connect to each of the HSTDs in the Silicon Storage Appliances. With the Silicon Storage Appliance’s virtualization technology, storage could be configured into a variety of virtualized LUNs up to a single 20TB LUN. The parallel processing architecture and PowerLUN capability allowed the Silicon Storage Appliance to provide higher performance then multiple striped LUNs in a first generation SAN.

Director class switches provide a level of management and network simplicity and resiliency that 16 port switches do not, but at a significantly higher cost per port.

(11)

Characterized by a continuous stream of mission and technology change, HPC facilities require infrastructure flexibility and scalability. The ability to quickly adjust to change and respond to mission, legislative and budget change delivers competitive advantage.

The HPC lab was looking for scalability in performance, and connectivity without having to create downtime and decrease vendor choice. The IP model provides for connectivity scalability to the larger and ubiquitous Ethernet network. This connectivity allows the HPC lab to cost-effectively scale to include desktop users throughout the facility leveraging the benefits of storage network data access into workgroups.

The File Sharing model provides performance and productivity scalability by enabling collaborative computing and increased work flow efficiency. Furthermore, the ability to concurrently share data delivers data coherency, eliminating the need for data replication, thereby reducing acquisition along with maintenance costs.

IP Model

The IP model provides the ability to cost-effectively scale storage networking to the desktop via iSCSI technology. In the IP model, the Silicon Storage Appliance can connect directly to an iSCSI router like the Cisco SN 5420 Storage Router, extending storage network connectivity and data access. The SN 5420 storage router is a 1U rack-mountable chassis

that has one Gigabit Ethernet port and one Fibre Channel port. Used with a SAN File System, direct SAN to Ethernet connectivity can occur bring cost-effective storage networking to thousands of desktop systems. This connectivity scalability provides flexibility and investment protection.

File Sharing Model

Data sharing is compelling for the HPC lab to increase mission productivity and reduce storage acquisition and operating costs. These benefits are derived from workflow efficiencies and collaborative computing advantages. The Silicon Storage Appliance was designed understanding that HPC labs would not only consolidate and

amortize storage but also ultimately need to share its large stores of data efficiently and cost effectively.

File sharing SANs present special challenges in high performance environments. The HPC facility was looking to overcome these performance and scalability issues including contention and switching latencies, which are exacerbated by the creation of high-speed volumes through striping by the host computers. These extremely high performance and tremendously scalable file sharing SANs present very difficult issues for first generation SAN technology.

The HPC lab wanted to implement a SAN that could file share with up to 20 hosts initially with room to scale to at least 100.

System 2 - Architect a Flexible, Scalable Network Topology

DataDirect Networks 11

SCSI Routing with a Silicon Storage Appliance

S2 A Da_taD irect N ETW OR

KS Silicon Storage Appliance

Disk Storage Application

Servers

(Connected via FC)

iSCSI Clients

(12)

12

DataDirect Networks

Small File Sharing SAN The Small File Sharing SAN

The above diagram displays a small first generation File Sharing SAN using eight typical RAID systems (this could be four dual-controller RAIDs) and a Fibre-Channel switch, while the second figure utilizes a Silicon Storage Appliance in a similar manner

The Silicon Storage Appliance is designed to address the need for simple to install and manage, flexible, scalable intelligent network storage infrastructure optimized for HPC applications. In the past, because of the nature and amount of data, less attention was paid to optimizing network storage architectures as a potential source of competitive advantage. The goal of the Silicon Storage Appliance is to create sustainable competitive advantage with intelligent infrastructure solutions that delivers flexibility, reduce TCO and increase productivity, while delivering simplicity and manageability.

Contention and Switching Latency

Several issues were immediately apparent in the first generation SAN configuration. First, every storage access by any host computer results in eight distinct transactions, one to every RAID system. That means there are 8 switching latency occurrences that reduce performance somewhat.

More importantly, however, is the effect of multiple hosts making I/O requests concurrently. Not only are there switching latencies to contend with, but also far worse are the number of collisions for access to switch ports attached to the individual RAID systems. This contention can severely limit the aggregate throughput provided through the switch. In fact the combination of switching latencies and

contention can degrade the true throughput in a high-bandwidth environment up to one-half of what the RAID systems could theoretically provide.

As might be expected, the effects of switching latency and contention continue to worsen as both the number of hosts and the complexity of the switching fabric increase.

Elimination of Contention and Switching Latency, Silicon Storage Appliance

The HPC lab looked to the Silicon Storage Appliances to eliminate the contention and latency associated with the switch fabric. All the host ports of a Silicon Storage Appliance have parallel access to all the storage resources without having to pay the penalties of contention or switching latency. This approach is analogous to multiple straws in a glass of water; there is no waiting to drink from any of the straws, and the straws do not interfere with each other.

This parallel host port capability of Silicon Storage Appliances is one of the major differentiating features of its architecture that allows the HPC facility to remove the scalability and performance robbing limitations of the first generation SAN.

Implications of Striping

The HPC lab also wanted to eliminate another insidious limitation caused by striping that severely limits performance in a first generation file-sharing SAN, the reduction in I/O block size requested to each storage unit.

When a host application makes an I/O request to a volume, it makes a request of a particular I/O block size. Typical block sizes range from bytes File-Sharing Before Silicon Storage Appliance File-Sharing with a Silicon Storage Appliance

(13)

Virtualization with Performance

To eliminate striping, the HPC lab utilized the Silicon Storage Appliances’ unique “PowerLUN Virtualization” capabilities. This means the Silicon Storage Appliance can, without host-based striping, present very high-performance volumes to multiple hosts. The Silicon Storage Appliances provided simultaneous full-speed access from each host port to individual data or to the same data.

Full-speed access from a single storage pool requires that the shared LUN be very high-speed. This is accomplished by the unique S2A Virtualization and load balancing characteristic that allows a single LUN to be spread across an arbitrary number of parity group “Tiers”. In this manner a single PowerLUN can be constructed that sustains 650-700Mbytes/second.

This high-speed LUN was created without the need for striping at the host or SAN file system level. This created three benefits. The first is that large I/O block size requests may be realized at the storage unit. The second is that this also greatly reduces the contention and switching latency “chatter” present in a switched fabric. The third is the reduction in CPU utilization that striping at the host computer consumes.

Along with sharing data, failover functionality is also impacted by the limitations of first generation SAN technology. Therefore, to realize the HPC lab’s performance and scalability requirements while meeting HPC application and

Storage Appliance Single-LUN Virtualization

DataDirect Networks 13

(512 bytes in the case of a standard TAR), up to 1Mbyte or more for high-bandwidth, visualization applications. Generally small I/O block size requests are generated by high-transaction rate applications, while large block size requests are generated by applications that transfer large masses of data (imaging, data mining, supercomputing, etc).

Disk storage systems, in order to achieve high throughput for rich media applications, favor receiving the largest I/O block size request possible. Most storage systems maximize their throughput (measured in megabytes/second) when I/O block request sizes exceed 512Kbytes. Most applications and file systems limit the I/O block request size to a volume to the 1Mbyte range (except in some large supercomputers where block sizes can range all the way to 16MByte).

In the case of the eight-way stripe in a first generation SAN, if the volume received a 512KByte I/O block size request, that would mean each individual storage unit would only see 512KByte / 8 = 64Kbyte I/O block size request. Although not terribly small, a 64KByte block size would likely not come close to maximizing a storage unit’s throughput. This is true with both a first generation SAN and a Silicon Storage Appliance SAN.

Consequently, the HPC facility used the Silicon Storage Appliance as a means to reduce and eliminate striping.

(14)

available. Software that alternates pathing at the host during a failover event was available from several vendors including the host manufacturers themselves, but an even more elegant solution was found at the Host-Bus Adapter driver level. Providing alternate pathing at the driver level allowed for very fast failover events.

Software at the host level was included to leverage the capabilities of the Silicon Storage Appliance and permit load balancing if alternate paths to the storage are available. Veritas Volume Manager was configured with Dynamic Multi-Pathing (DMP). DMP uses all available paths to transfer data. This ability, known as “trunking” or “bonding” in the telecommunications world, is analogous to striping except that it doesn’t require separate LUNs to stripe against. Instead the host will transfer the data as fast as possible using all available paths. A byproduct of this capability is the ability to route around any failure in any of the paths. While technically not a failover, it performed the same function - just better.

Host-to-Host Failover Cluster

For the future, the HPC lab wanted the flexibility to support host-to-host failover clustering. Host-to-host failover usually provides the same local failover capabilities but also provides for redundant hosts in case of a catastrophic host failure. In this scenario both hosts require access to the same LUNs, although typically a LUN is “owned” by an active host, with the other host supporting it in a passive mode. A “heartbeat” signal connects the redundant hosts, and if the heartbeat disappears the passive host takes ownership of all LUNs (also the Ethernet address, etc) and becomes active. Again, the parallel nature of the Silicon Storage Appliance will allow this to occur very quickly and in a very elegant fashion, just as in the local failover situation above. Load balancing may still be employed in this scenario. Fibre-Channel switches may also be employed to increase the port connectivity.

The High-Availability software for Host-to-Host failover clustering is available from a variety of vendors, including the host manufacturers themselves.

14

DataDirect Networks

data availability standards, the Silicon Storage Appliance was implemented in a high availability mode. The Silicon Storage Appliance with its parallel host-port capability, allows for extremely efficient high-availability failover, as well as for load balancing capability.

This capability provides flexibility to accommodate changes in failover techniques as well as the move to high availability clustering.

SAN Storage Failover

The challenge first generation SAN technology posed to the HPC lab was complex and unreliable failover and the lack of load balancing in a distributed computing model. Local storage failover is a simple case, as it concerns just a single host computer. This is very typical in the case of a direct-attached dual-controller RAID configuration. If a path to either RAID controller is lost (due to a failure in the cable, Host-Bus Adapter, or RAID controller itself), the host may access the storage behind the failed RAID controller through the remaining RAID controller. This re-direction of access generally requires a process on the RAID system to re-assign all of the LUNs to the remaining active controller, and also redirect data access to the Host-Bus Adapter that is connected to the active controller. Load balancing is not possible with normal RAID systems, as LUNs generally must be accessed only through the controller that “owns” them.

The Silicon Storage Appliance, by virtue of its parallelism and port virtualization, allowed for a much faster and more elegant failover operation and also permits a load-balancing capability unavailable with any other storage network infrastructure solutions. The Silicon Storage Appliance allows for data access from any port to any storage. LUNs are not tied to ports, eliminating the contention and blocking that occurs with first generation SAN technology and topologies.

Furthermore, if a host fails all that is required is the intelligence to look at an alternate path. This path could be to the same HSTD or through the redundant HSTD within the same Silicon Storage Appliance. No failover event is required at the Silicon Storage Appliance itself to re-assign LUNs, so there is no time wasted by that process; the alternate path is instantly

(15)

parallelism of the Silicon Storage Appliance, the HPC lab benefits by having a highly available and redundant HPC quality infrastructure. If a data pathway fails (e.g. cable, switch port, loop or HSTD failure) the system automatically re-routes the traffic down the surviving pathway ensuring data availability to the host. The HPC lab now has a proactive approach to monitoring and managing problems that could arise. The Silicon Storage Appliance’s built in zero-time failover capability allows the HPC lab to share and manage its local and remote data automatically, no matter where the arrays are located-locally, nationally, or internationally. To augment its IT staff, the HPC lab deployed Silicon Storage Appliance with directREPORT functionality, standard with each appliance, which enables DataDirect Networks service and support to analyze a problem quickly providing for a fast response and corrective measures. It leverages expert logic built into the sophisticated Silicon Storage Appliance Customer Support Center database to utilize historical data and reported problem information to minimize downtime. At the conclusion of phase three, a reliable, highly scalable IT environment was in place to protect the most critical lab asset, data. Protection of information and control of data

assets are important components in the HPC lab storage management process, especially when the HPC data costs tens of millions of dollars to build, acquire, simulate and analyze. To address its information management challenge the HPC lab implemented various modules of DataDirect Networks’ storage management software. This software enabled the HPC lab to effectively configure, monitor and tune its network storage systems simply and easily throughout its entire production infrastructure. The HPC is also taking advantage of the tight integration between the Silicon Storage Appliance and the S2A directMONITOR management station.

By utilizing Silicon Storage Appliance technology for its production and delivery infrastructure and S2A directMONITOR for its storage and network management process, the HPC lab can now manage the entire storage infrastructure from one centralized location, thus minimizing staffing and training costs. S2A directMONITOR allows the HPC lab to automate many of its management processes to minimize IT staff intervention.

Moreover, by utilizing a front-end switched-fabric architecture with the fully pipelined

Phase 3 - Automate Storage Management

(16)

It took one week for the HPC lab to standardize on the Silicon Storage Appliance infrastructure and consolidate data, applications and platforms on DataDirect Storage Network-Infostructure - a flexible, scalable and continuously upgradable information infrastructure. Residing at the heart of its DataDirect Storage Network, the Silicon Storage Appliances provided the foundation on which a HPC facility can build a flexible, scalable, and continuously available information infrastructure for a connected world.

The HPC lab will realize several benefits, including: cost savings and avoidance, and ultimately, increased scientist productivity with accelerated application performance.

The HPC lab will be able to reduce personnel costs by centralizing information management, reducing complexity, and assigning fewer IT personnel to manage, back up and monitor the workload cross the HPC data network. The HPC lab further reduces personnel costs because it can now run the entire production and delivery operation with only a single IT person. This simpler, easier to deploy and manage infrastructure will also save in warranty and support costs.

The HPC lab avoids operational costs including the daily cost of having idle assets and people estimated to be several million dollars per day because of system downtime or waiting for data. It also eliminates the immediate need to recruit and train more IT professionals. In addition, the HPC facility now can manage its entire production and delivery IT infrastructure-more than 100 clients and servers, and 100TBs-with one IT professional.

Benefits of DataDirect Silicon Storage Appliance in a HPC

Infrastructure

Most significantly, the HPC lab is generating outcomes more quickly. With true integration of application, information and infrastructure, came a 300% to 500% improvement in productivity in the production-to-air life cycle. This, in turn, accelerates return on capital from productive properties and enables the HPC facility to focus capital on more productive properties more quickly. A HPC lab has also increased collective productivity and greatly aided in the organization’s shared goal of creating and delivering outcomes with greater accuracy, quality and speed.

Armed with a scalable, flexible and reliable HPC infrastructure, the HPC facility is now able to create competitive advantage from its workflow process and data delivery.

SUMMARY OF BENEFITS • Consolidated HPC content in a

centralized location is now easily shared throughout the production facility. • Workgroup collaboration with

concurrent shared access to data creating tremendous efficiencies and increased productivity for quicker ROI.

• Scalable reliability for increased application and production availability. • Virtualized storage management for quick and easy resource allocation without increase in IT staff.

• Ease of and cost effective scalability for ever growing data stores and delivery requirements for reduced operating costs.

(17)

Next Steps

This DataDirect Networks’ Silicon Storage Appliance Implementation Case Study provides an analysis of how one HPC customer might build and use a DataDirect Storage Network-Infostructure to surmount their IT challenges and attain their business objectives. Over a hundred businesses around the world are using DataDirect Storage Network-Infostructure today to help them achieve their most aggressive busi-ness goals. To discover how a DataDirect Storage Network Infostructure can help your busibusi-ness achieve its greatest ROI in the shortest time, contact your local DataDirect Networks sales representative or value-added systems integrator. Or visit DataDirect on the Web at www.datadirectnet.com.

(18)

High Performance Computing: Advanced Computational and Three Dimensional Modeling and Simulation

Advanced Computational and

Three Dimensional Modeling

and Simulation

Silicon Storage Appliance Implementation for

High Performance Computing

High Performance, Massively Scalable Storage Network with

Increased Efficiency and Lower Operating Costs

By Robert Woolery

Vice President,

Corporate Development and Strategic Planning

Business Model and Information Challenge: HPC, Network

Storage Enabled

HPC Applications

6

Solution Architecture and

DataDirect Networks’ Silicon Storage Appliance

8

System 1 - Consolidate and Connect Multiple Systems and Pools of Data

10

System 2 - Architect a Flexible, Scalable Network Topology

12

14

Phase 3 - Automate Storage Management

Benefits of DataDirect Silicon Storage Appliance in a HPC

Infrastructure

Next Steps

Chatsworth, California 91311

(818) 700-7600 (Phone)

(818) 700-7601 (Fax)

[email protected]

www.datadirectnet.com

© 2002 DataDirect Networks, Inc