Queries for ibms

(1)

(2)

Availability analysis of blade

server systems

&

W. E. Smith K. S. Trivedi L. A. Tomek J. Ackaret

The successful development and marketing of commercial high-availability systems requires the ability to evaluate the availability of systems. Specifically, one should be able to demonstrate that projected customer requirements are met, to identify availability bottlenecks, to evaluate and compare different configurations, and to evaluate and compare different designs. For evaluation approaches based on analytic modeling, these systems are often sufficiently complex so that state-space methods are not effective due to the large number of states, whereas combinatorial methods are inadequate for capturing all significant dependencies. The two-level hierarchical decomposition proposed here is suitable for the availability modeling of blade server systems such as IBM BladeCentert, a commercial, high-availability multicomponent system comprising up to 14 separate blade servers and contained within a chassis that provides shared subsystems such as power and cooling. This approach is based on an availability model that combines a high-level fault tree model with a number of lower-level Markov models. It is used to determine component lower-level contributions to downtime as well as steady-state availability for both standalone and clustered blade servers. Sensitivity of the results to input parameters is examined, extensions to the models are described, and availability bottlenecks and possible solutions are identified.

INTRODUCTION

Blade server systems are available that can be used to meet high-availability requirements for many commercial systems, such as e-commerce, financial, stock trading, and telephone communications, in addition to several types of life-critical and safety-critical systems. However, it is all too common for server modules (blades) to be used as stand-alone servers with shared services without carefully considering how best to configure the environment to maximize availability. Many techniques to

achieve high availability are known.1–4Technical analysis is used frequently for quantifying computer system characteristics such as reliability and avail-ability;5–13many software packages supporting such analyses are available.9,10,14–19Nevertheless, such

Ó_{Copyright 2008 by International Business Machines Corporation. Copying in} printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of the paper must be obtained from the Editor. 0018-8670/08/$5.00Ó2008 IBM

(3)

analysis is not routinely carried out on commercial high-availability products like blade server systems. There are only islands of such competency, even in large companies.

When such analysis is attempted, engineers com-monly use reliability block diagrams or fault trees to formulate and solve such availability models be-cause of their simplicity and efficiency.18,20But such combinatorial models cannot easily incorporate realistic system behavior such as less-than-complete (imperfect) fault coverage, multiple failure modes, hot-swap components, etc.21,22By contrast, such dependencies and multiple failure modes can be easily captured by state-space models such as Markov chains,22semi-Markov processes,23and Markov regenerative processes.23But the computa-tional requirements for building, storing, and solving such state space models for real systems can become prohibitive.7,8The problem of large model construction can be alleviated by using some variation of stochastic Petri nets,7,8,14,19but a more practical alternative is to use a hierarchical approach where a judicious combination of state space models and combinatorial models is utilized.15,18Such hierarchical models have been successfully used to solve many practical problems.9,10,24–26

The purpose here is to provide a practical approach for evaluating the availability of blade server systems, to apply it to the IBM BladeCenter*, and to show how these designs can be easily configured to achieve extremely high availability, suitable even for some mission-critical application environments. In the two-level hierarchical approach that is employed in this analysis, each subsystem is modeled as a Markov chain, whereas the entire system is modeled as a fault tree because some of the system component failures affect several differ-ent portions of the system at the same time. Such effects can be captured by fault trees with repeated events but cannot be captured by other methods, such as traditional reliability block diagrams.20,21 The BladeCenter is a system whose complexity precludes its modeling as a single-level state space model. The number of components in the Blade-Center that are subject to failure is close to 140. If each component were to be in one of two states only (although some components actually have more than two states), the size of the state space of the overall Markov chain will be 2140. However, as

various dependencies exist in the system, an overall combinatorial model will not suffice. The depen-dencies within BladeCenter subsystems are modeled in this paper by employing homogeneous continu-ous-time Markov chains. Independence across sub-systems is assumed and hence, a combinatorial model is used to aggregate the subsystem availabil-ities into the overall system availability.

The stochastic models developed here are solved in order to quantify the availability of single and multiple blades as well as to identify key contribu-tors to downtime. The expected uptime of a single server blade in a chassis can be significantly increased by understanding contributors of down-time and configuring the BladeCenter appropriately. Further analysis of the full fault tree shows how, as blade-based server systems scale out by adding more nodes to the system configuration, downtime can be minimized by using spare blades in hot-standby mode. Markov submodels are able to capture the detailed dependencies within each subsystem. However, our assumption that subsys-tems behave independently does not hold if a single, shared repair person provides repairs for all the subsystems, because of contention for that repair person, repair prioritization, and so on. The effect of such dependencies on the software submodel is addressed in the Extensions section by way of fixed-point iteration.

The rest of the paper is organized as follows. The next section contains a brief description of the BladeCenter product and the key assumptions made in modeling it. In the following section, models are developed for the BladeCenter subsystems, which include chassis components (midplane, cooling subsystem, power domain, and network subsystem) and blade server components (blade server, pro-cessor subsystem, software, memory subsystem, and disk storage subsystem). The section that follows describes the way in which a hierarchical model for BladeCenter availability is created that makes use of the submodels developed in the previous section. Hardware and software input parameters are discussed and the model results are examined. Sensitivity of these results to software failure rates, service response times, and midplane common mode faults is provided. Areas for further investigation and future extension of this model are described in the Extensions section. These include a fixed-point iteration methodology to capture repair

(4)

dependence and techniques to relax the assumption of exponentially distributed failure and repair times. The concluding section examines the availability bottlenecks in the BladeCenter architecture, along with design and configuration alternatives, and closes with a summary of the capabilities of this hardware architecture to support various levels of availability.

SYSTEM DESCRIPTION

Blade servers have been widely adopted, in part because of their modular design, a multi-server frame technology that is available from several server vendors. These designs are based on indus-try-standard racks, and provide denser packaging facilitated by shared power, cooling and other common services within the chassis. Integrated network switches provide additional conservation of space and significant reductions in cabling. Total cost of ownership can be reduced through simplifi-cation of management tasks such as deploying, reprovisioning, updating, and troubleshooting hun-dreds of servers. The IBM BladeCenter system shown inFigure 1is one such design, supporting up to 14 computing elements known asblade servers.27 From the front side, access is provided to the control panel, removable media devices, and the blade servers. Network switch modules, power supplies, cooling devices, and management modules are located at the rear of the chassis. All these devices plug into a central midplane that provides power distribution, sideband management buses, and network interconnections. The midplane is a re-dundant, fail-in-place design. Power domain 1 consists of power supplies 1 and 2. These power supplies are redundant and all devices attached to power domain 1 remain operational if one of the power supplies fails. Similarly, power domain 2 consists of power supplies 3 and 4, which are also redundant. Power domain 2 supplies blades 7–14, as indicated by shaded blocks in Figure 1. Everything else in the BladeCenter chassis is supplied by power domain 1.

For the work described here, IBM BladeCenter HS20 blade servers are configured with two Intel** Xeon** processors, four 1-gigabyte double-data-rate-2 synchronous dynamic random access memo-ry (DDR2 SDRAM) dual in-line memomemo-ry modules (DIMMs), and one or two small-form-factor (SFF) serial attached SCSI (SAS) disk drives (SCSI stands for Small Computer System Interface). Disk drives

may be configured as Redundant Array of Indepen-dent Disks 1 (RAID1) arrays so that the blade server can continue to operate when one of the two disk drives fails. These blade servers also have the ability to recover in degraded mode from processor or memory failures by de-configuration of the failed components and restarting the blade server. Each of these fault recovery capabilities can improve cus-tomer uptime and they are reflected in the blade server submodels described in the following sec-tions.

A typical configuration consists of a chassis with redundant blowers, two power domains each containing two redundant power supply modules, redundant management modules, redundant Ether-net switches, and redundant Fiber Channel Ether-network

Manage-ment modules Switch modules Power supplies Power supplies Pr ocesso r b la d es Fr o n t o f syst e m R e ar o f syst e m Cooling Blower 1 Fi bre c h annel 1 Et he rn e t 1 P ower s upp ly 3 P ower s upp ly 1 Power domain 1 Power domain 2 Man agement mo dule 1 Blower 2 Fi bre c h annel 2 Et he rn e t 2 P ower s upp ly 4 P ower s upp ly 2 Man agement mo dule 2

CD-ROM: Compact disk read-only memory FDD: Floppy disk drive

FDD CD-ROM

Control panel

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Figure 1

(5)

switches along with up to 14 blade servers. Configurations are differentiated by the models of the blade servers and by the speeds and types of the network switches. Firmware is treated as part of the hardware.

MODELS OF BLADECENTER SUBSYSTEMS BladeCenter subsystems fall into two categories: shared subsystems that are found in the chassis, and blade subsystems that are part of each blade server. The submodels described here were developed to investigate BladeCenter hardware availability. Soft-ware failures are introduced in the model to put the relative contributions of the hardware and of the software stack to downtime into perspective. Gen-erally,permanenthardware faults in components are assumed. In addition,softortransientfaults are also considered in some of the hardware compo-nents.1,2The distinction between permanent and transients faults made here is based on the actions required for fault removal. Faults that can be cleared by means of a reboot or restart are labeled as transient, while those faults that require a repair or replacement are called permanent. Software faults accounted for here are the ones that are sometimes known as soft or transient faults and more recently asMandelbugs.16,28

In this paper, all times to failure are assumed to be exponentially distributed. Similarly, all other event times, such as the time to repair and field service response time, are all assumed to be exponentially distributed. Techniques for relaxing this assumption are discussed in the Extensions section. Time to repair is separated out into response time of the repair person (including travel time) and the actual time to diagnose and to repair or replace the component affected. The phraseimperfect repair

refers to a failure to fix a problem on the first attempt and to delays in obtaining necessary replacement parts. Imperfect repair is not included, but the models can be easily extended to include this aspect of time to repair. First, the Markov sub-models of each of the subsystems are described. Chassis midplane submodel

Each BladeCenter chassis contains a midplane that provides the interconnect paths between the blade servers, network switches, management modules, and so on. The midplane was designed to minimize the probability that it causes a blade server outage by the following design decisions: 1) the midplane

contains only a few active components; 2) it provides two independent sets of interconnects to each blade; and 3) it has two independent connec-tors to each set of interconnects. Thus, the chassis midplane is a fail-in-place design, allowing the chassis to tolerate failure of either half of the midplane while the blade servers continue to function without an operational outage by utilizing the backup communication paths.

The midplane is represented by the Markov sub-model inFigure 2. The midplane fails with amean time to failure(MTTF) ofmttfmp. The steady-state availability of the midplane submodel is the probability that the system is in operational states 1 or 2. State 2 represents the fault-free state of the midplane. State 1 represents the midplane that is still operational after a fault has been detected. At this point, failover of a communication path, if necessary, was successful and midplane replace-ment was requested. The midplane has a transition from state 2 to state 1 for most failures; any uncovered cases, such as any common mode failures, are represented by the transition to state 0. State 0 is a DOWN state and the transition rate to state 0 is determined by the common mode factor,

fm.

In both state 0 and state 1, the system is in need of timely repair, and the midplane model shows a transition to state 3 on the arrival of the service person with a mean response time to arrival of

mttrsp. The midplane is replaced with amean time to repairofmttrmp. State 3 is a DOWN state because the chassis must be taken out of service to replace the midplane. For the purposes of the BladeCenter model, this time is counted as outage time. In practice, if the midplane is in state 1, the midplane replacement might be scheduled or the workload might be moved to blade servers in another chassis while this repair is performed to avoid interruptions to normal server workload processing. A closed-form solution to the steady-state availability is shown in Equation (1)

Amidplane¼ ðmttf mpðmttf mpþmttrspÞÞ 4ððmttf pmþmttrmpþmttrspÞ

3ðmttf mpþf m3mttrspÞÞ: ð1Þ The midplane also provides power distribution within the chassis through two power domains. This

(6)

is discussed in more detail in the description of the power domain submodel and in the top-level model description.

Chassis cooling submodel

The BladeCenter cooling subsystem contains two redundant hot-pluggable blowers. The Markov submodel inFigure 3is used to represent both the cooling subsystem and the power domain subsys-tem. States 1, 2, and 3 are UP states. Blowers fail with an MTTF ofmttfc. Blower failures are detected by a management module that monitors tachome-ters on each blower. When one of the blowers or its monitoring hardware fails, the remaining blower enters a full-speed mode. Because of this fail-safe mode, no coverage factor is included in the cooling subsystem model and the dashed transition from state 2 to state 0 is not used in this submodel. State 2 represents the condition when both blowers are operating normally. If one of the two blowers or its monitoring hardware fails, the cooling subsystem transitions from state 2 to state 1. A service person is summoned with mean service response time of

mttrsp. When the service person arrives, the subsystem enters state 3. Since the blower is hot-pluggable and redundant, it can be replaced and the cooling subsystem returns to state 2 with mean time ofmttrcwithout any blade server downtime. From state 1, the second blower could fail before arrival of the service person, resulting in a shutdown of the entire BladeCenter. In this case, a transition to state 0 occurs. The cooling subsystem then transitions to state 5 on the arrival of the service person with a mean service response time ofmttrsp.The repair process completes and the subsystem is restored to state 2 with a mean repair time for both blowers of

mttr2c. Note here that, for simplicity, state 5 could have been omitted and a transition from state 0 to state 2 with mean transition time of (mttrspþ mttr2c)utilized. Modeled in this way, the transition from state 0 to state 2 is not exponentially

distributed but rather two-stage hypo-exponentially distributed.26Strictly speaking, the model will then be a semi-Markov process (SMP). But it is known that the steady-state solution of an SMP only depends on the mean sojourn time in its states, so treating the SMP as a Markov chain does not introduce any error in the steady-state solution in this case.23,26However, both states are included here in order to explore the significance of the repair dependency when a single repair person is assumed (see Extensions). The second blower could also fail

while the subsystem is under repair in state 3. Here, the transition occurs to state 4 and, since the repair person is already on-site, only a repair time for both blowers is required to return the subsystem to state 2. Letd(cooling)represent the denominator for the closed-form expression for steady-state availability of this submodel. Then, the steady-state availability of the cooling subsystem,A_cooling, is given in Equations (2) and (3) dðcoolingÞ ¼ mttf c2₃_ð_{mttf c}_þ_mttrc_þ₃_mttrsp_{Þ þ}_{mttf c} 3ð23mttrsp2þ2mttr2c3ðmttrcþmttrspÞ þ33mttrc3mttrspÞ þ23mttrc3mttrsp 3ðmttrspþmttr2cÞ ð2Þ Acooling¼ mttf c dðcoolingÞ 3ðð23mttrc3mttf cÞ þ ðmttf cþmttrcÞ 3ðmttf cþmttrspÞ þ23mttrsp 3ðmttf cþmttrcÞÞ: ð3Þ Chassis power domain submodel

The BladeCenter contains two identical power domain subsystems. Each power domain contains redundant, hot-pluggable power supply modules and is also represented by the Markov submodel shown in Figure 3. States 1, 2, and 3 are UP states. Power supply modules fail with an MTTF ofmttfps

0 fm/mttfmp fm/mttfmp (1-fm)/mttfmp 1/mttrsp 1/mttrsp 1/mttrmp 3 2 1 Figure 2

(7)

and have mean times of repair ofmttrspandmttrps2

for 1 or 2 power supplies, respectively. This model differs from the cooling subsystem model in that, when one of the power supply modules fails in state 2, an uncovered or common mode fault can bring both of the power supplies down with probability

(1-cpsub), as depicted by the dashed transition from state 2 to state 0. Letd(power)represents the expression for the denominator. Then, a closed-form expression for the steady-state availability of the power supply subsystem,A_power, is given in Equa-tions (4) and (5) dðpowerÞ ¼ mttf ps23ðmttf psþ23cpsub 3ðmttrps1mttrps2Þ þmttrps1 þ23mttrps2þ33mttrpsÞ þmttf ps3ð23mttrsp2_þ₂₃_mttrps₂ 3ðmttrps2þmttrspÞ þ33mttrps13mttrspÞ þ23mttrps13mttrsp3ðmttrspþmttrps2Þ ð4Þ Apower¼ mttf ps dðpowerÞ 3ð23cpsub3mttrps13mttf ps þ ðmttf psþmttrps1Þ 3ðmttf psþmttrspÞ þ23cpsub 3mttrsp3ðmttf psþmttrps1ÞÞ: ð5Þ Chassis network switch submodels

Each BladeCenter chassis can accommodate four network switch devices that are configured as two pairs of identical network switches for redundancy. For the purposes of the BladeCenter model pre-sented here, only Ethernet and Fiber Channel networks are represented, although InfiniBand and other technologies are supported.

The network switches are represented by a simple alternating renewal process model that has two states. Switch failures occur with an MTTF of

mttfeswfor the Ethernet switch (ormttffcswfor the Fiber Channel switch). When down, the switch is replaced and service restored to the up state in a time that is the sum of the mean time for service response (mttrsp) plus the mean time to repair of the switch. Closed-form expression for the steady-state availability of the Ethernet switch subsystem is given in Equation (6) and of the Fiber Channel switch subsystem in Equation (7)

Aethsw¼ mttf esw

mttf eswþmttreswþmttrsp ð6Þ

Af csw¼ mttf f csw

mttf f cswþmttrf cswþmttrsp: ð7Þ Blade server processor submodel

Each blade server has 2 processors. The processor or CPU subsystem is represented by the Markov submodel inFigure 4, which has six states. States 2 and 1 are UP states representing the blade server with two or one operational processors. While in state 2, the processor subsystem may experience a failure represented by the transition to state 4. This event may be due to a transient hardware fault with probabilitycptthat is assumed to occur in 1 percent of failures. In this case, the fault clears on reboot and the processor subsystem recovers to state 2, as depicted by the dashed transition, when the blade server reboots. Otherwise, the failure is a hard fault that is detected on blade server reboot and the 1/mttrsp Z Y ZZ ZZ X2 1/mttfrsp X1 Y 1 4 3 0 5 2 Transition rate X1 X2 Y Z ZZ Cooling 2/mttfc n/a 1/mttfc 1/mmtrc 1/mttr2c Power domain 2*cpsub/mttfps 2*(1-cpsub)/mttfps 1/mttfps 1/mttrps1 1/mttrps2 Figure 3

Markov model for BladeCenter cooling and power domain subsystems

(8)

processor subsystem recovers in degraded mode to state 1 with a single functioning processor. In state 1, a service person is summoned with a mean time to respond ofmttrsp. When this person arrives the blade server must be removed for repair, so the processor subsystem enters the DOWN state 3 for the repair and then returns to state 2 once it is completed. If the second processor fails while the processor subsystem is still in state 1, then no functioning processors are left. The blade server is unable to reboot and the subsystem transitions to the DOWN state 0, from which a repair person is summoned and the repair process is completed, restoring the subsystem to state 2. A closed-form expression for the steady-state availability of the processor subsystem is given in Equation (8)

Aprocessor¼

ððmttf cpuÞ=ð23mttboot1þmttf cpuþ2

3ð1cptÞ3ðmttrcpuþmttrspÞÞÞ þ ðð23ð1cptÞ3mttf cpu3mttrspÞ

4ððmttf cpuþmttrspÞ

3ð23mttboot1þmttf cpuþ23ð1cptÞ 3ðmttrcpuþmttrspÞÞÞ: ð8Þ Blade server memory submodel

Each blade server has two banks of memory. Each bank comprises two memory DIMMs. The memory subsystem is also represented by the Markov submodel in Figure 4. States 2 and 1 are UP states representing the blade server with two or one operational banks of error correction code (ECC) protected memory, respectively. While in state 2, the memory subsystem may experience an unre-coverable multi-bit error represented by the transi-tion to state 4. The error is processed by blade server BIOS and the memory experiencing the multi-bit error is always deconfigured so that the dashed transition from state 4 to state 2 is not used here. The blade server reboots to state 1 at ratemttboot1

and the blade recovers utilizing the remaining memory bank.

In state 1, the repair person is summoned with a mean time to respond ofmttrsp. Since the blade server must be removed from the chassis to replace memory, the repair is performed in DOWN state 3 with a mean time to repair ofmttrmem. If the second memory bank fails prior to arrival of the repair person, no memory is operational and the memory subsystem enters the DOWN state 0 until the repair

person arrives (state 5) and the repair is completed, restoring the memory subsystem to state 2. A closed-form expression for the steady-state availability of the memory subsystem is given in Equation (9)

Amemory¼ ðð23mttrsp3mttf memÞ 4ððmttf memþmttrspÞ 3ð23mttboot1þmttf memþ2 3mttrmemþ23mttrspÞÞÞ þ ððmttf memÞ 4ð23mttboot1þmttf memþ2 3mttrmemþ23mttrspÞÞ: ð9Þ Blade server disk and RAID submodels

Two configurations for the disk subsystem are considered. One configuration uses a single disk in each blade, while the other uses a mirrored RAID1 disk per blade. In the first case, a disk is modeled as an alternating renewal process, similarly to the network switch submodel discussed earlier. A closed-form solution for the availability of this simple disk subsystem is found in Equation (10).

Z X2 R R ZZ X1 Y 1/mttrsp 5 3 0 1 Transition rate X1 X2 R Z ZZ Processor 2/mttfcpu 1/mttfcpu 1/mttfrcpu (1-cpt)/mttboot1 cpt/mttboot1 Memory 2/mttfmem 1/mttfmem 1/mttrmem 1/mttboot1 n/a 2 4 Figure 4

Markov model for BladeCenter memory and processor subsystems

(9)

ADISK ¼

mttf hdd

mttf hddþmttr hddþmttrsp: ð10Þ

For the second case, the Markov submodel captures the behavior of dual disk drives in a mirrored configuration (i.e., RAID 1) where one disk drive can fail and all data and programs are still accessible via the remaining disk drive. The RAID subsystem has six states (seeFigure 5). Both disk drives are in operation in state 2. The MTTF of each disk drive is

mttf_hdd. States 1 and 5 are UP states representing the blade server with one operational disk drive. When the RAID controller chip on the blade server detects that a drive has failed, the RAID subsystem enters state 1. The remaining drive supplies all data to the blade server. A repair person is summoned with a mean time to respond ofmttrspand the subsystem enters DOWN state 3 since the blade server must be removed from service to replace the drive. If the second drive fails before the arrival of the repair person, the RAID subsystem transitions from state 1 to DOWN state 0, with no remaining drives, where it remains until the repair person arrives and the repair is completed. In state 3, the disk drive is replaced with a mean time to repair of

mttr_hddand the subsystem enters UP state 5. Then, the data must be copied onto the new disk drive with a mean time to completion ofmttcopy. If the new drive fails during the copy process, then the

subsystem returns to state 3 and that drive is replaced a second time. From state 5, it is also possible for the disk drive holding the data to fail before the copy is completed. In that case, the subsystem enters the DOWN state 4. In both states 0 and 4, both disk drives are replaced with fresh preloaded drives with a mean time to repair of

mttr_hdd_2. Letd(raid)represent the denominator; then, a closed-form expression for the steady-state availability of the DISK subsystem utilizing RAID 1,

A_RAID, is given in Equations (11) and (12)

dðraidÞ ¼ mttf hdd23ðmttf hddþ33mttcopyþ2 3mttr hddþ33mttrspÞ þmttf hdd 3ð23mttrsp2þ23mttcopy3mttr hdd2þ4 3mttcopy3mttr hddþ33mttcopy3mttrsp þ23mttr hdd23mttrspÞ þ23mttcopy 3mttrsp3ðmttrspþmttr hdd2Þ ð11Þ ARAID¼ mttf hdd dðraidÞ 3ð23mttcopy3mttf hdd þ ðmttrspþmttf hddÞ 3ðmttf hddþmttcopyÞ þ2 3ðmttrsp3mttf hdd þmttcopy3mttrspÞÞ: ð12Þ Blade server base hardware and network interface controller submodels

The blade server base hardware and network interface controller (NIC) hardware are each repre-sented by a simple alternating renewal process model that has two states. Failures occur with an MTTF ofmttfbasefor the base hardware, with

mttfnicethfor the Ethernet NIC ormttfnicfcfor the Fiber Channel NIC. When down, the blade server hardware is replaced and service restored to the UP state in a time that is the sum of the mean time for service response (mttrsp) plus the a mean time to repair of each of these elements of the blade server. Closed-form expressions for the availability of these submodels are given in Equations (13) through (15)

Abase¼

mttf base

mttf baseþmttrbaseþmttrsp ð13Þ Aethp1¼Aethp2¼Aethp

¼ mttf niceth mttf nicethþmttrnicethþmttrsp ð14Þ 1/mttcopy 1/mttr_hdd 1/mttrsp 1/mttr_hdd_2 1/mttf_hdd 2/mttf_hdd 1/mttf_hdd 1/mttf_hdd 1/mttr_hdd_2 1/mttrsp 0 1 4 5 6 2 3 Figure 5

(10)

Af cp1¼Af cp2¼Af cp

¼ mttf nicf c

mttf nicf cþmttrnicf cþmttrsp: ð15Þ Blade server software submodel

The blade server software is represented by the Markov submodel shown inFigure 6. While the software images may be identical or very similar, especially in the case of clustered blades with hot-standby spares, the software environment and workload on each blade server is unique and not replicated elsewhere. The model has five states. State 0 is the UP state. Even without a hardware failure, blade server software can crash or hang and the blade server enters the DOWN state 1 until the operating system can perform a fast reboot and the middleware and applications can restart. A software failure is covered by a fast reboot with probabilityc1

(definitions of all variables used in this paper are found in the Appendix). If the failure is not covered by fast reboot, then the system moves to state 2, where a longer reboot and recovery action such as consistency checks and automated data recovery are attempted. This step will be successful with a coverage factor ofc2and the software returns to state 0. Otherwise, the software enters state 3, where a repair person is required to restore corrupted data or to perform a software repair or other recovery action. The dashed transition from state 2 to state 4 is not used in the base model. It is used to account for an onsite repair person as part of the repair dependency discussion in the Extensions section. Conceptually, manual recovery will be required to deal with residual Bohrbugs (i.e., a bug that manifests itself reliably under a well-defined but possibly unknown set of conditions), while the faster automated recovery will succeed if the failure is caused by a Mandelbug (i.e., a bug whose causes are so complex that its behavior apears chaotic).16A closed-form expression for the steady-state avail-ability of the software subsystem is given in Equation (16)

Asof tware¼ ðmttf swÞ

4ðmttboot13ð1c1Þ3mttboot2 þmttf swþ ð1c1þc13c2c2Þ 3ðmttrswþmttrspÞÞ: ð16Þ BLADECENTER AVAILABILITY MODEL

The BladeCenter availability model was created to represent a fully redundant multi-server chassis

configuration using multiple network links (in this case Ethernet and Fiber Channel), such as might be available from any of several vendors. The sub-models described in the previous section capture many of the details of the BladeCenter architecture and design that affect system uptime and availability when these platforms are used for customer solutions. By assuming independence, these sub-models are now combined with basic events into a fault-tree model.

Top-level fault tree

The top-level fault tree for the BladeCenter avail-ability model is shown inFigure 7. The BladeCenter system is DOWN if the top event labeled System failure is TRUE. Three types of gates are used in this fault tree. The output of an AND gate is TRUE (i.e., DOWN) if all inputs are TRUE. The output of an OR gate is TRUE if any of its inputs are TRUE. The output of ak-out-of-n(KOFN) gate is TRUE ifkor more of theninputs are TRUE. The inputs to the KOFN gate are sub-fault-tree models representing the blade servers. The KOFN gate in the fault tree allows evaluation of various combinations of

1/mttrsp c1/mttboot1 B2 R 1/mttfsw B c2/mttboot2 (1-c1)/mttboot1 2 4 3 1 0 Transition rate B B2 Y With repair dependencies poff*(1-c2)/mttboot2 (1-poff)*(1-c2)mttboot2 r_nu/mttrsw Base (1-c2)/mttboot2 n/a 1/mttrsw Figure 6

(11)

required and hot-spare blades. Inputs 1–6 are used for instances of blades in power domain 1. Similarly, inputs 7–14 are used for instances of blade servers in power domain 2.

Note that both of the Fiber Channel switches and both of the Ethernet switches are shared by all of the

nblades, while the dual-network interfaces of each type on each blade are matched up with their corresponding switches. An OR gate is used to pair each network switch with a port on the blade. Thus, if either port on the blade or its switch goes DOWN, communication on that link is DOWN. If the blade cannot communicate through at least one port to each network type, then the blade is considered DOWN. Such interrelations cannot be captured by

traditional reliability block diagrams,21so a fault tree with therepeated eventsconstruct is used.18 This fault tree is used to evaluate the solution level availability that might be achieved by using the BladeCenter hardware platform in those solutions. Evaluating the top-level fault tree withkandnset to 1 provides the availability offered by any single blade in isolation installed in a slot in Power Domain 1. Note that each Markov submodel is simple enough to obtain a closed-form analytic solution. The SHARPE software package29,18is used to produce a closed-form answer for the overall availability for the case of a single blade (that is the case withn¼1) in the fault tree. The availability of a single blade with software and the entire common System failure Mi dp la ne Coo lin g P ower d o main 1 Ba se CPU Memor y Di sk Sof twa re FC po rt 1 FC po rt 2 Et he rn e t p o rt 2 FC s w itch 1 FC s w itch 2 Et he rn e t s w itc h 2 Et he rn e t p o rt 1 Et he rn e t s w itc h 1 Blade in power domain 1 fault subtree Blade in power domain 2 fault subtree Basic event Defined submodel Repeated use of same submodel instance Instance of blade in power domain 1 fault subtree Instance of blade in power domain 2 fault subtree

Po w e r dom ain 2 FC po rt 1 FC po rt 2 Et he rn e t p o rt 2 FC s w itch 1 FC s w itch 2 Et he rn e t s w itc h 2 Et he rn e t p o rt 1 Et he rn e t s w itc h 1 Ba se CPU Memor y Di sk Sof twa re k/n Figure 7

(12)

infrastructure is given by Equation (17)

Asbladewcommon¼Abase3Acool3Aproc3Aethsw3Af csw 3ADISK3Amemory3Amidplane3Aethp 3ð2Aethsw3AethpÞ3Af cp

3ð2Af csw3Af cpÞ3Apower

3Asof tware: ð17Þ

SHARPE can produce the closed-form expression for the general case, but the length of the expression and corresponding storage space is so enormous that it is necessary to resort to a numerical solution for the general case. Therefore, the overall model is solved numerically for larger values ofn.

The numerical solution involves a separate steady-state solution of each of the Markov submodels. The results are then fed into the fault tree that, in turn, is solved numerically. For a discussion of numerical methods of Markov model solutions, see References 26 and 30; for fault-tree solution methods, see References 20, 31, and 32. All solutions, including composition of submodels, are automated via SHARPE.16,18

Hardware input parameters

There are a variety of methods that may be used to predict the reliability of the hardware of a computer. A few of the industry standards include Telcordia** SR33233and MIL-HDBK-217.34While these have been used for many years, there are some issues that may arise with their use. SR332 prior to Issue 2 uses a 90-percent confidence level on its statistics. While this is useful for worst-case analysis work, it is troublesome for predicting what the actual level of failure rate will be and what corrections should be factored into the analysis. Another issue can be the age of these approaches and whether they have current data for the technologies used in the design. Some large organizations have the resources to work with supplier data, test data, and field data to such a degree as to allow them to develop their own reliability data sets for the devices that they design into their products. Whatever method used, it is important to account for all types of incidents that cause a particular component to cease operating within the design limits of the system. Replacements that, under a given set of operational conditions, result from material degradation and failure mech-anisms at the microstructure level are broadly termed asphysics-of-failure-caused(POF-caused). In addition to POF, other causes of replacement could

be test escapes, induced damage from manufactur-ing, shipping or installation, improper interaction over the boundaries of field replaceable units, among others. Even though this paper uses the expected rate of replacement, the termsreplacement

andfailurewill be used interchangeably.

The failure rate experienced in the field will seldom be a constant rate, though for purposes of simplified mathematics it is common to treat it as such. To get to that stage, it is necessary to determine what ‘‘average’’to use in the model. Common points are: during the life of the product, during the warranty period, during the first year, and at the point on the failure-rate curve that is reasonably close to constant.

The hardware input parameters used for this model are based on the IBM MTTF predictions for each replaceable hardware unit. These predictions are based on an industry-standard parts-count method adjusted using proprietary methods and field history to approximate the average field replacement rates under nominal conditions. These predictions are not constant values, but rather curves based on hours of operation. Typically, these curves follow a Weibull rather than an exponential distribution. In practice, the end of product design life should occur before entering the physical wear-out phase of the product. So, that phase need not be considered in the models. Experience has shown that the electronic subsys-tems in BladeCenter-type products have reached the constant failure-rate phase of the Weibull distribu-tion by the end of the twelfth month of continuous operation (i.e., 8760 hours). That data point on the failure-rate curve for each subsystem is the basis for the MTTF values for the BladeCenter model, and an exponential distribution can be reasonably assumed for the period after the first year of operation. Data of this type is generally vendor-confidential, and the actual values are not available for publica-tion. A range of typical MTTF values for the subassemblies used in this model are listed in

Table 1. The assumed mean time to service response,mttrsp, is based on a service agreement that allows an average of 2.5 hours time for a service person to arrive on-site for the repair.

Software input parameters

For hardware-only evaluations, the software sub-model can be omitted. When the software subsub-model

(13)

is included, it provides insight into the relative contribution of software to downtime and it is used to evaluate the use of spare server blades to overcome software as a source of downtime. For this exercise, a value of two years (17,520 hours) is used for the MTTF of the software,mttfsw, to show the effects of frequent software interruptions on blade server downtime and availability. When a software failure occurs, the success of the quick reboot is based on a coverage factor,c1, of 0.85 with a mean boot time,mttboot1, of 20 minutes. A longer, more complex reboot of the software is attempted if the short boot is not successful. This success of the longer reboot is based on the coverage factor,c2of 0.95 with mean boot time,mttboot2,of 45 minutes. Any faults not recovered by the quick or the complex reboots require attention from a repair person, who responds with a mean time ofmttrsp

and completes the patch or repair of the software in mean time,mttrsw, of 2 hours. In practice, the actual software parameters are dependent on both the operating system chosen and the software stack that is implemented. However, the MTTF value chosen is consistent with observations of the software aging processes that led to software rejuvenation as a means to avoid system down-time.12

Model results

Utilizing the low and high MTTF values found in Table 1 as well as the predicted actual MTTF for

each field-replaceable unit (FRU), the availability and expected annual downtime for a single blade are examined. InTable 2, the shared hardware com-posed of the midplane, power supplies in power domain 1, and blowers are examined. For all three MTTF values, the shared hardware is responsible for less than one minute of annual downtime. This is well below the 5-minutes-per-year limit for a five-nines (i.e.,.99.999 percent) availability solution. In addition, row 2 of Table 2 shows that a blade server plus network switches can be expected to have greater than 5 minutes but less than 50 minutes of annual downtime. Thus, the blade server hardware is a four-nines availability solution. Finally, the combined values for blade server, network switches, and shared hardware are shown in row 3 and the total blade server solution is also expected to provide availability greater than 99.99 percent. InTable 3, the downtime contributions of the hardware components are itemized, along with the contribution from the software. Here, software MTTF is identical for all three cases of hardware MTTF considered. The portion of the hardware and software downtime contributed by service response time was also determined. For predicted hardware MTTF values, service response represents about 35 percent of total downtime and represents a signif-icant availability bottleneck that can only be addressed with changes to the service-delivery process. The nonredundant electronics on the blade server (baseþNICsþnetwork switches) and the disk drives are the largest contributors of hardware-related downtime. While high-reliability disk drives might appear to be the answer to decreased downtime, high MTTF values for drives may be more a matter of operational and environmental conditions than any really significant differences in the hardware. So, other options should be consid-ered. The downtime associated with the disk drive can be virtually eliminated by using a RAID 1 configuration if a scheduled maintenance window can be used for disk replacement. To quantify the downtime when no maintenance window is avail-able, further evaluation was done utilizing the predicted disk MTTF for the disk drives and the RAID submodel shown in Figure 5. The results revealed that the disk-drive downtime can be reduced to 1.75 minutes/year when the two disk drives are configured for disk mirroring (RAID 1) and disk drives are replaced as the failures occur. In this case, the service person responds while the

Table 1 Typical reliability values (MTBF in hours) Field replaceable unit (FRU) Low MTTF High MTTF Fiber Channel switch 320,000 440,000 Power supply 670,000 910,000 Blower 3,100,000 4,200,000 Midplane 310,000 420,000 Ethernet daughter card 6,200,000 8,400,000 Fiber Channel daughter card 1,300,000 1,800,000 Hard disk drive 200,000 350,000 2GB memory bank (2 DIMMs) 480,000 660,000 CPU 2,500,000 3,400,000 Base blade 220,000 300,000 Ethernet switch 120,000 160,000

(14)

blade server is still operational using the single remaining disk. Another option is to run the mirrored array until both disks fail. The outage time is longer in that scenario due to the extra downtime resulting from the repair delay while waiting for a service person to arrive after the second disk fails, and the average downtime contribution doubles to 3.6 minutes/year.

Spare blades are often considered as a means to achieve high availability. For the configuration and

operational assumptions modeled,Table 4gives the probability of exactlynoperational out of 14 total blade servers. Either 14 or 13 operational blades is the most likely scenario, due to 0 or 1 blade server failures. There is only a very small probability of more than 2 simultaneous blade server failures; hence, BladeCenter availability does not improve with more than 2 spare blade servers.

Additional calculations were performed to predict downtime based on the minimum number of blade

Table 2 Availability and downtime for hardware and software

Configuration Low MTTF Predicted MTTF High MTTF

Downtime (min/yr) Availability Downtime (min/yr) Availability Downtime (min/yr) Availability

Shared hardware (excl. switches) 0.9093 99.9998270% 0.7822 99.9998512% 0.6745 99.9998717%

Single bladeþnetwork switches 31.5025 99.9940064% 27.3672 99.9947931% 23.2224 99.9955817%

Single bladeþsharedþnetwork switches 32.4118 99.9936134% 28.1494 99.9946443% 23.8969 99.9954534%

Table 3 Component contributions to single blade downtime Hardware component Downtime (minutes/year)

Low hardware MTTF Predicted hardware MTTF High hardware MTTF

Blade CPU 0.34690 0.29700 0.25507

Blade memory 1.81770 1.52985 1.32197

Blade disk drive 7.88400 5.24879 2.10240

Blade baseþNICsþnetwork switches 7.16677 6.00447 5.25585

Chassis midplane 0.86087 0.74105 0.63885

Chassis power subsystems 0.04839 0.04116 0.03563

Chassis cooling 0.00000 0.00000 0.00000

Total hardware component downtime 18.12463 13.86232 9.60977

Software downtime 14.28711 14.28711 14.28711

Total downtime 32.41174 28.14943 23.89688 Minutes of total downtime attributed

to service response time

(15)

servers required for the entire BladeCenter to remain in an operational (UP) state. Using the fault-tree model in SHARPE and the predicted actual MTTF values, Table 4 shows total downtime for all blades based on the number of required blade servers in column 1 and number of hot-standby spares in column 2. From these results, it can be seen that one hot-standby blade offers a significant improvement in availability of the blade servers in the chassis. A second hot spare offers a further improvement that might be useful for extremely high-availability applications. Utilizing more than two blades as hot-standby spares provides negligible further return on investment in terms of downtime reduction. It is also worth noting that the total downtime for a 14-blade server chassis is slightly less than 14 times the

predicted downtime for a single-blade server. This result reflects the effect of multiple, simultaneous blade server failures. Table 4 also shows a small probability of exactly 6 operational blade servers. The most likely cause of this condition would be a power domain 2 failure. In that case, all spare blade servers must reside in power domain 2 so that the there are still 6 operable blades under this fault condition.

It is common to utilize models to examine the sensitivity of systems and solutions to various parameters. One such parameter for the BladeCenter model is the common mode factor in the midplane submodel (see Figure 2). The common mode factor might be a major concern when mission-critical solutions depend on the BladeCenter system. The results presented here utilize a common mode factor based on analysis of the design. The expected contribution of the midplane to the total downtime of a blade is 0.74 minutes/year. However, as shown inFigure 8, overall blade downtime due to mid-plane outages remains nearly constant as the common mode factor is increased by multiples of up to 10 times the initial value, indicating that uptime is not sensitive to this factor.

Another important parameter is the mean time to service response, the delay between a failure and the arrival of a repair person. This is a factor that is found in all the submodels, and it can significantly increase the duration of a BladeCenter outage by prolonging the service-delivery process. Service response is one factor that can be controlled. Options such as on-site maintenance and quick-response service level agreements are available. In the results presented here, a mean time to service

Table 4 Probabilities of exactly noperational blades (including software) in a chassis with 14 blades

Number of required blade servers,n

Number of hot standby blade servers

Availability Probability of exactlyn operational blade servers

Downtime (min/yr) 14 0 99.9269727% 0.999269727 383.83 13 1 99.9998186% 0.000728459 0.95 12 2 99.9998433% 0.000016573 0.82 11–7 3–7 99.9998433% 0 0.82 6 8 99.9998511% 0.000000078 0.78 5–1 9–13 99.9998511% 0 0.78 0 20 40 60 80 100 120 140 160 180 1 2 3 4 5 6 7 8 9 10 M inu te s/ ye ar of blad e d o w n time Midplane common mode faults Mean time to service response Mean time to software failure Parameter multiplier Figure 8

Sensitivity of downtime for a BladeCenter chassis with 14 blades

(16)

response of 2.5 hours is assumed. However, other (mostly longer) times could have been used and actual experiences may, in fact, be longer due to an imperfect service-delivery model. In Figure 8, the mean time to service response time is multiplied by factors ranging from 1 to 10 to determine the sensitivity of downtime to this parameter. Down-time increases dramatically as mean Down-time to service response is scaled upward, indicating that blade downtime is highly sensitive to this parameter. Finally, the sensitivity of software failure rate to overall BladeCenter downtime is considered. Ini-tially, mean time to software failure was estimated at 2 years. As the software failure rate is multiplied by factors of 1 to 10, BladeCenter downtime increases even more dramatically than with in-creased service response time, indicating that blade downtime is even more sensitive to changes in this parameter.

EXTENSIONS

There are several underlying assumptions in this model that warrant further consideration. First, the model assumes a perfect repair process (i.e., that repair attempts are always successful). In practice, faulty spare parts, incorrect fault diagnosis, and lack of training or skill on the part of the repair person can result in imperfect repair. Faulty repair may also result when replacement parts are not readily available, although on-site spare parts may alleviate that concern.

Second, the model assumes that a service person begins the hardware repair in an average service response time of 2.5 hours. Depending on user requirements and system configuration, longer repair times might be acceptable, while still main-taining the required system uptime. But, if timely repair is required, on-site personnel or a service contract with response-time guarantees might be needed.

Third, the model assumes that the BladeCenter system is operating continuously. Incorporating periods of scheduled downtime for maintenance may reduce unscheduled customer outage time. In particular, midplane and disk drive replacements (RAID only) could be performed during scheduled outages. The models can be extended to further explore these possibilities, and to ensure the highest possible uptime.35In addition, this type of model

and analysis can be applied to other multi-server chassis configurations from any vendor, or even to other multiple-component devices, such as disk or tape drive clusters, and other similar systems. Fourth, the model assumes a pool of repair persons, so that one or more additional repair persons can be dispatched in the event of multiple failures. The model can be adjusted to represent a single, shared repair person, which is more practical for a small data center, by adding the following dependencies: 1) the likelihood that the repair person is already on-site; 2) the likelihood that he is already busy when a second failure occurs; and 3) repair prioritization when there are multiple pending repairs. For example, a blade server could suffer a software failure after a memory fault, but replacement of the faulty memory might be prioritized ahead of software service. In practical terms, the shared repair person brings some efficiency to the process, because the response time is typically longer than the repair time. Even waiting for an on-site repair person is often quicker than bringing a new repair person on-site, but any advantages disappear if the inventory of equipment and frequency of repairs increases.

These dependencies could be modeled by first constructing a stochastic Petri net7,12representation of the model, then automatically converting and solving the underlying Markov model with packages such as SHARPE,16,18UltraSAN,19or SPNP.1The underlying Markov model would likely have an extremely large state space. To reduce the size of this state space, state truncation can be applied at the net level,7and truncation error bounds could be computed.36,37Alternatively, fixed-point iteration methods25can be used to avoid a large state space. In any case, the effect of such repair dependence is expected to be minimal, allowing independence to be reasonably assumed.

This point is demonstrated by extending the software Markov model to incorporate the effect of shared repair. The expanded software submodel is depicted in Figure 6 by using the transition rates for the expanded model. Here, the possibility that the repair person is already on-site to repair another subsystem is taken into account by including the dashed transition from state 2 to state 4. The probability that the repair person is not on-site,poff,

(17)

other submodels where the repair person is not on-site.

The software submodel is further enhanced to account for the cases when the repair person is already engaged in a repair action. This repair action could be on any subsystem of the chassis or on any blade except the blade with the software failure. Additionally, the repair person is considered to be engaged if there exists a pending, higher-priority repair on the blade that encountered the software failure. Specifically, a blade could suffer a processor or memory hardware failure from which it has recovered in degraded mode. When a repair person arrives, the pending hardware repairs are given priority and repaired first. To account for the busy repair person, the repair rate out of state 4 of the software submodel is multiplied by a factor,r_nu, the probability that the repair person is not busy. To obtainr_nu, each submodel is evaluated to deter-mine the probability that that submodel is not in a state with an active or pending higher-priority repair. Then,r_nuis iteratively computed as the product of these probabilities for each of the other submodels. The effect on the software submodel of this dependence is found to be negligible, as shown inTable 5. Here, it is evident that, under the current set of assumptions for the model parameters, the effect of these two repair dependencies is too small to affect the model results. In fact,poffactually improves software availability by sometimes elimi-nating the service-response delay, whiler_nu

accounts for the wait for an on-site repair person to complete another repair and slows the mean rate at which software repairs are completed. Also note thatpoffandr_nuconverge here after the first iteration. That occurs because these dependencies are only factored into the software submodel. Recall, however, that most of the BladeCenter submodels represent redundant, hot-pluggable subsystems, so that there is no system downtime due to these

dependencies in those cases. Finally, note that the nonredundant subsystems are primarily contained in the blade, and software is no longer running if one of those has previously failed. Had these factors been necessary in other submodels, convergence would require additional iterations.

While the effect of non-exponentially distributed failure times can be significant, the effect of the deviation of all other distributions from the expo-nential is often found to be insignificant. Further-more, the times to failure and the times to repair of a subsystem without redundancy can be general, as the corresponding availability model is the alter-nating renewal process.26The justification of the use of exponential distribution for the failure times is the lack of availability of adequate data on distribution information. If distribution information is available and if one or more distributions are in fact found to be non-exponential, methods are available to enhance the models to take these into account23,26,38 without any significant increase in complexity of the overall hierarchical modeling technique. While the full treatment of non-exponential distributions is beyond the scope of this paper, the general idea is provided below.

There are two types of states in the software submodel: states 0, 3, and 4 have only one outgoing arc, while states 1 and 2 have multiple outgoing arcs. For the former type of state, it is well known from the theory of semi-Markov processes23that steady-state probabilities depend only on the mean sojourn time in those states; hence, the distributions can be general. For the states with multiple outgoing arcs, branching occurs probabilistically after the sojourn is completed. So once again it is known that the sojourn time can be general, since the steady-state probabilities depend only on the mean sojourn time (see example 8.27 from Reference 26). The situation is different in several other submodels. For

Table 5 Fixed-point iteration for software submodel Probability repair

person offsite (poff)

Probability repair person not utilized (r_nu)

Software availability

Base case 1.000000 1.000000 99.9941111%

After first iteration 0.999996 0.999997 99.9941111%

(18)

instance, consider state 2 of the memory submodel of Figure 4, where the only outgoing arc, in reality, is based on a competition between times to failure of the two memory modules. If these times to failure were to be non-exponentially distributed, the resulting semi-Markov process would not have the insensitivity property and its steady-state probabil-ities would indeed depend on the nature of the distribution. If the chosen distribution is Weibull, the recommended method is to fit the Weibull distribution to a phase-type distribution and hence expand the Markov chain to incorporate these phases. For further details, see Reference 38. In this paper, it was necessary to resort to numerical solution of the fault-tree and Markov models to produce the desired results. It is possible to solve the models in closed-form, as shown for all Markov submodels in this paper. Such closed-form solutions can potentially provide more insights, and can be used to compute derivatives of measures of interest and hence find bottlenecks. However, closed-form solution even for Markov model steady-state be-havior is compute-intensive and can stress the tools such as Wolfram Mathematica**.39Fault trees can be solved in closed-form using SHARPE. In any case, when the models are large, even if the closed-form solution is possible, the size may preclude all advantages.

CONCLUSIONS

In this paper, submodels for IBM BladeCenter chassis components and blade servers were first developed using Markov chains. Then, these sub-models were used to create a hierarchical fault-tree model for the blade server system. This model was used to determine the downtime and availability of these systems, which were viewed as representa-tives of a class of high-availability products avail-able in the marketplace. The results shown here were based on statistical mean failure rates and a number of underlying assumptions. Actual perfor-mance of a particular BladeCenter system may, of course, vary from these results.

Five major bottlenecks for availability were identi-fied.

1. Software failures are responsible for about one-half of the downtime minutes in these models. Additionally, blade server availability is highly sensitive to software reliability; therefore, poorly

behaved software may significantly reduce up-time. Uptime can be increased through improved system design, by selection of highly reliable software, by using software replicas, and by integrating and testing software before deploy-ment.

2. Service response time is distributed across all the submodels in this analysis and is accountable for more than one-third of the downtime minutes, and availability is also very sensitive to this parameter. Improving service response would also reduce the downtime attributed to software. Service response time is largely controlled by the service-delivery process and could be mostly eliminated with on-site repair persons.

3. Disk storage is the third-leading cause of down-time. Switching to a RAID 1 array rather than individual disk drives eliminates more than one-half of these downtime minutes. Solid-state drives may be a more reliable alternative to the conventional disk drives modeled. Though not evaluated here, other alternatives are available, such as SAN solutions that could further reduce downtime by eliminating dependency on these devices within the blade servers.

4. The blade server base hardware and associated network ports and switches are the causes of most of the remaining downtime minutes. Since there are dual, redundant network ports and paths, the base blade hardware is actually the primary cause in this category. Base blade hardware could be made redundant by clustering blades and using hot-standby blade servers. With this solution, the downtime attributable to blade memory and CPU would also be virtually eliminated.

5. The only other significant availability bottleneck is the single point of repair created by the fail-in-place, redundant midplane of the chassis. This can be eliminated by clustering blade servers across two or more chassis units. Additionally, this approach has the potential to eliminate most of the downtime associated with any other shared hardware devices in the chassis, primarily be-cause the clusters will remain UP, even if an uncovered power domain or other failure were to occur.

This analysis shows that:

_{BladeCenter-type chassis designs generate less}

(19)

achieving five-9s availability with a single blade server is possible;

_{Current modular blade server designs deliver}

nearly five-9s hardware availability; and

_{A fully populated chassis with two or more server}

blades in hot-standby mode can deliver five-9s BladeCenter availability, even with relatively poor software reliability.

Although this model is based on the IBM Blade-Center design, similar results could be derived for any similar product, with the outcome, of course, depending on component reliability, built-in redun-dancy, and chassis architecture.

*Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both.

**Trademark, service mark, or registered trademark of Intel Corporation, Wolfram Research, Inc., or Telcordia Technolo-gies, Inc., in the United States, other countries, or both.

CITED REFERENCES

1. A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, ‘‘Basic Concepts and Taxonomy of Dependable and Secure Computing,’’IEEE Transactions on Dependable

and Secure Computing1, No. 1, 11–33, 2004.

2. J. Gray and D. P. Siewiorek,‘‘High-Availability Computer Systems,’’Computer24, No. 9, 39–48, 1991.

3. E. Marcus and H. Stern,Blueprints for High Availability: Designing Resilient Distributed Systems, John Wiley and Sons, New York (2003).

4. G. Pfister,In Search of Clusters, Prentice Hall, Upper Saddle River, NJ (1997).

5. S. W. Hunter and W. E. Smith,‘‘Availability Modeling and Analysis of a Two Node Cluster,’’Proceedings of the 5th International Conference on Information Systems Analysis and Synthesis (ISAS ’99), Orlando, FL, 1999. 6. R. McDougall,‘‘Availability - What It Means, Why It’s

Important, and How to Improve It,’’ Sun Microsystems, Inc. (October 1999), http://www.sun.com/solutions/ blueprints/1099/availability.pdf.

7. J. K. Muppala, A. Sathaye, R. Howe, and K. S. Trivedi, ‘‘Dependability Modeling of a Heterogeneous VaxCluster System Using Stochastic Reward Nets,’’Hardware and Software Fault Tolerance in Parallel Computing Systems, D. R. Avresky, Ed., 33–59, Ellis Horwood, New York (1992).

8. D. M. Nicol, W. H. Sanders, and K. S. Trivedi, ‘‘Model-based Evaluation: From Dependability to Security,’’IEEE

Transactions on Dependable and Secure Computing1,

No. 1, 48–65 (2004).

9. D. Tang, J. Zhu, and R. Andrada,‘‘Automatic Generation of Availability Models in RAScad,’’Proceedings of the International Conference on Dependable Systems and

Networks (DSN’02), Bethesda, MD, pp. 488–492 (2002).

10. D. Tang, M. Hecht, J. Miller, and J. Handal,‘‘MEADEP: A Dependability Evaluation Tool for Engineers,’’IEEE Transactions on Reliability47, No. 4, 443–450 (1998). 11. K. Trivedi, R. Vasireddy, D. Trindade, S. Nathan, and R.

Castro,‘‘Modeling High Availability Systems,’’ Proceed-ings of the 12th IEEE Pacific Rim International Symposium

on Dependable Computing (PRDC’06), University of

California, Riverside, CA, pp. 154–164 (2006). 12. K. Vaidyanathan, R. E. Harper, S. W. Hunter, and K. S.

Trivedi,‘‘Analysis and Implementation of Software Rejuvenation in Cluster Systems,’’Proceedings of the Joint International Conference on Measurement and Modeling

of Computer Systems (SIGMETRICS’01), Cambridge, MA,

pp. 62–71 (2001).

13. H. Weber,‘‘A New Reliability and Availability Strategy for Communications Servers in Next Generation Net-works,’’White paper, Motorola, Inc.(July 2005). 14. G. Ciardo, A. Blakemore, P. F. Chimento, Jr., J. K.

Muppala, and K. S. Trivedi,‘‘Automated Generation and Analysis of Markov Reward Models Using Stochastic Reward Nets,’’Linear Algebra, Markov Chains, and

Queueing Models, Carl Meyer and R.J. Plemmons, Eds.,

IMA Volumes in Mathematics and its Applications48, pp. 145–191, Springer-Verlag, Heidelberg (1993).

15. D. D. Deavours, G. Clark, T. Courtney, D. Daly, S. Derisavi, J. M. Doyle, W. H. Sanders, and P. G. Webster,

‘‘The Mo¨bius Framework and Its Implementation,’’IEEE

Transactions on Software Engineering28, No. 10, 956–969 (2002).

16. M. Grottke and K. S. Trivedi,‘‘Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate,’’Computer40, No. 2, 107–109 (2007).

17. Relex Markov: Markov Analysis Software, Relex Software

Corporation, http://www.relexsoftware.com/products/ markov.asp.

18. R. A. Sahner, K. S. Trivedi, and A. Puliafito,Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software

Package, Kluwer Academic Publishers, Boston (1995).

19. W. H. Sanders, W. D. Obal II, M. A. Qureshi, and F. K. Widjanarko,‘‘The UltraSAN Modeling Environment,’’

Performance Evaluation24, No. 1–2, 89–115 (1995). 20. A. Rauzy,‘‘New Algorithms for Fault Trees Analysis,’’

Reliability Engineering & System Safety40, No. 3, 203–211 (1993).

21. M. Malhotra and K. S. Trivedi,‘‘Power-hierarchy of Dependability-Model Types,’’IEEE Transactions on Reli-ability43, No. 3, 493–502 (1994).

22. J. K. Muppala, M. Malhotra, and K. S. Trivedi,‘‘Markov Dependability Models of Complex Systems: Analysis Techniques,’’inReliability and Maintenance of Complex Systems, S. Ozekici, Ed., Springer, Berlin, Germany, 442–486 (1996).

23. V. Kulkarni,Modeling and Analysis of Stochastic Systems, Chapman-Hall, New York, 1995.

24. D. Tang and K. S. Trivedi,‘‘Hierarchical Computation of Interval Availability and Related Metrics,’’Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN’04), Florence, Italy, pp. 693–700 (2004).

25. L. A. Tomek and K. S. Trivedi,‘‘Fixed-Point Iteration in Availability Modeling,’’Proceedings of the 5th Interna-tional GI/ITG/GMA Conference on Fault-Tolerant Com-puting Systems, Tests, Diagnosis, Fault Treatment, Springer-Verlag, London, pp. 229–240 (1991).

(20)

26. K. S. Trivedi,Probability and Statistics with Reliability,

Queueing and Computer Science Applications, Second

Edition, John Wiley, New York, 2001.

27. R. Credle, D. Brown, L. Davis, D. Robertson, T. Ternau, and D. Green,The Cutting Edge: IBM eServer BladeCenter,

IBM Redpaper REDP-3581-01, IBM Corporation (November 24, 2003).

28. J. Gray,‘‘Why Do Computers Stop and What Can be Done about it?’’Proceedings of the Fifth Symposium on Reliability in Distributed Software and Database Systems, IEEE, New York, pp. 3–12 (1986).

29. C. Hirel, R. A. Sahner, X. Zang, and K. S. Trivedi, ‘‘Reliability and Performability Modeling Using SHARPE 2000,’’Proceedings of the 11th International Conference

on Computer Performance (TOOLS2000),pp. 345–349

(2000).

30. W. J. Stewart,Introduction to the Numerical Solution of Markov Chains,Princeton University Press, Princeton, 1994.

31. S. Rai, M. Veeraraghavan, and K. S. Trivedi,‘‘A Survey on Efficient Computation of Reliability Using Disjoint Products Approach,’’Networks25, No. 3, 147–163 (1995).

32. X. Zang, H. Sun, and K. S. Trivedi,‘‘A BDD-based Algorithm for Reliability Analysis of Phased-mission Systems,’’IEEE Transactions on Reliability48, No. 1, 50–60 (1999).

33. Reliability Prediction Procedure for Electronic Equipment, Document SR-332, Issue 2, Telcordia Technologies, Inc., September 2006.

34. MIL-HDBK-217 Revision F Reliability Prediction of

Elec-tronic Equipment,MIL-Standards.com, http://store.

mil-standards.com/index.asp?PageAction¼VIEWPROD& ProdID¼13.

35. Y. Cao, H. Sun, K. S. Trivedi, and J. J. Han,‘‘System Availability with Non-exponentially Distributed Outag-es,’’IEEE Transactions on Reliability51, No. 2, 193–198 (2002).

36. S. Mahe´vas and G. Rubino,‘‘Bound Computation of Dependability and Performance Measures,’’IEEE Trans-actions on Computers50, No. 5, 399–413 (2001). 37. R. R. Muntz, E. De Souza e Silva, and A. Goyal,

‘‘Bounding Availability of Repairable Systems,’’IEEE

Transactions on Computers38, No. 12, 1714–1723

(1989).

38. D. Wang, R. M. Fricks, and K. S. Trivedi,‘‘Dealing with Non-Exponential Distributions in Dependability Models,’’

Symposium on Performance Evaluation—Stories and Perspectives,G. Kotsis, Ed., Oesterreichische Computer Gesellschaft, pp. 273–302, 2003.

39. Wolfram Mathematica 6, Wolfram Research, Inc., http://

www.wolfram.com/products/mathematica/index.html. Accepted for publication , .

APPENDIX

Table 6 List of terms and acronyms Term Definition

c1 Fast reboot coverage factor for software submodel

c2 Extended reboot coverage factor for software submodel

cpt Probability that a CPU fault is transient

cpsub Power domain submodel fault coverage

factor

CPU Blade server processor

DISK Fault tree label of basic event for non-RAID disk drives

FC Fiber Channel

FDD Floppy disk drive

fm Probability of midplane common mode fault

mttboot1 Mean time for fast software reboot

mttboot2 Mean time for extended software reboot

mttfc Mean time to failure for blower

mttcopy Mean time to copy data to mirrored RAID

replacement disk drive

mttfcpu Mean time to failure for CPU

mttf_hdd Mean time to failure for hard disk drive

mttfmem Mean time to failure for memory

mttfmp Mean time to failure for midplane

mttfps Mean time to failure for power supply

mttrc Mean time to repair blower

mttr2c Mean time to repair two blowers

mttrcpu Mean time to repair CPU

mttr_hdd Mean time to repair hard disk drive

mttr_hdd_2 Mean time to repair two hard disk drives

mttrmem Mean time to repair memory

mttrmp Mean time to repair midplane

mttrps1 Mean time to repair one power supply

mttrps2 Mean time to repair two power supplies

mttrsp Mean time to respond by service person

mttrsw Mean time to repair software

poff Probability service person is off-site