13. Hypothetical Computer System Example
13.2 Fault Tree Quantification
There are two kinds of failure parameters needed to quantify the basic events for the HECS system: failure rates or probabilities and coverage* parameters. Consider each subsystem in turn.
Failure rates and probabilities
Processing system
The failure rate for each the processors (when active) is assumed to be a fairly typical value for such systems, λP = 10-4 per hour. Because the spare is cold, and thus is assumed not to fail
before it is used to replace a failed processor, the dormancy factor* is zero. Memory system
The failure rate for each of the memory units is assumed to be λM = 6×10-5 per hour, while that
of the memory interface units is assumed to be λMIU = 5×10-5 per hour. HE CS Failure P rocessing system failure HW SW operator SPARE SPARE A1 A2 C o ld Spare A 2 * BU S B us system failure M 1 M 2 M 3 M 4 M 5 FDE P FDE P MIU 1 MIU 2 3/5 Memory system failure FDE P A pplication/ Interface failure
Figure 13-7. Fault Tree for HECS
*
Bus system
The failure rate for the each individual bus is assumed to be λB = 10-6 per hour.
Application/Interface system
The failure rate for the GUI HW is assumed to be λHW = 5×10-5 per hour. The two remaining
basic events are quantified in terms of failure probabilities rather than rates. The assumed probability of failure for the operator is PO = 0.001, which means that, on average, the operator is
assumed to be 99.9% reliable. A similar quantification for the application software results in a software reliability of 97%. Thus the failure probability for the application software is assumed to be PSW = 0.03.
Coverage parameters
For the fault tolerant components, coverage parameters must be defined. Chapter 8 described the three coverage probabilities (r, c and s) representing, respectively, the probability of successful recovery from a transient fault, the probability of successful recovery from a permanent fault and the probability that a fault is uncovered. To estimate these parameters, some analysis of the fault tolerance mechanisms is in order.
Processing system
A processor contains built-in test functionality so that error checking occurs concurrently with instruction execution. If an error is detected, the instruction is retried immediately. Partial results are stored in case the retry is unsuccessful, so that the computation can be continued from some intermediate point (called a checkpoint). The process of continuing a computation from a previously saved checkpoint is called a rollback. In some cases the fault is such that the rollback is not successful, so the computation must start over after a system-level recovery procedure is invoked.
An example of a processor fault coverage model is shown in Figure 13-8, and represents the following hypothetical recovery procedure. First, assume that the fault is transient, and begin a four-step recovery procedure that continues as long as an error is detected. If an error persists after all steps have been performed, then a permanent recovery procedure must be invoked.
Transient Step 1: Wait Transient Step 2: Retry Transient Step 3: Rollback Transient Step 4: Restart Exit R: Transient Restoration Attempt Permanent Recovery Exit C: Permanent Coverage Exit S: Single-Point Failure
Figure 13-8. Coverage Model for HECS Processors
Step 1. Wait for a short time (a few cycles) and do nothing. If the fault is transient, it may disappear during this time, allowing rollback to succeed.
Step 2. Retry the current instruction several times.
Step 3. If an error persists, perform a rollback to a previous checkpoint, and pick up the computation from the checkpoint.
Step 4. Restart the processor and either reload a checkpoint or start the task from the beginning. If an error persists after the four-step transient recovery process, it is assumed to be caused by a permanent fault. A system level permanent fault recovery process is begun, to remove the offending processor from the set of active units and to reconfigure the system to continue without it.
The analysis of this coverage model consists of calculating the probability of system recovery for each step of transient recovery and for permanent recovery, given parameters that define the probability of success and duration of each phase and the characteristics of faults themselves. The detail of the analysis is beyond the scope of this handbook. The results of the analysis of the coverage model are summarized in the three coverage probabilities, one for each of the three exits of the coverage model (Exits R, C and S). For this example, transient restoration is assumed to be an effective recovery procedure for 70% of all faults (rp = 0.7). Permanent
coverage is assumed to be 98% effective on the remaining 30% of faults (i.e., permanent faults). Thus, cP = 0.294. The probability that a single processor fault is uncovered, and thus leads to
system failure is sP = (1 – cp – rp) = 0.006.
Memory system
A hypothetical recovery procedure for the memory units is shown in Figure 13-9. The memory uses an error correcting code, so a single-bit error is always detectable and correctable, and no
reconfiguration is required. If 98% of all memory faults affect only a single bit, then the probability of reaching the R exit isrM =0.98.
Single bit Memory error Error masked in zero time Multiple bit Memory error Attempt recovery Error Occurs successful unsuccessful not detected detected 0.98 0.02 0.05 0.85 0.15 0.95 Transient Restoration Exit R Permanent Coverage Exit C Failure Exit S Failure Exit S
Figure 13-9. Coverage Model for Memory System
The 2% of faults that affect more than one memory bit are assumed to be 95% detectable. When a multiple memory error is detected, the affected portion of memory is discarded, the memory mapping function is updated, and the needed information is reloaded from a previous checkpoint and updated to represent the current state of the system. Experimentation on a prototype system revealed that this recovery from the detected multiple-memory errors works 85% of the time. Thus, the probability of reaching the C exit is the probability that a multiple fault occurs, is detected, and is recovered from is cM = (0.02) • (0.95) • (0.85) = 0.01615.
There are two paths to the S (single point failure) exit. First, the memory fault causes a single- point failure if a multiple-bit error is not detected (with probability 0.02 • 0.05). Second, a single-point failure occurs if a multiple-bit memory error is detected, but the attempted recovery is not successful. Thus, sM = (0.02) • ((0.05) + (0.95) • (0.15)) = 0.00385.
Memory interface units
Experience with memory interface units suggests that 95% of all faults are recoverable transients, thus rMIU = 0.95. Of the remaining 5% of faults, which are permanent, 80% are
recoverable by discarding the affected MIU. Thus, cMIU = 0.4 and sMIU = 0.1.
Bus, application and interface
For the remaining components (the bus, the application HW and SW and the human operator) it is assumed that all faults are permanent and are perfectly covered, thus c = 1 and s = r = 0.