Hardware Failure Phenomena: How Electronics Fail

Fig. 2.2: Vendors fault tolerance metrics

year) is said to have five nines of availability.

Also, serviceability is a broad qualitative term describing how easily faulty com- ponents are identified, diagnosed and/or isolated.

These three related attributes are commonly referred to as the RAS (Reliability, Availability, Serviceability) features of a system and are considered when designing, manufacturing, purchasing or using a computer product.

2.3 Hardware Failure Phenomena: How Electronics Fail

Traditionally, hardware errors have been divided into four main categories according to their nature and duration: transient faults, intermittent faults, permanent faults and design bugs.

Transient faults are non-permanent faults caused by several phenomena in- cluding voltage fluctuations, electromagnetic interference and electrostatic discharge. However, the major cause is radiation to the chip [129]. High energy cosmic particles interact with atmospheric nuclei and create a cascaded generation reaction of many nucleons such as neutrons, protons, muons, etc. These particles, normally neutrons, strike silicon devices randomly in time and location. When the particles hit the silicon devices they generate electron hole pairs resulting into generation of charge, as Figure 2.3 shows. When this charge exceeds a critical charge (Qcrit) [222], they can corrupt a data bit stored in the memory or create a current glitch in any gate in logic. Since the corruption does not harm the transistor structure, the fault will disappear once the cell or transistor output is overwritten. Transient faults manifest as transient errors, also known as soft errors. Whereas packaging radiation and alpha particles can generally be minimized through specific material manufacturing, cosmic rays are unavoidable and their flux increases exponentially with altitude [222]. Transient faults has been considered one of the most predominant source for errors in microarchitectures for current and past silicon technologies [188].

·

Chapter 2. Background

Fig. 2.3: Particle strike causing current disturbance [111]

time. These faults are non-permanent, as in the case of transient faults. As opposed to transient faults, the replacement of the affected device eliminates an intermittent fault. Errors induced by intermittent faults usually occur in bursts when the fault location is exercised. Generally, voltage peaks and falls, as well as temperature fluctuations originate intermittent faults. Intermittent faults often precede the occurrence of permanent faults [42]. High frequency circuits will initially suffer from intermittent delay faults, before open faults occur.

Permanent faults, also known as hard faults, involve errors that are irreversible due to physical changes. These faults are either caused by run-time aging or are orig- inated during the chip fabrication process. Until disabled or repaired, a permanent fault will potentially keep producing erroneous results. There are mainly two sources for permanent faults [186]:

• Physical wear-out. Several sources of failures can be classified as aging phenomena. Electromigration [92] refers to the displacement of the metal ions caused by the current density flowing through the conductor. As seen in Fig- ure 2.4, the depletion and accumulation of material creates voids and hillocks, which can lead to open and short faults, respectively. Negative-bias temperature instability [6] (NBTI) breaks progressively silicon-hydrogen bonds at the silicon/oxide interface whenever a negative voltage is applied at the gate of PMOS transistors. The main consequence is a reduction in the maximum op- erating frequency and an increase in the minimum supply voltage of storage structures to cope for the delay faults. Oxide gate breakdown [194] ultimately manifests as a conduction path from the anode to the cathode through the gate oxide as a result of the reduced dimensions of transistors’ gates. Other

2.3. Hardware Failure Phenomena: How Electronics Fail

·

Fig. 2.4: Physical wear-out phenomena, open and short creation [59]

physical events that can reduce the reliability of devices are stress migration for wires, thermal cycling for the package and pins, and hot carrier injection for transistors.

• Fabrication defects. Chip fabrication is an imperfect process, and product samples can be fabricated with inherent faults. Defects at manufacturing time cause the same problems as wear-out faults but from the very first moment. Plus, it is more likely to have multiple fabrication defects in a chip than multiple wear-out faults manifesting in the field at the same moment. Similarly, tolerable latent fabrication defects can exacerbate during lifetime and lead to intermittent contacts [42].

Design bugs are a special type of permanent faults. Even in an ideal scenario with perfect manufacturing process and total reliability against transient faults, a fabricated microprocessor may not operate correctly in all situations due to a mis- match between the implementation and the specification, or due to an incomplete specification. These kinds of faults are normally referred to as functional faults or design bugs [35, 208].

·

Chapter 2. Background 2.4 Aspects of Fault Tolerance

Dealing with hardware and design faults involves several challenges that constitute in a broad sense the field of fault tolerance research. The fault tolerance area is generally classified into several overlapping fields:

• Error detection. The most crucial aspect of fault tolerance is determining whether the system operation was affected by an error or not. To achieve detection capabilities, error detection mechanisms are included into the microprocessor design in order to regularly check the internal state and activity during its lifetime (after the microprocessor has been sold). Adding error detection (but not correction) to a structure eliminates SDC errors, converting those faults to DUE errors. As a consequence, error detection mechanisms al- low reducing the SDC FIT. Error detection is the pillar capability that allows enabling other fault tolerance aspects.

• Error diagnosis. Error diagnosis has been traditionally conducted during the post-silicon validation phases, as a method to understand the reason behind failures and bugs and guide their correction. However, diagnosis is also used in mission critical segments during their lifetime. Their objective is to guide an adequate higher-level repair or reconfiguration mechanism that can deal with the affecting fault. Since errors can be caused by faults with different nature, error diagnosis is often needed to pinpoint the error type as well as the location of the error. The diagnosis latency is not generally a problem because its cost is paid after an error has been detected. Therefore, software solutions are also attractive and cost-effective.

• Hardware repair and error reconfiguration. Once an error has been detected and diagnosed, additional actions are taken in order to avoid that the fault will be exercised again during the processor lifetime. If the fault is permanent or intermittent, repair and reconfiguration can be handled through disabling the faulty parts of the affected component if possible [26, 149]. Re- pair and reconfiguration can also be conducted at a higher granularity, through physical replacement of the microprocessor, or by means of disabling the faulty core and using a spare one: the ubiquitous chip multiprocessor (CMP) systems makes repair and reconfiguration a realistic and simple approach. Software ap- proaches like software circumvention [116] are a viable solution for single core designs. For transient faults, there is no need for repair or reconfiguration. • Error Recovery. After repair and reconfiguration, the last step is to recover

In document Low-cost and efficient fault detection and diagnosis schemes for modern cores (Page 44-48)