Other Fault Tolerance Approaches - Contributions to the fault tolerance of soft-core processors

Apart from the mainly established fault tolerance strategies presented in previous sections, different approaches that explore alternative solutions have also been proposed in the literature. Although they can be suitable in some scenarios, most of them have a limited application scope. For this reason, their adequacy will depend on the particular application to be hardened.

A fault tolerance approach based on the concept of self-stabilization, utilized in distributed computing, was presented in [85, 91]. The research proposed a self-recovering algorithm for processors (hard-core and soft-core). It follows the idea

of that the self-stabilization permits to distributed systems to reach a correct state regardless of in which state has been initialized with in a finite number of execu-tion steps. The proposed algorithm utilizes a specific interrupexecu-tion code combined with asynchronous interruption signals at random clock cycles to modify the con-tent of memory cells randomly. In the presence of corrupt data cause by failures, the algorithm leads the system to an arbitrary configuration that corrects the behaviour in a finite amount of time. The approach focuses on specific resources, such as, registers (special and general purpose), internal SRAM, memoy caches, etc. Due to this approach, the system can bear transient failures, especially those failures that do not affect the code executed by any node. However, since this approach is program dependant and especially focused on preserving the con-vergence of transmitted data, its application scope is limited. In addition, the approach is not effective if an error affects one of the registers that contain the variables utilized by the processor while executing the self-converging program.

The approach presented in [239] is another algorithm-based hardening alterna-tive. The proposed Algorithm-Based Fault Tolerance (ABFT) technique reduces the susceptibility by 99% with a limited resource overhead (about 25%) by hard-ening both the datapath and the configuration memory. The application scope for the ABFT is linear-algebra operations. However, as the author states, for ap-plications that are not comprised of linear algebra operations the ABFT cannot provide sufficient protection.

A different strategy was proposed in [284] where the configuration bitstream is modified by the hardware inside the chip. A controller implemented with logic resources of the FPGA manages the bitstream adaptation following two algorithms (Modify Placement and Modify Routing). This approach affects the design negatively in terms of area and performance. Besides, bearing in mind that the controller is implemented with logic resources of the FPGA it is susceptible to induced faults.

Other fault tolerance approaches hinge on design aspects in order to harden FPGA based designs. In [65], a methodology for the hardware/software co-design of embedded systems has been proposed. This method takes the advan-tage of different software and hardware strategies (hardware or software redun-dancy, etc.). It presents the concept of Software Implemented Hardware Fault Tolerance (SIHFT), which allows to achieve a trade-off solution that meets the specific requirements of designs. In this way, after defining the specific require-ments, different SIHFT approaches are used in an incremental way in order to obtain a concrete number of candidate implementations of the software. After that, all of these generated candidates are compared to estimate the code and execution time overheads. After discarding the implementations that not meet

requirements, the selected candidates are tested by an SEU emulation tool. The ones that present better fault tolerance results are again tested in deeper fault injection campaigns identifying the critical elements in order to protect them with hardware redundancy. In this way, a range of trade-offs hardware/software implementations is obtained. Although the positive results achieved hardening a PicoBlaze based design this approach is a software and hardware dependant solution and it demands high design efforts.

Another alternative to harden FPGA systems in the design process is to harden designs during the synthesis like in [285], where the In-place X-Filing (IPF) tech-nique was introduced. The IPF is a synthesis-based algorithm that masks SEUs at a logic level in both LUTs and interconnects. This approach requires to anal-yse the configuration bits in other to identify the bits with no failure rate. This process demands some kind of analysis mechanism such as a logic simulation. Af-ter identifying the bits with no failure rate, they are filled in order to mask errors.

The more of these bits are reconfigured to mask errors, the smaller failure rate is obtained. The utilization of this solution also requires a high design effort and its results are dependant of the number or configuration bits with no failure rate.

In addition, the results in terms of resource overhead of performance obtained when utilizing this kind of approaches may be worst than the ones achieved when utilizing vendor’s tools, which are optimized for their devices.

In [11], a hybrid error-detection technique based on assertions and a non-intrusive enhanced watchdog module to recognize induced faults in processors, named HETA, was introduced. This technique depicted in Figure 3.17 (obtained from [11]), utilizes the combination of software-based techniques in tandem with a non-intrusive hardware module. By virtue of this idea, the approach analyses and adds static instructions to the original program-code. During operation it also constantly updates the content of a signature register, which is connected to different program nodes. These program nodes are the basis for the error detect-ing control flow. On the other hand, the purpose of the hardware module is to detect incorrect jumps to unused memory positions and control flow loops. This approach utilizes 66% less resources than a TMR implementation. However, the reliability level is lower and the performance is considerably reduced.

An approach inspired by the immune system that can be found in higher organ-isms was presented in [286]. The proposed hardware immune system is based on the negative-selection algorithm, which utilizes binary matching rules to dis-criminate invalid states. These matching rules are implemented utilizing the logic resources of the FPGA in order to obtain a higher operation speed with more log-ical gates cost. Bearing in mind that almost hardware systems can be represented as an individual or a set of interconnected FSM, the discrimination methods are

Enhanced HW Module Processor

Program Memory

address

data

read_write

Figure 3.17: Block diagram of the HETA approach.

implemented using an FSM in order to represent the system to be immunized.

The detection of faults is performed following the idea that an error creates an invalid state. In this way a controller module brings together the inputs, com-bining them with previous states to generate different strings for the FSM and the detection. Monitoring the internal states of the FPGA makes it possible to detect an error before it propagates to the output. The results in terms of capac-ity and performance are non optimal and the utilization of an internal controller compromises the reliability of the entire design.

In document Contributions to the fault tolerance of soft-core processors implemented in SRAM-based FPGA Systems. (Page 113-116)