• No results found

3.2 Hardware Redundancy

3.2.2 Triple Modular Redundancy

Unlike what happens with DMR, TMR and higher redundancy levels enable the possibility of identifying the faulty module and provides higher fault tolerance levels. However, increasing the redundancy level also implies a resource overhead directly related with the redundancy level. Among the different hardware re-dundancy alternatives the Triple Modular Rere-dundancy (TMR) is considered to be the most extended one to harden FPGA designs because it offers a remark-able trade-off between reliability and resource overhead. Since first introduced in [217] this technique has become in a standard hardening approach inspiring a vast number of researches in the literature [54, 59, 66, 137, 218–223]. Nevertheless, its biggest handicap is the resource overhead, which is usually about %200. The work [224] presents a TMR alternative with a resource overhead reduction. It proposes to implement a TMR scheme composed by a full precision module and two reduced precision modules. These two reduced-precision blocks will generate an upper and lower bound on the correct function output. The research itself states that despite the acceptable results with complex functions, the result with simple operations are not efficient in terms off fault tolerance.

Essentially in a TMR scheme, three exactly alike or functionally equivalent units of the element to be hardened are used. Once the element is tripled, the final output is determined by a majority voting. Thanks to that, as is shown in 3.9(a), even when a fault is present in one of the three replicas, a voter would still be fed with two correct values. Hence, the wrong third result is masked providing a valid output without stopping the operation.

However, as depicted in Figure 3.9(b), when multiple errors are induced in dif-ferent modules at the same time provoking an equal faulty output, they can be set as correct. This is an unlikely but possible situation. Another remarkable drawback of using a voter is that in the best scenario errors are masked but not corrected. Hence, when utilizing hardware redundancy there is an inherent risk of accumulation of SEUs. Due to this, as it has been mentioned in section 3.1 the common practice is to combine it with the scrubbing.

Considering the crucial relevance of the voter in these type of schemes, its design is a main concern [191, 225, 226]. Figure 3.10(a) depicts the most basic voting and as it can be observed it consists of three 2-input AND gates connected to a 3-input OR gate. This scheme is capable of providing a correct output in presence of a faulty module. However, this affirmation presumes that the voter cannot be affected by induced faults, which does not match reality.

When a fault affects the voter it can provoke wrong behaviour, from partial failures to total disruptions. The criticalness of such scenario have motivated a

Soft-core

(a) Correct voting under an SEU in a TMR scheme.

(b) Wrong voting due to MBUs in a TMR scheme.

Figure 3.9: Examples of different TMR scenarios.

number of researches [227–229]. An alternative to harden the voter can be to trip-licate it like in [230], where a suitable partition of the reptrip-licated module/voter structure is implemented in different devices. However, its hardware overhead and the fact that the three outputs have to be voted at same point, make this alternative not advisable for many situations. In the article [231], a voter reliabil-ity study and a novel voter (depicted in Figure 3.10(b)) implemented using XOR gates, priority encoders and multiplexers was presented. This approach increases the reliability of the voter by six times but also the resource overhead. Figure 3.10(c) portrays the approach presented in [232]. In this case, the proposed voter obtains higher robustness than the basic voter but less than [231]. The main ben-efit of this voter is a resource overhead reduction over even the basic voter. In [233], a deep analysis of the reliability of the previously presented majority voter approaches was introduced, proposing a new voter hardened for TMR designs.

Figure 3.10(d) shows this voter which is composed by several OR and AND gates.

This scheme is similiar to the voter’s proposed in [232]. Nevertheless, it improves the resilience to potential internal and/or external faults. [87] presents a complex voter composed by different logic gates and managed by an external controller, which detect and recover from permanent faults. [234] proposed a scan-chain based approach applied to the inputs and outputs of each one of the flip-flops in the circuit to detect any functional fault affecting the majority voter, enabling to determine which module has to be fixed. Furthermore, extending the scan chain to inside the module and wrapping on it the different combinational blocks and registers enables the precise location of the fault.

Despite all these approaches improve the reliability over a basic TMR voter, they are not capable to eliminate the presence of a single point of failure in the output

of the final voter.

A totally different approach was presented in [226], where an ICAP-based voting is proposed in order to overcome the consequences of faults in the voter. In this case the voting is performed by an external processor after reading the content of the ouputs registers of each module from the bitstream through the ICAP recon-figurable port and utilizing the GCAPTURE primitive. Nevertheless this approach does not solve the problem of multiple errors in different modules provoking an equal faulty output. In a similar way, it does not prevent the accumulation of errors in registers and data memories. The approach presented in [235] also uti-lizes the ICAP to read the bitstream and detect errors. However, it requires an external device to implement a watchdog timer. Another similar strategy was presented in [236], where the error detection is performed through a direct readback and comparison of the current configuration bitstream. This approach increases the hardware usage since the fault detection and recovery is performed outside the FPGA by a dedicated on-board CPU via the SelectMAP port. In general bitstream based voting methods present low availability due to their time requirements.

As a conclusion, despite they increase the reliability level, all the published ap-proaches present hardware overhead or a single point of failure, and in the ma-jority of them both.

(b) Voter approach from [231].

.

(d) Voter approach from [233].

Figure 3.10: Voter alternatives for TMR approaches.

In contrast to coarse grained, the implementation of fine grained TMR schemes can be a tedious task. To help the designers to deal with providing TMR hard-ening to design, several software tools are available. The TMRTool by Xilinx [237] is a remarkable tool when working with Xilinx’s 4QV and Virtex-5QV FPGAs. Nevertheless, since it has been discontinued, new tools has to be considered especially when working with newer devices, such as the Zynq. An-other relevant tool is the BLTmr developed at Brigham Young University (in collaboration with Los Alamos National Laboratory) [202]. BLTmr is a CAD tool to implement partial TMR designs by applying selective triplication in order to target the most sensitive components of designs, hence, reducing the hardware overhead. Another alternative is Synplify Premier by the Synopsys, Inc. [238].

This software offers multiple options for implementing error detection and mit-igation circuitry, such as, memory protection by inferring error correcting code (ECC) memory primitives and by inserting triple modular redundancy (TMR) on BRAMs to mitigate single-bit errors, safe FSM implementation and fault-tolerant FSMs with Hamming-3 encoding. In [236], a set of tools that allow to manipulate partial bitstreams to perform fine and coarse grained redundancy was presented.