3.3 Other Types of Redundancy
3.3.1 Data Redundancy
Data redundancy is another remarkable fault tolerance alternative to harden SRAM based FPGA designs. This concept is based on adding additional data in order to be able of verifying and even correcting the original information.
Data redundancy is mainly used to reduce error rates in memories, which is especially interesting in SRAM based ones. The prevalent alternative is the use of built-in Cyclic Redundancy Check (CRC) and Error Correction Code (ECC) or Error Detection and Correction (EDAC) strategies [59, 60, 73, 239]. With these techniques a bit-flip can be detected, automatically corrected and even recorded, with the possibility of saving a time stamp for further actions. ECC techniques can be implemented in hardware or by software [240]. With software
ECC approaches, transient faults in the combinational logic are not stored in storage cell, and bit-flips in the storage cells can be avoided or instantly corrected.
However, single-bit errors may cause failures in software ECC if an error occurs when reading data from a memory and it is coincident with the time between the last scrubbing and the time of reading. In contrast, hardware ECC checks all the data read from the memory correcting single-bit errors. Hence, bearing in mind that hardware ECC provides better reliability, it is a most advisable strategy.
Thanks to the reliability and the limited hardware overheard introduced by these techniques, they are a widely used alternative to TMR schemes when harden-ing memories in FPGA designs [194, 241]. While TMR boost the reliability by increasing significantly the area of memory cells (especially with fine granular-ity), ECC codes produces a small hardware overhead but it needs large logic blocks (with multiple levels) to implement coders and decoders, which increases the length of the critical paths. Due to this, the convenience of each approach depends on the design needs.
Several ECCs are available in the literature [242, 243], such as, Hamming, One-Step Majority-Logic Decoder (serial and parallel), Majority Gate, Bose Chaudhuri Hocquenghem, Berger, m-Of-n, m-Out-n, Residue codes, Reed, Solomon, Hsiao, Checksum, etc. One of the most utilized is the Hamming code [66, 75, 192, 194, 241, 244, 245]. As depited in Figure 3.11, it adopts the parity concept, but uses more than a single parity bit. To detect k -single bit error, the minimum number of bit positions at which the corresponding symbols are dis-tinct or minimum hamming distance (D) is D= k+1. A code word of n bits with m data bits and p check bits, where n = m+p, can correct (D-1)/2 errors and can detect D-1 errors. Moreover, with the addition of an extra parity bit, it can be determined whether it is 1-bit or 2-bit error. Nevertheless, even being a good candidate to be used in memories or register files, the Hamming code can be a non advisable alternative in some scenarios. Specially when the huge number of bits leads to long path of serial XOR gates in the decoding and coding modules.
In this scenario a suitable alternative is to divide the data in smaller data words.
As [241] states, in the case of data words up to 16 bits, the difference in area and delay between hamming code and TMR is nearly negligible. Due to this a TMR scheme is commonly an interesting alternative for blocks made-up of registers and pipelines, while Hamming ECC is more appropriate to harden register files and embedded memories.
Xilinx has a built-in ECC module to protect the BRAM structures [192]. This module is developed for BRAM primitives of data widths greater than 64 bits [246] by using Hamming code. Its goal is to detect and correct errors with a good performance and small resource utilization. Using this module one single bit error
1 0 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1
(Corrupted bit: 7TH) 01 11
Figure 3.11: Example of error detection using Hamming code.
can be automatically corrected and double errors can be detected. However, its use presents some drawbacks. For instance, the need of 2 clock cycles for data reading, which implies an extra clock cycle latency. To solve this problem, [54]
it presents a hardware interface which solves the synchronization problem by
“looking ahead” the next instruction. But the drawback of this approach is the resource overhead. Another drawback when using Xilinx ECC module for BRAM can be found when the data width chosen by user is not a multiple of 64, because a double-bit error signal may point out errors that have occurred in the unused bits. In addition, when using the Built-In ECC there are some limitations, such as, the non-availability of Byte-Write enable, RST[AB] Pin and the “output reset value” options and the impossibility of initialization. This last limitation is an important disadvantage for the process of writing the instructions in the program memory. Another problem when utilizing the built-in ECC is the synchronization of the BRAM memory block and the user application. Since, while processors mainly expects memories with a latency of a single clock-cycle, ECC BRAM implementations need two clock-cycles to read data. To circumvent this issue, in [54] a hardware interface called EPA (ECC Processor Adaptor), which looks ahead for the next instruction, was presented. [247] presents a similar approach.
The implementation of these interfaces, comes which a resource overhead.
In addition to the hardware based ECC approaches, vendors offer harden-ing alternatives for the bitstream. Xilinx provides the Frame ECC logic (FRAME ECC VIRTEX6 primitive) for Virtex-6 and 7 Series FPGAs [248]. It enables de possibility of detecting single or double bit errors in configuration frame data thanks to the use of a 13-bit Hamming code parity value that is calculated based on the frame data generated by BitGen. During the readback porcess (by utilizing SelectMAP, JTAG, or ICAP interfaces), the Frame ECC logic generates a syndrome value utilizing all the bits in the frame (including the
ECC bits). If no bit change happens from the original programmed values, the [12:00] bits from the SYNDROME word are all zeros. On the contrary, if a fault provokes a bit-flip (including the ECC bits), the [11-0] bits from the SYNDROME word indicate its location. If the flipped bits are two or more, the [12:00] bits from SYNDROME are indeterminate. In this scenario, the error output of the block is asserted. In addition, after reading each frame the syndrome valid signal is asserted. Repairing flipped bits demands a user design since the frame ECC logic does not repair them. Xilinx provides an SEU controller IP [186] capable of repairing, an even injecting, configuration-frame faults. In any case, the design has to be able to save at least one frame or be able to fetch original data-frames for reloading them. In [248], it is addressed that the simplest operation is to read the frame through the configuration interface, store it in a BRAM, anal-yse it and if it required repair it before writing it back. However, if the BRAM which stores the frame is affected by an induced fault, the entire design can be compromised.
Different alternatives have been proposed in the literature to deal with the frame ECC issue. In [249] an embedded IP core, which can be implanted into the FPGAs to detect and correct soft errors automatically was proposed. [250] also presents an SEU-Monitor System which is able of injecting, detecting and correct-ing scorrect-ingle-bit errors and injectcorrect-ing and detectcorrect-ing double-bit errors in the FPGA configuration memory. On the other hand, in [251] a low-cost ECC was presented to detect and correct MBUs in configuration frames. This code is based on the conception of Erasure codes and utilized vertical and horizontal parity-bits to avoid redundant data. It also proposed the utilization of Mutation codes. As it states by using Erasure or Mutation codes the delay can be reduced. This approach does not demand alterations in the FPGA design. The parity bit oper-ation is performed for all configuroper-ation frames increasing the computoper-ation time.