Synchronization of Repaired Modules - Dynamic Partial Reconfiguration to Fix Permanent Faults

3.4 Dynamic Partial Reconfiguration to Fix Permanent Faults

3.4.3 Synchronization of Repaired Modules

In the majority of cases a reconfiguration by itself does not suffice for recovering faulty systems hardened by the combination of hardware redundancy and partial reconfiguration techniques. If the repaired instance features some kind of internal state, it needs to be synchronized with the rest of the healthy instances after the reconfiguration.

The straightforward way to synchronize a redundancy scheme is to reset the entire system and to start the execution in all the replicas at the same point.

Nevertheless, this kind of approach provides poor availability results. Another alternative to obtain higher availability levels is the utilization of the combination of checkpointing and rollback techniques, which has been widely discussed in 3.2.1.

A common synchronization technique for redundancy based schemes is the so called roll-forward [60, 68, 68, 89, 210, 213]. This technique consists in copying the correct state from the fault-free replicas and downloading it to the repaired unit. This approach avoids the utilization of any re-computation process and, as is stated in [263], it presents lower performance overhead and increases the reli-ability over re-starting and re-executing the entire process. The main drawback of this approach is that if error-free modules continue operating, their states con-tinue changing which makes this approach unsuitable for this scenario. A solution

for this issue is to pause the system during the synchronization, which comes with a significant performance reduction. The approach presented in [279] proposes a modified roll-forward approach that also includes checkpointing. After detecting an error, the three replicas stop operation while the inputs are buffered. After repairing the faulty module, the buffered data feed the three instances. This approach does not add any computation delay. Nevertheless, it increases re-source overhead because of the required buffers and the synchronization reduces the system’s availability. In addition, it is less capable of recovering multiple faults than the voting scheme.

In [280], the ScTMR technique was introduced. This technique proposes a roll-forward approach which re-utilizes the scan chain implemented in the processors for testability purposes to recover the system’s fault-free state. This avoids any re-computation and meets the specifications for real-time systems adding a low resource overhead. Although this approach significantly decreases susceptibility, it is unable to recover the system from simultaneous errors in two modules and from masked errors in single faulty modules. In [87], an updated version of this approach that addresses the shortcomings of ScTMR was presented. In this case, the proposed technique named scan chain-based multiple error recovery TMR (SMERTMR) is a roll-forward technique for TMR-based designs which offers the capability of fault recovery in the presence of multiple masked error and also two faulty modules. This technique is only applicable to systems where the replicas are always synchronized and it introduces a performance penalty.

The work presented in [10] proposes a present-input and healthy-state based syn-chronization method for TMR schemes called PIHS3TMR, which is depicted in Figure 3.16 (obtained from [10]). This real-time approach avoids the utilization of checkpointing and provides availability during the synchronization process with-out stopping the operation. The PIHS3TMR is based on modifying the FSM by introducing a healthy present state and a synchronization control signal. This ap-proach is an interesting synchronization alternative. However, its implementation for more complex designs like soft-core processors can be impractical.

The so-called known-blocking method was presented in [86]. This approach in-creases de reliability of soft-core processors implemented in SRAM FPGAs by utilizing TMR in combination with DPR. Its key feature is to avoid system from blocking situations. To perform this technique the both processors that are still running properly are lead to a known safe-loop (all outputs reset to low logic level). Due to this, the system continues running in a safe mode. Although this approach enables the possibility of non stopping the system, the processors are not available for the target application. Hence, it cannot continue working during the safe-loop.

Combinational

Figure 3.16: Block diagram of the PIHS3TMR.

Another synchronization method valid for small FSM was proposed in [281], introducing the notion of state prediction. This suggests that each FSM has (at least) one state to which the machine always returns after a finite amount of time. Therefore, by setting the FSM of a reconfigured module to this state it is possible to wait for the other two instances to reach this point during their normal operation, and thereafter continue seamlessly operating with all three instances. The work presented in[282] proposes a similar approach which deals with the synchronization of a recovered module in a TMR scheme. It suggests to perform the synchronization while the other two modules kept running, waiting for a predictive a future state in which converge. These approaches are useful for simple designs, such as, FSMs. Nevertheless, they are not advisable alternatives for complex architectures where future states cannot be predicted.

Synchronization has been also considered in [59], where a fault tolerant MicroB-laze architecture using TMR and DPR was presented. In that work, three Mi-croBlaze processors sharing peripherals and memory are implemented in partially reconfigurable regions. The peripherals and the shared memory are protected by TMR and ECC, respectively. Sharing one memory between the three proces-sors reduces synchronization to a process of reading and writing to the memory.

Whenever the processors write data to the memory a voting process is started.

It masks wrong data from the newly reconfigured instance, by storing the correct values sent from the two remaining functional instances. In a subsequent read cycle the three processors can read the synchronized value back to their memo-ries. While this synchronization approach is suitable for MicroBlaze processors, it is not applicable to all processor architectures. For the MicroBlaze processor it is possible to access all registers of relevance for synchronization, such as e.g.

the stack pointer, the status register and the program counter. In this manner, a synchronization by reading and writing to the shared memory becomes feasible.

On the contrary, many other popular processor architectures (e.g. PicoBlaze or PIC) do not provide reading access to all registers representing the state of the processor.

Another related contribution which uses rollback is the work [60], where the synchronization between two MicroBlaze processors operating in (DMR) is ad-dressed. After one of those processors is partially reconfigured, a similar technique to state prediction is used. Once the faulty processor has been identified, the roll-forward is executed to set the processor to a state the other MicroBlaze will reach in the future. Since the state is assumed to only consist of the program counter, a synchronization similar to the one in [59] is required after the roll-forward to update the register contents. Hence, this method presents to the same drawbacks as those previously identified for the shared memory approach in [59].

In [50], an approach based on the use of the TOPPERS/JSP open-source RTOS kernel was introduced. It utilizes three MicroBlaze processors in a TMR scheme.

After detecting a fault and reconfiguring the faulty module, an interruption in the RTOS triggers the synchronization process. As the work states, this approach requires a large area usage and it decreases the maximum operating frequency of the design.

Another synchronization approach that consists of a down-counter to determine when the newly reconfigured module has re-established its state was proposed in [283]. To carry out this approach, a countdown value is set to the latency in clock cycles of the longest path through the component. In this way, the outputs of the reconfigured block are ignored until the resynchronization. The application scope of this approach is very limited, since applications need to present a cyclic behaviour to return a previous state, where all the three modules can be synchronized.

In document Contributions to the fault tolerance of soft-core processors implemented in SRAM-based FPGA Systems. (Page 110-113)