RAS Features in the Baseline Processor - Low-cost and efficient fault detection and diagnosis s

This section lists the error protection mechanisms that are included in the baseline processor. Modern advanced out-of-order processors include few simple RAS features to protect critical structures from an area and vulnerability perspective.

Therefore, our baseline processor also includes simple error code protection mechanism in several structures. Figure 4.2 shows in light green the arrays that we as- sume protected by an error code scheme, and in light red the blocks that cannot be protected by existing mechanisms (components heavily implemented by means of combinational logic). Cache structures such as the instruction cache, data cache, second-level cache and the TLBs are protected by error detection-correction codes. Whereas TLBs are protected by means of parity, the caches are protected by ECC codes that support error correction. The second-level and LLC caches are protected by stronger SEC-DED schemes.

Other storage structures, like buffers, are protected by simple error detection codes. The fetch buffer is protected by parity codes that are extracted from the instruction cache. Other arrays like the allocation buffer, or the entries in the issue queue payload RAM are protected by explicitly generated parity bit (they are wide, and non-mutable). Faults can be simply detected by checking the information code, and non-permanent faults can be recovered by means of the pipeline-flush and restart mechanism provided by the baseline core. The register files are protected by a parity bit, and the parity generators and checkers reside at the inputs of the write and read ports, respectively.

4.2. RAS Features in the Baseline Processor

·

59 As most processors, ours also includes a watchdog timer that monitors the hard- ware for signs of deadlock. Specifically, the watchdog timer monitors the ROB: if no instructions commit for an extremely long time that exceeds a predefined threshold, then the watchdog timer reports that an error has occurred, the pipeline is flushed and execution is re-started from the instruction at the head of the ROB.

Instruction control flow and allocate logic is protected by this watchdog timer and a special checker residing in the ROB [49, 155]: the Program Counter (PC) of each instruction is checked against the following instructions PC to ensure correct program order. Sequential committing instructions add their length (recorded at decode time) to the retirement PC and branches update the retirement PC with their calculated PC. Comparing a committing instructions PC with the retirement PC will detect discontinuities. Detected failure scenarios include: wrong PC generation, unintended instructions (dis)appearing in the frontend, overwriting instructions in the frontend queues, instructions being moved forward in an unordered manner, allocation in wrong ROB/LSQ/issue queue entries (potentially overwriting).3

Decoders logic and PLAs (Programmable Logic Arrays) are protected using the method described in [37], due to their large area.

Allocating an instruction in a wrong ROB entry is detected by means of the PC checker. If an instruction is wrongly allocated in the issue queue / LSQ (overwriting an existing unexecuted one), the ROB complete bit of the overwritten instruction entry will not be activated, leading to a deadlock.

CHAPTER 5 REGISTER DATAFLOW

VALIDATION

5.1 Introduction

Whereas classical error detection mechanisms based on re-execution were amenable for high-end segments where high area, power and/or performance penalties could be tolerated, the radical increase in raw error rates calls for fault tolerance mechanisms that can be deployed in commodity segments. New requirements include negligible area, power and slowdown overheads, while at the same time providing the high reliability levels of traditional defect tolerance techniques.

On another axis, whereas critical SRAM structures (such as caches and register files) are already protected with parity or error correction codes in most commercial processors, limited research efforts have been devoted to design cost-effective error detection strategies for the wrapping control logic of high-performance microproces- sors. Currently it plays a critical role for the whole microprocessor correct operation, and it represents a significant portion of the die area and testing and validation costs. In this chapter we propose a low-cost online end-to-end protection mechanism that protects the control logic involved in the register dataflow. This includes the rename tables, wake-up logic, select logic, input multiplexors, operand read and writeback, the register free list, register release, register allocation, and the replay logic. Our proposal is based on microarchitectural invariants (applicable to any processor design) and allows detecting multiple sources of failures, including design bugs.

End-to-end protection is based in generating a protection code at the source where vulnerable data is generated, sending the vulnerable data with the protection code

·

Chapter 5. Register Dataflow Validation

along the path, and checking for errors only at the end of the path, where it is consumed. Faults caused by any logic gates, storage elements, or buses along the path are detected at the consumption site. Instead of individually checking specific low-level microarchitectural blocks, our solutions verifies high-level functionalities whose implementation is scattered across many components.

The centerpiece of the proposed solution is a signature-based protection mechanism. The implementation cost and the coverage provided by the protection framework depends, primarily, on the signature width and, secondarily, on how signatures are generated. We propose and thoroughly assess different multiple ways of generating and handling signatures. For each policy, we discuss the error coverage and their cost in area and power.

In this chapter, we also study how to extend fault coverage to cover against errors in register values. To achieve this, we first exploit the potential of residue codes to build an end-to-end self-checking microarchitecture that computes with encoded operands. Then, we describe how this end-to-end residue checking system can be smoothly embedded into our register dataflow end-to-end protection scheme, in order to amortize costs. The net result is that functional units, load-store queue data and addresses, register file storage and data buses are also protected at a low cost.

The rest of the chapter is structured as follows: Section 5.2 reviews how faults in the dataflow may manifest. Section 5.3 reviews our framework for a dataflow self-test mechanism. Section 5.4 overviews an end-to-end residue coding scheme and explains how to integrate it with our proposal. In Section 5.5 we propose and assess different policies for generating and handling the signatures. Section 5.6 discusses how the different signature generation policies impact the overall coverage and processor overheads. Section 5.7 reviews some relevant related work. We summarize our main conclusions in Section 5.8.

In document Low-cost and efficient fault detection and diagnosis schemes for modern cores (Page 81-85)