Silent data corruption (SDC) rate

6.2 Reliability models

6.2.2 Silent data corruption (SDC) rate

Errors are rare events, so simulating them is not straightforward. If we expect an error every _≈ 1020 _{cycles, we cannot simulate for} _≈ ₁₀20 _{cycles until the error is}

observed: the simulation runtime would be prohibitive. Instead, we inject errors on purpose to evaluate system behavior when an error happens, and we evaluate the SDC rate according to the probability of the fault injection we forced. Our approach is outlined below.

Each application has the following characteristics that can be obtained experi- mentally:

• d0, the depth, or the number of cycles needed between the first input showing

up and the last output leaving. The depth can be different for CMP and LWC, so we use d0,LW C and d0,CM P. For example, a streaming mergesort that sortsN

elements and processes P inputs per cycle will have a depth d0 =N/P, because

it takes 2N/P cycles between the firstP inputs entering the sorter and the last P outputs leaving the sorter, but during a given cycle half the nodes are being used by another set of data (so we divide by 2).

• Nt, the total number of physical locations where a bit flip can occur. Again, it

can be different for CMP and LWC, so we use Nt,LW C and Nt,CM P. Since the

check is lightweight, we will typically have Nt,LW C < Nt,CM P. Nt includes many

types of different physical locations (wire segments, registers, LUTs, I/O). To simplify our analysis, we assume an equally likely error rate for each resource type. This will not be the case in practice, but we will perform a sensitivity analysis to ensure we are properly capturing the worst case (Sec. 6.3).

• pbf(V), the bit flip rate at voltage V, or the probability that the logical level of

a resource is misinterpreted (0 instead of 1 or 1 instead of 0) in one clock cycle. This alone does not necessarily lead to an error since the fault could be masked by subsequent logic. For example if one input to an AND gate is misinterpreted as 1 instead of 0, and the other input is correctly interpreted as 0, the output would correctly be interpreted as 0, even though the first input experienced a bit flip.

• pprop(i), the probability that the output is corrupted given i simultaneous bit

flips. pprop(i) captures the logical masking effects explained above and is obtained

through simulation.

pbf(V) has an exponential relationship on voltage and is given by the following equa-

tion (derived in Sec. 6.2.4):

The probability of i simultaneous bit flips while the application is executing its task, psbf(i), depends on the size of the design (Nt) and how many cycles it takes to

complete its task (d0). Throughout the whole process, there are d0Nt nodes that can

get upset, leading to:

psbf(i, V) = d0Nt i (pbf(V))i(1−pbf(V))d0Nt−i. (6.11)

Note that psbf(i, V) gives simultaneous bit flips while the application is executing its

task, not just in one clock cycle, as opposed to pbf. So we have pbf(V)≤psbf(1, V),

since apsbf(1, V) event can happen on any of theNtnodes during any of thed0 cycles.

We can now simulate these cases separately and combine the results to compute the SDC rate (probability of SDC per cycle) for the “base” design:

psdc,base = d0Nt X

i=1

psbf,CM P(i, Vnom)pprop,CM P(i). (6.12)

i.e. for each possible number of simultaneous bit flips 1_≤i _≤d0Nt we multiply the

probability of isimultaneous bit flips by the probability that isimultaneous bit flips cause an error.

For the “protected” design we follow a similar approach, but since CMP and LWC potentially use a different voltage, we cannot do it as directly. The probability of simultaneously seeing i bit flips in CMP and j bit flips in LWC is:

psbf2(i, j, VCM P, VLW C) =psbf2(i, j) =psbf,CM P(i, VCM P)·psbf,LW C(j, VLW C). (6.13)

Furthermore, for the protected design, we need to capture how well the LWC catches an error at the CMP output: when an error occurs in CMP and LWC catches it, we do not consider it to be an SDC event. Instead of the two possible outcomes for the “base” design captured by pprop(i), we now have four possible outcomes as shown in

Table 6.1: Possible outcomes of the protected design (CMP+LWC)

Notation CMP LWC Outcome

p00 no error no catch expected behavior→ proceed

p01 no error catch false positive → recompute

p10 error no catch SDC → proceed

p11 error catch successful catch →recompute

simultaneous bit flips, we have:

• p00(i, j) is the probability that there is no error in CMP, and that LWC does not

catch an error, this is the common case.

• p01(i, j) is the probability that there is no error in CMP, and that LWC thinks

there was an error, this is a false positive.

• p10(i, j) is the probability that there is an error in CMP, and that LWC does not

catch an error, this is an SDC event.

• p11(i, j) is the probability that there is an error in CMP, and that LWC success-

fully catches the error. Of course, we have:

p00(i, j) +p01(i, j) +p10(i, j) +p11(i, j) = 1. (6.14)

We can now write an expression similar to Eq. 6.15 for the four outcomes of Tab. 6.1. The SDC rate of the protected design is given by:

psdc,protected = d0,CM PNt,CM P X i=0 d0,LW CNt,LW C X j=0 psbf2(i, j)p10(i, j). (6.15)

The probability of a false positive event:

pf alse+= d0,CM PNt,CM P X i=0 d0,LW CNt,LW C X j=0 psbf2(i, j)p01(i, j). (6.16)

The probability of an error correctly caught by the LWC: pgood catch = d0,CM PNt,CM P X i=0 d0,LW CNt,LW C X j=0 psbf2(i, j)p11(i, j). (6.17)

The probability of no error and normal continuation:

pexpected= d0,CM PNt,CM P X i=0 d0,LW CNt,LW C X j=0 psbf2(i, j)p00(i, j). (6.18)

Finally, the probability of recomputation:

prcmp =pf alse++pgood catch =

d0,CM PNt,CM P X i=0 d0,LW CNt,LW C X j=0

psbf2(i, j)(p01(i, j) +p11(i, j)).

(6.19) As we will see in Sec. 6.3, it will be an important design characteristic of our LWCs to have

p10(1,0) = 0, (6.20)

i.e. The LWC catches any single error in CMP. This will allow us to operate at exponentially higher bit flip rates and thus drop the voltage and energy further (more in Sec. 6.3.2.3). This is because when Eq. 6.20 is satisfied, there needs to be at least two simultaneous errors in order to cause an SDC, a much less likely event than having a single error. Indeed, since we will keep pbf relatively low, the probability of n+ 1

simultaneous bit flips is exponentially smaller than the probability of n simultaneous bit flips, so we actually do not really need to simulate up to d0Nt simultaneous bit

flips. In the experiments that will follow, we simulate up to 3 simultaneous bit flips, sufficient to observe all the interesting results. We do not need to drop the voltage enough to reach the region where pbf is too high and requires us to simulate more

In document Energy Reduction Through Voltage Scaling and Lightweight Checking (Page 137-142)