Fault Injection Analysis - LnCm fault model : complexity and validation

3.4 Fault Injection Analysis

This thesis uses the LLVM Fault Injection Tool (LLFI) [156] to introduce faults into target programs. LLFI is a LLVM-based fault injection tool that works at the LLVM [84] compiler’s intermediate representation (IR) level.

3.4.1 LLVM

Low Level Virtual Machine (LLVM) [84] is a compiler infrastructure designed as a set of reusable libraries with well-defined interfaces for program analysis and optimisation. LLVM consists of (i) a front-end to translate program code written in a high-level language such as C/C++ to an intermediate representation and (ii) a backend to translate the intermediate representation (IR) into machine code for specific platforms. The IR is a low-level programming language similar to assembly. The IR is a strongly typed RISC instruction set which abstracts away details of the target. It can be transformed by multiple optimisation passes before being converted to the machine code by the backend. The LLVM intermediate representation is a typed language in which source-level constructs can be easily represented. It preserves the variable and function names, making source mapping feasible. Further, LLVM has extensive support for program analysis and transformations which makes it easier to study the effect of fault injection at a higher level than assembly language.

3.4.2 LLVM Fault Injection (LLFI) Tool

On the account that LLFI target programs at the IR code, it allows fault- injections to be performed at specific program points and into specific instructions. The effect can then be easily tracked back to the source code. LLFI supports various fault injection customisations, and enables tracing the propa- gation of the resulting error among instructions in the program [156].

3.4. FAULT INJECTION ANALYSIS 35 PyYaml configuration file Target system source code IR

bytecode Run LLVM _compiler

Run LLFI instrument

Fault injection

executables executablesProfiling

Run LLFI profile Dynamic count of injection location Run LLFI inject fault Fault injection output

Statistics and logs of injected faults and

experiments

Run-time options Compile-time

options

Figure 3.2: LLFI workflow [156].

LLFI Workflow and System Instrumentation

Figure 3.2 shows the working of LLFI, which consists of three compulsory stages.

First stage, entails running LLFI instrument. In this step, LLFI takes the

program byte-code as input, and applies a configuration instructions specified the fault injection configuration script written in PyYaml format. The configuration file contains both (i) compile-time options, including injection location and register to target in the specified injection location, and (ii) the run-time options, consisting of the fault model, i.e the type of fault to inject, number of experiments to run and time out definitions. The LLFI instruments target injection locations with calls to fault injection functions. Call-backs functions are also instrumented for profiling. Running the LLFI instruments produces as output fault-injection and profiling executables, a dynamic count of injection locations is also logged.

3.4. FAULT INJECTION ANALYSIS 36

Secondly, the LLFI profile runs the profiling executables produced in Step 1

in order to create a golden run of the target program, where a golden run is a reproducible fault free execution of the system. This is done in provide a baseline for comparisons with fault injection executions. This step also create more fault injection setup text files.

Thirdly, the fault injection executables created in Step 1 is executed at runtime,

and LLFI randomly selects one runtime instance of the instrumented instructions to trigger the fault injection function and inject into the selected instruction operand value leveraging the information in setup files created in Step 2. Because hardware faults occur randomly at runtime, LLFI picks a random instruction from the set of all dynamically executed instructions at runtime to inject into. This is possible because the fault injection function is invoked at runtime, and can hence choose which invocation of an instruction to inject into. The output of the fault-injector is the fault injection experiments, including program output, log and stat file. The log files captures execution information including program exceptions and system crashes etc, while the stat files stores execution information such as injected fault type, injection location etc.

Further, by instrumenting the program once with the set of all fault-injection functions, and injecting the fault at runtime, LLFI ensures that the same ex- ecutable file (with the instrumentation in it) is used in all the fault injection runs. Finally, this method makes it unnecessary to recompile the code for each fault injection. Other work on high-level fault injection has followed a similar approach [16, 59, 73]. The result is then logged (i.e., where the fault was injected, what type) and the program allowed to continue. The final output of the program is also logged. This requires the profiling step to have completed suc- cessfully and the corresponding stat file to have been created. However, when any compile-time options is changed, the process has to start from Step 1.

3.4. FAULT INJECTION ANALYSIS 37

and allows visualisation how the IR code are mapped to the source code. Trace is collected only if specified in Step 1. This generates an execution trace after the profiling and fault-injection steps, for each fault injected. The traces can be compared to identify how the fault propagated.

3.4.3 Failure Scheme

To better understand the failure profile of the targeted program, the outcome of the fault injection experiments are categorised, using a purpose-built tool, as follows:

• No Impact: If the program execution terminates normally, and the output produced is identical to the output produced by the fault-free execution of the program, the outcome is labelled asNo Impact. This fault-free execution of the program is referred to asGolden Run.

• Exception Failure (Exception): The outcome is classified as an Ex-

ception Failure, if the program execution encounters an unexpected error

that does not result in the program crashing or hanging, i.e., the program terminates normally but does not produce any output.

• Silent Data Corruption (SDC):If the program execution terminates

normally, but produced an output that deviates from that produces by the golden runs, the outcome is classified asSilent Data Corruption (SDC).

• Time Out Failure (Time Out): The outcome is classified as aTime

Out failure, if the program execution hangs, i.e. fails to terminate within predefined time. This time is arbitrary set to be approximately 15 times larger than the execution time of the golden run.

• Crash Failure (Crash): If the program execution is terminated unex- pectedly when it encounters an unexpected error, the outcome is classified

CHAPTER

4

Problem Statements

Fault tolerance mechanisms have traditionally been validated through the use of software-implemented fault injection (SWIFI). Several SWIFI frameworks exist, however most of them are based on single-fault assumption, i.e., they assume that a single fault will occur in any execution run of the system. During the validation process, this assumption translates into a single fault being injected into an execution of the system, overlooking any potential interactions between simultaneous independent faults. Such a fault assumption has become less ap- propriate and limited as (i) software systems containing more than a single fault are more often the norm rather than the exception [14], (ii) current safety standards require the consideration of multi-point faults [62] and (iii) it has been shown that simultaneous fault injections can efficiently detect robustness vulnerabilities [163].

In general, during fault injection, (i) the type of faults to inject, (ii) which variable to inject in (referred to as the injection location) and (iii) the time at which the fault injection occurs, are usually considered. Under the single soft-error assumption, i.e., single bit-flip, the fault space is linearly large (in the size of the word or register and in the number of variables), making exhaustive fault injection feasible. Generally, when multiple fault injections are considered: (i) several variables can be targeted at any given point, and/or (ii) several different locations can be target at a given time. When multiple soft-errors are considered, important challenges arises, the most important ones being to deal

4.1. SELECTING POTENTIAL INJECTION BLOCKS LOCATIONS 39

with the exponential size (in the number of variables, variables combinations and the bit-positions) of the fault space. To make the multiple fault injection process efficient, it is important to inject fault combinations that are likely to convey more information and uncover more vulnerabilities, i.e., cause the system to fail. Moreover, the values to inject at the injection location is also an important dimension to consider for multiple faults assumption. Thus, the following challenges are identified in support of the thesis statement:

“There exists a computational feasible bits to explore under multiple bit-flip faults that will induce a wider failure profile.”

4.1 Selecting Potential Injection Blocks Loca-

tions

Does a set of locations (blocks) exists which will be suitable can- didate blocks for multiple faults injections? How can these blocks (potential injection locations) be identified?

Having addressed this problem, it is important to discern the best combination of variables to target. Thus following problem needs to be considered.

In document LnCm fault model : complexity and validation (Page 62-67)