Architectural Vulnerability Factor Estimation through Fault Injections

(1)

Architectural Vulnerability Factor Estimation through Fault

Injections

A Thesis Presented by

Fritz Gerald Previlon

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements for the degree of

Master of Science in Computer Engineering Northeastern University Boston, Massachusetts April 2016

(2)

(3)

List of Figures

1.1 Clock rate and Power for Intel x86 microprocessors over eight generations and 25 years (Source [2]) . . . 2 1.2 OpenCL Programming Model and Evergreen Hardware Architecture. . . 4

4.1 Possible outcomes for each simulation run . . . 19 4.2 This graph shows how the Architectural Vulnerability Factor (AVF) value changes

based on the number of fault-injection experiments. We notice that the AVF value shows little variation and stabilizes after 5,000 injections. . . 20 4.3 Formula for the number of faults to select for injection . . . 21

5.1 Results of Fault injection experiments on the Local Data Share . . . 25 5.2 Amount of local memory used by each application for each NDRange mapped to the

compute device . . . 26 5.3 Results of Fault injection experiments on the General Purpose Register File . . . . 27 5.4 Number of General-Purpose registers used by each application for each NDRange

mapped to the compute device . . . 28 5.5 Intervals of Vulnerability for Radixsort. This shows that the faults that lead to

incorrect output fall only into specific intervals of time . . . 30 5.6 Intervals for LDS accesses for Radixsort. . . 31 5.7 Intervals of Vulnerability for MatrixMultiplication. This vulnerability of

MatrixMul-tiplication shows a periodic behavior . . . 32 5.8 Local Memory accesses in MatrixMultiplication show a periodic behavior . . . 33

(6)

4.1 The GPU configuration used in the experiments . . . 22 4.2 The benchmarks used in the experiments . . . 23

(7)

List of Acronyms

AVF Architectural Vulnerability Factor. Probability that a soft error will cause an error in a program output. The AVF of an architectural bit can be thought of as the Fraction of time a bit matters for final output of a program

ACE Architecturally Correct Execution. ACE Analysis is a method to derive an upper bound on AVF using performance simulation.

LDS Local Data Storage Local Memory module in a compute unit of a GPU. This module is shared between work-items in a compute unit and allows for communication between the work-items. It can be manipulated using explicit instructions.

(8)

Here I wish to thank those who have supported me during the process of the thesis work. First I would like to thank my family and close friends who have encouraged me and believed in me. Their support was critical to the completion of this work.

I also want to think the friends from NUCAR and my committee members (Dr. Vilas Sridharan and Prof. Ningfang Mi) who have provided important and constructive feedback throughout the progress of this thesis.

Lastly, my advisor David Kaeli has been a reliable and indispensable guide and support from the beginning of this work. Many thanks!

(9)

Abstract of the Thesis

Architectural Vulnerability Factor Estimation through Fault Injections

by

Fritz Gerald Previlon

Master of Science in Electrical and Computer Engineering Northeastern University, April 2016

David Kaeli, Adviser

Given the large number of processing cores, as well as their impressive parallel processing capabilities, Graphic Processing Units (GPUs) have become the accelerator of choice across multiple domains. GPUs are able to accelerate processing in a wide range of applications including scien-tific computing, bio-informatics, and financial applications. Their presence in the world’s fastest supercomputers has been steadily growing over the last few years.

With technology scaling, soft-error reliability has become a major issue for hardware designers. Soft-errors are a non-permanent fault, where a bit flip occurs in a latch or memory cell. A recent study by the Department of Energy has identified soft errors as one of the top 10 barrier to exascale computing. The architecture research community needs to pursue solutions to address the challenges presented by the growing presence of soft errors. While some number of soft errors will not necessarily cause an error at the output of a program, many will corrupt vulnerable program state. Since GPUs are increasingly being used for compute instead of just graphics, their reliability has become a concern. Therefore, an important step in tackling soft errors in GPUs is to first assess the impact of soft errors and the robustness of the GPUs in the presence of these faults.

In this thesis, we evaluate this question using fault injection on an AMD Evergreen family of GPUs. In this study, we inject bit flips using a detailed architectural simulator. Our results indicate that a GPU can be a highly resilient device to soft errors. We present a study of trends that appear in common GPU programs when soft errors occur in GPU memory hierarchy. These trends can be used to inform programmers, as well as system designers, when making decisions about how to increase the reliability of GPU software and hardware.

(10)

Introduction

For more than 3 decades, frequency scaling - increasing a processor’s frequency for performance - has been the driving force behind Moore’s law [3]. Processor frequencies have increased from 1-8 MHz in the 1970s to 2-4 GHz today (approximately 4,000 times faster). How-ever, power/thermal constraints have made it very challenging for us to continue increasing clock frequencies of microprocessors. Figure 1.1 shows how both power and clock rate have increased rapidly for decades, but have recently flattened [2]. The microprocessor industry has thus turned to parallelism in order to obtain higher performance using the same frequency, though with only a minimal increase in power consumption. For the past decade, general purpose compute applications have started to leverage the parallelism provided through parallel computing hardware, as well as sophisticated parallel programming interfaces.

As parallelism became more prevalent, the market saw an increase in multi-core processors, able to take advantage of the parallelism inherent in general-purpose applications. New program-ming interfaces have been developed in order to facilitate the development of parallel applications. Developers have looked for ways to accelerate their performance-critical applications in order to exploit the performance benefits offered by the parallelism in multi-core processors.

With hundreds of cores and streaming processing devices, Graphics Processing Units (GPUs) have become an attractive parallel processing device. Originally, GPU acceleration was limited due to the lack of programmable shaders, then the use of graphics-oriented programming languages. Improved programmability has helped these devices become quite attractive for high-performance computing and other data-intensive applications. Beyond their primary graphics role, they are now are used in a growing range of applications, including scientific computing [4], bio-informatics [5], molecular modeling [6] and financial applications [7].

(11)

CHAPTER 1. INTRODUCTION

Figure 1.1: Clock rate and Power for Intel x86 microprocessors over eight generations and 25 years (Source [2])

However, because their primary use has been for graphic processing, reliability (or more specifically, soft error reliability) has never been a major concern for GPU designers. As expressed by Sheaffer et al., a user is quite unlikely to care about or even perceive a single-bit error in a single pixel for a single frame, when running traditional gaming programs [8].

To continue to exploit the GPU impressive parallel compute capabilities, and expand the use of GPUs to a wider range of markets and industries, it is imperative that reliability issues on GPUs be rigorously addressed.

In a traditional CPU design, soft-error reliability is not a foreign concept. Reliability has commonly been a key design trade-off considered during processor design. Soft errors are radiation-induced errors and are caused by energetic particles (neutrons from cosmic rays, and alpha particles from packaging materials) generating electron-hole pairs as they pass through semiconductor devices. Soft-error reliability has been well studied on CPUs; numerous techniques have been developed to characterize errors, and to protect microprocessors against these faults [9][10][11]. However, little work has been done on the resiliency of GPUs in the presence of soft errors. We need to first understand how vulnerability of these devices is dependent on underlying program characteristics. In this thesis, we present an extensive fault injection study on soft error reliability in the memory hierarchy of a class of GPUs.

(12)

GPU applications. We have also found trends in the resiliency of a GPU, which can be exploited by GPU application designers to make their software more robust against soft errors.

Next, we provide background information on General-Purpose Computing on Graphics Processing Units (GPGPU).

1.1 Introduction to GPU Programming

GPUs were originally designed to efficiently render 3-D graphics, providing highly opti-mized datapaths for generating frames and frames of pixel data. The research community recognized that GPUs could also be used for massive data processing, and started executing floating-point computations using shader languages such as OpenGL and DirectX. The applications that were first ported to GPUs typicallly involved compuations on matrices and vectors. Matrix multiplication was one of the early CPU programs that performed significantly better when run on a graphics card [12]. However, porting these general-purpose applications to GPUs was a very complex and daunting task, as it required that the programmer to recast their algorithms in terms of the graphics APIs.

Industry leaders AMD and Nvida recognized this trend, and proposed general purpose programming languages that would allow GPUs to be used for a broader class of applications. OpenCL [13] and CUDA [14] have emerged as two standard programming frameworks that allow GPUs to be integrated in supercomputers and desktops as accelerators. Programmers were no longer tied to the underlying graphics programming model. They could focus more on high-performance computing, which attracted many more developers of general purpose applications to a GPU platform.

1.1.1 The Open Compute Language (OpenCL)

As GPU hardware vendors introduced programmable shaders, AMD and NVIDIA intro-duced support for OpenCL and CUDA, respectively. These C-like parallel programming frame-works provide a Software Development Kit (SDK) that includes a rich set of APIs and compil-ers/runtimes/drivers. In this thesis we use programs written in OpenCL [13].

OpenCL is an emerging framework for programming heterogeneous devices. It is an industry standard maintained by Khronos, a non-profit technology consortium. OpenCL has seen an increasing number of adoptions from major vendors in industry, including Apple, AMD, ARM, NVIDIA, Intel, Imagination Technologies, Qualcomm and S3. OpenCL provides a number of abstraction models, allowing the model to be applied to a wide range of system architectures. In

(13)

a) Elements defined in the OpenCL programming model. Work-items running the same code form work-groups, which in turn, compose the whole ND-Range.

b) Simplified block diagram of the Radeon HD 5870 hardware architecture. This GPU belongs to the Evergreen family of AMD devices.

Figure 1.2: OpenCL Programming Model and Evergreen Hardware Architecture.

the OpenCL terminology, the GPU is referred to as thedeviceand the CPU as thehost. Figure 1.2 presents an overview of the OpenCL programming model.

1.1.2 The OpenCL Platform Model

The platform model for OpenCL consists of a host connected to one or more OpenCL devices. Each device consists of one or more compute units (CU) and each compute unit further consists of one or more Processing Elements (PE). Within a device, the computations are performed within the processing elements. An OpenCL application runs on the host and sends commands to the device be executed by the processing elements. OpenCL’s programming model emphasizes parallel

(14)

processing by assuming a Single Program Multiple Data (SPMD) paradigm, in which a piece of code, (called akernel) maps to multiple subsets of input data, creating a massive number of parallel threads.

The host program is the starting point for the OpenCL program and executes on the CPU. The device kernel is written in OpenCL. In most cases, the OpenCL kernel runs on a GPU device and is usually compiled during the execution of the OpenCL host program.

An instance of the kernel executing in a Processing Element is called awork-itemand is identified by a global ID. Each work-item executes the same code, but the specific execution path can vary per work-item by querying its ID.

Work-items are organized intowork-groups. Work-groups are assigned a work-group ID. Work-items within a work-group are also assigned a unique local ID. A set of work-groups in turn form anND-Range, which is a grid of work-item groups that share a common global memory space.

1.1.3 Architecture of the Evergreen family of GPUs

In this thesis, we have worked with the AMD Evergreen family of GPUs to evaluate soft-error resiliency in GPUs. The Evergreen family was an earlier flagship GPU developed by AMD. This device was designed to target general-purpose data-intensive applications, along with the primary graphics applications. While the Evergreen devices are a few years old, they support OpenCL exection.

Figure 1.2b shows the general systems architecture of the Radeon HD 5870 GPU, a popular GPU in the Evergreen family. The GPUs in this family have computational units called compute units that can take advantage of data parallelism.

The Radeon 5870 has 20 compute units. Each compute unit has 16 stream cores. Each stream core in a compute unit is devoted to the execution of one instance of an OpenCL kernel. The stream cores also have access to a 32KB local data storage. Additionally, each stream core has 5 processing elements that execute the machine instructions.

Interestingly, the stream cores in Evergreen are time-multiplexed in 4 slots. This gives the illusion that each stream core is running 4 different kernels simultaneously. Furthermore, the Evergreen architecture has support for 5-way Very Long Instruction Word (VLIW) bundles of arithmetic instructions. The hardware support is provided in each stream core in the form of the 5 processing elements, labeledx,y,z,wandt. As a result, the Radeon 5870 GPU has the ability to issue up to 5 floating-point operations in one cycle.

(15)

When an OpenCL kernel is launched on an Evergreen GPU, the ND-Range is initially transferred to it. A dispatcher processes the ND-Range and assigns work-groups to any of the available compute units in any order. Each compute unit contains a set of 16 stream cores, each devoted to the execution of one work-item. All stream cores within the compute unit have access to a common local data storage (LDS), used by the work-items to share data at the work-group level. The LDS is the implementation of the local memory concept as defined in OpenCL. Finally, each stream core contains 5 processing elements to execute Evergreen machine instructions in a work-item, plus a file of general-purpose registers, which provides the support for the private memory concept as defined in OpenCL.

The GPU memory hierarchy is divided into three memory scopes: 1) private memory (the register file), 2) local memory (the local data storage), and 3) global memory. Access to each memory scope is defined by software.

In this thesis, we focus on the vulnerability of the first two memory scopes in the GPU, the Local Data Storage and the Vector Register File. These structures represent a large portion of a GPU chip and can be directly addressed by a programmer. It is crucial for a programmer to understand how to use these memory scopes when resilience is critical to an application. We provide a brief description of each of these two structures in the following paragraphs.

1.1.4 Register File

The register file of a compute unit can be considered its private memory as defined by the OpenCL programming model. The register file provides each work-item in a work-group that is mapped to a compute unit at a given time with their private copy of register values. The register file can be accessed and modified by specific instructions, the Arithmetic Logic Unit (ALU) instructions and the TEX (fetch through a texture cache) instructions during their read or write stages of execution.

The register file in a GPU is significantly larger than that of a CPU. Moreover, since GPUs are throughput-oriented devices, they can usually have hundreds of threads (work-items) running concurrently. GPUs utilize fine-grained scheduling among the individual threads to hide latencies which can be associated with memory operations and dependencies from these threads. Having such a large number of threads and being a throughput-oriented device, a GPU needs to have dedicated hardware to support each running thread in the device. This explains the motivation for a large register file (about 64 KB in the Radeon 5870).

(16)

1.1.5 Local Data Storage (LDS)

As previously explained, the register file provides private memory for the individual threads running on a GPU. Each thread is provided with its own set of separate registers. Communication between individual threads through the register file is therefore not allowed. However, the OpenCL programming model supports the sharing of data between work-items within a work-group.

The GPU uses local memory in order to support this feature. Each compute unit contains one local memory module that is accessible by all work-items that are running in the work-group. The local memory functionality is different than that of the cache in the CPU. In a GPU, data in local memory is manipulated using explicit instructions, and the size of the local memory is comparable to the register file size (32 KB in the Radeon 5870).

1.2 Contributions of this Thesis

The contributions of this thesis include:

1. We present a reliability study of the vector register file and the local data share of a GPU.

2. We simulate the presence of single-bit faults using fault injection and carry out a study by simulating GPU workloads on a cycle-based simulation model of the AMD Radeon 5870.

3. We provide a characterization of the resiliency of a suite of OpenCL kernels to the effects of particle-induced faults.

4. We observe how the vulnerability of applications change over time and provide insights that can be used by application developers to reduce vulnerability.

1.3 Organization of the Thesis

This thesis is organized as follows. In Chapter 2) we discuss prior work on reliability modeling. In Chaper 3 we review the limited prior work on GPU reliability. Chapter 4 describes the framework we use for our fault injection experiments, as well as the details oft our fault model. We also discuss the Architectural Vulnerability Factor of the applications that we use. Chapter 5 provides the results of our simulation study, Chapter 6 summarizes lessons learned in this thesis, and discusses directions for future work.

(17)

Chapter 2

Background

In this chapter, we provide background information on Soft Errors and the methods used to deal with these errors. We discuss techniques and paradigms used at the architectural level to assess the error rate of a processor, then we discuss recent reliability work for general purpose GPUs.

2.1 Soft Error Overview

Soft errors are intermittent malfunctions of the hardware that cannot be reproduced. Soft errors are dynamic and are changes to a cell’s contents, rather than a change in the circuitry. They are caused by single event upsets (SEUs) which are most often the result of particle strikes on silicon devices. Among the most common particles that produce Soft Errors are neutrons from cosmic rays and alpha particles from packaging materials.

When these strikes occur, the particles are able to inject charge into the devices which can alter values in the devices. Each cell in a device has a minimum charge needed to change the stored value in the cell. This minimum charge is called the critical charge f (Qcrit) for that cell. Following a particle strike, if the accumulated charge exceeds the critical charge of the cell, a Soft Error occurs. In short, particle strikes which generate a charge higher than Qcritwill cause a Soft Error.

2.1.1 Faults vs Errors

A fault is an undesired state change in hardware. A fault in a particular layer in the computing stack may propagate to the next layer. The undesired state change in the next layer is termedan error. In this thesis, we use the termtransient faultsfor thesoft errors, defined in 2.1.

(18)

When a transient fault occurs in a bit, this bit can be overwritten to remove the fault. When the bit is not overwritten, the incorrect state that happens as a consequence of this fault is termed an error.

Errors can be classified based on their impact on the system. We identifyCorrectable Errors,Detected Unrecoverable ErrorsorSilent Data Corruptions.

2.1.2 Correctable Errors (CE)

Correctable Errors are errors from which the system is able to recover from and return to normal operation. This is usually made possible through either hardware or software. Because the system is able to recover from the effect of these errors, they are not usually not a cause of concern. Many vendors however, use the reported rate of Correctable Errors as a warning that a system may have an impending hardware problem [15].

2.1.3 Detected Unrecoverable Errors (DUE)

Detected Unrecoverable Errors are errors that will be discovered either through a program, operating system, or hardware. These errors are typically reported to the system and very often the system cannot recover. They often cause a system to crash.

2.1.4 Silent Data Corruptions (SDC)

Silent Data Corruptions are undetected errors that alter data in a system without being detected, and ultimately permanently corrupt program states or user data. Because they can cause a program to produce incorrect results without the knowledge of the user, these are the most undesirable errors.

2.2 Transient Fault - Background and Terminology

In order to deal with transient faults, microprocessor vendors often establish an error budget for each design. Designers then perform extensive analysis to ensure that a design meets these target budgets. Vendors express their error budget in terms of Mean Time Between Failures (MTBF). For example, for its Power4 processor-based systems, IBM targets 1,000 years system MTBF for SDC errors, 25 years system MTBF for DUE errors that lead to system crash and 10 years system MTBF for DUE errors that lead to application crash [16].

(19)

CHAPTER 2. BACKGROUND

Another commonly used measurement unit for error rates isFailures in Timeor FIT, which is inversely related to MTBF. One FIT corresponds to one failure in a billion hours. Therefore, 1,000 years MTBF equals 114 FIT(109/(1000∗365∗24)). The same way, zero FIT means that there is infinite time between failures (infinite MTBF). Designers prefer to work with FIT as opposed to MTBF because it is an additive unit, unlike MTBF.

To evaluate whether a chip meets its FIT target, designers use sophisticated computer models. The effective FIT rate of a structure in a chip is the product of two metrics, itsraw circuit FIT rateand itsvulnerability factor.

2.2.1 Raw Circuit Fit Rate

The raw circuit FIT rate of a cell also calledintrinsic FIT rateis its device-level Transient Fault Rate and includes any extra derating such as the ones that may be necessary for a dynamic cell.

2.2.2 Vulnerability Factor

Vulnerability factor (also called derating factor or soft error sensitivity factor) is an in-dication of the probability that an internal fault in a device will result in an externally-visible error.

Several vulnerability factors affect the FIT rate of a structure. Timing vulnerability factor for example measures the percentage of time a fault in a structure will lead to an externally-visible error. A strike in the stored bit of a level-sensitive latch may not cause an external error if the strike occurs while the latch was accepting data. The stored bit will be overridden by the entering data. Assuming that the latch was receiving data 50% of the time, itsTiming Vulnerability Factoris 50%. Several other vulnerability factors affect the effective FIT rate of a structure. However, in this work, we are assuming that all vulnerability factors, except thearchitectural vulnerability factor are incorporated into the raw circuit FIT rate.

2.3 Architectural Vulnerability Factor

TheArchitectural Vulnerability Factorof a structure is defined as the percentage of bits in the structure that are necessary for correct program execution over the lifetime of a simulation. It expresses the probability that a bit flip in the structure will produce a visible incorrect result at the output of a program.

(20)

Current predictions show that the overall raw FIT rate per bit will remain constant for the next several technology generations. Therefore, it is crucial to focus efforts on reducing the architectural vulnerability factor of a chip to make the chip more reliable and competitive.

There are two common methodologies for assessing AVF in silicon devices:fault injection andACE analysis[17][18][9]. These two methods help designers analyse the AVF of an architecture in various stages of the design process.

2.3.1 Fault Injection

Fault injection is the most widespread method for assessing reliability. A fault injection campaign compares the reference behavior of the circuit for a given workload (that is, the correct behavior validated by the designer) with the behavior obtained in the presence of each fault in a predetermined set [19].

In a fault injection campaign, a fault is injected in a hardware structure at a random time and at a random location, while a workload is being executed on the device under test. The output of the workload is then examined against a golden output to determine whether the injected fault caused a visible failure. This process is then repeated a number of times and as the number of runs becomes statistically significant, the ratio between the number of failing runs to the total number of runs will converge towards theArchitectural Vulnerability Factorof the structure. Hardware-implemented fault injection and software-implemented fault injection are the most common approaches used to perform fault injection.

2.3.1.1 Hardware-implemented Fault Injection

In this method, faults are inserted into the actual device silicon by either using a dedicated custom hardware [20] or by injecting the faults into integrated circuits using heavy-ion radiation [21].

Because Hardware Fault Injection is done in actual hardware, there is no need to know the internal details of the hardware and it really mimics what happens in real systems. It is therefore very accurate; the effects of the operating system, the latency from IO operations, and other non-determinist effects are already taken into account. Furthermore, since injections are done in the actual hardware that is running the workloads, a fault injection campaign takes significantly less time than software-implemented fault injection.

However, the disadvantages of a hardware fault injection campaign make it very difficult to consider. First, hardware fault injection needs to be done post silicon, as we need at least a hardware

(21)

CHAPTER 2. BACKGROUND

prototype. This is usually too late considering that such reliability analysis is often needed during the architectural exploration phase of a design. However, the results can help make reliability decisions for future devices that use a similar technology or architecture. Second, it is very expensive and time-consuming to build a dedicated custom hardware and submit a hardware through an electron beam.

2.3.1.2 Software-implemented Fault Injection

In Software Fault Injection, faults are injected in the simulated hardware under test. Because this is done in a software implementation of the hardware, it can be done in a performance simulator which is usually available during the architectural exploration phases of a microprocessor design project. Therefore, the results of a software-implemented fault injection campaign can be used to influence the design of a new chip. Moreover, since we are using a software implementation of the hardware, we naturally have more visibility into the internals of the architecture under test. One drawback of the software fault injection method is that simulation tends to be very slow compared to the execution of a workload on the native hardware.

2.3.2 ACE Analysis

ACE analysis has first been developed by Mukherjee et al [9] to calculate the Architectural Vulnerability Factor (AVF) of pipeline structures such as the instruction queue and the Reorder Buffer. Traditional ACE analysis is implemented in simulation and will determine the AVF of hardware structures by executing a single pass through a program.

In ACE analysis, the AVF of hardware structures is estimated by tracking the hardware state bits that are required for Architecturally Correct Execution (ACE). If any fault occurs in a storage cell containing theseACE bits, and if there is no error correction technique present on the system, there will be a visible error in the output of the program. The remaining state bits that are not ACE are calledun-ACE bits; they are not required for architecturally correct execution of the program and a fault in a storage cell containing an un-ACE bit will not cause a visible error at the output of the program.

The AVF for a single-bit storage cell is the fraction of time it holds an ACE bit. Conse-quently, the AVF for a hardware structure is the average AVF of its storage cells. ACE analysis on a structure starts by conservatively assuming that all bits in the structure are ACE bits, then proceeds to identify bits that can be marked as un-ACE. Un-ACE bits can be categorized as either

(22)

architectural or microarchitectural un-ACE bits. Examples of architectural un-ACE bits include bits from NOP instructions, performance-enhancing instructions (e.g., prefetches), predicated-false instructions, dynamically-dead code, and logical masking. Examples of microarchitectural un-ACE bits are idle or invalid bits, mis-speculated bits (wrong-path instructions or predictor structure bits), and microarchitecturally dead bits.

Because ACE analysis generates a conservative value for the AVF of a structure, the AVF value obtained through ACE analysis can very often be too conservative. It has been shown that even a refined ACE analysis can overestimate the error vulnerability of a structure by 2-3x [22]. This can result in overprotection of the structure, which makes a processor uncompetitive. Furthermore, although ACE analysis gives more insight into the resilience of a structure, performing ACE analysis on certain structures can be a very involved process.

(23)

Chapter 3

Transient Faults on GPUs

In this chapter, we review previous studies on the effects of transient faults on GPUs. We also consider the tools that have been developed to study these effects. The previous studies can be categorized into fault injection studies and ACE analysis studies.

3.1 Fault Injection Studies

Fault injection evaluates the impact of introducing a fault into the execution of a program. The execution can be done on live hardware (typically done through radiation beam testing, or a software-based injector) or in a simulated microprocessor or memory system using software.

3.1.1 GPU-Qin

GPU-Qin [23] is a fault injection tool for GPUs. The tool is built to perform fault injection studies on real GPUs running CUDA-based applications. It uses CUDA-GDB, the NVIDIA tool for debugging GPU applications. The applications are first profiled, and then instructions are selected as fault injection sites. At runtime, GPU-Qin injects a fault into the selected instructions.

The results of a fault injection campaign with GPU-Qin [23] showed that some applications inherently possess some resiliency characteristics to transient faults. This should be taken into account when protecting an application against transient faults. In addition, there was a wide variation in the rates of Silent Data Corruptions (SDC) and crashes across the studied applications. However, benchmarks with similar behaviors showed similar vulnerability behavior. For example, HashGPU-sha1 and HashGPU-md5 are respectively SHA1 and MD5 hash implementations of StoreGPU[24], a

(24)

The reason for this variability in rate of SDCs in GPU applications is mainly related to the applications’ characteristics. For example, applications based on search algorithms are likely to have lower SDC rates than applications that perform computations such as linear algebra because a fault that affects the search in a part that will not lead to a match, is unlikely to produce an incorrect result. Applications based on the ”average out” algorithm, such as stencil codes [25], also have a low SDC rate.

These applications have computations in which the final state is a product or average of multiple temporary states. Because the final state will be an average or product of temporary states, a fault affecting an intermediate state is likely to be masked and unlikely to affect the final state.

Another focus of the fault injection study with GPU-Qin was a technique to cluster applications into five resilience categories based on their SDC rates. Because of the variability in SDC rates across the applications, and the similarity in resilience among similar algorithms, the GPU-Qin authors found it useful to categorize the applications based on their SDC rate and the operations they perform. From this clustering, each of the resilience categories seemed to match very well with one or many of the dwarves defined by Asanovic et al. [26]

3.1.2 SASSIFI

SASSIFI [27] is another fault injection tool for NVIDIA GPU’s. i SASSIFI is based on SASSI, a low-level, compiler-based assembly-language instrumentation framework that allows the injection of code at specific points in a program [28]. SASSIFI injects faults in the destination values of executing instructions of a running program at the architectural level. This allows for faster fault injection, increased visibility into the applications and the possibility for a detailed study and analysis of the magnitude of Silent Data Corruptions (SDC).

SASSIFI provides the user with the ability to trace an SDC all the way back to the specific fault which produced it, and also the ability to correlate program properties with program vulnerabilities, which is a key to develop low cost error mitigation schemes. Because SASSIFI injects faults at the architecture level (as opposed to the microarchitecture level), fault injection experiments with SASSIFI can only measure the derating that happened at the application level.

A fault injection study using SASSIFI and the Rodinia applications evaluated the variation in the SDC rate of these applications. Further analysis was done on the injected faults that caused different outcomes and it was observed that fault injection outcomes vary with different kernels of the same program, and with different invocations of the same kernel.

(25)

CHAPTER 3. TRANSIENT FAULTS ON GPUS

SASSIFI is similar to GPU-Qin, as it injects single-bit faults in destination values of executing instructions of a program. One key difference is that SASSIFI is able to inject faults into control and predicate registers. Another drawback of these tools is that because the instrumenting instructions need to run on the GPU, code injections may perturb the workload that is running on the GPU, possibly altering the behavior of the workloads. This can lead to inaccuracy in the reported reliability results.

Finally, it is worth mentioning that SASSIFI and GPU-Qin inject faults at a level above the microarchitecture. This means that the derating that comes from hardware structures is not taken into account in the results. Furthermore, the location for an injection is carefully selected in order to reduce the population size and easily attain statistical significance.

3.1.3 Multi2Sim Fault Injection [1]

In this thesis we utilize the Multi2Sim simulation framework [29] to provide the basis for our simulation model of the AMD 5870. We leverage prior work by Farazmand et al. [1] to build our fault injection framework in Multi2Sim [29]. Multi2Sim is a simulation framework for CPU-GPU heterogeneous computing. We provide a more detailed description on the environment and infrastructure in Chapter 4.

In prior work by Farazmand et al., For this fault injection campaign, faults are injected in structures of the microarchitecture. The results of this campaign show that a great number of resources are not utilized by the GPU, especially for the small applications that were used. This results in a very low rate of Silent Data Corruptions and crashes. For the injections in utilized resources, the GPU demonstrated high resilience, and in many cases, the applications were able to run to completion without any error in their output.

There were a few interesting implications from this fault injection study. The authors observed that structures with similar functionality in the CPU and GPU were not necessarily similar in terms of their vulnerability. In addition, given that very few injections into the register file led to an error in the program outputs, it makes little sense to dedicate significant resources to protect this structure. This prior work is the only other fault injection campaign that tries to compute the AVF values for structures of the GPU.

(26)

3.2 ACE Analysis studies

3.2.1 GPGPU-SODA

GPGPU-SODA [30] is a framework to evaluate the vulnerability of a GPU to transient faults. It is built on a cycle-accurate, open-source and publicly available, simulator, GPGPU-Sim. GPGPU-SODA is capable of estimating the vulnerability of the major microarchitecture structures in a Streaming Multiprocessor using ACE analysis. GPGPU-SODA attempts to characterize the vulnerability of different micro-architectural structures a GPU to transient faults through architecture vulnerability factor (AVF) analysis.

Tan et al.’s study with GPGPU-SODA found that the GPU microarchitecture vulnerability is highly related to workload characteristics such as the percentage of un-ACE instructions, the per-block resource requirements and the degree of branch divergence. They also concluded that several structures are highly susceptible to transient faults, and that the entire GPU should be considered for protection.

(27)

Chapter 4

Methodology

As discussed in Chapter 2, fault injection can be performed on a real or a simulated hardware. GPU-Qin and SASSIFI, for example, inject faults in physical GPUs that are under test. This fault injection method (in real hardware) will yield very accurate results and can be significantly faster than the alternative.

However, simulation-based fault injection is useful in the fact it can help GPU designers to explore potential design decisions. Our simulator as described in this chapter is built for architectural exploration. Our fault injection mechanism, which is built into Multi2Sim [29], can be used to carry fault injection experiments while exploration different microarchitectural, compiler and runtime tradeoffs. In this chapter, we describe the framework that we used to perform the fault injection campaign, as well as the post-experiment analyses.

4.1 Multi2Sim simulation model

We performed our fault injection campaign in an architectural simulator, Multi2Sim [29]. Multi2Sim is an open-source, modular and fully configurable simulation framework for CPU-GPU computing. Multi2Sim provides a wide range of CPU and GPU choices. The specific framework used in this thesis leverages a model of the AMD Evergreen family of GPUs. The Evergreen Instruction Set Architecture has been used in the implementation of AMD’s mainstream RadeonTM5000 and 6000 series of GPUs. The model implemented in Multi2Sim is similar to the RadeonTM5870 GPU. Multi2Sim supports both functional simulation and architectural (or detailed) simulation for the Evergreen family of GPUs. Functional simulation provides traces of Evergreen instructions; architectural or detailed simulation tracks the execution time and architectural state of the GPU

(28)

hardware structures. Simulation of a program in the Evergreen model begins with the execution of a CPU code, the host code of the OpenCL program. The host code is run using the CPU simulation module of Multi2Sim. The OpenCL API calls from the host program are intercepted and used to kick-off the GPU simulation.

4.2 Multi2Sim for Fault Injection

The authors of Multi2Sim have introduced the ability to perform fault injection into the execution of a program [31]. Using Multi2Sim, we are able to assess the reliability of individual hardware structures. We can inject faults during any cycle of the runtime of any hardware structure that is modelled by Multi2Sim. The fault injection mechanism is not specific to the Evergreen model, and therefore can be implemented and applied to different GPU architectures in Multi2Sim. In addition to the study presented in Chapter 3, the fault injection mechanism in Multi2Sim has been extended and used in other fault injection studies [32] [33] [1].

When using Multi2Sim to inject a fault in a hardware structure during a simulation, a fault definition file is fed to the simulator. This fault definition file contains the following information: a) the targeted hardware structure, b) the specific fault location, and c) the injection time. The fault location is the position of the bit within the hardware structure where the fault should be injected and the injection time is the simulation cycle where the injection is performed.

Figure 4.1: Possible outcomes for each simulation run

A fault is represented by a bit flip in the simulated hardware structure at the specified cycle and location. The faulty value is either propagated to other locations in the simulation model, or is masked by the program. The programs used in our experiment have a self-check mechanism. This mechanism compares the output of the GPU program to a reference golden precalculated output. The

(29)

CHAPTER 4. METHODOLOGY

For each structure and each application, a total of 10,000 single faults are injected. The statistical significance of this number of experiments is discussed in Section 4.3. In order to calculate the Architectural Vulnerability Factor, we compute the number of fault injections that result in a program failure (SDCs) and divide by the total number of faults injected.

4.3 Statistical Significance

Figure 4.2: This graph shows how the AVF value changes based on the number of fault-injection experiments. We notice that the AVF value shows little variation and stabilizes after 5,000 injections.

Our goal is to statistically estimate the AVF of the workloads running on an AMD Evergreen GPU. We want to choose a number of simulations that is large enough for our results to achieve statistical significance, and not so large that performing the experiments become burdensome.

In order to reach this confidence level, we computed the AVF of the structures with a varying number of injections. The results of this experiment are shown in Figure 4.2. We found that the AVF value varies significantly with a small number of injections. However, after the number of injections passes 5,000 there is little variance in the AVF value.

(30)

Figure 4.3: Formula for the number of faults to select for injection

We also verify that the number of faults injected into every structure is statistically signifi-cant using the methodology presented by Leveugle et al. [19]. According to their methodology, given a confidence level, the sample sizen, or number of faults to randomly select for injection, can be computed with the formula in Figure 4.3. The variables in this formula are:

• N: initial population size. This is the number of all the potential injection sites.

• p: estimated probability of faults resulting in an error. The authors demonstrated that p=0.5 is a sufficient value to use in our experiments.

• e: margin of error. This is the most sensitive parameter in the formula. Reducing this parameter can increase the sample size very quickly. We chose 0.005 as our margin of error

• t: cut-off point or confidence level. We chose 95% for our confidence level

Using this formula, we computed the required sample size for the vector register file of the GPU while running the ScanLargeArrays workload. ScanLargeArrays runs for less than 7.5 million cycles. To guarantee a margin of error of less than 0.05%, we would be required to inject 9,800 faults. Given that most of our applications run for less than 7.5 million cycles, and that the number of potential injection sites on the local memory are far less than that of the register file, the initial population size is then the highest in the case of ScanLargeArrays. Thus, we can easily argue that this case will give us the highest error margin. Consequently, with 10,000 injections used in this thesis, our results will have a margin of error of less than 0.5%.

4.4 Post-Experiment Analysis

Fault-injection campaigns very often treat the system under evaluation as a black box. After the results of the campaign are reported, researchers come out with a number that measures the

(31)

CHAPTER 4. METHODOLOGY

the vulnerability of their applications. Data obtained from our fault injection campaign allows us to perform more analysis on the injected faults throughout the execution of the programs. Because we have the data from each fault and their outcome at the end of the simulation, we are able to perform an per-time-interval vulnerability study on the applications. That is, we can track the vulnerability of an application during the course of its execution. We hope that these results, coupled with their application profiles, will be helpful to programmers when evaluating the vulnerability of their applications.

4.5 Evaluation Framework

Number of compute units 1 Number of Stream Cores 16 Number of Vector Registers 16384 Number of Memory Banks 32

Table 4.1: The GPU configuration used in the experiments

4.5.1 Platform for Evaluation

The configuration used for the GPU model used in the experiments is presented in Table 4.1. We have used the AMD RadeonTM5870 as the base configuration. An overview of the Evergreen architecture is provided in Figure 1.2.

Because our applications are small compared to the size of the workloads that a GPU can potentially run, we have used only one compute unit in order to maximize the occupancy of the GPU. The faults can therefore be injected into usable resources of the GPU. We argue that focusing on the reliability of a single CU should not signficantly impact the fidelity of our AVF values.

4.5.2 Evaluated Benchmarks

The applications are taken from the AMDAPP SDK [34] and they are common general-purpose GPU applications. These applications were chosen because they provide a representative cross-section of common GPU applications. The list of selected applications is shown in Table 4.2.

(32)

Benchmark Description

BNSRCH Binary Search Binary Search finds the position of a given element in a sorted array. Instead of dividing the search space at every pass, it is divided into N segments and is called N’ary search. Computation complexity is log to base N. BSRT Bitonic Sort Sorting network ofnlog2ncomparators.

Performs best when sorting a small number of elements.

DCT DCT Discrete Cosine Transform is a common

transform for compressions of 1D and 2D signals such as audio, images and video. HGRM Histogram Calculates the histogram of n array MMUL MatrixMultiplication Performs the multiplication

of two matrices

MTRNS MatrixTranspose Matrix transpose optimized to coalesce accesses to shared memory and avoid bank conflicts

PSUM PrefixSum Computes an array which

is the running totals of the elements of the input array.

RDXS RadixSort Radix-based sorting algorithms treat keys as multi-digit numbers in which each digit is an integer with a value ranging from 0 to m, where m is the radix.

SLA ScanLargeArrays This scan is based on a prefix sum but the scan is done block-wise, then the blocks are combined into a single result array.

(33)

Chapter 5

Results and Analysis

Multiple factors can affect the vulnerability of a structure. Vulnerability can differ based on kind of computation a program performs, whether or not the application can tolerate approximate output values, and the level of occupancy of the structure where the fault was injected. Also, the microarchitecture can mask faults. Specifically, an identical fault injected in an alternative microarchitecture can lead to very different outcomes.

In this thesis, we will look at is the liveness of a structure. We will examine how a programmer can reduce the vulnerability of a structure by reducing the liveness of that structure.

5.1 Local Data Storage

The results derived from our fault injection study in the Local Data Share are presented in Figure 5.1. From these results, we observe that some applications do not make use of the LDS and therefore show no vulnerability. Other applications either partially or fully use the Local Data Storage (LDS) and show a wide range of vulnerability. Figure 5.2 shows the maximum amount of local memory used by each application.

Some applications, notably BitonicSortand BinarySearchdo not use the Local Data Storage in their algorithm. There is no apparent sharing between work-items within a workgroup. Therefore, any fault injected in the LDS will not have an an impact on visible output of the program. Because the locations for the bit flips were randomly chosen, and because most of the applications are small in nature and can make minimal use of the LDS, many of the flipped bits were into unallocated portions of the LDS. This can be seen in applications have a high percentage of

(34)

(35)

CHAPTER 5. RESULTS AND ANALYSIS

Figure 5.2: Amount of local memory used by each application for each NDRange mapped to the compute device

(36)

NoFaults(benign faults). For example, inPrefixSum,DCTandScanLargeArrays, a large number of the faults were injected into non-utilized portions of theLDS.

Other applications, such as Histogram, MatrixMultiplication, RadixSort and Scan-LargeArraysheavily use the LDS. However, the failure rate for these applications varies widely. Histogram experiences very little resilience to faults, while MatrixMultiplication is highly tolerant to faults. When running MatrixMultiplication, the probability that a fault in the LDS leads to a program visible error in the output matrix is around 1%. This suggests that the inherent resilience of an application should be taken into account when protecting it from transient faults.

5.2 Register File

Figure 5.3: Results of Fault injection experiments on the General Purpose Register File

(37)

Figure 5.4: Number of General-Purpose registers used by each application for each NDRange mapped to the compute device

(38)

in Figure 5.3. These results show that the register file of the GPU is a highly resilient structure. A good portion of the activated faults get masked, and so do not affect the output of the program.

Figure 5.4 shows the maximum number of registers used for each NDRange mapped to a compute device per benchmark.

5.3 How does vulnerability vary over time?

To understand the inherent resilience of an application, we look to see how vulnerability of an application varies over time. Our goal is to identify points in time where the application is most vulnerable (or least vulnerable) and match them with specific architectural events that can be under the control of a developer. This should provide insights to a programmer or application designer on how to make their applications more fault-tolerant. HERE

To identify points in time where an application is most vulnerable, we divide the application simulation time in cycle-intervals. Then we look all the faults that were injected during this cycle and their outcomes. We also verify that the small number of faults into each interval is statistically significant using the formula from Leveugle et al [19], presented in 4.3. We estimate that the number of faults needed to obtain a 95% confidence interval in our results is about 90. Because the faults are uniformly distributed over time, we are confident that our results are statistically meaningful. We found some interesting findings which can beneficial to application designers. We cannot guarantee that the observed trends can be seen in every application; however they can offer some guidelines when designing fault-tolerant applications. We are presenting our findings in a few case studies that we present below.

5.3.1 Case Study: LDS - RadixSort

The LDS-AVF variation over time for RadixSort is shown in Figure 5.5. This graph shows that midway through the running of RadixSort, the application is highly vulnerable. Almost any fault in these intervals will lead to an incorrect result. We conduct some analysis to examine the reasons behind the high vulnerability of this application.

The RadixSort application has 3 phases: a histogram which runs on the GPU, a scan of the generated histogram which runs on the CPU and a permute which runs again on the GPU. Figure 5.6 shows the in-flight accesses to the LDS during the course of the running of the application. The

(39)

Figure 5.5: Intervals of Vulnerability for Radixsort. This shows that the faults that lead to incorrect output fall only into specific intervals of time

(40)

(41)

application loops four times, each time operating on a different part of the 32-bit integers, starting with the least significant.

We see that the first iteration is not vulnerable at all. This is because this iteration is done on low-order bits of the integers. In the other iterations, the histogram kernel is the most vulnerable, with the 3rd iteration being almost 100% vulnerable.

5.3.2 Case Study: MatrixMultiplication

Figure 5.7: Intervals of Vulnerability for MatrixMultiplication. This vulnerability of MatrixMultipli-cation shows a periodic behavior

The LDS-AVF time variation for MatrixMultiplication is shown in Figure 5.7. This figure shows that the vulnerability of the LDS with MatrixMultiplication presents a periodic behavior with some short intervals of no vulnerability. This is usually explained by the fact that the application is reading locations of memory in chunks and re-writing other locations. In Figure 5.8, we show how

(42)

(43)

with brief spurts in writes followed by longer periods of memory reads. We coincidentally see that the intervals of low vulnerability of the LDS fall right before the memory writes. This is because the application has finished using the values stored in memory right before it initiates the writes that would overwrite the memory locations. Therefore, a fault in an LDS location right before the location gets overwritten is not going to affect the result of the Matrix Multiplication.

(44)

Conclusion

In this thesis, we conducted a thorough characterization of the effects of particle-induced errors in the vector register file and the local data share of a GPU from the AMD Evergreen family of GPUs. Our study shows that the vulnerability of the common GPU applications varies widely depending on the implementation of the application.

We also look at the a few trends that can be exploited by application developers in order to make their applications more robust. We observed that some applications show a periodic behavior in their vulnerability and are most vulnerable at specific intervals during their execution. Overall, the longer the intervals of vulnerability and the more intervals of vulnerability, the more vulnerable the application is. One general rule of thumb for a highly resilient application is to reduce the liveness of useful data, i.e. reduce the time between writing to a structure and reading the written data. Also, because, caches are more likely to be protected, a GPU application developer seeking to make his/her application more reliable should consider storing his/her data in global memory as a trade-off for better performance.

In future work, we plan on developing a more comprehensive vulnerability analysis framework. A fault injection campaign is a very lengthy process and provides very little detail with regards of the sources of masking in a hardware structure. A comprehensive reliability framework will help us identify the sources of vulnerability and provide better insights to both hardware designers and application developers.

(45)

Bibliography

[1] N. Farazmand, R. Ubal, and D. Kaeli, “Statistical fault injection-based analysis of a GPU architecture,” inWorkshop on Silicon Errors in Logic - System Effects (SELSE), 2012.

[2] D. A. Patterson and J. L. Hennessy, Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design), 4th ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008.

[3] G. E. Moore, “Cramming more components onto integrated circuits,”Proceedings of the IEEE, vol. 86, no. 1, pp. 82–85, Jan 1998.

[4] E. Alerstam, T. Svensson, and S. Andersson-Engels, “Parallel computing with graphics process-ing units for high-speed monte carlo simulation of photon migration,”Journal of biomedical optics, vol. 13, p. 060504, 2008.

[5] M. C. Schatz, C. Trapnell, A. L. Delcher, and A. Varshney, “High-throughput sequence alignment using graphics processing units,”BMC Bioinformatics, vol. 8, no. 1, pp. 1–10, 2007. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-8-474

[6] J. E. Stone, J. C. Philips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten, “Accel-erating molecular modeling applications with graphics processors,”Journal of Computational Chemistry, vol. 28, pp. 2618–2640, 2007.

[7] S. Grauer-Gray, W. Killian, R. Searles, and J. Cavazos, “Accelerating financial applications on the gpu,” inProceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, ser. GPGPU-6. New York, NY, USA: ACM, 2013, pp. 127–136. [Online]. Available: http://doi.acm.org/10.1145/2458523.2458536

(46)

[8] J. W. Sheaffer, D. P. Luebke, and K. Skadron, “The visual vulnerability spectrum: Characterizing architectural vulnerability for graphics hardware,” in Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, ser. GH ’06. New York, NY, USA: ACM, 2006, pp. 9–16. [Online]. Available: http://doi.acm.org/10.1145/1283900.1283902

[9] S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin, “A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor,” inMicroarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on, Dec 2003, pp. 29–40.

[10] D. T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, and R. K. Iyer, “Nftape: a framework for assessing dependability in distributed systems with lightweight fault injectors,” inComputer Per-formance and Dependability Symposium, 2000. IPDS 2000. Proceedings. IEEE International, 2000, pp. 91–100.

[11] J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson, “Goofi: generic object-oriented fault in-jection tool,” inDependable Systems and Networks, 2001. DSN 2001. International Conference on, July 2001, pp. 83–88.

[12] E. S. Larsen and D. McAllister, “Fast matrix multiplies using graphics hardware,” in Supercom-puting, ACM/IEEE 2001 Conference, Nov 2001, pp. 43–43.

[13] T. K. G. T. O. Standard,www.khronos.org/opencl.

[14] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with cuda,” Queue, vol. 6, no. 2, pp. 40–53, Mar. 2008. [Online]. Available: http://doi.acm.org/10.1145/1365490.1365500

[15] S. Mukherjee, Architecture Design for Soft Errors. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008.

[16] D. Bossen, “Cmos soft errors and server design,”IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, vol. 121, pp. 07–1, 2002.

[17] M.-C. Hsueh, T. K. Tsai, and R. K. Iyer, “Fault injection techniques and tools,”IEEE Computer, vol. 30, no. 4, pp. 75–82, Apr. 1997.

(47)

BIBLIOGRAPHY

[18] C. Constantinescu, M. Butler, and C. Weller, “Error injection-based study of soft error propagation in amd bulldozer microprocessor module.” inDSN, R. S. Swarz, P. Koopman, and M. Cukier, Eds. IEEE Computer Society, 2012, pp. 1–6. [Online]. Available: http://dblp.uni-trier.de/db/conf/dsn/dsn2012.html

[19] R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert, “Statistical fault injection: Quantified error and confidence,” in Design, Automation Test in Europe Conference Exhibition, 2009. DATE ’09., April 2009, pp. 502–506.

[20] J. Arlat, Y. Crouzet, and J.-C. Laprie, “Fault injection for dependability validation of fault-tolerant computing systems,” inFault-Tolerant Computing, 1989. FTCS-19. Digest of Papers., Nineteenth International Symposium on, June 1989, pp. 348–355.

[21] J. Karlsson, P. Liden, P. Dahlgren, R. Johansson, and U. Gunneflo, “Using heavy-ion radiation to validate fault-handling mechanisms,”Micro, IEEE, vol. 14, no. 1, pp. 8–23, Feb 1994. [22] N. J. Wang, A. Mahesri, and S. J. Patel, “Examining ace analysis reliability estimates using

fault-injection,” in Proceedings of the 34th Annual International Symposium on Computer Architecture, ser. ISCA ’07. New York, NY, USA: ACM, 2007, pp. 460–469. [Online]. Available: http://doi.acm.org/10.1145/1250662.1250719

[23] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,” inPerformance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on, March 2014, pp. 221–230. [24] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu, “Storegpu:

Exploiting graphics processing units to accelerate distributed storage systems,” inProceedings of the 17th International Symposium on High Performance Distributed Computing, ser. HPDC ’08. New York, NY, USA: ACM, 2008, pp. 165–174. [Online]. Available: http://doi.acm.org/10.1145/1383422.1383443

[25] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick, “Stencil computation optimization and auto-tuning on state-of-the-art multicore archi-tectures,” in2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2008, pp. 1–12.

(48)

[26] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The landscape of parallel computing research: A view from berkeley,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2006-183, Dec 2006. [Online]. Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

[27] S. K. S. H. Hari, T. Tsai, M. Stephenson, S. Keckler, and J. Emer, “SASSIFI: Evaluating resilience of gpu applications,” in Workshop on Silicon Errors in Logic - System Effects (SELSE), 2015.

[28] M. Stephenson, S. K. Sastry Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O’Connor, and S. W. Keckler, “Flexible software profiling of gpu architectures,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ser. ISCA ’15. New York, NY, USA: ACM, 2015, pp. 185–197. [Online]. Available: http://doi.acm.org/10.1145/2749469.2750375

[29] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “ Multi2Sim: A Simulation Framework for CPU-GPU Computing ,” inProc. of the 21st International Conference on Parallel Architectures and Compilation Techniques, Sep. 2012.

[30] J. Tan, N. Goswami, T. Li, and X. Fu, “Analyzing soft-error vulnerability on gpgpu microarchi-tecture,” inWorkload Characterization (IISWC), 2011 IEEE International Symposium on, Nov 2011, pp. 226–235.

[31] R. Ubal, D. Schaa, P. Mistry, X. Gong, Y. Ukidave, Z. Chen, G. Schirner, and D. Kaeli, “Exploring the heterogeneous design space for both performance and reliability,” inProceedings of the 51st Annual Design Automation Conference, ser. DAC ’14. New York, NY, USA: ACM, 2014, pp. 181:1–181:6. [Online]. Available: http://doi.acm.org/10.1145/2593069.2596680

[32] F. Previlon, M. Wilkening, V. Sridharan, S. Gurumurthi, and D. Kaeli, “Examining the impact of ace interference on multi-bit avf estimates,” Proceedings of SELSE-8: Silicon Errors in Logic-System Effects, 2015.

[33] M. Wilkening, V. Sridharan, S. Li, F. Previlon, S. Gurumurthi, and D. R. Kaeli, “Calculating architectural vulnerability factors for spatial multi-bit transient faults,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-47.

(49)

BIBLIOGRAPHY

Washington, DC, USA: IEEE Computer Society, 2014, pp. 293–305. [Online]. Available: http://dx.doi.org/10.1109/MICRO.2014.15

[34] “AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK),” http://developer.amd.com/sdks/AMDAPPSDK.

Architectural Vulnerability Factor Estimation through Fault Injections

Architectural Vulnerability Factor Estimation through Fault

Injections

Contents

List of Figures

List of Acronyms

Abstract of the Thesis

Architectural Vulnerability Factor Estimation through Fault Injections

Introduction

1.1

Introduction to GPU Programming

1.2

Contributions of this Thesis

1.3

Organization of the Thesis

Chapter 2

Background

2.1

Soft Error Overview

2.2

Transient Fault - Background and Terminology

2.3

Architectural Vulnerability Factor

Chapter 3

Transient Faults on GPUs

3.1

Fault Injection Studies

3.2

ACE Analysis studies

Chapter 4

Methodology

4.1

Multi2Sim simulation model

4.2

Multi2Sim for Fault Injection

4.3

Statistical Significance

4.4

Post-Experiment Analysis

4.5

Evaluation Framework

Chapter 5

Results and Analysis

5.1

Local Data Storage

5.2

Register File

5.3

How does vulnerability vary over time?

Conclusion

Bibliography