ROBUST NEURAL NETWORKS

(1)

António Manuel Delgado Morais

R ÔBUST N ÊURAL N ÊTWORKS

Dissertation in the context of the Master in Informatics Engineering, Specialization in Intelligent Systems, supervised by Professors Raul Barbosa

and Nuno Lourenço and presented to the

Faculty of Sciences and Technology / Department of Informatics Engineering.

September 2021

(2)

(3)

I would like to thank my supervisors, Professors Raul Barbosa and Nuno Lourenço, for the continuous assistance, for the availability to answer questions, and for the suggestions which were crucial for the success of this work.

I am also deeply grateful for the support given to me by my family and friends, which motivated me to continue working when I stumbled upon a problem, and encouraged me when I managed to overcome it.

This work is partially funded by Portuguese Foundation for Science and Technology, I.P., within the project CISUC - UID/CEC/00326/2020, by European Social Fund, through the Regional Operational Program Centro 2020 and by the project AI4EU – “A European AI On Demand Platform and Ecosystem”, project funded by the European Commission, H2020-EU.2.1.1., Grant agreement ID: 825619.

(4)

(5)

Department of Informatics Engineering

Robust Neural Networks

António Manuel Delgado Morais

Dissertation in the context of the Master in Informatics Engineering, Specialization in Intelligent Systems supervised by Professors Raul Barbosa and Nuno Lourenço and presented

to the Faculty of Sciences and Technology / Department of Informatics Engineering

September 2021

(6)

(7)

The growing usage of Machine Learning (ML) based systems in safety-critical contexts has prompted increased concerns over the reliability of the models and algorithms used. De- spite their effectiveness, these models can make mistakes with serious consequences. These failures are often attributable to some sort of defect in the model architecture or lack of training data. Other times, however, these errors happen due to random hardware faults.

To limit the effects of the latter, several methods have been developed and applied to ML models with the goal of increasing their fault tolerance. Models based on Deep Neural Networks (DNNs), particularly Convolutional Neural Networks (CNNs), are especially significant due to their applications in safety-conscious tasks such as autonomous driving or medical environments.

In this work, we study the effectiveness of existing methods in improving the fault tolerance of CNNs, such as Dropout, Redundancy, Ranger and Stimulated Dropout. We use four datasets of varying complexity that represent diverse applications of ML models, one of which in a safety-critical context. In addition, we combine some of these fault tolerance methods into hybrid approaches.

To measure the fault tolerance of ML models, we devise and implement a model-agnostic experimental process that uses the ucXception framework to inject faults during the testing phase.

Our evaluation of the tested methods shows that only Ranger and Stimulated Dropout consistently improve the fault tolerance of CNN-based ML models. Of these two methods, Stimulated Dropout shows the largest improvement in fault tolerance; however, the high computational costs of this method make its use a challenge for modern architectures in its current form, and further research is required to improve its performance.

Keywords

Safety-Critical Systems, Convolutional Neural Networks, Fault Tolerance, Fault Injection, C++, PyTorch, Dropout, Redundancy, Ranger, Stimulated Dropout

(8)

(9)

A utilização crescente de sistemas baseados em Aprendizagem Computacional (AC) em contextos seguros-críticos tem levado a um aumento da preocupação associada à fiabili- dade dos modelos e algoritmos utilizados. Apesar da sua elevada eficiência, estes modelos podem cometer erros com consequências graves. Estas falhas são com frequência atribuíveis a algum tipo de defeito na arquitetura do modelo ou a falta de dados de treino. Contudo, existem ocasiões em que estes erros acontecem devido a falhas aleatórias de hardware. De forma a limitar os efeitos destes tipos de falhas, vários métodos foram desenvolvidos e aplicados a modelos de AC com o objetivo de aumentar a sua tolerância a falhas. Modelos baseados em Redes Neuronais Profundas (RNPs), particularmente Redes Neuronais Con- volucionais (RNCs), são especialmente significativos devido à sua utilização em contextos sensíveis como a condução autónoma ou aplicações médicas.

Neste projeto, estudamos a eficiência de métodos existentes para melhorar a tolerância a falhas de RNCs, como Dropout, Redundância, Ranger e Stimulated Dropout. Utilizamos quatro conjuntos de dados de complexidade variável que representam aplicações diversas de modelos de AC, uma delas num contexto seguro-crítico. Para além disto, combinamos alguns destes métodos de tolerância a falhas em abordagens híbridas.

Para medir a tolerância a falhas dos modelos de AC, idealizamos e implementamos um processo experimental utilizável com qualquer modelo que utiliza a framework ucXception para injetar falhas durante a fase de testagem.

A nossa avaliação dos métodos testados mostra que apenas o Ranger e Stimulated Dropout melhoram de forma consistente a tolerância a falhas de modelos de AC baseados em RNCs.

Destes dois métodos, Stimulated Dropout mostra uma maior melhoria na tolerância a falhas; contudo, o elevado custo computacional deste método torna a sua utilização desafiante em arquiteturas modernas na sua forma atual, e mais investigação é necessária para melhorar o seu desempenho.

Palavras-Chave

Sistemas Seguros-Críticos, Redes Neuronais Convolucionais, Tolerância a Falhas, Injeção de Falhas, C++, PyTorch, Dropout, Redundância, Ranger, Stimulated Dropout

(10)

(11)

Acronyms

ANN Artificial Neural Network. xii, 4, 6, 7, 9–11, 21–26, 28, 33, 43, 65 API Application Programming Interface. 21

BN Batch Normalisation. 10 CCI Code Change Injection. 19, 21

CNN Convolutional Neural Network. iii, 1, 2, 7–10, 29, 37, 38, 60, 65, 66 CS Critical System. 14

DEI Data Error Injection. 19, 20 DL Deep Learning. 7

DMR Dual Modular Redundancy. 17

DNN Deep Neural Network. iii, 1, 7, 10, 13, 14, 17, 22, 23, 33 FC Fully-Connected. 10, 39

FIT Failure-in-Time. 14

HFI Hardware Fault Injection. 18 IEI Interface Error Injection. 19–21 LN Layer Normalisation. 10

LRN Local Response Normalisation. 10, 22

ML Machine Learning. iii, 1, 2, 7, 11, 13, 20, 28, 29, 33, 43, 63, 65, 66 MSE Mean Squared Error. 6

OS Operating System. 21 PL Programming Language. 32 PSD Power Supply Disturbance. 18 RGB Red, Green and Blue. 35 RNN Recurrent Neural Network. 10

SCIFI Scan-Chain Implemented Fault Injection. 20 SCS Safety-Critical System. 1, 14, 17, 28, 32, 65, 66

SDC Silent Data Corruption. xii, 22, 30, 32, 33, 44, 45, 55, 57, 62, 63, 65, 66

(14)

SEU Single Event Upset. 14

SFI Software Fault Injection. 18–20, 29 SoC System on Chip. 14

SOTA State-of-the-art. 30, 34

TMR Triple Modular Redundancy. 17

(15)

(16)

List of Figures

2.1 Model representation of biological neuron . . . 4

2.2 Example Artificial Neural Network . . . 5

2.3 Mapping of 5x5 region to hidden neuron . . . 8

2.4 Feature maps in hidden layer . . . 8

2.5 Max-pooling example . . . 9

2.6 Example Convolutional Neural Network . . . 9

2.7 Early Stopping when there is an increase in generalization error . . . 12

2.8 Example network with dropout . . . 13

2.9 Example of a Random bit-flip caused by an Single Event Upset . . . 16

2.10 Visualisation of the effect of transient faults in Artificial Neural Networks (ANNs). Darker color means higher values. . . 22

2.11 Diagram with types of fault tolerance applied to ANNs . . . 23

2.12 Ranger . . . 25

2.13 32 bit representation of neuron output value . . . 26

4.1 Example 28x28 image with label "7" from the MNIST dataset . . . 35

4.2 Frequency distribution for Training and Testing sets of MNIST . . . 35

4.3 Example 28x28 image with label "Ankle boot" from the Fashion-MNIST dataset . . . 36

4.4 Frequency distribution for Training and Testing sets of Fashion-MNIST . . 36

4.5 Example 28x28 image of a traffic signal with label "21" from the GTSRB dataset . . . 37

4.6 Frequency distribution for Training and Testing sets of GTSRB . . . 37

4.7 Example 32x32 image with label "Horse" from the CIFAR-10 dataset . . . . 38

4.8 Frequency distribution for Training and Testing sets of CIFAR10 . . . 38

4.9 Diagrams for LeNet networks (1) . . . 39

4.10 Diagrams for LeNet networks (2) . . . 40

4.11 Directory tree for the project repository . . . 41

4.12 Options class . . . 42

4.13 The outcome for bit 9 of register 14 is Silent Data Corruption (SDC) . . . . 45

4.14 Section of script2.sh where the interval value is calculated (milliseconds) . . 46

4.15 Testing for the last bit-flip in bit 51 of register 2 takes too long . . . 46

4.16 Commands to setup ucXception and run script.sh . . . 47

4.17 Example command to run script2.sh . . . 47

4.18 Diagram with application of Ranger . . . 48

4.19 Diagram of fault injection experiment . . . 49

5.1 Training and testing accuracy per dataset . . . 52

5.2 Training and testing Negative Log Likelihood Loss per dataset . . . 53

5.3 Loading time for each dataset . . . 54

5.4 Training time per dataset in seconds, using GPU acceleration (CUDA) . . . 54

(17)

5.5 Training time (seconds) for models with Stimulated Dropout per dataset, using CPU . . . 55 5.6 Testing time per dataset in milliseconds . . . 56 5.7 Experiment time per dataset in milliseconds . . . 57 5.8 Experimental outcomes for models with redundancy for the MNIST dataset 58 5.9 Experimental outcomes for models with Dropout . . . 59 5.10 Experimental outcomes for models with Ranger . . . 60 5.11 Experimental outcomes for models with Stimulated Dropout . . . 61

(18)

(19)

Introduction

The adoption of Machine Learning (ML) has been expanding in recent years [57]. Recent advances in ML algorithms, especially Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs), as well as growing availability of large amounts of datasets and the reduced cost of computational capabilities, have been the catalysts for this expansion.

As a consequence, ML is transforming a wide range of domains, from finance and health care, to science and technology [73]. Some of these domains have strict safety requirements;

in these cases, the systems used are called Safety-Critical Systems (SCSs). SCSs are defined [62] as systems whose failure can result in loss of life, significant property damage, or damage to the environment.

The usage of ML in safety-critical contexts comes with particular concerns, one of which is the possibility that the model will make an unintended mistake, resulting in adverse consequences. Some of these mistakes are intrinsic to ML models since it is unfeasible to achieve an accuracy of 100% in every possible scenario in real-world circumstances.

However, there are failures which are not related to the accuracy of the model and can occur at random in any computerized system. These mistakes happen due to hardware faults [110], which are caused by external factors that are difficult to control and predict.

Because of these characteristics, these failures can compromise the security of these systems and reduce user’s trust.

For the use of ML models, particularly those that use DNNs, to become achievable in safety-critical scenarios, the consequences of these events have to be reduced as much as possible. In this context, it is vital to create approaches that increase the fault tolerance of these types of models.

The objectives of this work are to examine, compare and discuss the impact of several techniques on the fault tolerance of CNN-based ML models in a data-driven way. These techniques include regularisation with Dropout [106], redundancy [110], modified learning through Stimulated Dropout [15], and Ranger [28]. In addition to applying these techniques to baseline models, we also combine these approaches into hybrid models.

(20)

1.1. Contributions

This work aims to investigate techniques for improving the fault tolerance of CNN-based ML models. In this context, our main contributions are detailed in the following para- graphs.

We implemented a well known CNN architecture [67] and created ML models that could be used as a baseline for comparing properties. We applied existing fault tolerance methods [15, 106, 110] as well as our own implementation of a recent method [28], to the baseline models. We selected some of these fault tolerance techniques and combined them, creating hybrid models. In addition, we trained and tested the models on four datasets of varying complexity.

We designed a model-agnostic experimental process to evaluate the fault tolerance of any ML model. This was achieved by using the ucXception [111] framework to inject faults in the models during testing.

Finally, we performed an analysis and comparison between the implemented ML models using several metrics. We quantified the impact of the selected fault tolerance methods in the performance and fault tolerance of ML models, and came to the conclusion that Stimulated Dropout and Ranger provide the most improvement in fault tolerance.

1.2. Structure

This document is organized as follows: in Section 1, we introduce the topic and problem statement and summarise the contributions; in Section 2, we provide the necessary theo- retical knowledge to further understand the project topic and implementation; in Section 3, we describe the motivation, objectives and our planned approach to solve the problem;

in Section 4, we detail the experimental process, including justifications for the selected tools and resources and explanations for the implementation; in Section 5, we present the experimental findings, analyse and compare the results on different metrics; and finally in Section 6, we present a summary of the work and specify future directions.

(21)

(22)

Background and Related Work

2.1. Artificial Neural Networks

The development of Artificial Neural Networks (ANNs) was inspired in part by biological learning systems [85, p.82]. Even though there are some differences between them, ANNs can be thought of as loosely analogous to the neural systems in the brain. ANNs are built out of a dense and interconnected set of simple units, where each unit takes a number of inputs and returns a single output value. These inputs can be either the outputs values of other units, or external data points. Because of the similarity they share with their biological counterparts, these units are interchangeably called neurons or artificial neurons, and these networks can also be referred to as neural networks.

In order to understand how each unit works, we first need to learn about its history. The first representation of an artificial neuron was created by McCulloch and Pitts in 1943 [79], in what is commonly called the McCulloch-Pitts (MCP) neuron. It is described as a simple logic gate with binary outputs. As we can see in figure 2.1, there are multiple input signals which are integrated into the body of the neuron. If the accumulation of the values of those inputs exceeds a certain threshold, an output signal is generated and passed on through the axon.

Figure 2.1: Model representation of biological neuron

(23)

Based on this work, Rosenblatt developed the perceptron in 1957 [100]. A perceptron takes in several inputs x1, x2, . . . xk and produces a single binary output.

For the purpose of generating this output, Rosenblatt developed a simple rule. It starts by introducing weights w1, w2, . . . wj, which are real numbers that are associated with a specific input and represent the importance the respective input has on the output. The binary output is then determined by whether the weighted sum of the inputs is greater than or less than a given threshold.

This threshold can be transformed into a variable called bias, which is equal to the negative threshold. Bias represents how easy it is to activate the given unit, which in this case means to turn its output from 0 to 1. We can see the expression for this rule in equation 2.1.

output =

0 if w · x + b ≤ 0

1 if w · x + b > 0 (2.1)

Perceptrons can be joined together into networks. An example of a possible network [88]

is represented in figure 2.2.

Figure 2.2: Example Artificial Neural Network

This structure can be classified as a feedforward neural network, since the output of one layer is used as input to the next. As we can see, each neuron of one layer is connected to each neuron of the next, which means that the layers are fully-connected. Since this is a perceptron network, each neuron receives and outputs a binary value. We can separate

(24)

this network into sections with different roles: one input layer, one hidden layer and one output layer.

The input layer receives information from an input sample. Input samples can be any encoded information in the format of real or integer values. One example of an input is a grayscale image with dimensions of 28x28 pixels. In this case, each neuron in the input layer would represent one pixel, with the value of each neuron being the intensity of that pixel in binary, either 0 to 1.

The first hidden layer receives the weighted outputs of the input layer. There is usually more than one hidden layer, in such a way that the output of the first hidden layer is the input of the second, the output of the second is the input of the third and so on. Each of these layers can have any desired number of neurons. The total number of layers is called the depth of the network and the number of neurons of a layer is called the layer width.

We can intuitively think of each additional hidden layer as incrementally increasing the complexity of the network. This means that as a general rule, increasing the depth and width of a network will enable it to learn more abstract patterns.

The output layer receives the weighted outputs of the last hidden layer and generates the final return values for the network. The decision that the network makes depends on the neuron of the output layer with the highest value.

Using Figure 2.2 as an example, we can see that there are 10 neurons in the output layer.

If the output of the first neuron is 1 and the outputs of the rest are 0, then the decision of the network is that the input image belongs to class “0”.

As previously mentioned, in perceptron networks, all neuron values are binary. This means that learning complex patterns is realistically impossible. Because of this, the function that generates the output of each neuron, called the activation function, is modified in most ANN architectures. One widely used activation function is the sigmoid function, which receives and returns real values.

In the context of ANNs, learning means adapting the weights and biases of the network to a given training dataset. This means that given a set of examples and their respective labels, the network should be able to output values that correctly classify each of the training inputs. To quantify how well we are achieving this goal, we utilise cost functions.

One example of a cost function is Mean Squared Error (MSE), which simply measures the average of the squares of the errors. In this case, the errors are the difference between the expected labels of a given input and the actual labels that were returned by the network.

We can see an example cost function that utilises MSE in equation 2.2, where w represents the weights, b corresponds to the biases, n is the number of elements in the training dataset, y is the correct label for sample x and ˆy is the predicted label for that sample.

C(w, b) ≡ 1 n

X

x

ky − ˆyk². (2.2)

The objective is therefore to minimise this cost function. This can be achieved in many ways, one of which is by using the gradient descent algorithm [88]. This algorithm works by iteratively taking small steps in the direction that decreases the cost function. In practice, this means that the algorithm calculates the gradient for the weights and biases of the network, and then moves in the direction specified by the negative gradient. This process continues until a local minimum is reached.

The most common way for computing the gradient of the cost function for one input and

(25)

output set is through an algorithm called backpropagation [101]. This computation is done one layer at a time, starting from the output layer and ending on the input layer. This is accomplished by applying the chain rule in order to avoid redundant calculations. This method is more efficient than traditional, naive approaches that calculate the gradient separately for every layer.

2.2. Deep Neural Networks

Now that we have explained in broad terms how ANNs work, we can introduce specific implementations that are relevant to this thesis.

Deep Neural Networks (DNNs) are ANNs that have multiple layers between the input and output layers. Deep Learning (DL) is a term that encompasses all Machine Learning (ML) methods that utilise DNNs. In recent years, DL has taken a dominant role [74] in an immense amount of fields, both in terms of state-of-the-art performance and real-world applications.

It is generally agreed upon that DL was officially introduced in 2006 when Hinton proposed a novel deep structured learning architecture called Deep Belief Network [47]. Since then, growing attention has been given to this field, resulting in numerous breakthroughs, for instance in 2012 [53] when a team led by Hinton won the ImageNet image classification competition.

2.3. Convolutional Neural Networks

Convolutional Neural Network (CNN) is a type of DNN that is especially well suited for the task of image classification [42]. To explain how CNNs work, it is helpful to start with the ANN structure from Figure 2.2.

CNNs are basically ANNs which have two particular types of layers: convolutional and pooling. There are three important concepts to understand how CNNs work: local receptive fields, shared weights and pooling.

In order to understand the concept of local receptive fields, we can visualise the input neurons in the format of the input image. For instance, if we have an input image of 28x28 pixels, the input neurons are represented as a 28x28 square. Now instead of connecting each neuron of the input layer to every neuron of the first hidden layer, we will only connect a localised region of the input space. In Figure 2.3 we can see an example with a region of 5x5 input neurons, which is connected to one hidden neuron. This region is called a local receptive field.

The next step is to iteratively slide this region through the input neurons, from left to right and top to bottom, until the region collides with the bottom right corner of the image.

The length of this slide is a modifiable parameter called stride. In most cases, the value of the stride is 1.

In CNNs, the weights and biases are shared between all of the hidden neurons of a given

(26)

Figure 2.3: Mapping of 5x5 region to hidden neuron

hidden layer. This means that all of the neurons that belong to the same hidden layer are detecting the same pattern or feature, only for different regions of the input space. The mapping between the input layer and a given hidden layer can also be called a feature map.

Convolutional layers are essentially an aggregation of a predetermined number of feature maps, as we can see in figure 2.4. Each of these feature maps represent the capacity to detect one feature at every possible location of the input space. The dimensions of this feature are the same as the dimensions of the region in the input space.

Figure 2.4: Feature maps in hidden layer

Finally, CNNs have an additional type of layer called a pooling layer. These layers are usually placed directly after a convolutional layer and are meant to summarise each of that layer’s feature maps.

Pooling layers are composed of pooling units. These units contain the output of a pooling operation that is performed on a region of a given feature map. This region is necessarily smaller than the region used to create the feature map. In figure 2.5, since the previous

(27)

region had dimensions of 5x5, the pooling region has dimensions of 2x2.

Figure 2.5: Max-pooling example

There are several different pooling operations. Some of the most commonly used include max-pooling, which simply returns the maximum activation value out of a given region, and average-pooling, which calculates the average in the region.

The final scheme of an example CNN can be viewed in Figure 2.6.

Figure 2.6: Example Convolutional Neural Network

In this case, the pooling layer is fully-connected to the output layer. Similarly to the previously mentioned ANN architecture, we can now train the CNN using, for instance, backpropagation and gradient descent.

(28)

2.4. Normalisation

Normalisation is the process of standardizing data, i.e. putting data in the same range.

The inputs of DNNs are usually normalized as part of the pre-processing stage. Even with this initial process, the activation values of neurons in the hidden layers of the network can begin to deviate from their usual magnitude, both between different layers, across neurons in the same layer, and over time,[54, 118], which can negatively impact training accuracy. In order to diminish this issue, normalisation can be added to the network as a layer. Normalisation layers have two purposes: speeding up training time and improving generalization accuracy [108].

Local Response Normalisation (LRN) is a type of normalisation first introduced in the AlexNet architecture [65]. It was inspired by a concept from neurobiology called lateral inhibition, where excited neurons should subdue their neighbours. In ANNs, this means that neurons with higher outputs inhibits adjacent neurons, creating local maximums and increasing the significance of the high value neurons. In CNNs, LRN can be applied within the same channel, or across different channels. Later architectures such as VGGNets [105]

removed LRN since its effect was not significant in CNNs with a high number of layers [72].

Batch Normalisation (BN) [54] is a normalisation technique that is applied to individual layers. It works by first normalising the inputs of the specific layer. This normalisation is achieved by subtracting the mean and dividing by the standard deviation, where both values are obtained from the current minibatch. Afterwards, we apply a scale coefficient and scale offset. BN has a different implementation depending on the type of layer it is applied to; if it is a Fully-Connected (FC) layer, BN is inserted after the affine transforma- tion (e.g. summation) and before the activation function, whereas in convolutional layers, BN is applied after the convolution and before the activation function.

BN has two major drawbacks. First, it produces inconsistent results in training and testing, which makes it inadequate for complex architectures such as Recurrent Neural Network (RNN) [50]. Second, the error increases dramatically when using small batch sizes [115].

This means that in order to use BN, a sufficiently large batch size needs to be selected (e.g.

32, 64), which can be memory-consuming depending on the task. Despite these issues, BN is one of the most common normalisation techniques for DNNs [3] and is an integral part of several prominent architectures [45, 49, 109].

Another widely used normalisation technique is Layer Normalisation (LN) [50], which aims to correct or improve some of the drawbacks of BN. This technique works similarly to BN, with the exception that it estimates the normalisation statistics (mean and variance) from the summed inputs to the neurons within a hidden layer. This is done individually for each training sample, which makes it so that the normalisation does not introduce new dependencies between training cases.

(29)

2.5. Overfitting in Artificial Neural Networks

Overfitting is an issue in ML performs well in the training set but fails to generalise to unseen data [118]. In ANNs this happens because the weights are being tuned to fit particularities of the training examples which are not representative of the general distribution of samples [99, p. 73]. This process is particularly apparent during later iterations of the training algorithm, as the decision surface becomes overly complex [85, p.

111].

There are two important concepts related to overfitting: training error and generalization error. The training error is the error of the model as calculated on the training set, whereas generalization error is a measure of the model performance on previously unseen data.

Generalization error is estimated based on a validation set (hence often being referred to as validation error), which is randomly sampled from the training set [118]. Methods that prevent overfitting aim to reduce the generalization error.

2.5.1. Data augmentation

One of the most common ways of preventing overfitting is by using a larger training dataset, either by collecting more data, or through data augmentation. Data augmentation [39, 103]

is a set of techniques that aims to enhance the size and quality of training datasets. In image data augmentation, two of the most relevant techniques are data warping and oversampling.

Data warping involves transforming existing images while keeping the same label. Example transformations include altering geometric or color properties [104], erasing parts of the image at random [119] or adversarial training [41].

Oversampling is another data augmentation technique. It is based on creating artificial samples using methods such as mixing existing images [61] or utilising generative adversarial networks (GANs) [23].

These techniques are not suitable for every scenario. Smaller datasets can have inherent biases [104] which cannot be prevented or reduced with Data Augmentation.

2.5.2. Regularization

Regularization is defined [42, sec. 5.2.2] as any modification that is made to a learning algorithm that is intended to reduce overfitting. Regularization techniques can be used instead of or in addition to the previously mentioned overfitting prevention methods.

There are various regularization methods, of which we will mention Early Stopping, L1/L2 Regularization, Dropout and Stimulated Dropout.

2.5.3. Early Stopping

Early Stopping is one of the most commonly used forms of regularization in ANNs [95].

As we have previously mentioned, after a certain point in the training process, the generalization error starts to increase. We have also noted that this can be seen as an indication that the model is overfitting.

(30)

The principle behind early stopping is that we can obtain a model with better generalisation capacity by returning the state of the model at the point in time at which it had the lowest generalisation error. The learning algorithm is therefore terminated whenever there has been no decrease in the generalisation error over a predefined number of iterations. This process can be visualised in Figure 2.7.

Error

Epochs

Generalisation error

Training error

Early stopping point

Figure 2.7: Early Stopping when there is an increase in generalization error

2.5.4. L1 and L2 Regularization

L1 and L2 regularization are other commonly used techniques. The basic concept behind these methods is that they penalise models that have large weight values, such that these weights are driven towards the origin (zero). They do so by subtracting a penalty term to the cost function, which is calculated differently for both types. The penalty term for L1 is the sum of the absolute values of the weights, while for L2 it is the sum of the squared values of the weights.

The motivation for L1 and L2 regularization is that if the weight values are kept small, the learning algorithm will become biased against complex decision surfaces, resulting in a simpler model that has a better generalization capacity.

L1 regularization often results in a model that is sparse [42, sec. 2.3.2]. Sparsity in this sense means that most model parameters are zero. For this reason, L1 is utilised in feature selection.

2.5.5. Dropout

Dropout [106] is another regularization method. It provides a simple and computationally efficient way of preventing overfitting. During each training iteration, each neuron has a predetermined probability p of being omitted. In the original version of dropout, both input and hidden layer neurons can be omitted and the probability p can be different between layers (e.g. 0.2 in the input layer and 0.5 on the hidden layers). If a neuron

(31)

is omitted, it will not be considered during that iteration, which leads to a continuous changing of the network at every iteration.

Figure 2.8: Example network with dropout

In the testing phase, the full network is utilised, including the neurons that were omitted.

However, every neuron output is multiplied by the probability p of its corresponding layer.

This serves to compensate for the larger size of the network, since the outputs of each layer will be larger than in the training phase. This can be interpreted [66] as the averaging of the several different networks that were generated during training. This averaging reduces the testing error, leading to better results than previously mentioned methods such as L2 regularization [92].

2.6. Dependable and Secure Computer Systems

In the past decade, ML has been increasingly used to solve complex problems, particularly through DNN models [17, 117]. There are some problems, such as the classification of skin cancer [38] or the detection of arrhythmia [98], among others, that are achievable by these kinds of models with similar accuracy to specialists in those fields. Recently there have been major advances in the usage of DNNs for object detection and image classification purposes in autonomous vehicles with promising results [30]. Nowadays, there is a higher circulation of autonomous vehicles than ever before, something which is expected to grow in the future [24, 117].

Since these types of models are being used more and more in situations that are sensitive in terms of safety, there is a discussion to be had about the possibility of these models making mistakes, as well as the negative impact of those mistakes.

Every computer-based system delivers a service to its users [18]. When that service corresponds to the function of the system (i.e. what the system is intended to do) it is said that the system delivered a correct service. On the other hand, if the service deviates from what is expected, this is referred to as a system failure. A deviation between expected and actual service is called an error.

(32)

Errors are caused by faults, which are anomalous physical conditions in the system. Faults can occur for several reasons, such as thermal cycling, transistor variability or Single Event Upsets (SEUs) [90]. SEUs happen when highly energetic particles (e.g. alpha particles) strike sensitive regions of a microelectronic circuit [35]. These events occur arbitrarily in natural environments.

In autonomous driving, one example of a failure is misclassifying an object, for instance by mistaking a vehicle for a bird [28], resulting in a delayed response that can potentially cause an accident. Autonomous driving systems that are reliable have a low probability of such misclassifications happening or at least affecting the correct functioning of the system.

Because of the potentially harmful consequences of failures in sensitive domains, software solutions are required to meet strict standards such as IEC 62304 [55] for medical and ISO 26262 [56] for automotive applications.

In order to set limits for the failures in different domains, failure rate metrics such as Failure-in-Time (FIT) are widely used [70]. FIT represents the number of failures that can be expected in one billion hours of operation. Standards such as ISO specify the maximum FIT rate of devices in that domain (e.g. ISO 26262 limits the FIT rate to less than 10 [56]).

The FIT rate is calculated for the entire system in question [70], called the System on Chip (SoC), and not just for the part of the system that is running specific processes such as a DNN. This makes the fault tolerance of DNNs even more important, since they need to be minimised in order to make sure the total FIT rate does not exceed the limits.

2.7. Critical Systems

Critical Systems (CSs) are systems where a failure can have a significant negative impact at a human or economical level [46]. There are three types of CSs: Safety-critical, Mission- critical and Business-critical.

Safety-Critical Systems (SCSs) are systems in which failure can lead to loss or injury to life, property damage and/or damage to the environment [63]. This designation encompasses a wide-ranging selection of fields, such as:

• Infrastructure, for instance in emergency services dispatch systems, electrical distribution and generation and telecommunications;

• Medicine, in devices like mechanical ventilation, dialysis or robotic surgery, as well as in medical imaging devices such as X-rays, computerized tomography (CAT) or magnetic resonance imaging (MRI);

• Nuclear Engineering, in nuclear reactor control systems;

• Transport, from aviation, railway and spaceflight control systems to the automotive industry.

Mission-critical systems are essential to the success of a specific task or the survival of a business or organization [46]. Failures in these types of systems result in the termination

(33)

of a specific operation, at the cost of time, resources or equipment. Examples of this category include spacecraft navigation systems (where a failure can cause the spacecraft to get destroyed), or database systems with no backup (where power outages can lead to loss of important data).

Business-critical systems [46] are systems in which failure leads to significant economic losses, at a tangible (e.g. money) and/or intangible (e.g. reputation) level. If these systems fail, the business or organization can continue to function or partially function, but there is a loss of time or resources. One example is a customer accounting system in a bank, where if it fails, the data is not lost, and the business can continue operating without it.

2.8. Faults

Faults can be classified according to their characteristics [110]. When a fault is constant through time (which usually happens in cases where there is persistent physical damage in the microelectronic components) it is called permanent. When the fault only lasts for a brief interval, it is referred to as a transient fault or soft error. When transient faults are recurring, they are called intermittent. Most of the faults that occur in systems with current semiconductors are transient or intermittent [110].

In modern devices, the decreasing size of computer chips and the adoption of certain energy management techniques (e.g. dynamic voltage scaling) increases those chip’s vulnerability to transient faults [22, 91, 102]. Existing problems such as flaws in the design/manufactur- ing of the components or using those components beyond their expected lifespan further worsens this vulnerability. These characteristics contribute to the fact that transient faults are much more common than permanent faults, by a rate of up to 100:1 [94]. Transient faults have also gathered attention in recent years because of their impact in cloud services [36].

Fault models describe the system components that become defective, as well as the conditions in which this occurs [21]. Two widely used [33] fault models are:

• Stuck-at, where data or a control line is stuck at a high or low value or state (also referred to as stuck-at-1 and stuck-at-0, respectively);

• Random bit-flips, where a data or memory element is assigned an incorrect, randomly generated value.

The stuck-at fault model intends to mimic defects at the transistor and interconnection structures level. These defects are examples of permanent faults [110].

The random bit-flip fault model [110], on the other hand, simulates faults that happen at the register or memory level. These are examples of transient faults. A bit-flip operation takes place when one bit of a binary value is flipped, i.e. switches to the opposite of its current state. This alteration changes the logic value of the corresponding memory element, affecting the processes which are accessing that memory element at the time. A diagram for this operation can be seen in Figure 2.9.

(34)

Figure 2.9: Example of a Random bit-flip caused by an Single Event Upset

2.9. Fault tolerance

Fault-tolerant systems have been studied in depth in research [110]. This implies several key issues, from understanding how faults are originated in the first place, to categorising faults based on similar characteristics into types [14, 18, 80] and lastly to evaluating the behaviour that systems have when faults occur. The ultimate goal of this area of study is understand the impact that faults have on systems, and develop methods to prevent or mitigate those effects.

There are several concepts [18] which are important to define in order to understand this field:

• Dependability is defined as the capacity of a given system to deliver trustworthy service, i.e. to keep the amount of failures at a reasonable level.

• Trustworthiness is the degree of assurance that a system will perform as expected.

• Availability is the readiness of correct service at any given time.

• Reliability is the ability of a system to perform its required functions under certain conditions for a given time interval [11]. Highly reliable systems have a low probability of failure, which often translates to an increased trustworthiness of that system;

this concept is therefore very important when deciding on whether to use computer systems to solve a certain task.

• Safety refers to the absence of harmful consequences on users, property and environment in the case of a system failure.

• Integrity is the ability for a system to prevent information modification or destruction, while at the same time ensuring properties such as non-repudiation and authenticity.

• Maintainability is the ability for a system to be modified or improved.

• Robustness can be defined as dependability with respect to external faults (e.g.

hardware-related faults). Robust systems continue to operate as intended regard- less of noise or variation of internal values.

Avizienis et al. [18] define several means to improve the dependability of a system.

(35)

• Fault prevention refers to preventing faults or fault introductions (e.g. through fault injection) from happening.

• Fault tolerance is the ability of a system to provide correct service despite the occurrence of faults.

• Fault removal refers to means to reduce the number and severity of faults.

• Fault forecasting refers to means to estimate the current number, the future incidence, and the likely consequences of faults.

Traditionally, the fault tolerance of a system was assured by exploiting redundancy at the hardware level, through methods such as replication, where multiple systems perform the same operation and the final output is decided by majority-voting. Examples of this technique include Dual Modular Redundancy (DMR) [71, 113] and Triple Modular Redundancy (TMR) [51, 76, 116], where two and three systems are used, respectively.

There are major downsides to this type of approach, such as overheads in cost, performance and energy consumption [71]. This is further worsened by the fact that DNN models often have applications with energy or spatial constraints (e.g. embedded devices). In addition, replication-based techniques require synchronization between the systems, which may not be trivial. For these reasons, these methods are not suitable for systems that require swift and efficient functioning, such as most SCSs.

2.10. Fault injection

The occurrence of faults in a natural environment is arbitrary and rare. This makes it troublesome to study their effects. Because of this, techniques have been developed to either create or replicate faults at will.

Fault injection is defined [19] as the deliberate introduction of faults into a system. After the injection, the system is usually examined in some way in order to identify any possible errors or effects of that injection. A sequence of fault injections in the same system is referred to as a fault injection campaign.

There are several characteristics which are important to have in fault injection tools:

• Repeatability: repeated fault injections should get the same results;

• Controllability: ability to control the temporal and spatial components of the fault injection;

• Intrusiveness: degree of the unintentional effects that the fault injection tool has on the system at a temporal and spatial level;

• Observability: how easy it is to observe and measure the effects of fault injection;

• Reachability: ability to access locations in the processor on which to inject faults;

• Reproducibility: ability to replicate the results of a fault injection campaign.

(36)

One concern in fault injection is that testing every single possible fault bears tremendous computational cost, even for moderately complex systems [64]. Consequently, it is common to test only a fraction of the possible faults for controllability and reproducibility purposes.

The results are approximate estimates of those obtained by exhaustive testing [29].

The earliest methods for fault injection, known as Hardware Fault Injection (HFI), worked by introducing physical disturbances into the hardware components of the system.

These types of tools can employ different techniques [19], such as pin-level fault injection, Power Supply Disturbance (PSD), electromagnetic interference or particle radiation (e.g.

heavy-ion radiation).

Pin-level fault injection can have two types: Active probe or Socket insertion. Active probe utilises specialised probes which are connected to the integrated circuits or hardware components of the target system. These probes inject faults in the attached components by altering their electrical current [48]. One downside of this approach is that if the electrical current is too high, the target hardware is likely to become damaged.

On the other hand, in socket insertion, a socket is added between the target hardware and its corresponding circuit board, allowing for the injection of complex faults into the pins of the target hardware.

Some examples of tools that use pin-level fault injection include RIFLE [77] and Messaline [16].

PSD is another method of HFI. It consists of dropping the supply voltage of the circuits below the expected range [19]. This affects several nodes in the circuit at once and causes multiple transient faults. It represents what might happen in case of a supply outage that affects a computer system in a real-life scenario [60].

Transient faults can also be generated by performing electromagnetic bursts on part or the entirety of the integrated circuits. This can be achieved by placing the target hardware near an electromagnetic field [48]. This is considered a relatively imprecise method [48]

because it is difficult to control the moment at which the electromagnetic field is created, and therefore it is also hard to regulate the time and location of the fault injection.

Lastly, particle radiation [35] can also be used to inject transient faults without physical access to the hardware components. It is representative of what can happen in a real-life situation, where neutron and alpha particle hits can cause errors. As the size of modern integrated circuits becomes smaller, the probability of these particle hits increases, making this a problem of increased importance [91].

Software Fault Injection (SFI), on the other hand, uses software tools that emulate hardware faults by perturbing the state of memory or the hardware registers. SFI tools can only inject faults on the resources that the hardware allows them to access.

The main advantages of SFI approaches are that they do not require expensive specialized equipment [40], they can be applied to most computing devices [93] and they can be programmed to inject faults automatically without supervision. They are also capable of simulating an extensive range of fault models, from hardware to software (e.g. data, interface and code) faults [19]. In general, these tools also offer more controllability and repeatability when compared to HFI methods [64].

There are several properties which are important in SFI tools [87]:

• Representativeness is the ability to simulate the actual faults that a system might have in operation. This property depends on the realism of the fault models. Realis-

(37)

tic fault models emulate the types of faults that are expected to occur in the system.

They should also have the frequency distribution of normal circumstances. These characteristics can be obtained by studying the failures that occur spontaneously on the target system or in a similar system, a process called field failure data analysis. Failures can also be compared with other fault injection methods that have demonstrated accurate failure representation.

• Usability is the ability to apply the SFI tool on any given system. This property depends on the following aspects:

– Portability: SFI tools should be functional on different operating systems, applications and hardware components;

– Intrusiveness: the SFI tool should not affect the results of the experiments (e.g.

by impacting the performance);

– Flexibility: the SFI tool should cover a diverse range of fault models, as well as have the capability to add new ones;

• Efficiency is the ability to produce experimental results with an adequate amount of resources and time. The efficiency of an SFI tool depends on the effort that is put in to obtain useful findings. Common indicators for efficiency include the number of fault injections, and the probability of a fault injection causing an error. The optimal type, location and timing of the injection should be determined by analysing the system.

We can differentiate between SFI tools based on the timing of the injection: compile-time and runtime [48]. In compile-time injection, the program instructions are modified before the program is loaded and executed. This results in errors being injected into the source code or assembly code of the program, which allows for the emulation of different types of faults (e.g. hardware, software, transient and permanent) without any auxiliary software program. One limitation of this approach is that the user needs to have access to the code of the target program.

Runtime injection uses triggering mechanisms to activate the fault. There are different types of triggering mechanisms, such as timers, exceptions/traps, or code insertions.

Timers expire at an exact instant, generating a time-out event. This event creates an interrupt signal that is used to trigger the fault injection. Hardware timers need to be associated with an interrupt handler which injects the fault. This technique is adequate for emulating both transient and intermittent faults.

Instead of only injecting faults at an exact instant, exceptions/traps can be used to perform fault injection before, during or after specific events occur. In the case of software traps, the fault can be injected after a specific program instruction, whereas in hardware exceptions faults are injected when a hardware event occurs (e.g. accessing a specific memory location).

Another runtime fault injection technique is code insertion, which works by inserting certain instructions on the source code of the target program just before an event occurs [75].

These instructions inject the fault and then the program continues. This technique dif- fers from compile-time injection in that it adds instructions instead of modifying existing instructions.

There are three main classes of SFI [87], divided according to the types of faults they are intended to simulate: Data Error Injection (DEI), Interface Error Injection (IEI) and Code Change Injection (CCI).

(38)

DEI works by corrupting the values in a memory location or hardware registers. Though originally intended to simulate the errors caused by hardware faults, DEI tools can also be used to emulate software faults [20]. Different fault models can be used, including stuck-at or bit-flip. When injecting faults in memory, the location of the injection can be either deliberately selected (e.g. a specific variable) or randomly chosen from a memory area, whereas injection in registers can be done in any of the registers that are accessible through the software.

FIAT [20] is a DEI tool that emulates both hardware and software faults. Injections are performed at compile time, by a fault injection library which is linked to the target process. It is possible to control the experiments and analyse the results through a simple user interface. One limitation of this approach is that it requires access to the source code of the target process.

Xception [26] is another tool that injects faults at runtime. it takes advantage of existing debugging and performance monitoring capabilities of modern CPUs in order to ensure minimum interference in the target system, thus decreasing its intrusiveness. This tool is able to time faults with great accuracy because it uses performance counters instead of the less accurate system clock. One disadvantage of Xception is that it is reliant on features that are only available in modern CPUs, which invalidates the usage of this tool on older hardware.

GOOFI [13] is another example of a DEI tool. It is implemented in Java and includes a SQL database that stores information about the experiments. Since it uses programming languages which are available in a wide range of systems, it is easy to adapt this tool to new systems. Because of its object-oriented approach, this tool also provides flexibility in the types of faults that can be injected. Besides single and multiple transient faults, it allows for new types of faults to be implemented easily.

In addition to providing support for compile-time SFI, GOOFI can also perform Scan- Chain Implemented Fault Injection (SCIFI), which uses components that are present in modern VLSI circuits to inject faults directly into the pins and internal state elements of the integrated circuit.

TensorFI [69] is an example of a recent SFI tool that is built onto TensorFlow [10], a popular ML framework. It modifies the TensorFlow graph and allows for the injection of both hardware and software faults.

Since TensorFI is meant to be used with TensorFlow, the user only needs to add a small number of lines in the Python code in order to use this tool. It is also supported by previous versions of TensorFlow, enabling the programmer to use it in existing programs without modifications. Finally, there is a minimal overhead to the execution, which means TensorFI is a very efficient tool.

BinFI [29] is an adapted version of TensorFI. This tool searches for the critical bits in a process (i.e. the bits in which the fault injection causes errors). In comparison with other methods that also identify the critical bits, BinFI produces similar results while ensuring a speedup of 5x. This is because instead of performing an exhaustive search like existing methods, it uses binary-search, which is more efficient.

IEI works by corrupting the input or output values that are exchanged between the target component and the environment or other software/hardware components. This exchange is done through the target component’s interface.

When the inputs are corrupted, this tests the target components ability to tolerate faults

(39)

that occur in external components, whereas when the faults are injected in the output values, we are able to evaluate the effects of internal faults on the external components.

Testing with fault injection in the input values has particular importance in programs that include an Application Programming Interface (API), since they need to be able to handle faults that might occur in a wide range of devices.

IEI tools can be based on two types of techniques: test-driver, which uses a special program that is linked to the target component and tests it with corrupted inputs; and interceptor, which corrupts the inputs that are sent from other components to the target component.

Fuzz and Ptyjig [83] are two of the earliest IEI tools. They send random data to the target process (in this case, a UNIX utility program) in order to check if the altered data induces errors. The results were that the tools were able to crash or hang between 24%

and 33% of utility programs on three UNIX systems. Over the years, these tools were tested on modern Operating Systems (OSs) [82, 84] and the conclusion is that there is some improvement; in 2020, the failure rates were between 12% and 19% using the original methods.

Code Change Injection

CCI [87] is used to emulate software faults by injecting code that causes programming bugs. The injection can be achieved by simply modifying a small set of instructions in the source code or the binary executable. This is often done before execution [32] in order to make the faults permanent. Some studies inject faults during execution [37] in order to have more control over the timing of the injection.

FINE [58] is the earliest tool to employ CCI. It is able to inject hardware-induced software errors as well as software faults. FINE injects faults into the binary code of a target process in systems with a UNIX kernel. This tool was later extended with DEFINE [59], which adds the capacity for executing experiments in a distributed environment and expands the available fault types. Using these tools often results in a high number of inactivated faults, since the inputs required to activate the faults are frequently very specific.

2.11. Effects of faults in Artificial Neural Networks

Any process that is running on a computer-based system can be affected by faults. As mentioned in the previous section, transient faults do a bit-flip on one of the CPU registers of the computer. In practice, this means that the process that is using those registers at a given time will use defective values, which can lead to errors in internal values. If that process happens to be an ANN, these types of errors can impact it in several different ways:

• Input layer: change in the values of the input neurons;

• Weights: change in the values of the weights or in the values of mathematical operations involving the weights;

• Neuron body: change in the value of the summation or the output of the activation function. If the activation function has fixed bounds (e.g. tanh has a range of [−1, 1]), changes in the summation often cause the neuron to become saturated, i.e. stuck at

(40)

the upper or lower bounds. In the cases where the activation function is unbounded (e.g. ReLu has a range of [0, +∞]), these alterations can result in

It is commonly held that ANNs have some intrinsic tolerance to the impact of faults [110]. In other words, if a fault causes a fluctuation in the value of some neurons in the network, this deviation can evened out by the remaining neurons. This is attributed to ANNs distributed structure and over-provisioning [81] (i.e. having more neurons than the minimum number required to perform a computation).

In most cases, this intrinsic capacity makes ANNs capable of withstanding the impact of faults. In rare occasions, however, these faults can overwhelm the network’s fault tolerance capacity and affect the output of the neural network resulting in a misclassification. We can see an example of this in Figure 2.10b.

(a) ANN with no faults. The output class is 1.

(b) ANN with fault and error propagation.

Because of the error, the output class is altered to 2.

Figure 2.10: Visualisation of the effect of transient faults in ANNs. Darker color means higher values.

Nowadays, most research in applying ANNs to the visual domain utilises DNNs. There has been some research into the characteristics of DNNs that impact their fault tolerance the most [68].

According to [68], one of the most important characteristics is the topology of the network.

Special emphasis is given to the type of layer that is used. Normalisation layers [65], for instance, are a type of layer that is used to increase the generalization accuracy of the network by bounding the output of specific layers to a predetermined range. They are often applied before unbounded activation functions (e.g. applying LRN before ReLU in AlexNet [65]) in order to prevent the output values of the activation layer from exploding.

In addition to their intended purpose, LRN layers also increase the fault tolerance of the network since they normalize faulty values according to their adjacent fault-free values.

Another important characteristic is the type of data that is used in the implementation of the DNN. When a fault occurs in a specific bit, the vulnerability of that bit to Silent Data Corruptions (SDCs) is proportional to the dynamic value range of the data type.

This means that data types with higher value ranges have proportionally higher deviations due to faults, which in turn results in a higher probability of there being an SDC. In [68], it is stated that, when possible, it is preferable to choose a data type with an adequate

(41)

dynamic value range according to the needs of the specific network. An added benefit of using data types with smaller dynamic value ranges is increased energy efficiency, which is a requirement for some safety-critical DNN systems such as those used in the automotive domain.

It has been proven [68] that in an autonomous driving scenario, the DNNs which are used for object detection might output the wrong class for a stop sign, or fail to detect it. This type of mistake can provoke serious safety violations in the form of loss of life or damage to property. Since traditional DNNs have no intrinsic way of detecting and handling this type of situation, it is essential to research possible solutions to this problem.

2.12. Fault tolerance in Artificial Neural Networks

There have been significant efforts to improve the fault tolerance capabilities of ANNs.

A diagram with the main categories of current fault tolerance techniques can be seen in Figure 2.11.

Fault tolerance

Active

Passive

Explicit redundancy

Optimization under constraints

Modified learning

Figure 2.11: Diagram with types of fault tolerance applied to ANNs

There are two generally accepted types of fault tolerance [110]. The first is active fault tolerance. Systems that employ this type have two components, one for detecting faults and another for controlling them. The main idea is that these systems detect faults as they appear and handle the effects using mechanisms that compensate for those effects. This compensation is achieved by taking the tasks that were being carried out by the faulty components of the neural network and assigning them to non-defective ones.

The second type is passive fault tolerance. In contrast to the previous type, systems that use passive fault tolerance do not directly detect and manage faults. Instead, these systems make architectural changes to the neural network that increase its redundancy, thus indirectly compensating for the fault effects. These changes ensure that the neural network outputs are as expected even when faults occur. In comparison to active fault tolerance, this approach incurs low overhead, since no additional components have to be in place in order for it to work.

(42)

There are several categories of passive fault tolerance [110], of which we will mention the three most relevant ones.

The first is explicitly augmenting redundancy. The starting point for this method is usually an unmodified ANN which is trained on a given training dataset. After this stage, there is a selection phase that results in the works by inserting redundancy in hidden neurons (those belonging to the hidden layers of the ANN) and their respective weights.

After this stage there is a selection of neurons based on a ranking of their sensitivity. The sensitivity of a neuron can be understood as a measure of the change in the output of the network whenever that neuron is changed. Changing the value of a neuron has an impact on the output of the network that is proportional to that neuron’s sensitivity. Highly sensitive neurons are also called critical neurons.

Several techniques are then applied to the neurons with the highest sensitivity. These techniques include spatial redundancy (i.e. duplicating critical neurons) or evenly weight distribution with neuron pruning and removal of unnecessary weights.

The second category is referred to as optimization under constraints. In this category, the training process is turned into an optimization problem, where the constraints are the fault tolerance capacity of the model and the model’s ability to perform the task that is assigned to it. Optimization algorithms are then used in order to find an ANN architecture and parameters that satisfy these restrictions.

Initially, classical optimization problems such as minmax were used [34]. In these cases, conventional methods were used to find the solution. One issue with this was that these methods used up too many resources.

One example of research in this category is using genetic algorithms as an optimization algorithm to adapt the network’s architecture, for instance in the work done in [120]. In this case, each network is an individual and the fitness function depends on the global error of the network i.e. the difference between the actual output and the expected output. The authors conclude that the evolved networks had improved fault tolerance and generalisation capacity compared to networks trained conventionally.

The final category is modified learning. As the name suggests, alterations are made to the model during training that make it more tolerant to faults in the testing phase. There are two subcategories of modified learning.

The first one is inserting regularization into the learning algorithm. Regularization is defined as any modification that is made to a learning algorithm that intends to reduce overfitting.

The second subcategory is characterised by adding noise, perturbations or fault injections to the learning algorithm during the training phase. The reasoning behind this is that when some of the internal values of the ANN change, other neurons will tend to compensate for this change. This results in a more evenly distributed set of weights, which leads to an improvement in the fault tolerance of the network.

In Section 2.5.5 we discussed the Dropout technique, which belongs to the first type of modified learning. In the following sections we will detail Stimulated Dropout, a technique that fits in the second type of modified learning, and also one recent technique that does not fit within either subcategory, called Ranger.

(43)

2.12.1. Ranger

Constraining the weights of ANNs has proven to be useful for improving their fault tolerance in works such as [114], where the magnitude of the weights in the same layer are bound to a predetermined range.

One recent technique employing this idea is Ranger [28]. As previously mentioned, one of the possible consequences of transient faults is a large deviation in a neuron output value. If the network cannot intrinsically handle this deviation, the error can be propagated through the network and possibly affect the output. Ranger aims to increase the fault tolerance capability of ANNs by restricting the output values of neurons at certain layers in the network, thus preventing errors that occur before those layers from affecting the following.

The selection of the layers where Ranger is applied is essential for this technique to have a significant effect. It has been proven empirically [28] that Ranger provides the most effect if it is used directly after activation layers.

An example application of Ranger can be visualised in Figure 5.10. This is an adapted version of a diagram in the original paper [28]. We are using the example fault propagation given in Figure 2.10b, and demonstrating the effect of Ranger when applied to the network between the input and hidden layers.

As a reminder, when faults cause the output value of a neuron to increase dramatically, this error can propagate through the network and affect neurons in subsequent layers. By restricting the range of values that pass from the input layer to the hidden layer, Ranger dampened the error that occurred in the input layer and stopped it from being propagated to the following layers. This prevented the network from misclassifying the sample. If Ranger was applied between every layer, this protection would extend to errors that occur all throughout the network, not just in the input layer.

2 1

Ranger

Figure 2.12: Ranger

In order to apply Ranger, one first has to derive the restriction bounds. In cases where the activation function of the selected layer is already bounded (e.g. tanh has a range of [−1, +1]), those bounds are used. In the cases where the activation function is unbounded (e.g. ReLu has range [0, +∞]) the restriction range is obtained by sampling the values of the activation layer when testing the model on 20% of the training data. Afterwards, the upper bound is obtained by calculating the 99.9% percentile of that set of values.

ROBUST NEURAL NETWORKS

António Manuel Delgado Morais

R OBUST N EURAL N ETWORKS

September 2021

Department of Informatics Engineering

Robust Neural Networks

António Manuel Delgado Morais

Keywords

Palavras-Chave

Contents

Acronyms

List of Figures

Introduction

1.1. Contributions

1.2. Structure

Background and Related Work

2.1. Artificial Neural Networks

2.2. Deep Neural Networks

2.3. Convolutional Neural Networks

2.4. Normalisation

2.5. Overfitting in Artificial Neural Networks

2.6. Dependable and Secure Computer Systems

2.7. Critical Systems

2.8. Faults

2.9. Fault tolerance

2.10. Fault injection

2.11. Effects of faults in Artificial Neural Networks

2.12. Fault tolerance in Artificial Neural Networks

R ÔBUST N ÊURAL N ÊTWORKS