2.2 Dependable Real-Time Systems
2.2.2 Dependability
Underpinning any research relating to Quality of Service (QoS) is the concept of dependability which is the “ability to avoid service failures that are more frequent and more severe than is acceptable”[2]. This concept provides metrics for measuring how “dependable” a system is with respect to specific attributes and defines the basic constructs for understanding system faults and methods for handling them.
The concept of dependability was initially explored by J. Laprie [99] and was formalised into a taxonomy by Avizienis et al. [2; 98] and is shown in Figure 2.11. The concepts of availability and reliability are central to defining and evaluating QoS:
Availability refers to the ability of the system to provide the correct service when required. It is considered as a measure of the frequency and consistency of the delivery of the correct capability [100]. In terms of service delivery it captures the ratio of
alternation between correct, incorrect, and no service.
Reliability provides the measure of the continuous duration for which the system is able to maintain correct service delivery [100]. A service’s reliability is therefore a measure or a distribution of the mean time between failures.
Faults, Errors, and Failure
The need to specify the attributes of dependability arises from the need to tolerate threats, where a threat may be either a fault, error, or failure. There are four means by which threats may be tolerated [2; 100]:
Fault Prevention also known as dependability procurement, which attempts to prevent faults occurring through robust system design and de- velopment.
Fault Removal or mitigation occurs once faults have been identified, this would ideally be during development but is often after execution. Fault Forecasting refers to the process of evaluating the system behaviour with
regards to fault existence and their activation (discussed below). These methods attempt to predict the number of faults present in the system and their likelihood of becoming activated and causing the system to enter into error states.
Fault Tolerance specifically refers to techniques designed to increase the system dependability in the presence of faults during execution. Such mechanisms are designed to inhibit the development of faults into failures, as discussed below. In the context of SOA, and generally in distributed systems, techniques such as recovery blocks [101], N-versioning [57], and N-copy [58] are common place. The former two refer to embracing design diversity with multiple implementations of the same service which can operate either sequentially or concurrently to handle failures. The latter refers to utilising multiple instances of the same service to mit- igate failures caused by either data or the operational context of
a given service. These approaches directly inspire the modular and loosely-coupled nature of SOA and are therefore inherently part of it.
The above method categories are designed to mitigate faults, errors, and failures which must be understood as a chain of activations, as shown in Figure 2.12a:
Faults can either be dormant or active:
Dormant where it merely exists but has yet to be activated. Dormant faults may never be activated and can continue unnoticed. Active faults are those where an input to a component transitions it to
an erroneous state. Fault activation transforms a fault into an error.
Errors can be either internal or external:
Internal errors are handled internally and do not reach the component interface and so do not propagate to other components. External errors result from a propagation of the error to the componen-
t’s interface and may therefore propagate into another compo- nent. In the context of SOA this may be an incorrect result from one service being passed on to the next service in the workflow.
Failures occur due to the propagation of errors to the system boundary resulting in the system deviating from correct service operation. Depending on the system and failure type they may be permanent or transient failures. In a system of systems a failure of a system can be regarded an activated fault within the wider operational context. Figure 2.12b depicts an example of a C2 system, with a mission and goals, where an internal error occurs, activating a fault which propagates up the tree to cause an external error and consequently mission failure. That failure propagates as an external fault to the wider system. In Figure 2.12b there is one dormant fault in a task and one fault that has been activated in another task which is regarded as an error. That error can propagate up the hierarchy of the system
(a) Fault-Error-Failure chain, adapted from [2]
(b) Fault chain propagation through a workflow or C2 system
Figure 2.13: Failure modes adapted from Avizienis et al. [98] highlighting the modes of interest to this research.
to cause another error and failure at the system level. When this propagates outwards to another system it becomes a fault within that system.
Faults and Failure Types
Faults, errors, and failures can be of various types and can be permanent or transient in nature where transient faults are caused by either external interactions or by the operational context of the system [2]. For example software flaws or production defects are permanent whilst input mistakes are transient. Figure 2.13 depicts a set of failure modes caused by activated faults and their propagation across system boundaries.
This research will be focussing specifically on physical SOA operational faults that are non- malicious and not caused by human interactions. These are highlighted in Figure 2.13 and relate specifically to fault activations caused by physical deterioration and interference. Specifically this thesis considers the failures relating to timing and can result in the highlighted symptoms.