2. Background
2.2 Dependability
2.2.1 Dependable computing
The idea of a dependable business process, as introduced in Section 1.1, relies on concepts and principles that come from the dependable computing area [IFI]. In the software engineering literature, the term dependability is defined as the ability to deliver service that can justifiably be trusted [ALRL04]. Such a service is delivered by a system: i.e. an entity made of hardware, software and humans. In the business domain, the service is the business process, whereas the system is the business organisation that owns such a process.
It is assumed that systems are not perfect, i.e. they are expected to fail from time to time. A system fails (aka system failure, or simply failure) when there is a service failure: i.e. the delivered service is judged (by a particular judgemental entity [RK07]) as being different from its intended state. As system failures are unavoidable, the challenge is to reduce their frequency and severity. A dependable system, thus, is one that has the ability to avoid service failures that are more frequent and more severe than is acceptable from the judgemental system’s point of view. The term “judgemental system” covers concepts ranging from failure detectors implemented in hardware or software to the legal justice system. Furthermore, since the judgemental system is itself a system, it might also fail (as judged by another higher level judgemental system). In this work, as will be demonstrated later, the role of judgemental system is played by the stakeholders that request the service that is to be provided. An error is the part of a system
2.2. Dependability 19
state that might lead to a failure. The hypothesised cause of the error is a fault. An error does not necessarily lead to a failure as it may be avoided by chance or design, or simply because it does not constitute a fault for the enclosing system.
There exist four general mechanisms to achieve dependability [ALRL04]: fault prevention, fault tolerance, fault removal, and fault forecasting. Fault prevention deals with the objective of avoiding the introduction of faults during the software development process. As such objective is a part of the general aim of every software development methodology, fault prevention can be considered as an inherent part of it. Thus, good practices in software development (e.g. modularity with low coupling and high cohesion, information hiding, use of strongly typed languages for the specification, design and implementation) help in reducing the number of faults when developing a system. Fault tolerance is aimed at allowing the system to provide the service in spite of the presence of faults. The basic activities required to achieve fault tolerance are error detection (i.e. to identify the presence of an error) and system recovery (i.e. to lead the system to a well-defined state without detected errors from which the system can continue its normal execution). Fault removal deals with uncovering faults that have occurred at any phase in the development process. Activities covered by this mechanism range from checking the specification of the system to uncover specification faults up to exercising the system (i.e. testing) to uncover development faults. Fault forecasting is aimed at evaluating the behaviour of the system under the occurrence of faults such that it can be concluded which ones would lead to system failure.
The effectiveness of each mechanism depends on the context and nature of the fault. Thus, the dependability of a system can be increased by a combined use of these mechanisms. It is worth mentioning that a totally dependable system (i.e. a perfect system) is impossible to achieve, but it should be the objective that any development process must try to attain. This work takes the view that fault tolerance is a means to achieve dependability, complemented with the other existing ones (i.e. fault prevention, fault removal, and fault forecasting). Based on the assumption that faults cannot be fully avoided or removed, the choice is to enrich the system with means to detect erroneous system states and then to perform the necessary recovery steps that lead the system to a well-defined state (fault tolerance view). Both the error detection tools and the recovery steps are made part of the model that describes the system. This model is produced during the analysis phase of the development methodology that is being followed (fault prevention view). The model not only describes the functional aspects of the system, but also the fault tolerant ones. Such model can be used to perform an early evaluation of the system behaviour with respect to its adherence to the expected functional properties as well as with respect to the occurrence of faults (fault forecasting view)5. If such early evaluation reports that the system model either does not adhere to certain functional property or does not behave as expected when facing a particular fault, then the model has to be corrected because it is faulty (fault removal view). This early evaluation is carried out until the model fulfills both the functional and fault-tolerant aspects. Once this point is reached, the next phase of the development process (i.e. design) begins.
2.2.2 Fault tolerance
Fault tolerance is achieved by error detection and subsequent system recovery. There are two kinds of error detection techniques: concurrent and preemptive [ALRL04]. Systems that allow errors to be detected during the delivery of the normal service are said to support concurrent
5
Modelling and simulating error detection and recovery activities could be used as an effective method to estimate the consequences of a fault.
20 2. Background
error detection. Those that detect errors only while running in specific modes or at a particular period of time (e.g. audit and start up, respectively) are said to support preemptive error detection.
The purpose of the recovery phase is to lead the system back to a certain state such that it can continue executing. This can be achieved by modifying the system state such that it does not contain errors. This technique is known as error handling. It can be implemented by backward error recovery (sometimes called rollback), forward error recovery (sometimes called rollforward), compensation, or any combination of these. Backward error recovery returns the system to a saved state that existed before the error occurred. This state is assumed to be correct since in the past it allowed the system to be fully operational. Implementations of this technique include checkpoints [EAWJ02], conversations [Ran75], and transactions [GR92].
Forward error recovery is aimed at leading the system towards a new (i.e. not reached recently) correct state. Reaching such a state is only possible when there exists precise knowledge about the kind of error that has corrupted the system. When the class of error is known, specific activities meant to deal with this particular error are performed. By executing these procedures, reaching the correct new state should be possible, thus, allowing the system to resume its operation either as before the error was detected (best case scenario) or in a mode where not all its services are available (aka degraded mode). Forward error recovery is usually achieved by using exception handling mechanisms (EHM) [Cri89, BM00] as they embody concepts (i.e. exception and handler) and capabilities (e.g. detection, control flow transfer, exception and handler categorisation) that make this type of error recovery mechanism easy to implement. Compensation aims to allow the system execution to progress as expected despite it being in an erroneous state. This technique is implemented under the assumption that a system erroneous state holds enough redundant information to allow the error to be masked. Hardware redundancy includes supplementary and potentially similar hardware in the system, whereas software redundancy includes additional components, i.e. programs, objects or data. Software redundancy is complemented with software diversity to solve the problem of replicated design and implementation faults. N-version programming [Avi85] and recovery blocks [HLMSR74, Ran75] are the original and basic techniques that implement software diversity.
Error handling techniques intended to cope with errors such as those previously described (i.e. rollback, rollforward, and compensation), leave the fault untreated. Thus, for a fault that has already led the system to an erroneous state there is clearly a possibility that the fault will continue to produce errors. As the repeated manifestations of a fault can make a system fail despite the efforts of the fault tolerance technique it implements6it may be necessary to eradicate the fault from the system.
Techniques aimed at removing the fault from the system to avoid it being reactivated are part of the final phase of fault tolerance. This phase is known as fault handling, and it implies identifying the fault (i.e. diagnosing), isolating the faulty element (e.g. component, module, class), reassigning the tasks performed by the faulty element among non-faulty elements (i.e. system re-configuration), and restarting the system.
2.2.2.1 Fault tolerance in distributed real-time systems
As previously stated, the approach to achieving dependability in business processes is centred around fault tolerance. Mechanisms or tools will have to be used in business processes that
6
Either because the consequences of the fault become more and more serious, or because the system cannot longer provide its service as expected due to the overheads of dealing with recurring errors[AL81].
2.2. Dependability 21
are collaborative and hold timing constraints. Thus, it is logical to explore how fault tolerance has been successfully applied in the computing domain when engineering “collaborative systems with timing constraints” (known as distributed real-time systems in the computing field). A “distributed system” [BW01, Lam78] is defined as a system composed of multiple autonomous processing nodes cooperating toward a common purpose or toward achieving a common goal. These autonomous processing nodes communicate with one another by exchanging messages. It is a necessary condition for a distributed system that the message transmission delay is not negligible compared with the time between events in a single processing node (multiprocessor computers are excluded as the message transmission delay is negligible). This definition is compatible with our way of considering collaboration in business processes.
A system is “real-time” [Bur91] if at least one of the computational tasks that it executes is constrained somehow by time (compatible with our way of considering time constraints). Such a definition indicates that the correctness of the result provided for the task depends not only of its logical value, but also on the time at which it is provided. These timing constraints appear in the requirements specification in the form of deadlines. This definition is compatible with our way of considering timing constraints in business processes.
Distributed systems (being real-time or not) have the partial-failure property: the occurrence of a failure (it does not matter of which kind) usually affects only a part of the system [Tel94]. Fault tolerance techniques exploit this property to coordinate the execution of the distributed system so that non-faulty processing nodes can take over the activities of those that are failing. Fault-tolerant algorithms based on replication (i.e. systematic compensation for fault masking) are an option since (potentially) every processing node can be used for redundancy purposes. Every system component that needs to be replicated (nothing forbids to replicate the entire system) can be deployed on one of the processing nodes that compose the distributed system. There also exist fault-tolerant algorithms designed to insure the correct behaviour of the system (while certain conditions hold) despite failure occurrences [LSP82]. Still others are meant for identifying the kind of failure (even in the case of multiple occurrences) to be able to perform the necessary recovery actions that will lead the system to either its normal behaviour (best scenario) or a graceful degradation, instead of reaching an overall malfunctioning [CR86]. Such fault- tolerant algorithms have to coexist with scheduling algorithms that take care of the temporal behaviour of the distributed system, when timing constraints need to be satisfied. In this scenario, a trade-off between dependability and performance must be made.
Fault tolerance techniques can be also applied within each processing node that is part of the real-time distributed system. The advantage of this technique is an increase in the dependability of the local computations carried out in the processing node. One method for achieving this is to combine exception handling principles with scheduling practises for providing fault tolerance by means of forward error recovery [dALB05]. This approach categorises processes or activities as primary or alternative tasks: a primary task is one whose execution is required in error-free scenarios, whereas an alternative (i.e handler) is one that must be executed only when some error is detected (i.e. exception is raised).
Such a categorisation helps in scheduling the tasks. Since an alternative task is expected to run less often than a primary one, different priorities can be assigned to them. It might be decided, for example, that alternative tasks run with higher priorities as a way to increase the system’s tolerance for detecting errors.
Another alternative, is to use the transaction processing paradigm for executing the timing constrained tasks within a processing node. Turning every task into a transaction allows fault
22 2. Background
tolerance to be provided by means of backward error recovery (atomicity property - the A of the ACID7 properties granted by the transaction processing paradigm). However, the transac- tion processing mechanism (whatever it is) embedded into the processing node has to include scheduling aspects so that the number of transactions missing their timing constraints are min- imised [AGM92]. Advanced transaction mechanisms, like the one introduced in the next section, can be used when some of the ACID properties need to be relaxed, while keeping others in tact. This is the case for long-running transactions (i.e. transactions intended to run over long periods of time) that maintain consistency and durability (C and D), while they relax atomicity and isolation (A and I) [Gra81].
2.2.3 Coordinated Atomic Actions
Coordinated Atomic Actions [XRR+95] represent a fault tolerance conceptual framework8 used to increase the dependability of distributed and concurrent object-oriented (OO) software sys- tems9.
The abstract concurrent OO computation model, which the coordinated atomic actions con- ceptual framework (CaaFWrk) is based, is defined as a collection of interacting objects (which may or may not be distributed) where the processes (or threads) executing concurrently (i.e. these processes may, but need not, overlap [BA06]) correspond to the executions of operations (which may be associated with a single object or several different ones) on a group of objects. Those objects that own the operations concurrently executed can be considered as active ob- jects, whereas the objects on which operations are applied can be considered as passive objects. The conceptual framework does not make a distinction as a particular object can behave both as active and passive during the software system execution. What the conceptual framework does assume is that each object executes just one of its operations at time.
The kernel of the CaaFWrk is an abstraction, which is defined as a generalised form of the atomic action10 concept. This abstraction allows a set of concurrent processes to perform a group of operations on a collection of objects. Therefore, it represents an atomic logic unit (i.e. indivisible and with well-defined boundaries) for the execution of a set of concurrent processes. Furthermore, the atomicity, consistency, isolation and durability (ACID) properties are ensured for those objects being accessed by the processes forming part of the atomic logic unit.
This atomic logic unit works also as a damage confinement area. It constrains the spread of errors to its enclosing context. This is achieved by allowing recovery procedures to be associated with each atomic logic unit. Exceptions and exception handling features are used to identify the presence of an error and eliminate it by putting in place one of the recovery procedures associated with the atomic logic unit being executed. Therefore, exception handling forms the basis of the CaaFWrk to implement fault tolerance. In the context of the CaaFWrk, this atomic logic unit is known as a coordinated atomic action11 (CAA), the concurrent processes are the
participants, and the set of operations that each participant performs inside the CAA is known as the role.
The CaaFWrk then can be seen as a tool that designers may use for structuring the software system activities (design) in order to meet the user’s requirements (specification). Concurrent
7
The acronym ACID means: Atomicity, Consistency, Isolation and Durability.
8A technique with its own terminology and strategy to implement fault tolerance. 9
The expression software systems is used to mean the piece of software that satisfies the user’s requirements along with the hardware and environment where it is deployed.
10
Abstraction that allows the execution of a set of operation to be seen as only one indivisible operation.
11