pos1 pos2 pos3 1 2 3 7 8 4 1 TxA TxB 0 0 0 5 5 t
Figure 2.2: Examples of transactions using a valid (txA) and an invalid (txB) snapshot.
2.3
Fault tolerance in distributed systems
Fault tolerance is a basic requirement for any system used in critical applications. The definition of a critical application is broad and even systems used exclusively for leisure can be considered critical if their unavailability may cause financial losses to the provider (e.g., because customers change providers or forgo using the service). Nevertheless, a distributed system is by definition a set of nodes that cooperate to carry out a common task; the more nodes compose the system, the higher is the probability that at least one will fail, potentially compromising the whole system if no special care is taken. In the next subsections, we review fundamental concepts in fault tolerance that will be relevant in later chapters.
2.3.1 Failure model and failure detection
The first important item to be addressed when considering fault tolerance is the failure model. In practice, there are two commonly used failure models. On the one hand, the crash-failure model[VR01] assumes that nodes fail by halting. In addition, when recovered, nodes come back with different ids. Thus, this is equivalent to having permanent faults. On the other hand, in the arbitrary-failure model[VR01], nodes may behave outside their specification in arbitrary ways. They may halt as in the crash failure model, proceed too slow or too fast (timing failure), produce wrong results, and even produce specially-crafted wrong results that can cause other nodes to fail. When choosing a failure model, it is important to choose the narrowest failure model possible as this greatly reduces the complexity and cost of the system. For example, assume that nodes can be arbitrarily slow and excessive slowness should be detected as a failure and handled. This kind of failures clearly includes crash failures (i.e., nodes can be infinitely slow) and, thus, appear to require a broader failure model. However, in the arbitrary-failure model, simple problems like having distributed nodes to agree on a single value (i.e., consensus [CT96]) will require 3 · f + 1 nodes in order to tolerate f failures [LSP82]. Nevertheless, it may be possible to use techniques to transform these timing failures into crash failures. For example, local hardware watchdogs could be used to crash nodes that are slower than a threshold [Fet03]. Using this transformation the crash-failure model could be used and then, f + 1 nodes suffice to solve the
agreement problem stated above. Similarly, value failures caused by hardware errors (e.g., bit flips) can be detected by encoded processing techniques [WF07] before any output is produced. This can be used to effectively transform value failures into crashes. These two examples argue that although simplistic, the crash-failure model is useful in many practical scenarios.
In the portion of this work that addresses fault tolerance, we assume the crash-failure model. Then, to detect crashes we consider a failure detector based on the one proposed by Fetzer [Fet03]. The failure detector works as follows. The protocol considers a system with three nodes and implements a failure detector that is able to detect one failure and never suspects a node that did not fail. Consider initially the system with nodes p1, p2, and p3, each equipped with a watchdog
(either a hardware watchdog or a software watchdog as available in common Linux distributions) and with a local clock with a bounded drift rate ρ from real time. Each watchdog is programmed to force the local nodes to crash (or restart) if the watchdog is not reset before T · (1+ ρ) time units passed. Besides that, nodes have to acquire a lease from at least one of the other nodes to be allowed to reset their watchdogs. Finally, when granting or requesting leases, nodes exchange a list of other nodes they directly granted a lease.
To illustrate how the protocol above detects failures, assume node p1is granted a lease from
p2. If either p1or p2has granted node p3a lease in the last T · (1+ 2 · ρ) time units (according
their local clocks), both p1and p2learn that p3may be still alive. However, if neither p1or p2
granted node p3such a lease, the node crashed (or was forced to crash by its watchdog, as it did
not have a lease). This protocol can be extended to work with more than three nodes by executing multiple concurrent 3-node instances. A node is then considered failed when all the instances that include that node detect it as failed. On the contrary, a node is able to reset its watchdog as long as it is able to acquire a lease in at least one of the instances.
2.3.2 Recovery semantics
A system that is able to recover from failures may offer different types of recovery, depending on the amount of information that is guaranteed to survive a failure. In this work, we focus on precise recovery. Precise recovery means that failures are completely masked with respect to semantics. Results generated after a failure will be exactly the same as if no failures had occurred. The only possible visible effect of a failure is then a slight decrease in performance.
Alternatives to precise recovery are rollback recovery and gap recovery [HBR+05]. Rollback recovery guarantees that no input information is lost, all input events are considered, even in case of failures. This definition implies also that accumulated state is preserved. However, in contrast to precise recovery, the execution after a failure may follow different paths and, thus, achieve different decisions. Finally, gap recovery accepts that inputs and accumulated states are lost due to failures. In this case, after a failure, a system providing gap recovery is allowed to restart with a fresh state and start processing from the latest input.
2.3.3 Active and passive replication
In later chapters, we address fault tolerant approaches for ESP systems. There are two classic approaches for fault tolerance. The first option is to have the relevant state of the application being periodically saved in stable storage. A stable storage is a storage that is able to survive
2.3. FAULT TOLERANCE IN DISTRIBUTED SYSTEMS 25
the failures that are expected in the system. The second option is to have multiple nodes in lock step so that all accumulate the same state. The state will be stable as long as tolerated failures do not affect all nodes simultaneously (i.e., nodes fail independly). These two options describe the basic approaches for fault tolerance, namely, passive [BMST93] and active replication [Sch90], respectively.
Choosing between passive and active replication requires considering several factors. On the one hand, active replication deals with failures by having redundant nodes, named replicas. Replicas repeat computations and output equivalent results. Because outputs from all replicas are equivalent, any of these can be used and the others ignored. Therefore, failures are transparently masked. A clear advantage is that there is no recovery phase. For the same reason, a clear disadvantage is the amount of resources wasted in failure-free runs. Another disadvantage is that computations need to be deterministic as in a state machine. The next state of the computation must depend exclusively on the current state and the current input. Hence, active replication is also known as the state machine replication approach. Requiring determinism affects several common operations like reading wall-time clocks and the use of multithreading (if the result of these operations may affect the state of the replica). In addition, the state transitions are likely to depend on the order of the inputs, requiring totally ordered communication protocols (e.g., atomic broadcast [CASD95, CT96]) to be used to ensure all replicas process the same messages in the same sequence. Ordered broadcasts are much more expensive than simpler ones (e.g., reliable broadcast [VR01, CT96]) as they require coordination among nodes, incurring extra resource costs to the system.
On the other hand, passive replication [BMST93] and, similarly, rollback-recovery ap- proaches [EAWJ02] deal with failures by having the state of the nodes being periodically check- pointed in a stable storage. For example, if nodes are expected to fail only due to software crashes (either from processes or from the operating system) and are expected to recover after failing (e.g., by having a watchdog that reboots the system if it hangs), then the local disk can be considered a stable storage if synchronous writes are used. Alternatively, if, for example, hardware failures may compromise nodes permanently, stable storage requires writing to a non-local storage (e.g., a disk or memory of a remote node).
In addition to checkpoints, passive replication can also use logging between checkpoints to save information that is relevant for replay [EAWJ02]. When a checkpoint is taken the logs can be cleared. For example, nodes may do periodic checkpoints of the complete state and, between checkpoints, log the messages they process, possibly together with any nondeterministic decisions taken, such as clock values read and scheduling decisions. On recovery, nodes restore the latest checkpoint and then replay messages from the log.
A clear advantage of passive over active replication is that there is no redundant work being done and, thus, less resources are used. In addition, nondeterministic decisions are allowed as long as they can be saved to the log and enforced on replay. As disadvantage, passive replication requires a recovery phase that restores the checkpoint and replays logs, which can take considerable time.