State of the art in Termination detection in distributed Systems
2.1 Introduction and Background
There are times when there is a need to ascertain whether a condition is true for a dis-tributed system and the condition cannot be judged locally but requires global knowl-edge of the state of the system. Distributed termination detection is one example that encapsulates this problem, other examples include distributed deadlock detection, dis-tributed garbage collection and disdis-tributed debugging.
Distributed termination detection (from hereon referred to as DTD) is a fundamental problem in distributed computing. It is a classical problem of distributed control, and it is considered to be of practical, algorithmical, theoretical and methodical importance [227].
The termination detection problem is related to the more general problem of detection of global predicates, a fundamental problem in debugging and monitoring [16].
In general, a distributed system can be viewed as a set of autonomous processes which cooperate with each other to compute a task. To coordinate computation and exchange data, processes may communicate with each other by message-passing. Termination detection refers to the necessity of determining whether the system has entered a silent status where all processes are idle and no computation is possible to take place in the future [233].
The level of difficulty to detect such a status depends on the nature of the distributed system, but is usually non-trivial due to the variation of processor speeds and the un-predictable delays of the message delivery and the absence of global clocks. The distributed termination problem was first identified by [78], and has since inspired a lot of research interest as reflected in various literature, (e.g. [77, 231].
DTD is closely related to other important problems such as deadlock detection [178, 44], garbage collection [227, 228] and snapshot computation [46]. Indeed with garbage collection [227] has shown that the semantics of the termination detection problem are fully contained in the garbage collection problem and that with appropriate program transformations, solutions for the garbage collection problem can be applied to termi-nation detection and vice versa.
Application of Distributed Termination Detection DTD has many applications:
It serves an important role in the diffusion computation [78, 233] and the distributed workpool models which are commonly used in distributed and parallel computational models [7]. The work pool or the task pool model is characterized by a dynamic mapping of tasks onto processes for load balancing in which any task may potentially be performed by any process. There is no desired preassignment of tasks onto pro-cesses [7]. The mapping may be centralized or decentralized. [139] observes that, in a workpool, if the work is generated dynamically and a decentralized mapping is used, then a termination detection algorithm would then be required so that all processes can actually detect the completion of the entire program (i.e., exhaustion of all potential tasks) and stop looking for more work.
Furthermore, the terminated status of a distributed system is among the stable states (such as global communication deadlock, token loss) that should be known for system administration [233]. It has also been shown that termination detection schemes can be applied to solve other distributed computing problems such as deadlock detection, checkpointing [64], and crash recovery among others.
2.1.1 Overview and terminology
A distributed algorithm terminates when it reaches a terminal state, a configuration in which no further event is applicable.
Techniques have been developed to make termination explicit by distributively detect-ing that the program has reached a terminal configuration. These are the techniques we set out to explore in this section.
A very informal problem statement can be formulated as follows; Given a network of N nodes, implement a distributed termination detection algorithm. Each node can be either in active or in passive state. Only an active node can send messages to other
nodes; each message sent is received after some period of time later. After having received a message, a passive node becomes active; the receipt of a message is the only mechanism that triggers for a passive node its transition to activity. For each node, the transition from the active to the passive state may occur spontaneously. The state in which all nodes are passive and no messages are on their way is stable: the distributed computation is said to have terminated. The purpose of the algorithm is to enable one of the nodes, say node 0, to detect that this stable state has been reached.
Definition of termination detection Consider this informal definition by [161],
A distributed computation is considered globally terminated if every pro-cess is locally terminated and no messages are in transit. Locally termi-nated can be understood to be a state in which the process has finished its computation and will not restart unless it receives a message.
Consider a formalisation given by [233], summarised here, quote;
A distributed system consists of a set of processesS = {P1, P2...., Pn} which cooper-ate with each other to complete a job. Processes can communiccooper-ate with each other by message-passing. Logically, from eachPi to eachPj there is a communication chan-nelCi,j. A process may switch between two states: active and idle. A process when performing some computation is said to be in the active state. An active process is free to send/recieve messages and may become idle spontaneously. On idle state, a process does not perform any computation, but can passively receive messages, on which event it becomes active immediately and starts computations. For distinction , computation carried out and messages transmitted by the system are called basic computation and basic messages respectively.
The distributed system is said to be terminated iff (i) Pi is idle and (ii)Ci,j is empty for all 1 ≤ i, j ≤ n (condition (ii) is necessary because message delays are
unpre-dictable and any hidden message will wake up the system later). When terminated , no distributed process can become active and perform any further computation. Extra messages, called control messages are sent , or extra information associated with basic messages to detect such a state. This is the distributed termination detection problem [233].
So the following definition follows;