proaches and protocols that enable distributed applications to recover in a consistent state based on saved checkpoints. This is achieved by either optimistic or pessimistic message logging, and by coor- dinated or uncoordinated checkpoints. The goal of checkpoint and recovery algorithms is to provide distributed applications with mechanisms to survive failures and roll back their state to a previously globally consistent state. They usually assume the existence of stable storage that survives failures. Strom and Yemini introduced in  the notion of optimistic recovery. They define it as a tech- nique based on dependency tracking, which avoids the domino effect while allowing the computation, the checkpointing and the “committing” to proceed asynchronously. Their approach requires the analysis of existing checkpoints and computing a global safe recovery line. This can be done either centralized  or distributed . Furthermore, their approach assumes that upon rollback the execution continues on the same path as before, hence messages that have been logged since the checkpoint can be replayed. This is true for most of the subsequent work based on optimistic log- ging [32, 41]. This approach is incompatible with speculative execution since processes may take a different execution path when the speculation is aborted.
Hardware FaultTolerance: This involves the provision of supplementary backup hardware such as; CPU, Memory, Hard disks, Power Supply Units, etc. hardware faulttolerance can only deliver support for the hard- ware by providing the basic hardware backup system, it can’t stop or detect error, accidental interfering with programs, program errors, etc. In hardware faulttolerance, computer systems that resolves fault occurring from hardware component automatically are built. This technique often partition the node into units that performance as a fault control area, each module is backed up with a defensive redundancy, the reason is that if one of the modules fails, the others can act or take up its function. There are two approach to hardware fault recovery namely; Fault Masking and Dynamic Recovery -.
Within a banking environment such failures are not acceptable, thus fault tolerant mechanisms must be employed to reduce the susceptibility of a given system to failure. In undertaking this research, potential application component failures, in an existing distributed system, were identified and an architecture to support the development of fault tolerant distributed application components using CORBA , a distributed object middleware specified by the OMG , was designed and implemented. Central to this architecture is OMG’s CORBA Trading Service as the mechanism to advertise and manage service offers for fault tolerant application components. Finally, the Quality of Service (QoS) in terms of scalability and performance of the suggested architecture was assessed.
Biased Random Sampling and Active Clustering are the algorithm which supports heterogeneity partly means as system diversity increases these algorithm degrade their efficiency, thus inferring that these technique would be suited to lesser diversify and dynamic environment. ACO is the dynamic load balancing technique which do not support heterogeneity but had the provision of faulttolerance technique and is easy to implement to Internet, cloud or grid computing systems. DLT had job migration provision for load balancing only and does not consider the faulty scenarios while FTDLB and VM Mapping shift the jobs to other nodes in order to achieve load balancing and on the occurrence of fault, for improvingperformance of the system. O-Ring as compare to FTDLB has lesser implementation complexity thus with less overhead it reduces the implementation cost and perform fast. VM Mapping had higher overhead associated than DDFTP as the former had resource monitoring system as core element of the system.
Plush is a generic, distributed application management infrastructure that provides a set of “building-block” ab- stractions for specifying, deploying, and monitoring dis- tributed applications. The building-block abstractions are part of an extensible application specification lan- guage that Plush uses to define customized flows of con- trol for distributed application management. By defin- ing a custom set of blocks for each application, devel- opers can specify exactly how their application should be executed and monitored, including important behav- iors related to failure recovery for application-level er- rors detected by the Plush monitoring service. Plush also provides advanced support for multi-phased applica- tions (which are common in grid environments) through a set of relaxed synchronization primitives in the form of partial barriers . Partial barriers help applications cope with “straggler” computers by automatically de- tecting bottlenecks and remapping uncompleted compu- tations as needed. Applications that use partial barriers achieve improved performance in failure-prone environ- ments, such as wide-area testbeds.
Distributed Knowledge Management Systems (DKMS) often depends on the Semantic Web Peer-to- Peer (SW-P2P) model. The reason for this is based on its support for autonomy of knowledge node, ease of accessibility and scalability. The susceptibility to failure experienced during knowledge re- trieval has been a concern for the SW-P2P. This paper presents a faulttolerance system in order to resolve the problem of the DKM. The architecture of this design consists of five components namely; Replication Manager (RM), Fault Detector (FD), Fault Notifier (FN), Recovery Mechanism (RMe) and Global Control Monitor (GCM). This design adopted dynamic replication strategy and group constitu- tion procedure to guarantee knowledge availability on knowledge nodes. The dynamic replication strat- egy was used to create and delete replicas based on the changes in the DKMS environment. The group constitution procedure suggested the efficiency of fault recovery process in terms of the best available replica among knowledge service group. The faulttolerance system execution cycle was performed on a set of Virtual Machines (VM) using the VMware Workstation version 7.0.1, while Java programming language was used to implement the group and ungroup replicas. Sample data of vary- ing magnitude in ranges of 225Kilobytes to 512Kilobytes and 450Kilobytes to 512Megabytes were tested at different time intervals on both the grouped and ungrouped replicas at a threshold between 0.85 and 0.9 of knowledge retrieval. The results showed a reduction in the average response time of the grouped replicas which was measured to be 34 milliseconds and 68.2 milliseconds against un- grouped replica that was estimated as 53 milliseconds and 107.2 milliseconds respectively. The effect of this reduction in response time was that the grouped replica was faster than the approach of un- group replica. In addition, the group replica occupied less memory space because it does not need to store replicas on the active knowledge peer when recovering from failure. This result showed that the system guarantees the faulttolerance of each knowledge node in a DKMS.
Abstract: In relation to network performance, graceful degradation amidst increase in the failure of network devices expects a reduction in performance in a way that a dramatic fall in throughput will not be noticed. To achieve a relevant graceful performance degradation especially in a cloud data center networks means the design must be fault tolerant. A fault tolerant data center network should be able to provide alternative paths from source to destination during failures so that there will not be abrupt fall in performance. But this is not the case because of the growth in the use of internet-based applications, big data, and internet of things; leading to several ongoing researches to find the best suitable design that could help alleviate the poor faulttolerance and graceful performance degradation in cloud data center. Fat trees (FT) interconnections have been the most popular topologies used in data centers due to their path diversities and good faulttolerance characteristics. In this paper, we propose a Reversed Hybrid architecture derived from the more generalized fat tree structure, Z-fat tree and compare it to a fat tree network with the same amount of resources for client server communication patterns such as HTTP and EMAIL application in a cloud data center. The results with faulty links show that our proposed Reversed Hybrid outperform the fat tree. We conclude based on the level of graceful performance degradation achieved that faulttolerance in data center cannot only be realized by adding extra hardware to the network, rather bespoke design plays a greater role.
Since distributed computing comprises of various software’s and hardware’s hence exhibits heterogeneity in nature and leads issues like scalability, availability and faulttolerance. We used various techniques to avoid faulttolerance in DC for better performance. Accuracy of the system was calculated by using node mobility and node failure and finally rate of faulttolerance. The faulttolerance approaches discussed in this paper are reliable techniques. Further the performance of these can be improved towards achieving high reliability. There is lot of research scope in minimizing recovery time of existing techniques and implementing dynamic adaptable techniques. We left the issue of integration in distributed computing as future work.
Inviting comparisons with CPU speed effect in terms of the best performance as figures 7 and 8 shows, our approach has given the lowest delay time and the highest number of processes value under all classes of CPU speed, and under various configuration parameters. As can be seen from the figures above and its values, the proposed approach is more applicable in all three classes. It shows improvement in terms of delay time in classes A, B, and C with improvement values of 79%, 81%, and 84% respectively. In terms of number of processes, the proposed approach provides higher number with improvement level of 9%, 19%, and 25% in classes A, B, and C respectively. This improvement is due to considering the processing capabilities in addition to reliability and memory size of hosts when ranking them as best hosts to redeploy or replicate software components to them. The equations of ranking hosts examine the processing capabilities of each host in which they are ranked to provide the best host it terms of memory size and processing power. The systems deployed using the proposed approach provide less delay time so that number of processes executed are more. As can be seen, the improvement level is increasing and has the highest values in class C which is axiomatic due to
s illustrated in the timing diagram on figure 5.4, the communication via CAN consumes much more time than the control action calculation. Thus, bottleneck of such a distributed system seems to be in communication. This is consequence of the fact that the CAN bus communication is not as fast as processor instructions execution. This imposes potential limitations to maximal applicable sampling frequency, causing limited suitability of such a system for fast hard-real time control loops. This hardware and software configuration was used in Embedded Systems realization project to determine maximum possible sampling frequency. If we experiment with relatively low CAN speeds, bottleneck of the system is determined by CAN communication latency. However, if we gradually increase used CAN speed, at certain point bottleneck of the system performance will be determined by the highest sampling frequency obtainable by used encoder. As explained in section 4.3.3. the way in which encoder is used enables sampling frequency of up to the 1KHz. Result of this experiment for several lower values of CAN speed is summarized in Table B.2 In Appendix B. Results should be taken skeptically, since measurements were done in absence of any other traffic over CAN bus.
Byzantine fault is an arbitrary failure occurs in the distributed environment causes heavy damage to the system. The word “Byzantine” refers to Byzantine generals problem which can be used in army in the past decades.Byzantine problem can be tolerated only there are 3N+1 nodes in the distributed environment. For example, Dynamo uses tens of thousands of servers located in many data centers around the world to build a storage back-end for Amazon’s S3 storage service and its e-commerce platform[11,13].Byzantine fault can be tolerated using distributed Byzantine Quorum systems techniques to provide security in the database storage[2,14].This Quorum system can be integrated with protocol for proactive recovery of servers[3,18].But it cannot be completely eradicated and also understand. By using some techniques it can be tolerated but it is not sufficient in the recent days. Byzantine fault tolerant algorithms will be increasingly important in the future because malicious attacks and software errors are increasingly common and cause faulty Nodes to exhibit arbitrary behavior[4,19]. In distributed system byzantine failure is used to describe the worst possible failure semantics in which any type of error occur[5,17]. The distributed environment uses membership service. The membership service is used it periodically notifies of other system nodes of membership changes[6,16].The membership service is used dynamically. It allows only authorized service to communicate with this group of system in the distributed environment.To reduce this byzantine fault occurred in the distributed environment provide the predefined constraints in the distributed system. The membership service allows only the service with possible constraints related to the distributed system constraints.
Evolution during service life is mandatory, particularly for long-lived systems. Dependable systems, which continu- ously deliver trustworthy services, must evolve to accommodate changes e.g., new faulttolerance requirements or varia- tions in available resources. The addition of this evolutionary dimension to dependability leads to the notion of resilient computing. Among the various aspects of resilience, we focus on adaptivity. Dependability relies on fault-tolerant com- puting at runtime, applications being augmented with faulttolerance mechanisms (FTMs). As such, on-line adaptation of FTMs is a key challenge towards resilience. In related work, on-line adaption of FTMs is most often performed in a pre- programmed manner or consists in tuning some parameters. Besides, FTMs are replaced monolithically. All the envisaged FTMs must be known at design time and deployed from the beginning. However, dynamics occurs along multiple dimen- sions and developing a system for the worst-case scenario is impossible. According to runtime observations, new FTMs can be developed off-line but integrated on-line. We denote this ability as agile adaption, as opposed to the preprogrammed one. In this thesis, we present an approach for developing flexible fault-tolerant systems in which FTMs can be adapted at runtime in an agile manner through fine-grained modifications for minimizing impact on the initial architecture. We first propose a classification of a set of existing FTMs based on criteria such as fault model, application characteristics and necessary resources. Next, we analyze these FTMs and extract a generic execution scheme which pinpoints the common parts and the variable features between them. Then, we demonstrate the use of state-of-the-art tools and concepts from the field of software engineering, such as component-based software engineering and reflective component-based middleware, for developing a library of fine-grained adaptive FTMs. We evaluate the agility of the approach and illustrate its usability throughout two examples of integration of the library: first, in a design-driven development process for applications in pervasive computing and, second, in a toolkit for developing applications for WSNs.
There are two basic forms of run-time fault-tolerance: forward-recovery (which includes failure masking and redundancy based fail-over), and backward-recovery (which includes checkpointing) [e.g., Laprie90]. A run-time failures of a system is in fact the result of a series of events – sometimes that of very complex and unexpected interactions. Typically, either at design time or at execution time developer or researcher makes an error, or underlying infrastructure (including invoked services) fails for some reason. An initial design error can become a fault in the initial product. This fault may propagate (as series of defects) to the final executable version of the workflow. When the workflow encounters that defect during execution, the workflow enters an error-state. If that error- state becomes visible to the end-user, it becomes a failure that may have anywhere from
In DCG fault tolerant technique support the grid node for successful deployment of computational grids. Grids are also usually loosely connected, often in decentralized network, rather than contained in a single location, as computer in cluster often are. In decentralized technique overcome the limitation of centralized system . The main objective of this strategy is to reduce the time to find the best substitute for a failed node. Various faulttolerance strategy using for best node selection. In  PFTJS algorithm follows, “prevention is better than cure”. Consider the faulty nodes after scheduling the job, this technique minimize the fault and reduce ART and AWT in grid. In a fault tolerant model has been proposed that checkpoint tasks immediately to light balanced computing nodes. ‘Perfect information algorithm’  is selecting a node with minimum finish time and migrating a task. Grid environment  is extremely unpredictable: processor capabilities are different and usually unknown, computer may connect and disconnect at any time, and their speeds may change over time. In grid vertices of the graph will have reassigned colour based on grid environment. Load balancing is the important concept in grid . A good load balancer compare the nearest neighbour node capacity (heaviest and lightest node) after transfer the load. In our model consist of parameter (Node parameter, Communication parameter, QOS parameter) and its either static or dynamic value based on grid environment. In the parameter are classify the identical, more efficient, and less efficient in terms of performance and capabilities. In case of node failure, the selection of its alternates based on this classification.
The second strategy returns from the called function as soon as the error is detected. The raised exception is caught after giving warning or doing some effective action in the catch block. This can help in preventing error propagation. Using this aspect, every call to critical functions is secured under a try/catch block to ensure effective faulttolerance against an erroneous state. It can be seen in the diagram below that exit_main pointcut expression join points the main() run-time function. Whereas caller_return pointcut expression join points every call to the contextMethod(). Moreover exit_main and caller_return pointcut expressions are associated with an around advice to implement error handling. The tjp proceed() allows the execution run- time main() and called functions in the try block. The advice block of the catcher handler identifies the exception raised as a result of in-appropriate changes in the rate of signal or data. Once the exception is identified, the recovery mechanism is initiated that assign new values to signal or data variables based on previous trends or history of the variable.
Soft errors are transient errors that cause incorrect operation of a digital circuit. To reduce the error algorithms are used. In this paper, STR and quad algorithms are used to compare the power analysis. Shrinking of device size leads to soft error increase. Faulttolerance techniques cannot be designed with 100% accuracy without increasing design cost. The TMR is the most popular SET mitigation. The SET have a major impact on circuit operation, and they should be treated properly.Triple modular redundancy is one of the possible faulttolerance technique that can be applied to a each logic element is triplicated.
Software faulttolerance demands additional tasks like error detection and recovery through executable assertions, exception handling, diversity and redundancy based mechanisms. These mechanisms do not come for free, rather they introduce additional complexity to the core functionality. This paper presents light weight error detection and recovery mechanisms based on the rate of change in signal or data values. Maximum instantaneous and mean rates are used as plausibility checks to detect erroneous states and recover. These plausibility checks are exercised in a novel aspect oriented software fault tolerant design framework that reduces the additional logical complexity. A Lego NXT Robot based case study has been completed to demonstrate the effectiveness of the proposed design framework.
Future Work The immediate next step is to apply the theory to a wider spectrum of examples, namely systems using replicas with state and system employing faulttolerance techniques such as lazy replication: we postulate that the existing the- ory should suffice. Other forms of fault contexts that embody different assumptions about failures, such as fault contexts with dependencies between faults, could be explored. Another avenue worth considering is extending the theory to deal with link failure and the interplay between node and link failure . In the long run, we plan to develop of a compositional theory of faulttolerance, enabling the construc- tion of fault tolerant systems from smaller component sub-systems. For all these cases, this work should provide a good starting point.
M-DART extends the DART protocol to discover multiple routes between the source and the destination. In such a way, M-DART is able to improve the tolerance of a tree-based address space against mobility as well as channel impairments. Moreover, the multi-path feature also improves the performances in case of static topologies thanks to the route diversity. M-DART has two novel aspects compared to other multi-path routing protocols [6-7]. First, the redundant routes discovered by M-DART are guaranteed to be communication-free and coordination-free, i.e., their discovering and announcing though the network does not require any additional communication or coordination overhead. Second, M-DART discovers all the available redundant paths between source and destination, not just a limited number.