In order to improve the SRM faulttolerance ability, hardware and software can be employed to achieve faulttolerance operation. For faulttolerance control, artificial neural network, genetic algorithms, and the time dynamic models are developed for normal and abnormal condition control of SRM [20]. Fuzzy controller without a model is proposed in [21] to improve the faulty performance of SRM. For hardware improvement, multi-phase SRM is developed to improve faulttolerance [22], [23]. The axial-flux configuration based, five-phase SRM is developed for EV application [23]. In order to improve the reliability of the SRM driving system, dual-channel SRM is developed and faulttolerance strategy is also discussed in [24]. Modular stator SRM is proposed in [25], which is convenient for replacing the fault winding. In [26], a double-layer-per-phase isolated SRM is also investigated which can not only improve the fault-tolerant capability, but also achieve higher torque and low noise. Furthermore, segmental stator SRM with modular construction is proposed for faulttolerance application [27]. In [28], two extra switching devices are added in the traditional asymmetric half bridge topology for an 8/6 SRM drive system to decrease the impact of a fault. However, the proposed faulttolerance topology is without module structure and only accustoms limited fault condition. A decentralized phase driving topology is developed in [29] to make full use of independence of phase windings; while a lot of half asymmetric modular. In the absence of position sensors, a faulttolerance control strategy is proposed in [30] to deal with the absent phase operation. In [31], a faulttolerance scheme is proposed for the SRM drive that driven by a three-phase bridge inverter when the phase windings are connected in star. The fault characteristics of the SRM drive under short and open circuit fault operations are analyzed in details in [32], and a fault tolerant control method is presented. However, the phase absence operation in open circuit condition, and setting the turn-off angle advanced in short circuit condition still causes a large torque ripples.
The increasing demand for flexibility and scalability in dynamically obtaining and releasing computing resources in a cost-effective and device-dependent manner and in hosting applications without the burden of installation and maintenance has resulted in a wide adoption of the cloud computing paradigm. The increasing popularity of this paradigm as an attractive alternative to classic information processing systems has increased the importance of its correct and continuous operation even in the presence of faulty components. Among the many addressable issues [22] of cloud computing such as faulttolerance, workflow scheduling, workflow management, security etc, faulttolerance is one of the important issues which the researchers are supposed to look forward. Cloud computing is still vulnerable to a large number of system failures and in effect and there is an increasing concern among users regarding the reliability and availability of cloud computing services. In order to minimize the failure impact on system and application execution, failures should be anticipated and proactively handled. Faulttolerance techniques are used to predict these failures and take an appropriate action before failures actually occur.
K.Mohammadi and H. Hamidi implemented fault tolerant execution for mobile agents in Distributed Systems [5]. To build up the mobile agent system Execution of the mobile agent is an important issue. As the Execution for the mobile agent must be a reliable. In this Mobile Agent system design offering the user transparent Faulttolerance which can be activated only when the request comes or when the task is in need. To Make a Fault Tolerant Mobile Agent system they uses the Fatomas which is bases on Java.There are two Approaches Place Dependent and Agent Dependent which can use to model the Fault tolerant mobile agents which is being executed in distributed system. In Place Dependent Approach fault tolerant is integrated into the mobile agent whereas in Agent Dependent FaultTolerance which is provide by some protocol is travel with the agent. They implemented their system in Asynchronous Distributed system. For the Providence of flexible and the adaptive system they change the agent domains dynamically.
Abstract— Grid computing, most simply stated, is distributed computing taken to the next evolutionary level. The goal is to create the illusion of a simple yet large and powerful self managing virtual computer out of a large collection of connected heterogeneous systems sharing various combinations of resources. However, in the grid computing environment there are certain aspects which reduce efficiency of the system, job scheduling of the resources and faulttolerance are the key aspect to improve the efficiency and exploit the capabilities of emergent computational systems. Because of dynamic and distributed nature of grid, the traditional methodologies of scheduling are inefficient for the effective utilization of the resource available. The faulttolerance strategy proposed will improve the performance of the overall computational grid environment. In this paper we propose an efficient job scheduling, replication and check pointing to improve the efficiency of the grid environment. The simulation results illustrate that the proposed strategy effectively schedules the grid jobs and reduce the execution time.
Independent checkpointing mechanisms deal with process dependency in two ways. Pessimistic independent checkpointing [19], requires that each process logs the changes since the last checkpoint after sending or receiving any message. As dependencies between processes are only due to messages passed between them, this ensures that rollback of one process to the previous checkpoint will not affect dependent processes. Optimistic independent checkpointing requires that depen- dencies are explicitly recorded somewhere in the system, so that on rollback of a process, dependent processes will be informed appropriately and possibly also rolled back [6, 11, 20]. The pessimistic approach places more restrictions on a process’ autonomy in checkpointing and may require more checkpointing than optimistic approaches. Optimistic mechanisms will have more overhead in roll- back, on the other hand. However, it should be noted that no single mechanism is universally applicable. The suitability of algorithms differ for each application type, namely that there exists a different set of algorithms for batch processing, shared memory and MPI based applications. In this paper we restrict ourselves to discussions on faulttolerance requirement of Web Services architecture.
two nodes into topology optimization with fault toler- ance requirement. For different faulttolerance require- ment, graph-based simulations results indicated that FICTC outperforms the state-of-the-art fault-tolerant topology control algorithms in terms of average node degree, maximum transmit radius, and maximum link interference. Furthermore, the ns-2 simulations showed FICTC achieves higher throughput and lower E2E delay. In term of energy consumption, we use the aver- age expended energy ratio (EER) to evaluate the energy performance of our proposed FICTC. The result indi- cated the proposed FICTC achieve lower EER than other algorithms. These results demonstrate that the proposed solutions are promising for specific fault- tolerant requirement in practical network deployment. However, There are several problems for further re- search on fault-tolerance-and-interference-aware top- ology control. The proposed FICTC does not analyze the network performance of considering sophisticated model for the radio signal propagation. Moreover, the FICTC algorithm needs to deal with mobile WNMs. In future study, we plan to investigate the network per- formance for these problems.
Tactics for faulttolerance detection, recovery and prevention are shown in Figure 1. Monitors are components that can monitor many parts of a system, such as processors, nodes, and network congestion. They may use heartbeat or ping-pong protocols to mon- itor remote components in distributed systems. In message passing distributed systems, timestamps can be used to detect or re-order incorrect event sequences. Most existing work in faulttolerance for HPC systems is based on checkpointing or rollback recovery. Checkpointing methods (Chandy & Lamport, 1985) are based on periodically saving a global or semi-global state to stable storage. Global checkpointing does not scale, and though asynchronous checkpointing approaches have potential to scale on larger systems, they encounter difficult challenges such as rollback propagation, domino effects, and loss of integrity through incorrect rollback in dependency graphs (Elnozahy et al., 2002). Algo- rithmic level faulttolerance is an alternative high level faulttolerance approach. Examples include leader election algorithms, consensus through voting and quorums, or simply ig- nore failures by smoothing over missing results with probabilistic accuracy bounds.
Agent and core based approaches to faulttolerance can be incorporated within parallel simulations in the area of molecular dynamics. However, which of the two approaches, agent or core intelligence, is most appropri- ate? The decomposition techniques considered above establish dependencies between blocks of atoms and between atoms. Therefore the degree of depen- dency affects the relocation of a sub-job in the event of a core failure and reinstating it. The dependencies of an atom in the simulation can be based on the input received from neighbouring atoms and the output propagated to neighbouring atoms. Based on the number of atoms allocated to a core and the time step of the simulation the intensity of numerical computations and the data managed by a core vary. Large simulations that extend over long periods of time generate and need to manage large amounts of data; consequently the process size on a core will also be large.
This paper proposes the design of a faulttolerance technique that consists of two layers. The first layer is the special layer that derived from the other known techniques in a way that exploit the effective characteristics of these techniques, with the consideration of keeping the complexity of the system in minimum degree. This layer can be named the 2-Version Software with Acceptance Test Support. The other layer is the general layer that can be used with the software faulttolerance technique that proposed in the first layer or with any other software faulttolerance techniques. The second layer propose the design of a software faulttolerance mechanism that concerns on the use of unusual (intelligence) ways for system recovering from design faults, also allowing the system operator to interfere in the process of system recovering. The developed mechanism will be used to support the operation of the conventional software faulttolerance techniques.
Abstract This paper challenges the common assumption that swarm robotic systems are robust and scalable by default. We present an analysis based on both reliability modelling and experimental trials of a case study swarm performing team work, in which failures are deliberately induced. Our case study has been carefully chosen to represent a swarm task in which the overall desired system behaviour is an emergent property of the interactions between robots, in order that we can assess the faulttolerance of a self-organising system. Our findings show that in the presence of worst-case partially failed robots the overall system reliability quickly falls with increasing swarm size. We conclude that future large scale swarm systems will need a new approach to achieving high levels of faulttolerance.
Abstract: Big Data is a term for information sets that are huge and complicated in which the traditional data processing applications are not capable to manage with them. MapReduce is a programming model for processing and producing these huge data sets. Programs written in this are naturally parallelized and runs on a huge cluster of commodity machines. Mapreduce executes faulttolerance at the task level, so when a task failure happens the whole task will be re-executed once more. In the case of long time running subtasks and then a single task failure is very large and cannot be avoided. The effect of failures can be considerable in terms of performance. As a result faulttolerance is significant for MapReduce. This Paper summarizes the checkpointing methods for faulttolerance on MapReduce.
the grid computing environment. In grid computing faulttolerance is very important. Gridsim toolkit easily faces the faulttolerance. A faulttolerance framework for Grids is presented in [10] which is called as MAG (Mobile Agents for Grid Computing Environments) is a research project that explores the mobile agent paradigm as a way to overcome the design and implementation challenges of developing a Grid middleware. In this total service register in grid information service and all resource transfer between client and server. In grid system environment maintain the broker [17-18]. Here Client and server are main players for job submission and job execution. Computational grid must need the hard ware and software infrastructures to give consistent, pervasive and inexpensive access to high end computational capable [8]. The faulttolerance main advantage is network independent and flexible.
A theme running through this survey has been that as faulttolerance (or resilience), that is, dynamic defences, exist in all kinds of systems, the measures that may be appropriate for studying them also belong to similar categories and the difficulties in defining measures, in performing measurements, and in predicting the values of measures also belong to common categories. Interest in studying and/or in extend- ing the use of faulttolerance or resilience has expanded of late in many areas, 7 and we can all benefit from looking at problems and solutions from different technical areas. I gave special attention to the “resilience engineering” area of study, since its choice of topic problems highlights extreme versions of measurement and pre- diction problems about the effectiveness of “resilience” that exist in the ICT area. In all these areas there are spectra of prediction problems from the probably easy to the intractable. The “resilience engineering” movement has raised important issues related to the measurement and prediction of “resilience” attributes. One is simply the recognition of the multi-dimensionality of “resilience”. For instance, Westrum [27] writes: “Resilience is a family of related ideas, not a single thing. [. . . ] A resilient organization under Situation I will not necessarily be resilient under Situ- ation III [these situations are defined as having different degrees of predictability] Similarly, because an organization is good at recovery, this does not mean that the organization is good at foresight”.
Decisions become far more difficult for highly dependable systems for critical applications. Some safety-critical applications use multiple-version (custom-built) systems or similar high-cost forms of faulttolerance. The cost of developing multiple versions is justified by the high cost of failure and the consequent very low probabilities required. But there may be no statistical knowledge for judging the advantages achieved, because the concern is with very rare, and thus seldom observed, failures, which yet are important because they may lead to serious accidents. Examples include civil aviation: Airbus has used diversity in software and hardware [32], and Boeing in hardware [44]. Other applications include various railway signaling and control systems [40 , 45, 46, 34, 47]. Faulttolerance is applied on top of stringent and expensive precautions for fault avoidance. For neither set of precautions is there a quantified assessment of effectiveness, e.g., in terms of the mean number of dangerous failures avoided per year. Faulttolerance is an added “safety margin” in the operation of already very dependable systems. This is similar to “safety margins” in other areas of engineering. A baseline design is demonstrated to be satisfactory under certain assumptions; the safety margin is a defense against all kinds of unanticipated departures of the real world from the assumptions. It is impossible to quantify this protection with any precision, but it is believed necessary to err, in case, on the side of excessive prudence. These industrial sectors adopted software faulttolerance when first introducing software in safety-critical roles, and facing the need to preserve the good safety record which had previously been achieved with non-software technologies that were thought to be better understood.
The second strategy returns from the called function as soon as the error is detected. The raised exception is caught after giving warning or doing some effective action in the catch block. This can help in preventing error propagation. Using this aspect, every call to critical functions is secured under a try/catch block to ensure effective faulttolerance against an erroneous state. It can be seen in the diagram below that exit_main pointcut expression join points the main() run-time function. Whereas caller_return pointcut expression join points every call to the contextMethod(). Moreover exit_main and caller_return pointcut expressions are associated with an around advice to implement error handling. The tjp proceed() allows the execution run- time main() and called functions in the try block. The advice block of the catcher handler identifies the exception raised as a result of in-appropriate changes in the rate of signal or data. Once the exception is identified, the recovery mechanism is initiated that assign new values to signal or data variables based on previous trends or history of the variable.
The second strategy returns from the called function as soon as the error is detected. The raised exception is caught after giving warning or doing some effective action in the catch block. This can help in preventing error propagation. Using this aspect, every call to critical functions is secured under a try/catch block to ensure effective faulttolerance against an erroneous state. It can be seen in the diagram below that exit_main pointcut expression join points the main() run-time function. Whereas caller_return pointcut expression join points every call to the contextMethod(). Moreover exit_main and caller_return pointcut expressions are associated with an around advice to implement error handling. The tjpproceed() allows the execution run- time main() and called functions in the try block.
Replication is the most basic way we adopt to achieve faulttolerance, however it still faces many challenges. How to maintain replica consistency is a typical problem, as Faulttolerance will obviously fail if replicas are in different status during a servicing time. Another problem is network partition. There is no support for the consis- tent remerging of the replicas of CORBA objects fol- lowing a network partition [3]. This paper focuses on these two problems as mentioned above and it introduces a voting mechanism CRVM to meet the challenges.
Abstract: Wireless sensor networks have diversified application like environmental monitoring, scientific data collection, and battlefield surveillance. In the Wireless sensor network, the sensor nodes are deployed in all possible environments. The wireless sensor network infrastructure comprises of a network and sensor nodes. Reliability is the prime issue in wireless sensor network. Reliability is affected by errors and faults that occurs due to various hardware and software issues. In both the wireless sensor networks and the traditional wireless networks occurrence of fault is persistent. If any node fails due to any abnormal condition then there will be a barrier in the communication. The failure of the communication in the wireless sensor networks may be caused by erroneous of nodes, breakdown in the links. The concept of faulttolerance enables the wireless sensor network to find and even out the errors. The self-healing capability in the wireless sensor network makes the nodes in the network more reliable. Keywords: FaultTolerance, Reliability, Sensor Nodes, Wireless Sensor Networks.
Nowadays the customer demand for accessing a web based application has grown enormously as everything is available in the Internet. Sensitive application providers retain their resources in safe from unauthorized access by using single signon technique. In this technique if a user gives an irrelevant information in a particular session, he may be asked to continue the session by using sign on technique once again irrespective of whether the user is a sensitive user. This paper proposes a new strategy which classifies the sensitive user and allows him to continue the session even if the user does a mistake (fault) that could be tolerated to some level. The proposed method focuses on the fault identification and classification in order to keep the sensitive user for assigning some tolerance level for accessing the web application. The users are classified based on the access level by setting the tolerance level for the type of fault identified. An efficient algorithm is proposed for handling the faulttolerance in the web application more efficiently. In future the faulttolerance can be extended by AI based technique with multi-level security for improving the performance over online transaction.
Network faulttolerance appears to be a critical topic for research [11]. Suitable faulttolerance mechanisms have been developed in SDN that intend to guarantee faulttolerance. FatTire [12] is a language for writing fault-tolerant SDN pro- grams in terms of paths through the network and explicit fault- tolerance requirements. The main features of FatTire include fast-failover OpenFlow mechanisms, and correct behaviour during periods of failure recovery. It focuses on data plane including the faulttolerance between the path between the source and destination and intermediate functions such as IDS and firewall. Compared to our faulttolerance framework, FatTire is based on Netcore compiler and it is not compatible with controllers such as OpenDaylight(ODL) 1 . Coronet [13] on the other hand proposes a fault tolerant system for NOX SDN controller able to protect data plane link/switch failures and is based on Dijkstra’s shortest path algorithm. Ravana [14] is a fault tolerant SDN controller platform evaluated in the Ryu controller that supports faulttolerance for both controller and switches. SMaRtLight [15] is a practical fault-tolerant SDN controller supporting primary and backup controllers (master slave) are used to replicate SDN controllers. LegoSDN [16] is able to tolerate SDN application failures, while AFRO [17]