Top PDF Fault tolerance using whole-process migration and speculative execution

Fault tolerance using whole-process migration and speculative execution

Fault tolerance using whole-process migration and speculative execution

MCC utilizes formal reasoning of programs and runtime safety checks to ensure that programs compiled using MCC are safe — that is, they will not attempt to access illegal areas of memory, or use values with inappropriate types. MCC’s safety properties allow mutually untrusting systems to exchange and evaluate code, and reduce the need for explicit security boundaries in distributed systems. They also allow programmers to develop modules for execution directly in kernel space, with the assurance that no MCC-compiled module will errantly tamper with other kernel modules. MCC is able to impose an architecture-independent representation of data and can migrate code using the architecture-independent FIR representation. This makes it an ideal platform for developing whole-process migration primitives in an efficient manner. Also, MCC’s heap design is easy to extend to support speculative execution models.
Show more

73 Read more

The Modern Approach to Exploit Multiple Cores using Process-Level Redundancy for Transient Fault Tolerance

The Modern Approach to Exploit Multiple Cores using Process-Level Redundancy for Transient Fault Tolerance

However, the same specified techniques cannot be directly adopted for general purpose computing domain. Compared to the ultra-reliable, computing environments, general purpose system is driven by a different and often conflicting set of factors. Thus, the reliability improves to meet user expectations of failure rates. While the reliability level of hardware technique is not rendered by a software technique, they provide a significantly low cost and flexible (zero hardware design cost). For checking computation and control flow process existing software transient fault tolerance approaches use the encyclopaedias to insert redundant instructions. Redundant more than is needed, desired or required to ensure correct execution. This paper presents the design and analysis of transient fault tolerance using PLR (process level redundancy). This paper exploits the Multiple Cores for transient fault tolerance using PLR (process level redundancy). To ensure correct execution, PLR compares the output and it creates a set of redundant process per application process. This is mostly used to leverage overhead mechanism.
Show more

5 Read more

Fault Tolerance using Fitness Algorithm in Cloud Computing

Fault Tolerance using Fitness Algorithm in Cloud Computing

There are various fault tolerance strategies which are proposed by authors that also play a crucial role in fault tolerance mechanism such as Y. ZHANG, Z. ZHENG, and M. R. LYU proposed BFTCloud [9] for voluntary resource cloud computing The architectures operate on five basic operations: primary node selection, replica selection, request execution, primary node updating, and replica updating. The primary node is selected based on QoS requirements. The request for the service is handled by the primary node. G. FAN, H. YU, L. CHEN, and D. LIU. Propose CFN[10] for service resources, cloud module, the detection and failure process ... etc. It is used to create the different components of Cloud Computing which gets integrated dynamically into CFN model. Based on CFN model, the properties of the components are analyzed developing a fault detection strategy at each level which dynamically detects the faults in the execution process. P. DAS, and P. M. KHILAR. suggested VFT[11] a reactive fault tolerant technique; it consists of a Cloud Manager (CM) module and a Decision Maker (DM) which are used to manage the virtualization, load balancing and to handle the faults. The first step involves virtualization & load balancing and in the second step fault tolerance is achieved by checkpoint and fault handler. The virtualization includes a fault hander. Fault handler finds these unrecoverable faulty nodes and It also helps to remove the temporary software . G. CHEN" H. JIN, D. ZOU, B. B. ZHOU, W. QIANG, and G. HU. Suggest SHelp[12] is an error handler which are run in different VMs hosted on one physical machine. It uses the Checkpoint/Restart as the checkpoint and rollback tool. Authors introduced two new techniques, namely, weighted rescue points and two-level rescue point database. When a fault occurs, the application is rolled back to a latest checkpoint, and first uses error virtualization at a rescue point which has the largest weight value among the candidate rescue points, then at the rescue point with the second largest weight value and so on until the fault is bypassed. S. MALIK, AND F. HUET. Suggest FTRT[13] is the highly intensive computing capabilities and scalable virtualized environment of the clouds help the systems to execute the tasks in real time. The proposed technique depends on the adaptive behavior of the reliability weights assigned to each processing node. The technique uses a metric to evaluate reliability. The nodes are removed if the processing nodes fail to achieve the minimum required reliability level.
Show more

6 Read more

Resource Reliability using Fault Tolerance aware Scheduling in Cloud

Resource Reliability using Fault Tolerance aware Scheduling in Cloud

During execution process, to maintain budget by minimizing the cost for execution [8], we have to first identify all available requests / tasks and their respective budget to execute them by calculating time and resource requirements. Then, it will arrange the tasks in ascending order based on their required cost. After this, it will start execution in arranged order. It increases overhead as replica always needs storage and execution of tasks / requests at all locations or servers basically. So, it increases all the required resources like CPU (MIPS), Memory, Bandwidth and Server for storage. It always increases the execution time. During execution of larger job, if execution engine fails to execute the task as 100% then this approach prefers to restart the execution from the beginning. So the executed percentage of job in previous scenario will be executed again.
Show more

7 Read more

A theory for observational fault tolerance

A theory for observational fault tolerance

are essentially a collection of located processes, or agents, composed in parallel where location and channel names may be scoped to a subset of agents. The syntax for processes, P , Q, is an extension of that in Dπ. There is input and output on chan- nels; in the latter V represents a tuple of identifiers, while in the former X is tuple of variables, to be interpreted as a pattern. There are also the standard constructs for parallel composition, replicated input, local declarations, a test for equality be- tween identifiers, migration and a zero process. The only addition to the original Dπ is ping k.P else Q, which tests for the liveness status of k in the style of [1,15] and branches to P if k is alive and Q otherwise. For these terms we assume the stan- dard notions of free and bound occurrences of both names and variables, together with the associated concepts of α-conversion and substitution. We also assume that systems are closed, that is they have no free variable occurrences. Note that all of the examples discussed in Section 1 are valid system level terms in DπLoc. But it is worth emphasising that when we write definitions of the form
Show more

45 Read more

How To Write A Fault Tolerance Manager In A Microsoft Microsoft System (Fault Tolerance)

How To Write A Fault Tolerance Manager In A Microsoft Microsoft System (Fault Tolerance)

Evolution during service life is mandatory, particularly for long-lived systems. Dependable systems, which continu- ously deliver trustworthy services, must evolve to accommodate changes e.g., new fault tolerance requirements or varia- tions in available resources. The addition of this evolutionary dimension to dependability leads to the notion of resilient computing. Among the various aspects of resilience, we focus on adaptivity. Dependability relies on fault-tolerant com- puting at runtime, applications being augmented with fault tolerance mechanisms (FTMs). As such, on-line adaptation of FTMs is a key challenge towards resilience. In related work, on-line adaption of FTMs is most often performed in a pre- programmed manner or consists in tuning some parameters. Besides, FTMs are replaced monolithically. All the envisaged FTMs must be known at design time and deployed from the beginning. However, dynamics occurs along multiple dimen- sions and developing a system for the worst-case scenario is impossible. According to runtime observations, new FTMs can be developed off-line but integrated on-line. We denote this ability as agile adaption, as opposed to the preprogrammed one. In this thesis, we present an approach for developing flexible fault-tolerant systems in which FTMs can be adapted at runtime in an agile manner through fine-grained modifications for minimizing impact on the initial architecture. We first propose a classification of a set of existing FTMs based on criteria such as fault model, application characteristics and necessary resources. Next, we analyze these FTMs and extract a generic execution scheme which pinpoints the common parts and the variable features between them. Then, we demonstrate the use of state-of-the-art tools and concepts from the field of software engineering, such as component-based software engineering and reflective component-based middleware, for developing a library of fine-grained adaptive FTMs. We evaluate the agility of the approach and illustrate its usability throughout two examples of integration of the library: first, in a design-driven development process for applications in pervasive computing and, second, in a toolkit for developing applications for WSNs.
Show more

150 Read more

Fault Tolerance Mechanisms in Distributed Systems

Fault Tolerance Mechanisms in Distributed Systems

This fault tolerance technique is often used for faults that disappears without anything been done to remedy the situation, this kind of fault is known as transient faults. Transient faults occurs when there’s a temporary mal- function in any of the system component or sometimes by environmental interference. The problem with tran- sient faults is that they are hard to handle and diagnose but they are less severe in nature. In handling of transient fault, software based fault tolerance technique such as Process-Level Redundancy (PLR) is used because hard- ware based fault tolerance technique is more expensive to deploy. As shown in Figure 9, the PLR compares processes to ensure correct execution and also it creates a set of redundant processes apiece application process. Redundancy at the process level enables the OS to schedule easily processes across all accessible hardware re- sources.
Show more

12 Read more

On Fault Tolerance of Resources in Grid Environment

On Fault Tolerance of Resources in Grid Environment

Computational grid consists of large sets of diverse, geographically distributed resources that are grouped into virtual computers for executing specific applications. As the number of grid system components increases, the probability of failures in the grid computing environment becomes higher than that in a traditional parallel computing. Compute intensive grid applications often require much longer execution time in order to solve a single problem. Thus, the huge computing potential of grids, usually, remains unexploited due to their susceptibility to failures like, process failures, machine crashes, and network failures etc. This may lead to job failures, violating timing deadlines and service level agreements, denials of service, degraded user expected quality of service. Thus fault management is a very important and challenging for grid application developers. It has been observed that interaction, timing, and omission faults are more prevalent in grid. Fault tolerance is the ability of a system to perform its function correctly even in the presence of faults. The fault tolerance makes the system more dependable. A complementary but separate approach to increase dependability is fault prevention. This consists of techniques, such as inspection, whose intent is to eliminate the circumstances by which faults arise. A failure occurs when an actual running system deviates from this specified behavior. The cause of a failure is called an error. An error represents an invalid system state that does not comply the system specification. The error itself is the result of a defect in the system or fault. In other words, a fault is the root cause of a failure. However, a fault may not necessarily result in an error; nevertheless, the same fault may result in multiple errors. Similarly, a single error may lead to multiple failures. The level of fault tolerance is reflected by quantifying the system dependability. Dependability means that our system can be trusted to deliver the service(s) for which it
Show more

8 Read more

Fault Tolerance in Live VM Migration   A Review

Fault Tolerance in Live VM Migration A Review

This mechanism ensures backup to be taken in elastic optical network. Since network is utilized which is prone to failures hence entire process of elastic network is at stakes. In order to resolve the problem novel mutual backup model is proposed in the studied paper. Number of output lines required to transfer and back up is reduced by the use of WDM. The problem of slow migration appears in this case. in order to resolve the problem optical medium is suggested. The optical medium transfer the data at the speed of light hence overall transfer rate enhances. More data can be transferred hence throughput is also enhanced. (3)
Show more

6 Read more

Aspect oriented software fault tolerance

Aspect oriented software fault tolerance

5. Aspect Oriented Exception Handling Patterns Exception handling has been deployed as a key mechanism in implementing software fault tolerance through forward and backward error recovery mechanisms. It provides a convenient means of structuring software that has to deal with erroneous conditions [11]. In [8], author addresses the weaknesses of exception handling mechanisms provided by mainstream programming languages like Java, Ada, C++, C#. In his experience exception handling code is inter-twined with the normal code. This hinders maintenance and reuse of both normal and exception handling code.
Show more

8 Read more

Aspect Oriented Software Fault Tolerance

Aspect Oriented Software Fault Tolerance

The dependability assessment of the proposed scheme has been done via fault injection. All the faults are injected into the most critical functionality of the system, that is reading the ultrasonic sensor, light sensor, motor speed sensor and writing motor servo commands. The faults are injected by supplementary code in an aspect oriented way using AspectC++ [1]. The faults injected are permanent stuck, noise bursts and random spikes at pre-defined or random locations.

8 Read more

Fault Tolerance Against Design Faults

Fault Tolerance Against Design Faults

“Systematic failures” - failures that are not due to random physical decay but to defects in the design or manufacturing of products, and therefore happen systematically when certain circumstances occur in the use of the product - are a common problem in engineering. Among their causes are design faults: avoidable defects due to human error or lack of foresight in developing the system. The advent of software gave new prominence to design faults, since software products are uniquely defined as being a purely intellectual construction, immune from physical damage or decay, and yet they were seen to fail often, to the point of being the main cause of failure of many systems. For computer systems, design faults – in software, less often in hardware, and in organizations and procedures of use – are reported to account for a large fraction of failures, including high consequence ones, in anecdotal evidence [1], and published statistics (e.g. in [2], between 25% and 65% of failures in different categories of systems). The importance of physical faults in determining unreliability of computer systems is decreasing compared to that of design faults and human error in operation and maintenance. (If some human errors were attributed to faults in the design of user interfaces and procedures, the relative weight of design faults would increase further.) This trend may be due in part to some categories of software becoming less reliable, but also to an increasing ability to achieve high reliability against physical faults of hardware, in part via fault tolerance. Techniques for avoiding or removing design faults have improved over the years, but the problem of design faults is still serious in at least two respects: i) there are safety- and mission-critical systems that require better dependability than current methods can verify, or possibly even achieve [3]; and ii) for a large part of the industry that produces hardware and software components for purchase “off the shelf”, dependability has low priority [4], so that users, or
Show more

21 Read more

Fault Tolerance and Reliability in Scientific Workflows

Fault Tolerance and Reliability in Scientific Workflows

Complex scientific workflows will often run for long periods of time and will need to be able to recover from dynamic changes. One issue is how to handle well-defined fault conditions such as a down server, or a resource that does not have the ability to service the request at that moment (e.g. out of storage space) when the workflow manager tries use the resource. Another issue is how to treat failures that render resources (temporarily) inaccessible, but not necessarily inoperative. The problem of monitoring resources accessed over wide-area networks (WAN) is an example of this. If the network between the WFMS monitoring service and the resource that is acting on behalf of the workflow partitions (no communication is possible between the two halves), then the monitoring service cannot determine if the resource is still acting on behalf of the workflow manager. This issue will be discussed more thoroughly in later paragraphs, and we will present a preliminary solution to the problem where a service becomes temporarily or permanently inaccessible.
Show more

79 Read more

Fault Tolerance Techniques for Combinational Circuits

Fault Tolerance Techniques for Combinational Circuits

ABSTRACT: Increasing in soft error rates, the multiple faults is needed to be considered while modeling circuit sensitivity and evaluating fault tolerance techniques. This paper proposed a design of combinational circuits for soft error tolerance with minimal area overhead. This is based on analyzing random pattern testability of faults in a design and protecting sensitive transistors, whose probability of soft error is relatively high. Transistors are protected by duplicating and sizing a subset of transistors for the protection of sensitive transistors. The proposed method compares the power and output accuracy using Quad technique and STR (Selective Transistor Redundancy) algorithm.
Show more

6 Read more

Architectural Framework for Cloud Reliability Model using Fault Tolerance Techniques

Architectural Framework for Cloud Reliability Model using Fault Tolerance Techniques

In the proposed framework Cloudsim will be used to perform the results. CloudSim offers a generalised simulation framework that allows faultless model and replication of application execution. CloudSim is a simulation tool that permits developers to check the performance of their guidelines in a convenient situation and it is free of cost. It is a toolkit for cloud environments which evaluates resource provisioning algorithms. The classes of the package permit the development of algorithms based on fault tolerance which in turn can supervise virtual nodes towards the identification of failures and then later resolves them. In the proposed framework to measure the results through CloudSim, netbeans will be used and script will be written in Java programming language. This provision will deploy acceptance computation, reliability assessment, decision mechanism. This package will also apply fine- grained checkpointing and replication mechanism (Damodhar & Poojitha 2017). This simulator offers the capability to calculate availability, throughput, time overhead and monetary cost overhead.
Show more

5 Read more

Fault Tolerance Techniques in Distributed System

Fault Tolerance Techniques in Distributed System

As already discussed, computing systems consist of a multitude of hardware and software components that are bound to fail eventually. In many systems, such component failures can lead to unanticipated, potentially disruptive failure and to service unavailability. Some systems are designed to be fault-tolerant: they either exhibit well-defined failure behaviour when components fail or mask component failures to users, that is, continue to provide their specified standard service despite the occurrence of component failures. To many users temporary errant system failure behaviour or service unavailability is acceptable. There is, however, a growing number of user communities for whom the cost of unpredictable, potentially hazardous failures or system service unavailability can be very significant [6]. Examples include the on-line transaction processing, process control, and computer-based communications user communities. To minimize losses due to unpredictable failure behaviour or service unavailability, these users rely on fault tolerant system. With the ever increasing dependence placed on computing services, the number of users who will demand fault-tolerance is likely to increase. The task of designing and understanding fault-tolerant system architectures is notoriously difficult: one has to stay in control of not only the standard system activities when all components are well, but also of the complex situations which can occur when some components fail. The difficulty of this task is also due to lack of structuring concepts and use of different names for the same concepts. For example, what one person calls a failure, a second person calls a fault, and a third person might call an error. A. Faults, Errors, and Failures
Show more

7 Read more

Decentralized Fault Tolerance on Grid Environment

Decentralized Fault Tolerance on Grid Environment

ABSTRACT: Grid computing is one of the technology for solving huge scale scientific problem utilizing heterogeneous and geographically distributed resource. Scientific problem is a major challenge in grid computing environment. In a grid environment potentially thousands of nodes connected to each other, the reliability of individual node cannot be guaranteed. Hence the occurrence of fault highly in grid computing. Therefore fault tolerance technique is essential in grid. In this paper, we propose a fault tolerance model for DCG(dynamic coloured graphs) on decentralized environment. In our colouring technique, set of colours (RED, BLUE, GREEN) assigned to each node based on its level of attribute performance. We classify each node in a three category identical, more efficient and less efficient based on attribute and assign RGB colours. The simulation result shows that very well perform and reduce the ART and AWT in huge grid environment.
Show more

5 Read more

Optimized Fault Tolerance in Distributed Environment

Optimized Fault Tolerance in Distributed Environment

Mohammad Jalil Piran,G. Rama Murthy,G. Praveen Babu give some overview on Vehicular Ad hoc and sensor networks principles and challenges [7]. In this paper they proposed work on Vehicular ad hoc network by using the GPS (Global Positioning System). This Device works in offline mode as well as in online mode by taking the concern of future to get safe and secure they presented the good approach for Intelligent Transportation system. In this they create a system by utilizing the Wireless Sensor Networks for vehicular networks. They gave some of the new challenges and some aspect of VANET and also explained communication architecture and feasible topology which are applicable in VASNET. They employ the wireless Sensor networks as Vehicular Ad Hoc and Sensor Network, and use the VASNET in short. To check the velocity for each instance of vehicle they used the vehicular nodes. In VANET we have some Base Stations like rescue team, police traffic Station. This Base Station can be mobile or it can be a stationary. As the VASNET provides us wireless communication capability between the stationary nodes and vehicular nodes and it also increase the comfort and safety on the highway roads for the vehicles.
Show more

6 Read more

Fault Tolerance in Interconnection Network-a Survey

Fault Tolerance in Interconnection Network-a Survey

We assume fault diagnosis to be available as needed with respect to the surveyed ICNs and do not discuss it further. The techniques for fault-tolerant design can be categorized by whether they involve modification of the topology (graph) of the system. The three well-known methods that do not modify topology are error-correcting codes, bit-slice implementation with spare bit slices and duplicating an entire network (this changes the topology of the larger system using the network). These approaches to fault tolerance can be applied to ICNs. Over the years number of techniques have also been developed to suit to the nature of ICNs and their use. Our survey here explores these methods in particular in a systematic order.
Show more

17 Read more

Virtualization and Fault Tolerance in Cloud Computing

Virtualization and Fault Tolerance in Cloud Computing

Fault tolerance in cloud computing is a grand challenge problem now a days. The main fault tolerance issues in cloud computing are detection and recovery. To combat with these problems, many fault tolerance techniques have been designed to reduce the faults. In this research work,a Virtualization and Fault Tolerance (VFT) technique is used to reduce the service time and to increase the system availability. A Cloud Manager (CM) module and a Decision Maker (DM) are used in this scheme to manage the virtualization, load balancing and to handle the faults. In the first step virtualization and load balancing are done and in the second step fault tolerance is achieved by redundancy, checkpointing and fault handler.The main load balancing issues in cloud computing is load calculation and load distribution. To solve these issues, many load balancing techniques have been designed to distribute tasks properly. In this work, a Load Balancing Technique for Virtualization and Fault Tolerance in Cloud Computing (LBVFT) is applied to assign the tasks to the virtual nodes.LBVFT is mainly designed to assign tasks to the virtual nodes depending on the success rates (SR) and the previous load history. In our load assigning technique assignment is done by the load balancer (LB) of cloud manager (CM) module in the basis of higher success rate and lower load of the available nodes.A Randomized Searching Algorithm is designed to select a virtual node.Performance of the Randomized Searching Algorithm lies between Binary Search and Linear Search Algorithms. VFT is mainly designed to provide reactive as well as proactive fault tolerance. In this approach a fault handler is included in the virtualization part. Fault handler blocks the unrecoverable faulty nodes along with their virtual nodes for future requests and removes the temporary software faults from the recoverable faulty nodes and makes them available for the future requests.
Show more

54 Read more

Show all 10000 documents...