Reliability-driven Workflow Deployment - Workflow Deployment Optimisation

2.5 Workflow Deployment Optimisation

2.5.3 Reliability-driven Workflow Deployment

Node and network failure can have a detrimental impact on workflow performance. Therefore, a distributed workflow mapping algorithm which takes into reliability con-

sideration is highly desirable to avoid or reduce the workflow execution failures.

One study [75] introduced the problem of mapping distributed workflows onto system

where nodes and links are subject to probabilistic failures. Throughput and reliability was considered within the mapping problem, and a decentralised layer-oriented

method was proposed to handle the high throughput of data flows while satisfying a pre-specified overall failure limit. An extension of the LDP(Layer-oriented Dynamic

Programming) workflow mapping algorithm [74] was used to minimise the transfer- ring time at global bottleneck links through a distributed manner under a reliability

constraint.

Assayad et al. [13] attempted to explore the trade-off between reliability, power consumption and execution time. A heuristic scheduling algorithm was designed, which

obtained the Pareto front of each object, and then approached two pre-defined con- straints of two objects such as reliability and power consumption respectively and

finally minimised execution time.

To reduce the effect of failures on an application executing on a failure-prone system,

an algorithm has been proposed [52] which not only minimise the execution time but also considers the probability of the failure of the application when deploying the

workflow over distributed environments. Both execution time and reliability rate are taken into account in one cost function, and then this function is optimised to find the

best deployment solution.

Similarly, a BSA(Bi-Objective Scheduling Algorithm) [79] can quickly make a de-

ployment decision for a workflow over heterogenous distributed computing systems, considering performance and reliability via a bi-objective compromise function. This

function includes two steps. Firstly, the minimal solutions for both reliability and performance of each task are generated, and next, a weight parameter θ is used to

privilege one of the objectives.

Dongarra et al. [53] introduced a pareto curve-based algorithm to optimise makespan

and the reliability of workflow execution. The algorithm firstly generated the pareto- front of reliability for each task of a target workflow, without considering the order

of the tasks. Next, based on the pareto-front, the makespan was refined to meet the time constraint M for each task, and record the computing resources. Finally, a

solution was generated based on the recording, maximising the reliability and meeting the makespan constraint.

Another study [142] applied a method to measure the reliability rate dynamically, relating the time to the computing resource failure rate. In addition, a version of

genetic algorithm was developed to optimise both the makespan and reliability of a workflow application.

Cao and Zhu [35] attempted to optimise the reliability and performance in a distributed workflow system, in which the measure of reliability includes the communication link

and computing resources. A method was designed to combine iterative critical path search and Layer-based priority assigning techniques (CPL) to minimise the end-to-

end delay (EED) by focusing on the optimal allocation of tasks on the critical path. Then a refinement method was applied to the tasks which were not on the critical

path, so as to increase the reliability of the whole system.

In paper [105], the authors proposed an approach to the optimise the performance and

reliability of scheduling workflow systems over clouds. Like BSA [79], the Pareto fronts were generated as two vectors as step one. However, they extended the Euclidean

Distance [50] to define distance between the two vectors, in order to function two objectives. Therefore, the optimal solution can be found based on the function.

Fault Tolerance

Another way to increase the reliability of workflow deployment is by introducing meth-

ods to add fault tolerance. Ideally, this should avoid reproducing and redeploying the entire workflow when nodes or links fail during execution.

AHEFT [152] is a rescheduling algorithm based on [136]. It was used to handle the grid environment changes such as computing node failures, and supports the resumption of

unfinished executions. A workflow was initially planned and executed over distributed computing resources, minimising the makespan. When the availability of resources is

changed, a new schedule is generated based on the current computing resources and unfinished tasks, aiming to maximise the performance.

VgrADS [123] is a tool that enables the execution of a workflow over multiple computing resources such as clouds and Grids. Furthermore, fault tolerance techniques were

added to increase the probability of success for each workflow task. The fault tolerance techniques analyse the initial mapping of the workflow, determine the mapping of the

replicated tasks on the available slots and return the mapping to the planner. The replicated tasks are selected based on the predicted failure rate of each tasks of in the

initial mapping.

QAFT [161] is a fault-tolerant scheduling algorithm that can tolerate permanent fail-

ures of one node in a heterogeneous cluster, during real-time tasks with QoS requirements. The fault-tolerant model extended the conventional primary-backup tolerant

model[122] through scheduling two copies of a task on two different resources. This increases the success of the execution and can also verify correctness by comparing

two versions of the results. Based on the fault-tolerant model, a reliability model can be used to quantitatively evaluate the system’s level of fault tolerance.

AhmedIbrahim et al. in [127], introduced a robust mechanism (called AGS) which can detect resource failures and continue to offer functionality. A mobile agent model

was used to decentralise the workflow execution and enhance the resource discovery and monitoring processes. Once a mobile agent has detected a failure in a computing

resource, the agent and the tasks which were assigned to this failed resource should be migrated to another computing resource through ITT (in advance Task Transmission).

Workflow execution can then be resumed.

Most relevant research uses the power method to measure the reliability of workflow

deployment. However, this can only guarantee the reliability of the whole workflow, whereas a more advantageous entropy-based method is introduced and utilised in this

thesis.

In document Partitioning workflow applications over federated clouds to meet non-functional requirements (Page 46-49)