2.5 Workflow Deployment Optimisation
2.5.3 Reliability-driven Workflow Deployment
Node and network failure can have a detrimental impact on workflow performance. Therefore, a distributed workflow mapping algorithm which takes into reliability con-
sideration is highly desirable to avoid or reduce the workflow execution failures.
One study [75] introduced the problem of mapping distributed workflows onto system
where nodes and links are subject to probabilistic failures. Throughput and reliabil- ity was considered within the mapping problem, and a decentralised layer-oriented
method was proposed to handle the high throughput of data flows while satisfying a pre-specified overall failure limit. An extension of the LDP(Layer-oriented Dynamic
Programming) workflow mapping algorithm [74] was used to minimise the transfer- ring time at global bottleneck links through a distributed manner under a reliability
constraint.
Assayad et al. [13] attempted to explore the trade-off between reliability, power con- sumption and execution time. A heuristic scheduling algorithm was designed, which
obtained the Pareto front of each object, and then approached two pre-defined con- straints of two objects such as reliability and power consumption respectively and
finally minimised execution time.
To reduce the effect of failures on an application executing on a failure-prone system,
an algorithm has been proposed [52] which not only minimise the execution time but also considers the probability of the failure of the application when deploying the
workflow over distributed environments. Both execution time and reliability rate are taken into account in one cost function, and then this function is optimised to find the
best deployment solution.
Similarly, a BSA(Bi-Objective Scheduling Algorithm) [79] can quickly make a de-
ployment decision for a workflow over heterogenous distributed computing systems, considering performance and reliability via a bi-objective compromise function. This
function includes two steps. Firstly, the minimal solutions for both reliability and performance of each task are generated, and next, a weight parameter θ is used to
privilege one of the objectives.
Dongarra et al. [53] introduced a pareto curve-based algorithm to optimise makespan
and the reliability of workflow execution. The algorithm firstly generated the pareto- front of reliability for each task of a target workflow, without considering the order
of the tasks. Next, based on the pareto-front, the makespan was refined to meet the time constraint M for each task, and record the computing resources. Finally, a
solution was generated based on the recording, maximising the reliability and meeting the makespan constraint.
Another study [142] applied a method to measure the reliability rate dynamically, relating the time to the computing resource failure rate. In addition, a version of
genetic algorithm was developed to optimise both the makespan and reliability of a workflow application.
Cao and Zhu [35] attempted to optimise the reliability and performance in a distributed workflow system, in which the measure of reliability includes the communication link
and computing resources. A method was designed to combine iterative critical path search and Layer-based priority assigning techniques (CPL) to minimise the end-to-
end delay (EED) by focusing on the optimal allocation of tasks on the critical path. Then a refinement method was applied to the tasks which were not on the critical
path, so as to increase the reliability of the whole system.
In paper [105], the authors proposed an approach to the optimise the performance and
reliability of scheduling workflow systems over clouds. Like BSA [79], the Pareto fronts were generated as two vectors as step one. However, they extended the Euclidean
Distance [50] to define distance between the two vectors, in order to function two objectives. Therefore, the optimal solution can be found based on the function.
Fault Tolerance
Another way to increase the reliability of workflow deployment is by introducing meth-
ods to add fault tolerance. Ideally, this should avoid reproducing and redeploying the entire workflow when nodes or links fail during execution.
AHEFT [152] is a rescheduling algorithm based on [136]. It was used to handle the grid environment changes such as computing node failures, and supports the resumption of
unfinished executions. A workflow was initially planned and executed over distributed computing resources, minimising the makespan. When the availability of resources is
changed, a new schedule is generated based on the current computing resources and unfinished tasks, aiming to maximise the performance.
VgrADS [123] is a tool that enables the execution of a workflow over multiple comput- ing resources such as clouds and Grids. Furthermore, fault tolerance techniques were
added to increase the probability of success for each workflow task. The fault tolerance techniques analyse the initial mapping of the workflow, determine the mapping of the
replicated tasks on the available slots and return the mapping to the planner. The replicated tasks are selected based on the predicted failure rate of each tasks of in the
initial mapping.
QAFT [161] is a fault-tolerant scheduling algorithm that can tolerate permanent fail-
ures of one node in a heterogeneous cluster, during real-time tasks with QoS require- ments. The fault-tolerant model extended the conventional primary-backup tolerant
model[122] through scheduling two copies of a task on two different resources. This increases the success of the execution and can also verify correctness by comparing
two versions of the results. Based on the fault-tolerant model, a reliability model can be used to quantitatively evaluate the system’s level of fault tolerance.
AhmedIbrahim et al. in [127], introduced a robust mechanism (called AGS) which can detect resource failures and continue to offer functionality. A mobile agent model
was used to decentralise the workflow execution and enhance the resource discovery and monitoring processes. Once a mobile agent has detected a failure in a computing
resource, the agent and the tasks which were assigned to this failed resource should be migrated to another computing resource through ITT (in advance Task Transmission).
Workflow execution can then be resumed.
Most relevant research uses the power method to measure the reliability of workflow
deployment. However, this can only guarantee the reliability of the whole workflow, whereas a more advantageous entropy-based method is introduced and utilised in this
thesis.