3.4 Exception Handler
3.4.2 Example of Exception Handling
The exception handling algorithm is illustrated using the running example from the
previous study [143], showing how it can be extended to deal with cloud failure. Fig- ure 3.6 shows the workflow (→ denotes a data dependency, D and S represent datum
and service respectively, while the subscript is a unique identifier for each block).
The example of a pipeline workflow is partitioned over a federated cloud that is com-
posed of two clouds: a public cloud and a private cloud. The previously described method [143] uses a static partitioning algorithm that generates six valid partitioning
options (Figure 3.7).
Once the valid options for deploying the services and data to the clouds have been determined, it is necessary to select an option for execution, and a cost model is used
cloud #0 Partition #2 cloud #1 Partition #1 D1 S1 D2 D0 S0 D1 sxfer (a) opt1 cloud #1 Partition #1 Partition #3 cloud #0 Partition #2 Partition #4 D1 sxfer D1 D2 D0 S0 D1 sxfer S1 D2 sxfer (b) opt2 cloud #1 Partition #1 D0 S0 D1 S1 D2 (c) opt3 cloud #0 Partition #2 cloud #1 Partition #1 Partition #3 D1 S1 D2 sxfer D2 D0 S0 D1 sxfer (d) opt4 cloud #0 Partition #2 cloud #1 Partition #1 D2 D0 S0 D1 S1 D2 sxfer (e) opt5 cloud #0 Partition #2 cloud #1 Partition #1 Partition #3 D1 sxfer D1 D0 S0 D1 sxfer S1 D2 (f) opt6
Figure 3.7: The Six Valid Cloud Mappings
Cloud Storage Transfer In Transfer Out CPU (GB / Month) (/GB) (/GB) (/s)
c0 10 10 10 10
c1 20 5 5 20
Table 3.4: Cloud Costs: Example
to find the best option. The factors that affect costs are the price of the Storage,
Transfer and CPU. Table 3.4 gives an example of costs for the two clouds, while Table 3.5 gives examples of costs for the workflow blocks
In the prior study [143], the cost model was used to choose the best option, which was then executed. The model is summarised as follows:
Block Size Longevity CPU (GB) (months) (s) D0 10 12 S0 60 D1 5 0 S1 100 D2 20 12
cost = X
di,j∈D
store(dci
i,j) ∗ size(di,j) ∗ longevity(di,j)+
= X
sj∈S
exec(ci) ∗ time(sj)+
= X
di,j∈D
(coutj + cini ) ∗ size(di,j)
where store(dci
i,j) is the cost of storing a unit data in ci. The size() and longevity()
represent the data size and storage time respectively. Moreover, exec(ci) is the unit
price of ci, which multiplied by with the execution time of sj. coutj and cini give the
cost of transferring a unit data out cj and that in ci.
However, no consideration was given as to the action to take if there was a cloud
failure during execution which prevented its completion. The key contribution of this chapter is to propose a method for extending the previous model to allow the exception
handler to decide on the best way to deal with exceptions that occur during workflow execution. The aim is to choose the cheapest available option at the point when the
exception is raised.
Firstly, the options generated by the partitioning algorithm (see Figure 3.7) are com-
bined into a Partition Tree. This is achieved as follows:
1. A Partition Set is created, consisting of all partitions found in the partitioning options (Figure 3.8 shows all the partitions for the running example)
2. Create a directed graph from this set of partitions by adding directed edges
between partitions wherever a data dependency between those partitions exists in the partitioning options
3. Prune the graph to replace multiple edges between two partitions with a single
edge
4. Label the graph with the cost of executing each partition (including any costs of transferring any input data into that partition)
Partition C C1 Partition D C0 Partition E C0 Partition A C1 Partition B C0 Partition F C1 Partition G C1 Partition H C0 Partition I C1 Partition J C1 D0 S0 D1 sxfer D1 sxfer D1 S1 D2 sxfer D2 D1 S1 D2 sxfer D2 D0 S0 D1 S1 D2 D1 S1 D2 D1 S1 D2 D0 S0 D1 S1 D2 sxfer
Figure 3.8: Valid Partitioning List
Figure 3.9 shows the Partition Tree for the running example. The yellow nodes indicate
that the partitions are deployed on Cloud 1, while the salmon colour represents those deployed on Cloud 0. By convention, the cost of the initial partitions A, G and J, do
not include the cost of transferring data in to the clouds on which they are deployed (we assume that the data is already there); likewise, no outgoing data cost is calculated for
the terminal partitions B, E, G and J. The costs are shown as labels on the incoming edges to the partitions.
Next, the Partition Graph can be used to work out the best option for executing the
workflow. Initially, from Figure 3.9, it is easy to see that the “best path” is A → H.(the weight in each path represents the lowest cost from its start node to the leaf nodes.)
In this case, the workflow application consists of two partitions, which are deployed onto two clouds, Cloud 0 and Cloud 1, as shown in Option1 of Figure 3.7a.
When an exception occurs in a partition currently being executed, the program iden-
tifies the type of exception first, and then decides to how to handle it. If the exception must be solved by replacing the currently running partition, an alternative partition
is found using the Partition Tree as follows
For a Type1 (failed communication) exception all the other child partitions are consid- ered, and tried in order of the cost of the path through that child to a terminal node
PartitionTree START G 10400 A 7075 J 8300 E 3775 B 3550 C 4775 H 3475 I 6875 D 2700 F 5100 2700 5100 3775 4775 3475 6875 2700 5100
Figure 3.9: Partition Tree
(cheapest first). If this node does not have child nodes or none are able to accept the
data then the same approach is taken, starting with the parent node of the partition that raised the exception. If this is unsuccessful, then the process is repeated for the
grandparent partition, and so on, moving up the tree, checking all alternatives, until the data is safely delivered to another partition, or the “Start” node is reached, in
which case there is no possibility of executing the workflow.
To deal with a Type 2 exception (partition execution fails), attempts are first made to re-execute the partition that generated the exception, after a time delay between
each attempt. If this fails then, starting from the parent of the failed partition, all the other child partitions are considered, and tried in order of the cost of the path through
that child to a terminal node (cheapest first). If this is unsuccessful, then the process is repeated for the grandparent partition, and so on, moving up the tree, checking all
alternatives, until the data is safely delivered to another partition, or the “Start” node is reached in which case there is no possibility of executing the workflow.
For the running example, path A → H is the best initial option.
The actions that the exception handling system will take in some example cases are now considered.
Figure 3.10: Exception Handler Works on Option1
1. The first case is a failure of Cloud 1 while partition A is running. When Man- ager(1) detects this exception, it abandons the workflow application as there is
no a partition that can replace partition A.
2. If the exception is caused by losing D1 when it is transferred from Cloud 1 to
Cloud 0 then, as Figure 3.10 shows, after a limited number of retries Manager(3), the controller of partition C, is invoked to replace partition H.
3. When partition A’s execution has finished, the system will try to execute par- tition H. However, if Cloud 0 is not available, running H is not an option.
Keeping the workflow running on Cloud 1 would be the only way to solve this problem. Therefore partition I or C from Option2 and Option6 respectively are
both potential replacement partitions.
4. If the exception is caused by Cloud 0 failing while partition H is running, then partition C will be the best alternative partition because the path C → D is
the cheapest option which is still executable. Cloud 0 may be available when partition C is finished; and even if it is not, partition D can also be replaced by
partition F.