The Taverna Optimization Framework - Automated Optimization Methods for Scientific Workflows in

The workflow management system Taverna [Missier2010] was selected as reference implementation in this thesis (cf. Chapter 3.1), thus the framework for scientific workflow optimization is being integrated in Taverna. The architecture of this concept is shown in Figure 4.5. The main requirements of the framework were detailed in the last section as:

• Hide complexity from the user.

• Provide extensible and user-friendly graphical interface.

…

Taverna Optimization Layer

WMS

Framework

Plugins

UNICORE Plugin Parameter Optimization Component Optimization

Figure 4.5: Architecture of the Taverna Workflow Management System, with the previously described UNICORE plugin (cf. Chapter 3), and the proposed extended optimization framework, including the API (cf. Chapter 4.3) as well as possible optimization plugins (cf. Chapter 5.2).

Summarizing this, the framework must offer an API in order to provide access to Taverna specific functionalities for model and GUI extensions. However, Taverna was not dedicated to support scientific workflow optimization methods and thus, a specific approach had to be developed for the integration. This approach is described in the following and includes the architecture and implementation of the framework, which was also described in [Holl2013f].

"To integrate workflow optimization within Taverna and provide Taverna specific functionalities, a major requirement is the accessibility of all input and output parameters and also the sub-workflow. The accessibility must be guaranteed, as the purpose of workflow optimization is to modify the workflow parameter or workflow structure, execute the modified sub-workflows and evaluate the results. Furthermore, a mechanism is required to i) interrupt the execution of a workflow before the entire workflow is executed, and ii) execute a sub-workflow several times for optimization purposes.

In order to implement these requirements and receive access to the Taverna execution model and graphical user interface, the first approach aims at extending and reusing Service Provider Interfaces (SPIs), which are provided by the Taverna system. One such SPI provides usage of the processor dispatch stack of Taverna as described in [Missier2010] and shown in Figure 4.6. The processor dispatch stack is a mechanism to manage the execution of a specific activity. Before the activity is invoked, a predefined stack of layers is called from top to bottom and after the execution from bottom to top. Each layer implements a special functionality which is important for the activity invocation. In order to integrate the optimization execution into Taverna, this dispatch stack was extended with

Optimize

Parallelize

Error bounce

Failover

Retry

Invoke

Parallelize

Error bounce

Failover

Retry

Invoke

Figure 4.6: The common Taverna dispatch stack in black and the extended optimization layer in red.

a new optimize layer on top of the stack, see Figure 4.6. This allows to interrupt the process of workflow execution before a specific component is executed.

Additionally, advantage was taken of the sub-workflow concept, provided by Taverna. It allows for the definition and execution of a sub-workflow. This implies that one can manually select a sub-workflow, which is interpreted by Taverna as a single activity and therefore processed by the dispatch stack. Only at the lowest layer the sub-workflow is decomposed into the individual activities, each of which again traverses the stack itself. This fact was utilized by defining those parts of the workflow that shall be optimized as a Taverna sub-workflow.

Leveraging these two concepts, the processor dispatch stack provides access to the sub-workflow, which in turn provides access to all parameters, data structures, data flows etc. during the invocation of the optimize layer, please refer to Figure 4.7 for more details.

The optimization framework also utilizes a GUI SPI and implements a new uniform perspective into the Taverna Workbench in order to implement a general set-up of an optimization run, as shown in Figure 4.4. In the new integrated perspective, the common workflow diagram (cf. right hand side red box of Figure 4.4) and a selection pane (cf. left hand side red box of Figure 4.4) were arranged. Within these, the user can define the sub-workflow, which shall be subject to the optimization. After the selection, all components which are not subject to optimization, appear greyed out in the workflow diagram. Internally a new sub-workflow is created. The pane at the lower left shows a specific interface implemented by the respective optimization plugin (cf. blue box in Figure 4.4).

In order to provide access and usage of the execution model and user interface mech- anisms, an API was developed for the framework. This API let developers extend op-

(6) Extracts the fitness values and

forwards them (1) User defines

the sub-wf and launches optimization

(Taverna GUI) (3) Plugin creates set

of differing workflow trials representing e.g.

different parameter combinations (2) Creates sub-wf and forwards it X=2 Y=3 Z=4 X=2 Y=3Z=4 X=2 Y=3Z=4 X=2 Y=3 z=4 X=2 Y=3 Z=4 X=2 Y=3Z=4 X=2 Y=3Z=4 X=2 Y=3 z=4

(4) Splits the set and fills the queue, monitors execution (5) Parallelize

layer executes each sub-wf in

parallel

(7) Evaluates the set of fitness values (8) Optimization finished? X=2 Y=3 Z=4 X=2 Y=3Z=4 X=2 Y=3Z=4 X=2 Y=3 z=4 X=2 Y=3 Z=4 X=2 Y=3Z=4 X=2 Y=3Z=4 X=2 Y=3 z=4 X=2 Y=3 Z=4 X=2 Y=3Z=4 X=2 Y=3Z=4 X=2 Y=3 z=4 X=2 Y=3 Z=4 X=2 Y=3Z=4 X=2 Y=3Z=4 X=2 Y=3 z=4 X=2 Y=3 Z=4 X=2 Y=3Z=4 X=2 Y=3Z=4 X=2 Y=3 z=4 X=2 Y=3 Z=4 X=2 Y=3Z=4 X=2 Y=3Z=4 X=2 Y=3 z=4 (10) Displays the optimization results (Taverna GUI)

Taverna Optimization Framework

(Optimize Layer) Optimization Plugin

(9) Finishing the optimization

Figure 4.7: The control flow of the new optimize layer and a parameter optimization plugin for workflow optimization. The colored activities in phase (3) define one sub-workflow. Before and after the optimization the workflow is regularly executed. (’sub-wf’ stands for sub-workflow) (Source: [Holl2012b])

timization methods without reinventing and reimplementing Taverna specific execution and GUI functionalities. The API provides methods for i) requesting the GUI pane to provide specific text fields, selection options, etc. for the particular optimization method, ii) starting the optimization process of the specific plugin, iii) receiving the modified parameter or sub-workflows to start the execution, iv) receiving the fitness value from the workflow to forward it to the plugin, v) requesting the termination criteria and decide whether the optimization is finished, vi) requesting the best result to return it to the user. The employment of this API is described in the following (cf. Figure 4.7) and an example usage is described in Chapter 5.2.

In the new optimization perspective, which is provided by the framework, the user can select the sub-workflow and optimization specific information (cf. Figure 4.7(1)). After the user has provided these required inputs, the newly integrated run button activates the fully automated optimization process. The workflow is executed until the respective sub-workflow is reached. Then the new optimize layer extracts required information and

forwards it to the particular optimization plugin for modification (cf. Figure 4.7(2)). The proper optimization method and sub-workflow refinement is then handled by the respective optimization plugin and not by the framework. The optimization plugin returns a set of new sub-workflow entities, each with a different set of parameters (cf. Figure 4.7(3)). The optimization framework executes this sub-workflow set by utilizing the topmost Taverna standard layer, namely parallelize (cf. Figure 4.7(4)). More precisely, it provides a queue filled with sub-workflows for the parallelize layer, which in turn pushes each sub-workflow down the stack in a separate thread. This triggers the parallel execution of all sub-workflows (cf. Figure 4.7(5)). After execution of the sub-workflows, the optimize layer receives a set of results from the parallelize layer. This result set, which represents the fitness value, is again passed through to the specific optimization plugin for analysis and evaluation (cf. Figure 4.7(6 and 7)). If the optimization has reached the termination condition, the optimization is finished (cf. Figure 4.7(8 and 9)). Otherwise, the plugin creates another set of sub-workflows for the next round of sub-workflow executions (cf. Figure 4.7(3)). If any termination condition is reached, the optimal sub-workflow and parameter set is returned to the user and the workflow execution can be resumed." (Modified from [Holl2013f].) Finally, the result is presented to the user (cf. Figure 4.7(10)), who can store it as a file including statistics and meta-data of the optimization process or as a Taverna conform workflow input file for sharing purposes.

In document Automated Optimization Methods for Scientific Workflows in e-Science Infrastructures (Page 78-82)