• No results found

5.4 Simulation of Workflow Structure Optimization

5.4.3 The Topology Level

As stated in Chapter 4.1.2, components used within a scientific workflow may also work in a switched order. This can be the case for example when using filtering or clustering mechanisms. Scientists want to evaluate, which workflow order performs best for a specific problem. This section shows an example use case utilizing two different filter mechanisms.

BLAST workflow:

A widely used experiment in the life sciences is to search a nucleotide or protein database for sequence similarities. BLAST is the common tool to perform such a search. Using the BLAST Web services, scientists can enter a target search sequence and BLAST matches this query sequence against a database. According to a calculated statistical significance the best matched sequences are returned.

Motivation: In order to reduce the size of the result set of sequences, scientists may apply different filter mechanisms. One such filter may select the top k sequences regarding the expected value (e-value), which can be used as a statistical significance threshold. Another filter may regard the maximum length of the matching sequences that orders the result set and filters the top k longest sequences.

Workflow: An abstract workflow of this experiment is shown in Figure 5.14. This work- flow can be executed in the order f ilter_length → f ilter_e_value or f ilter_e_value →

f ilter_length. A scientist may want to know which path performs best regarding a specific data set. The workflow is available at myExperiment: http://www.myexperiment.org/ workflows/3706.html

Fitness: The fitness value was selected to be the relation of how many sequences of the original result and the final result occur in a gold standard after filtering (using a fixed number of top k sequences).

Data input: As example input, a protein family was used as gold standard and one exemplary sequence was applied for the search. The gold standard originated from the ELL-associated factor (EAF) family. The Pfam database entry can be retrieved from http://pfam.sanger.ac.uk/family/PF09816.

Optimization:

Optimized parameters: top_e_value, top_length, flag Fixed parameters: database, sequence_id

User constraints for parameters: top_e_value ∈ [30, 70] (integer), top_length ∈ [30, 70] (integer), f lag ∈ 1, 2, 3 (integer)

Used data set: EAF family

Result: The maximum fitness was reached with the path f ilter_length → f ilter_e_value using the top 67 sequences for length filter and the top 35 for the e-value filter. 34 sequences of this top 35 list occurred in the gold standard. On the contrary, only 31 sequences from the original top 35 list occurred in the gold standard.

Scientific Use: The example use case shows that topology optimization is useful especially when using filter or clustering mechanisms. The data formats remained the same and the example workflow could be designed without using data transformation components.

filter_length filter_e_value calc_Fitness fitness sequence_Id top_e_value top_length database

Figure 5.14: An abstract workflow to perform a BLAST search including two filters. The order of the filter may be changed and optimized.

5.4.4

Discussion

The exemplary use case shown in the last sub-section took the e-value and the length of a sequence match as a filter. From this, a workflow with two alternative paths was created. The optimization was simulated by the parameter optimization plugin and illustrated the concept of topology optimization. However, a proper plugin for topology optimization may be required in the future, especially when adding more parameters.

The automated process of topology optimization constitutes the same challenges stated with component optimization. The main challenge is that input and output ports may differ in their type, and a proper ontology would be required to find generic transformation components or at least interchangeable components.

5.5

Conclusion

In this chapter, an exemplary implemented Genetic Algorithm based plugin for parameter optimization was integrated into the previously developed optimization framework. Devel- opers may use this as an exemplary implementation for further plugins as it shows the usage of the proposed framework and API. The plugin did not have to deal with any Taverna internals that are required for example for the parallel execution or security mechanisms. Thus, the plugin integration was straightforward by implementing required methods for

the API and focusing on the optimization algorithm and level in question.

The plugin makes use of a GA-library in order to implement the heuristic optimization process. The workflow parameters were encoded as genes and one parameter set as chro- mosomes, respectively. To take user knowledge into account, a panel was integrated into the optimization perspective in order to capture ranges for parameters and dependencies. Four use cases taken from proteomics, ecological niche modeling, biomarker identifi- cation and structurally bioinformatics have demonstrated that the automated parameter optimization gives improved results compared to default values, manual search (trial and error) or parameter sweeps, while being more time and resource efficient. The proteomics use case leads to new findings that have been obtained after a second optimization process with higher parameter limits, which are not typically applied with the observed tools. These findings as well as the fact that three different aspects of this use case were optimized are a good way to inspire scientists to apply similar optimizations also to other proteomics use cases.

"The ecological niche modeling workflow was optimized as the default parameters provided non sufficiently accurate models. The improved SVM algorithm produced more valuable models that could then be used for further processing." [Obst2013]. The optimized biomarker identification workflow ranked all important genes higher than using the default parameters. Protein structure prediction optimization could gain a 10% better result with four times less executions compared to manually performed parameter sweeps.

The parameter optimization plugin was also used for component optimization and showed that component optimization is valuable.

The main contributions of this chapter can be summarized as follows:

• A parameter optimization plugin was exemplary implemented from the developer’s perspective and plugged into the framework.

• Four different use cases were tested to evaluate the framework and plugin. • Especially the Proteomics use case showed that parameter optimization is useful, as

new insights of mass spectrometry were obtained.

• Suggestions and simulations were made for component and topology optimization use cases.

Chapter 6

Discussion: Scientific Workflow

Optimization in e-Science

Within the presented thesis, the possibility to adapt optimization to scientific workflows was investigated. It was shown that there is a great potential to support scientists during the improvement of scientific experiments. The main problem is that there was no former solution available to automatically improve a scientific output of a workflow within a SWMS. Researchers could use the existing design frameworks but they often lack a graphical user interface and new models have to be implemented in a programming language, finally interaction with workflows is not possible at all. Commonly, researchers just adhere to the default or already proven good values. To improve a result, they used trial and error or parameter sweeps, both may be inefficient and time consuming. Within this thesis, a novel approach was developed and implemented in order to improve scientific workflows in an automated and generic manner.

This chapter will first discuss problems that can occur during the optimization process and presents approaches that can improve the overall optimization process in the future. Some of those approaches have serious requirements regarding a minimum of available executions or computing time. To overcome these requirements, the second part of this chapter will introduce a novel approach in order to integrate optimization meta-data into the provenance data of scientific workflows. This extended provenance data could be used to overcome many difficulties of workflow optimization in the future.

6.1

Examination of Scientific Workflow Optimization

Within this thesis, a subset of possibilities and improvements that can be achieved by using optimization instead of trial and error or parameter sweeps were discussed. This is due to the large variety of available optimization algorithms, various improvements of algorithms, different variations of algorithms, and so on. This section explores the field of optimization and investigates several approaches to improve the optimization of life science workflows. In the beginning, some general problems of optimization are embedded in the context of workflow optimization.