Applicability to Other Datasets - Layer Decomposition Workflow Similarity

5.3 Layer Decomposition Workflow Similarity

5.4.5 Applicability to Other Datasets

Our gold standard corpus also includes a second set of workflows from another workflow repository, namely the public Galaxy workflow repository. For 8 query workflows, rated lists of compared workflows are available to evaluate ranking performance. This dataset differs from the previous one in various respects: Galaxy workflows are exclusive to the Bioinformatics area, the repository is smaller and curated by a smaller group of people, the annotation is generally more sparse (no tags etc.), and the modules used are only local executables (no web services as fre- quently in used in Taverna). Looking at such diverse data sets is important to show robustness of any evaluation results.

Figure 5.10 shows ranking correctness for sim_{M S}, sim_LD, and sim_{P S} on this sec-

ond dataset. The module comparison schemes used are gw1, comparing a selection of attributes with uniform weights, and gll, comparing only module labels by their edit distance. While results are generally less good than on the myExperiment data set,

simLD here even more clearly outperforms the other algorithms. We are currently

looking to extend this dataset to be able to perform a more complete evaluation and to trace back the observed differences in ranking performance to properties of the data set.

5 Layer Decomposition Similarity of Scientific Workflows

5.5 Conclusion

We introduced Layer Decompositon (LD), a novel approach for workflow comparison specifically tailored to measuring the similarity of scientific workflows. We comparatively evaluated this algorithm against a set of state-of-the art contenders in terms of workflow ranking and retrieval. We showed that LD provides the best results in both tasks, and that it does so across a variety of different configurations - even those not requiring extensive external knowledge. Results in ranking could be confirmed using a second data set. Considering runtime, we not only showed our algorithm to be faster than other structure-aware approaches, but demonstrated how different algorithms can be combined to reduce the overall runtime while achieving comparable, or even improved, result quality.

Though we did consider runtimes, our evaluation clearly focusses on the quality of ranking and retrieval. Real time similarity search at repository-scale will require further efforts in terms of properly indexing workflows. Such indexing of workflows

is straightforward when considering only their modules (like in sim_{M S}), but requires

more sophisticated methods when also topology should be indexed. Therefore, our approach of stacking Layer Decomposition-based ranking onto workflow retrieval by modules provides a good starting place for applying structure-based workflow similarity to scientific workflow discovery to scale.

6 Accelerating Similarity Search in

Scientific Workflow Repositories

In the previous chapters, we have shown that structure-based similarity search for scientific workflows can substantially outperform annotation-based approaches with respect to result quality. A drawback to such structure-based methods is that they are comparatively slow to compute. This consideration of speed becomes important when translating our previous results into a real-world system for similarity search over whole repositories: To be accepted by users, search results have to be presented fast, if not near instantly - making indexing a non-optional requirement. Yet, structure-aware indexing of workflows is not straightforward. For instance, the system proposed in [43] uses subgraph matching for similarity search in a scientific workflow repository. Their use of an existing graph indexing library requires sig- nificant workarounds for the intended purpose which cause a substantial slowdown of the resulting system. Not resorting to such existing libraries, [9] introduce a system for workflow similarity search using a two-phase retrieval: After an initial, rough preselection of (potentially) suitable workflows from a fast search over the whole repository, only some candidate workflows are subjected to a more complex graph-based comparison.

In this respect, our previous results are encouraging: Next to retrieval quality of single similarity algorithms, we also investigated how multiple structure-based and annotation-based measures can be stacked and ensembled into combined similarity measures to benefit both result quality and retrieval speed. In particular, we have shown:

1. On the level of whole workflows, combining the use of a (structure-agnostic) Module Set approach for retrieval with a (structure-aware) Layer Decomposi- tion step for reranking of the initial retrieval results maintains result quality in comparison to purely structure-based retrieval (Section 5.4.4)

2. For single module comparison, the edit distance of their labels can be effectively used to assess their functional similarity - for workflows where module labels are telling (4.4.1).

3. External knowledge derived from the repository not only improves result quality, but also reduces the sizes of the compared workflows, which leads to a speedup of the comparison process (4.4.1).

Inspired by these findings, we here present an approach for fast similarity search in scientific workflow repositories that takes the workflows’ structure into account.

6 Accelerating Similarity Search in Scientific Workflow Repositories

Figure 6.1: Schematic overview of scientific workflow similarity search using structure-based reranking.

The main goal is to demonstrate the feasibility of such a system, and to investi- gate transferability of our previous retrieval results to a repository-scale real-world scenario. In this chapter, we show how our previous findings can be leveraged to, a) efficiently index scientific workflows for fast similarity search using off-the-shelf technology, b) improve retrieval precision within the top-x results by reranking the initial structure-agnostic search results with the structure-aware Layer Decompo- sition algorithm, and c) further speed up the overall retrieval process by tweaking specific subtasks of the reranking algorithm.

In the following, we first give an account of our proposed architecture and how it indexes workflows. In Section 6.3 we evaluate the system for its retrieval quality and runtime. We conclude in Section 6.4.

6.1 Two-phase Retrieval Architecture

Taking from our findings on reranked retrieval of scientific workflows described in the previous chapter, we constructed a fast method for scientific workflow similarity search. We use a two-phased approach consisting of an initial, structure-agnostic retrieval step using off-the-shelf indexing technology, and a subsequent step of result reranking using more complex structure-based workflow comparison. An overview of the system as a whole is given in Figure 6.1. In the following, we describe the index, and the steps of retrieval and reranking in detail.

In document Similarity measures for scientific workflows (Page 101-104)