Evaluation by Life Science Use Cases - Automated Optimization Methods for Scientific Workflows

To evaluate the performance of the UNICORE-Taverna plugin presented in Section 3.2, an examination with an example workflow taken from the proteomics domain was conducted. In the following, the use case is described in detail as well as a survey presenting local, parallel and sweep job distributed execution.

d) e) Taverna Sweep Generator a) b) c) a1…an b1 UNICORE/X Service Orchestrator XUUDB UNICORE Target System a b _a 1b1 a2b1 a3b1 anb1 .. . ...

Figure 3.5: The figure shows the sweep job mechanism in Taverna. (a) One value set and one single input value arrive at the workflow input ports. (b) Corresponding to the parallel execution in Taverna, several parallel activities are instantiated. (c) The sweep generator collects all instances and creates one job sweep description for submission to UNICORE. (d) The UNICORE server splits up the job description script into the original number of executions. (e) The applications are executed on the target system.

In proteomics, fundamental questions are targeted by analyzing proteins experimen- tally to understand their structures and functions. Mass spectrometry is one of the used techniques to identify the characteristics of proteins. This technique measures individual masses of protein molecules in a sample to determine its composition and to draw conclu- sions about structures and functions. The workflow shown in Figure 3.6 (a) was developed within the proteomics domain to identify proteins. This ’bottom-up’ identification of proteins via the enzymatic digestion of the proteins into peptides is a typical task in mass spectrometry (MS)-based proteomics [Aebersold2003].

After the digestion of a protein, the peptides are separated by liquid chromatography coupled to mass spectrometry (cf. Figure 3.7 MS1). In liquid chromatography, the peptides are separated by their physicochemical properties, such as size, charge and hydrophobicity. In the mass spectrometer, the peptides are separated with very high resolving power based on their mass-to-charge ratios. Individual peptides are then selected and fragmented using

Workflow input ports

Workflow output ports

Tandem_Param_File Tandem FASTA_File mzXML_File pepXML Read_Text_File Tandem2XML PeptideProphet

(a) Local X!Tandem workflow

Workflow input ports

Workflow output ports FASTA_File PeptideProphet tandem mzXML_File mzxmlDecomposer nrOfDaughters pepXML pepXMLComposer

(b) Grid X!Tandem workflow

Figure 3.6: The X!Tandem workflow (a) for local execution on a client machine. Work- flow (b) executes the compute-intensive X!Tandem application on the Grid. The Tan- dem_Param_File is created by the mzxmlDecomposer

collisions with neutral gas or ion-electron reactions (cf. Figure 3.7 MS2). This complete procedure is called ’tandem mass spectrometry’ or ’MS/MS’ [Aebersold2003]. The output is a set of tens of thousands of peptide masses and associated fragmentation spectra (mass- to-charge ratio). These measured spectra can then be compared to previously calculated spectra from the genome by one or more algorithms – here X!Tandem [Craig2004] – to identify proteins. For the identification, X!Tandem compares measured spectra to calculated spectra for all peptides of similar mass that could possibly be generated by the enzyme used from the biological species. After this step, PeptideProphet [Keller2002] estimates the probability of each peptide-spectrum match assigned by X!Tandem by a mixture model of the X!Tandem score distribution, assuming there will be some correctly and some incorrectly identified spectra. This result can be used to measure the identification process. X!Tandem as well as PeptideProphet are available within the Trans-Proteomic Pipeline (TPP) [Keller2005].

The workflow shown in Figure 3.6 (a) accomplishes the execution of X!Tandem and Pep- tideProphet on a local client machine. X!Tandem uses the database as input (FASTA_File), which stores the calculated spectra for all peptides. Another input file contains all the measured spectra (mzXML_File). The workflow output port returns a summary file that contains among others the number of correctly identified peptides given by PeptideProphet,

MS1

.

MS2

...

Figure 3.7: The concept of tandem mass spectrometry. First a protein is digested and split into peptides (upper left side). Afterwards, the peptides are separated by their physicochemical properties (MS1). These peptides are again split into fragments(MS2). The fragment spectra (lower right side) measure the mass-to-charge ratio.

which statistically evaluated the identified measured spectra. Figure 3.6 (b) shows the same scientific experiment, ported to an distributed computing environment. Therefore an mzXMLDecomposer and pepXMLComposer, developed by Mohammed et al. [Mo- hammed2012] were used to decompose the original input mzXML file into several small input files. X!Tandem can then be executed in parallel, where each invocation compares the measured spectra of one piece of the input file. The pepXMLComposer joins the results after the X!Tandem executions. The spectra and the used database, which stores the known peptides, can be of differential size between 20 MB - 200 MB for both. The local execution of the workflow was performed on an Intel i5-2520M processor, 2.50GHz, 4 CPUs, 8GB RAM. The compute cluster, on which the parallel execution was executed, has 206 compute nodes, each of which consists of 2 Intel Xeon 6-core processor with 2.66 GHz and 96GB main memory. For the execution, 4 CPUs per job were requested.

Common parallel execution Parallel sweep execution

Total file upload in MB 1312.0 151.2

Total Web service calls 4032 126

Average CPU load on

the client in % 43.7 14.1

Total network packets 458,353 15,244

Totally transferred bytes 1,370,513,098 190,571,550

Table 3.3: Statistics of the submission via the conventional submission mechanism and the developed sweep generator.

In the presented example, the used sequence database consists of 37.4 MB sequences and the input data set consists of 113.8 MB spectra. To speed up the execution, the input file was split into 32 sub files (file size between 3.0 MB and 4.2 MB). The non-sweep classic solution submits one job to the computing infrastructure for each of those sub-files (in total 32). Each of these jobs consists of the sequence database and one sub file (41 MB upload per job). Each of these 32 jobs is monitored by a Web service call that is sent every 2 seconds and receives the job status. The developed sweep extension executes only one job and uploads the sequence database as well as all sub-files together (151.2 MB), only once. The single job is monitored by sending a Web service call every 2 seconds. Table 3.3 shows the analysis in detail. The CPU load was monitored for Taverna during the executions by applying the top command. The total packets and transfered bytes were captured by using the tool Whireshark [Orebaugh2006] applying a filter for the target registry and corresponding port. The local execution (cf. Figure 3.6 (a)) took 4.02 minutes in Taverna and the parallel execution (cf. Figure 3.6 (b)) took 1.51 minutes in Taverna (same for sweep mechanism).

Another survey was conducted and presented in [Holl2011], using the LOCUSTRA bioinformatics application [Zimmermann2008]. This workflow predicts the secondary structure of a protein, based on its amino acid sequence by using classification via Support Vector Machines (SVMs).

In document Automated Optimization Methods for Scientific Workflows in e-Science Infrastructures (Page 61-65)