Interactive Hadoop-based 13 C-MFA Workflow

Figure 6.8.: Residual values collected from a fitfluxes run. The total calculation time is 18,000 ms, however after about 3,000 ms the residual no longer improve significantly. Thus, by monitoring online provenance data, the simulation could be interrupted prematurely and save a significant amount of the overall simulation time.

collection solution enables interaction with 13C-MFA workflows that were out of reach before.

2. The provenance tools integrate well into the SWF and the employed toolchain. The provenance collection framework supports filtering messages by various metadata and regular expressions for the message content.

The provenance framework is designed to support the requirements of13C-MFA workflow applications. Because the presented provenance solution is by design non-intrusive and provides clean interfaces, future work includes the application in other scientific do- mains.

6.8. Interactive Hadoop-based

C-MFA Workflow

The exploration workflow described in § 6.4 provides the modeler information as to whether the model is – in principle – able to explain the observations. Moreover, as long as the number of flux samples (n) is sufficiently large, it may also yield an overview of regions in the flux space that are competitive in terms of the residual value. The next step in the overall13C-MFA procedure is the closer inspection and characterization of these regions, which is usually accompanied with modifying the model, e.g., by integrating additional

Chapter 6. Use Cases

Figure 6.9.: Overview of the interactive 13C-MFA workflow. Gray boxes represent automated tasks, while reporting tasks are colored orange. Arrows indicate the flow of control. Step 5 (turquoise) designates an interaction performed by the scientist. The exploration workflow described in § 6.4 is embedded as sub- workflow (upper branch). The Monte Carlo bootstrap sub-workflow (steps 6-7) is employed as described before (Dalman et al., 2013). The overall workflow includes several iterative steps and depends on the scientist’s decisions upon inspecting intermediate results.

biochemical knowledge or updating measurements. In contrast to the sampling-based explorative workflow described before, here a targeted optimization-driven strategy is utilized which eventually supplies information on flux (non-)identifiabilities (Raue et al., 2011). The latter step is vitally important for any 13C-MFA because it signals whether measured data actually contain the information that is needed to reliably estimate the unknown fluxes.

Due to its conceptual simplicity, the Monte Carlo bootstrap method is utilized as nonlinear statistical method of choice in combination with a multi-start heuristic (Efron and Tibshirani, 1993; M. Joshi, Seidel-Morgenstern, and Kremling, 2006). If flux parameters are found to be non-identifiable, either new measurement information has to be added or, because such additional observations are rarely available in practice, the non-identifiable fluxes have to be eliminated from the model (Raue et al., 2011). As fluxes may be cor- related, the elimination process must be done in an iterative manner. The proposed interactive 13C-MFA workflow consists of the following steps, as shown in fig. 6.9:

• The exploration workflow (steps 1-4) is performed to gain a basic "familiarity" with the model as described before (cf. § 6.4).

6.8. Interactive Hadoop-based 13C-MFA Workflow

Figure 6.10.: Visualizations generated in the course of an interactive refinement workflow. Left: residual distribution of 1,000 flux fits with one of the interim

C. glutamicum model variants. Two solution clusters are visible that differ

only slightly in their residual values (2,460.8 and 2,462.2). Right: scatter plot of glucose-6-phosphate dehydrogenase (gnd), and pyruvate kinase (pyk) fluxes. The color of the dots codes the residual value. The visualization re- veals that the flux solutions that underlie the two clusters have significantly different pyk fluxes. Moreover, while the flux value of pyk is dispersed for higher residual values, it is much more concise in case of the low residual solution cluster. The plot provides an indication for the multimodality of the nonlinear least-squares problem.

• With the visualizations of the exploration results, e.g., the distribution of residual values and various scatter plots (cf. fig. 6.10) at hand the researcher updates and refines the model according to the results if necessary (step 5). To assess the impact of changes made, the scientist may return to step 1 before continuing with step 6.

• The Monte Carlo bootstrap algorithm is applied with the best flux samples (steps 6 and 7).

• The results of the bootstrap are analyzed and visualized (step 8).

• After updating the model (step 5), the researcher may choose to restart from step 6 (or step 1) until the non-identifiable fluxes are removed from the model.

This use case highlights the iterative and interactive nature of typical13C-MFA modeling workflows. Notably, the single sub-workflow steps have quite heterogeneous runtime profiles: compute-intensive and long-running bootstrap executions alternate with inex- pensive analysis tasks. Amazon’s EMR cloud service is an elegant and straightforward way to solve compute intensive and embarrassingly parallel tasks like the Monte Carlo

Chapter 6. Use Cases

bootstrap. Nevertheless, it does not make sense in all cases to await the final result of long-running processes. For instance, flux estimation processes often show no significant improvement in the residual value. In such cases, the run can be stopped prematurely. With the provenance logging module, it is possible to safely interrupt processes, saving time and money.

Methods

As indicated in fig. 6.9, the computational parts of the workflow are reused from other use cases, i.e., the exploration workflow and the MCB workflow. These sub-workflows are accessed via their web service interfaces. The model refinement and update task (step 5) is purely user-driven. Similarly, the researcher decides at the end of step 8 whether the outcome is sufficient, or further iterations are required. In addition to providing visualizations, the matplotlib scripts (plot_histogram, plot_scatter) also compute statistical moments of the results, e.g., minimum, maximum, mean, standard deviation, or median. Because the output file format (CSV) of steps 1-3 is equal to the MCB output (steps 6-7), the visualization scripts (steps 4 and 8) are effectively the same.

Results

Reusability is regarded as one of the most important drivers for service-oriented solutions (Josuttis, 2007). As already stated, 13_{C-MFA workflows rely on many recurring tasks.}

These tasks may be seen as standard components either on atomic or already assembled level. In this particular use case, it took four iterations of the iterative13C-MFA workflow. In each iteration of the workflow, non-identifiable fluxes are identified and eliminated one by one from the set of unknowns. Additionally, information on (potentially locally) optimal solutions of the least-squares regression is continuously gathered.

Discussion

By clearly implementing web service interfaces it becomes easy to assemble these tasks in the wanted order, possibly by filling the missing gaps in between by writing scripts. In this use case, it was shown how previously developed workflows (namely, the MCB simulation workflow and the exploration workflow) are reassembled to a full-fledged interactive 13C-MFA application. The described workflow can be seen as a chassis workflow, which is extensible by additional sub-workflows, e.g., the identification of fluxes is further automated by applying the X-means clustering tool (Pelleg and Moore, 2000) to the final outcome of the workflow (cf. fig. 6.11).

Likewise, by exchanging the simulation tool and visualization applications the overall workflow can be flexibly modified to fit to a different context. Hence, the scientist is supported in managing the sequence of "daily life" steps.

6.8. Interactive Hadoop-based 13C-MFA Workflow

Figure 6.11.: Identified clusters (blue boxes) of the C. glutamicum model. The cluster image is generated using the X-means tool version 1.15 (Pel- leg and Moore, 2000), while the frame and axes are drawn with matplotlib (cf. fig. 6.10; right). X-means is called with kmeans kdtree in pyk_gnd_v4.csv -D_DRAWPOINTS -D_INTERACTIVE, where the CSV file

Chapter 7. Conclusions and Discussion

While the principle 13C-MFA procedure is seemingly straightforward, the specific steps undertaken by a researcher are often driven by modeling decisions that are dictated by the specific biological question under study, the observed data at hand, and computational considerations (Niedenführ, Wiechert, and Nöh, 2015). Thus, building high-quality

13_{C-MFA workflows requires modeling expertise and familiarity with a broad range of}

specialized software tools. The examples presented illustrate the diversity of 13C-MFA applications, which range from the modeling of complex microorganisms to sophisticated statistical analyses of large-scale simulation data.

Following the conventional definition of a software framework (R. E. Johnson and Foote, 1988), the outcome of this thesis is an abstract design for solutions to a family of related

problems (rather than a mere collection of libraries and tools), which aims at providing

solutions to master the 13C-MFA procedure. Exposing special-purpose 13C-MFA tools and data sources as services with unified interfaces allows for flexibly composing computational pipelines and workflow applications. Hence, the approach proposed by this thesis – namely, designing the SWF as a collection of loosely-coupled modules that are glued together with web services – supports scientists in the realization of13C-MFA workflows. Specifically, the five challenges identified in § 1.3 are addressed in this thesis as discussed in the following.

7.1. Discussion

C1: Heterogeneous and flexible data and tool organization

A usable SWF for model-based evaluations needs to strike a balance between being a mere collection of services and providing fully-integrated functionality, such as graphical modeling and easy-to-use HPC deployment. The chosen design of the SWF allows that both, the use of web services, which allow the flexible integration of otherwise diverse applications, and the introduction of a VCS in addition to traditional databases, help to organize different knowledge sources, experimental data, and models.

However, this flexibility also comes at a price: often, the automation of 13C-MFA workflows utilizing third-party components is only seldom possible (if at all) to full extent out of the box. Instead, the researcher needs to create a workflow that employs the software and determine which intermediate information requires an expert decision. In addition, input and output formats of some of the employed tools need to be adapted

In document Scientific Workflows for Metabolic Flux Analysis (Page 117-124)