Data Tier - Basic Concepts - Scientific Workflows for Metabolic Flux Analysis

3. Basic Concepts

4.1. Data Tier

13_{C-MFA workflows involve both, raw and pre-processed, inputs and outputs, i.e., exper-}

imental data, biochemical information, metabolites, reaction pathways, atom mappings, models, and simulation data (Wiechert, 2001; Yang, 2013; Zamboni, Fendt, et al., 2009). The employed tools and the data, have both their own data formats. Hence, the SWF needs to support – at least to some extent – facilities to load and convert these formats. Besides the data, biological and technical meta-information is vitally important (how is the experiment performed? What simulation parameters are used in the study?). To cope with this heterogeneous information, different data storage technologies are used: relational databases for bulk data, a version control system for documents, and a cloud-based storage solution.

4.1.1. Storage of Scientific Raw Data in Relational Databases

Because scientific raw data (e.g., measurements) is typically recorded once and accessed many times thereafter, relational SQL databases are employed to store this kind of information persistently.

• Design Decision 3: Experimental data is managed in the open source PostgreSQL software, which offers SQL:2008 compliance, database clustering, and transactions (Obe and Hsu, 2014). In addition, PostgreSQL provides a JDBC driver which allows the integration with a Java Enterprise application server (Rubinger and Burke, 2010).

• Design Decision 4: Scientific raw data is distributed on three databases which are identified in the 13C-MFA context: a database for experimental data and measurements, a model database, and a unified database to represent reaction information.

4.1.1.1. Measurement Database

This database is part of JuMeDaS (Jülich Measurement Data Selector), a management system for analytical raw data from middle- and high-throughput experiments with 13C isotope tracers (developed by Miebach, 2012). Figure 4.2 summarizes the database schema which consists of eight tables grouped in three categories: analytical data, process data, and measurement meta information.

4.1.1.2. Model Database

The model database provides interfaces to organize and access model documents. The following tables are defined in the database (cf. fig. 4.3).

• The models tables organizes the model metadata, i.e., the associated project, name and description of the model, owner, and insertion date.

4.1. Data Tier

Figure 4.2.: UML diagram of the JuMeDaS measurement database schema. The tables are grouped in three categories: measurement data (blue), process data (yellow), and metadata (cyan).

• Models are organized in projects, where the access permissions are organized per project in the permissions table.

• Because typical 13C-MFA workflows frequently modify a model, the model_versions table tracks the various changesets of a model including a history log.

• The files table contains metadata of the actual model file, i.e., name, type, and location in the filesystem (which is backed by a VCS project repository; cf. § 4.1.2).

The access to the model database is realized in the ModelManager EJB1. The management of experiments is realized in the ExperimentManager EJB2.

4.1.1.3. Reaction Database

JuMetReD (Jülich Metabolic Reaction Database) unifies the access to various knowledge

sources in the context of metabolic reactions and pathways (Fuhrmann, 2010). Because the tracking of atom transitions is required for 13C-MFA, this information is annotated to the reactions in addition. The database consists of 21 tables and is grouped in four

1 _{The WSDL of the model manager web service provides details on the available operations: https:}

//www.13cflux.net/ModelDB-ejb/ModelManager?wsdl; last accessed: May 22, 2017

The WSDL of the experiment manager web service provides details on the available operations: https: //www.13cflux.net/MeasurementDB-ejb/ExperimentManagerImpl?wsdl; last accessed: May 22, 2017

Chapter 4. Design and Architecture of the Scientific Workflow Framework

Figure 4.3.: UML diagram of the JuMetReD model database schema. Five tables are defined: models, projects, permissions, model_version, and files.

parts: compounds, reactions, hierarchical entities, and literature. Originally developed as an independent Java Enterprise tool, JuMetReD is fully merged into the SWF3.

4.1.2. Version Control Systems for Storing Document Data

Unlike bulk data, document-based files (especially FluxML models, but also spreadsheet files containing evaluated raw data) are constantly reviewed and changed in the course of a typical 13_{C-MFA study. While it is technically possible to store documents in a}

relational database, performance degradation is to be expected in the long run, especially for larger documents (Sears, Ingen, and Gray, 2006). In addition, especially text-based documents will consume the full file data space in the RDBMS, even for small changes.

Design Decision 5: A Version Control System (VCS) is employed to organize these documents efficiently, e.g., by tracking each modification with a change history per file and per folder (Tichy, 1982). Modern VCS solutions store files incrementally, sometimes even binary data which is beneficial for space consumption and performance (Hudson, 2002).

4.1.2.1. Subversion and Other VCS Software

Table 4.1 compares five popular VCS solutions that are selected from existing open source and commercial software packages4for inclusion into the SWF: Subversion, CVS, Perforce,

The integration into the SWF is realized by Runkel, 2009 as part of his diploma thesis

A comprehensive list of other VCS software is found at Wikipedia: https://en.wikipedia.org/wiki/ List_of_version_control_software; last accessed: May 22, 2017.

4.1. Data Tier

Subversion CVS Perforce Mercurial git

GUI client yes yes yes yes yes

console client yes yes yes yes yes

API support yes yes no yes yes

Java client lib. SVNKit no no no JGit Python client lib. PySVN no no native GitPython

atomic operations yes no yes yes yes

directory tracking yes no no no no

open source yes yes no yes yes

VCS style client-server client-server client-server distributed distributed comm. support CollabNet none yes none GitHub Inc.

Table 4.1.: Comparison table of popular VCSs. The decision criteria include the availability of GUI and console clients, Java and Python API support, atomic committing of a set of changed files, and the ability to track empty folders in the VCS.

Mercurial, and git. The most important decision criteria for the VCS software in the context of13C-MFA are: the availability of GUI and console clients, support for writing custom clients in Java and Python, atomicity (i.e., check-ins of multiple documents are performed as whole or not at all), and tracking of empty directories (which is useful for tracking project templates).

Design Decision 6: The Apache Subversion software is employed to store document- oriented data (Collins-Sussman, Fitzpatrick, and Pilato, 2008). Table 4.2 summarizes the available clients for Subversion: the Microsoft Windows Explorer plug-in TortoiseSVN, the svn command-line, the Java programming library SVNKit, and the Python SVN client library PySVN.

4.1.3. Cloud Storage

Complementing traditional local devices, cloud computing infrastructures provide storage renting services that have become popular in computational biology, most commonly the Amazon S3 cloud service (Palankar et al., 2008). One of the pillars of cloud computing are web services which are used to directly integrate the Amazon S3 storage (https: //aws.amazon.com/s3/). The Amazon S3 command-line utility s3cmd or a web service interface can be employed to access, modify, or delete data in the cloud. As Amazon S3 provides a web-based user interface, published data files can be instantly shared with other scientists via the Internet.

Chapter 4. Design and Architecture of the Scientific Workflow Framework

TortoiseSVN svn SVNKit PySVN add/rm/ci/co/diff/log yes yes yes yes

check-in per file yes yes yes yes check-in per folder yes yes no yes check-in of file selection yes no no yes

programmable no no yes (Java) yes (Python) usability easy intermediate easy difficult allows partial checkouts yes yes n/a yes

requires check-out yes yes no yes search revision history yes yes yes yes search in repository yes (Explorer) yes (grep) no no

Table 4.2.: Comparison of four Subversion VCS clients. (a) TortoiseSVN, a graphical tool with advanced access features; (b) the standard console tool svn for perform- ing expert tasks; (c) custom Java client application the SVNKit library; and (d) custom Python client scripts utilizing the PySVN Python library. The us- ability aspect is a subjective measure and reflects the required user experience to use this tool effectively.

In document Scientific Workflows for Metabolic Flux Analysis (Page 46-50)