Ever since scientists have begun to study microbial communities, technological ad- vances in both sequencing methods as well as the set of tools available for successful data interpretation have largely contributed to improve our understanding of these interacting systems. Even more, in many cases these developments were the crucial key prerequisites that for the first time enabled us to access these valuable resources at all and gain important novel insights. Some natural environments have been shown to be inhabited by highly complex microbial communities, for which even current metagenome sizes do not provide sufficient coverage; e.g. soil:
“Based on our analysis, we propose that the sequencing depth re- quired to provide comprehensive coverage of soil metagenomes should be increased by an order of magnitude, to ∼100 Gbp. This is a function of the extreme taxonomic heterogeneity of soil microbial communities . . . ” (Van der Walt et al., 2017).
Especially the advent of next-generation sequencing resulted in a dramatic de- crease of sequencing costs, without which a large number of studies would not have been possible. Ever since, unprecedented amounts of sequence data have been generated, and while data interpretation yielded a large number of novel findings, there were also new types of error discovered and new biases which needed to be addressed.
With third-generation sequence data, the next round of adaptations and novel opportunities has arrived – the ability to sequence long DNA stretches favors meta- genome assembly, while the decrease in basecall accuracy at the same time rather obstructs established read-based approaches. As a consequence, new approaches are needed to successfully analyze these metagenome datasets.
Bioinformatics tools rapidly conquered the new and evolving metagenomics field, and while initially tools typically used for genome analysis were employed, a re- searcher is nowadays able to choose from a wide range of software packages that were specifically developed for the processing and analysis of microbial community data. Within a short time frame, a large number of methodological advances have been made; but as the corresponding implementations are almost exclusively pro- vided as command-line tools for the Linux operating system, they are thus only accessible to tech-savvy people with the corresponding knowledge. In addition, the computational effort required for metagenome analysis still remains quite large de-
spite these developments, and without access to appropriate compute infrastructure, a timely data analysis is not attainable.
Platforms like IMG/M and MG-RAST avoid this burden, providing easily accessi- ble web interfaces in combination with sizeable compute resources. However, these applications trade ease of use for limited customizability – only a single analysis pipeline is offered allowing to execute a rather generic and often insufficient analysis, parameters can not be adapted, and visualization/charting capabilities are typically restricted to certain result types. Also, these pipelines are rarely updated and thus often rely on outdated tools (e.g. FragGeneScan instead of the far more recent FragGeneScan+; sequence alignment and best-hit annotation instead of faster and more flexible tools like Kraken) or software components with known deficiencies (Eren et al., 2013). Finally, they do not allow to include own data sources (e.g.
limited predefined set of reference genomes for host contamination removal) or lack the ability to define custom analysis pipelines to address less common use cases, such as the processing of eukaryotic data.
These static, “one-size-fits-all” pipelines have clearly been designed in order to cover the most common use cases, and while the overall throughput and quality of results is quite impressive, they do not suffice to address specific analysis needs where specialized sequence databases are required, or provide suboptimal performance when exposed to less common sequence data types (Brown et al., 2017).
MGX: An advanced framework for
microbial community analysis
The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.
– Tom Cargill, Bell Laboratories
This chapter describes the design and implementation of the MGX framework for metagenome analysis and its accompanying components. Contents are partially based on the original publication describing the MGX framework (Jaenicke et al., 2018) without further explicit attribution.
3.1 Objectives
Based on the interim conclusions given in Section 2.5, it is apparent that metage- nome data analysis is a fast-moving field that requires frequent adaptation to keep up with novel developments in sequencing technologies as well as latest method- ological advancements in data analysis. The MGX framework has been designed in order to address the shortcomings of other applications for metagenome analysis currently in existence and to provide an environment that can easily be adjusted to future developments.
On several occasions (Su et al., 2011; Zakrzewski et al., 2013), newly developed platforms soon fell into obsolescence once it became clear that neither were they able to cope with current sequence data volumes nor did their algorithmic approach provide sufficient scalability, disallowing an adaptation to changed external condi- tions. Hence, modular approaches are preferable over fixed systems, as they allow to dynamically improve or exchange components when better tools are available or sequencing strategies change.
The primary target of the MGX software is the storage and analysis of unassem- bled metagenome sequence data; however, possible future extensions have already been taken into account, and support for e.g. metataxonomics or metagenome assembly is attainable without major structural changes to the underlying data model.
Consequently, the main objectives for the development of the MGX framework have been defined as follows:
Data and metadata storage. For each dataset, the sequence data ought to be
stored in conjunction with corresponding metadata detailing origin and treat- ments applied for data generation (also see Section 3.3), thus facilitating the repetition of an experiment.
Fast adoption of novel tool developments. Newly developed tools should be
provided as readily available analysis pipelines as soon as possible in order to allow users to benefit from improved methods, or to use the tool most suitable for their type of data.
Single sequence resolution. It is desirable to retain all analysis results in a man-
ner that allows to identify and extract subsets of the available sequencing data based on arbitrary criteria. Therefore, results should mandatorily be traceable to the individual input sequences.
Abstraction to allow use without bioinformatics expertise. Most bioinformat-
therefore typically inaccessible to users without at least basic Linux expertise; UI (user interface) components should provide a sufficient degree of abstrac- tion to enable users to configure and execute tools without Linux proficiency.
Provisioning of compute resources. Considerable resources are still necessary to
run recent metagenome analysis tools; due to the required hardware invest- ments and associated maintenance costs, compute infrastructure used by MGX should be provided centrally and without imposing any costs on (academic) users.
Multiple specialized workflows for different tasks. Instead of one central pipe-
line for taxonomic as well as functional metagenome analysis, specialized and modular workflows should be provided, thus permitting users to restrict their selection to those aspects they are interested in; also, this approach eases future adaptations and improvements.
Parameter customization. All workflows should allow to adapt certain parame-
ters (such as cutoff values or scores), where applicable. Nonetheless, each possible parameter should still provide a predefined value, which serves as a sensible default and can be used as a starting point for possible future refinements.
Dynamic visualization options. Instead of fixed visualizations for specific results,
a dedicated component should automatically determine whether a user- selected result type and a certain mode of presentation are compatible and offer only appropriate combinations.
Comparative analysis and statistics. MGX should provide different analysis
types allowing to interactively perform comparisons between several data- sets as well as execute statistical evaluation using state-of-art methods such as compositional data analysis.
Ability to implement and execute own workflows. For advanced users or scien-
tists with highly specific analysis needs, it should be possible to draft and implement own analysis workflows, which are subsequently scheduled and executed on MGX-provided infrastructure.
Possibility to include own databases and reference genomes. Users should be
able to upload and include own datasources for metagenome analysis, e.g. specialized sequence databases or unpublished reference genomes. Thus, it is desirable to allow the inclusion of arbitrary data and provide predefined workflow templates for the most frequently used data formats, such as e.g. FASTA-based sequence collections or HMM models.
Modular design to allow adaptation to novel developments. A modular soft-
ware design greatly reduces the required effort needed to exchange individual components as long as the interface remains constant.
API provisioning. For programmatic access and automated data analysis, an API
(Application Programming Interface) should be provided together with an appropriate library for command-line usage.