DNA-based high-throughput sequencing - High-throughput sequencing

1 Introduction

1.5 Nucleic-acid based techniques to study microbial ecology

1.5.4 High-throughput sequencing

1.5.4.1 DNA-based high-throughput sequencing

Sequencing the 16S rRNA gene is a widespread and cost-effective high- throughput sequencing technique to evaluate microbial diversity at a high resolu- tion. However, shotgun metagenomic sequencing of an environmental sample provides a much deeper snapshot of both the community composition and the metabolic potential of this community (Scholz et al. 2015, Segata et al. 2013a).

16S rRNA amplicon sequencing

Amplicon sequencing refers to the process where a target gene is first amplified from template DNA by PCR with subsequent determination of the sequence of individual amplicons in the mixed product pool by HTS. By incorporating so- called bar codes into the primer sequence, multiple PCR amplicons representing multiple samples can be pooled and sequences in a single HTS run. Usually a highly conserved gene, such as a region of the 16S rRNA gene is used, in order to determine the taxonomic diversity of microbes in an environmental sample. The rapidly decreasing cost of next-generation sequencing in the past years have large- ly encouraged the usage of 16S rRNA gene HTS in studying microbial diversity across multiple disciplines (Ju and Zhang 2015).

Analysis of microbial diversity using 16S rRNA high-throughput amplicon sequencing involves the following steps: (i) DNA extraction, (ii) choosing a sequencing platform, (iii) library preparation with barcoding, (iv) the sequencing process and (v) bioinformatics and statistical analysis of the HTS data.

(i) DNA extraction for 16S rRNA HTS

DNA extraction is the first step for acquiring 16S rRNA high-throughput sequencing data. Since most sequencing approaches require between nanograms and mi- crograms of DNA, efficient DNA extraction and purification is important for downstream sequencing (Di Bella et al. 2013). DNA lysis can be done chemically (SDS, phenol), mechanically (bead beating, sonication) or enzymatically (protein- ase K). The quality and quantity of the isolated and purified DNA need to be con- firmed before amplification and sequencing, in order to prevent wasting expensive reagents for the library preparation and sequencing (Di Bella et al. 2013). This can be done with the Bioanalyzer from Agilent Technologies or with fluorometers like QubitTM fluorometer (Invitrogen Corporation, Carlsbad, USA). After confir- mation of quality and quantity the DNA templates can be used for library preparation.

(ii) Choosing a sequencing platform

There are several different sequencing platforms on the market, each one has its strengths and weaknesses (Di Bella et al. 2013). Two commonly used HTS platforms are the 454 Life Sciences (Roche) and Illumina. So far the majority of am-

plicon high-throughput sequencing studies have been done using the 454 Life Sci- ences technology, probably due to the fact that this was the first available HTS platform (Knief 2014). However, the ratio of usage is shifting in favour to the Illumina systems more recently (Oulas et al. 2015). Advantages and disadvantages of these two sequencing platforms are summarized in Table 1.6.

Table 1.6 Advantages and disadvantages of the sequencing platforms 454 Life Sciences and Illumina (according to van Dijk et al. (2014a) and Knief (2014))

454 Life Sciences (Roche) Illumina

Advantages + run time (~23h) + currently leader in the NGS industry + long reads – easier to map to refer-

ence genomes

+ most library preparation protocols are com- patible with Illumina systems

+ highest throughput (22-25 million reads) and

lowest per-base cost (Liu et al. 2012)

+ paired-end reads with very low error rate (MiSeq)

Disadvantages - Roche published that it will shut down 454 and discontinue supporting the platform by the mid of 2016

-sample loading is difficult – library concen- trations must be closely controlled, to avoid overloading and overclustering

- relatively low throughput (~1 million

reads) - requirement for sequence complexity – addition of PhiX might be necessary

- high reagent costs

The Illumina MiSeq platform was introduced in 2011. This platform produced 22- 25 million paired-end reads with a maximum length of 300 bp and a low per-base cost (Knief 2014). Therefore, the Illumina sequencing platform is a good choice for 16S rRNA amplicon high-throughput sequencing as well as for metagenomic projects, because this platform allows sequencing to a high depth, due to the low cost per base and hence it is possible to gain as much information as possible and to detect less-abundant microorganisms that may yet play an important role in the ecosystem (Knief 2014).

(iii) Library preparation for 16S rRNA HTS

Library preparation can be done directly from the PCR amplified 16S rRNA gene fragment (amplicon). Libraries are constructed by adding sequencing platform- specific DNA adapters to the DNA amplicons (Knief 2014). Various library preparation kits are commercially available, e.g. Nextera XT DNA library prep kit by Illumina Inc. (San Diego, USA) or the GS FLX Titanium sequencing kit XL+ by 454 Life Sciences (USA). Even more library preparation protocols have been pub-

lished, either to adapt to specific studies and research questions, to reduce costs and preparation time (Caporaso et al. 2012, Rohland and Reich 2012), to reduce possible bias by reducing PCR library preparation steps (Caporaso et al. 2012, van Dijk et al. 2014b) or to reduce the amount of required input DNA amplicon (Bowman et al. 2013).

At least one of the library adapters typically contains a barcode, a library specific DNA sequence, often 6 to 12 base pairs (bp) long (Knief 2014). The barcode allows the pooling of various libraries, which can then be sequenced in a single sequencing run. For 16S rRNA amplicon HTS the barcode is often already added during the PCR amplification of the 16S rRNA gene to allow parallel sample processing early on (Knief 2014). The process of library preparation for subsequent HTS is shown in Figure 1.5.

Figure 1.5 Schematic representation of the library preparation for 16S rRNA amplicon high-

throughput sequencing according to the protocol by Caporaso et al. (2012). a) Environmental

DNA is amplified with specific primers targeting the 16S rRNA gene. The forward primer also comprises a primer linker (for a more efficient translation), a primer pad (makes sample more complex for sequencing and prevents primer dimer formation) and a specific adapter, e.g. for the Illumina platform. The reverse primer comprises in addition to the primer linker, pad and adapter the unique barcode. b) DNA amplicon with primer linker and pad, barcode and adapter, ready for pooling and high-throughput sequencing.

(iv) Library amplification and sequencing

On the Illumina platform the pooled DNA amplicons are loaded onto a solid surface, normally a glass surface, coated with adapter oligonucleotides. One end of the DNA amplicons will bind to the free end of a surface-bound adapter and will ‘bend over’ and hybridize to a complementary adapter (van Dijk et al. 2014a). This initiates complementary strand synthesis, also called bridge PCR. This process of solid-phase amplification followed by denaturation is repeated multiple times to create clusters of ~1000 copies of single-stranded DNA amplicons (van Dijk et al. 2014a). The density of the library molecule on the glass surface has to be adequately low to avoid interference of library molecules, even after amplification via bridge PCR (Knief 2014). Sequencing is then performed in a parallel approach for thousands to billions of library amplicons. Sequencing is performed by repeated cycles of nucleotide addition and incorporation by a DNA polymerase, detection of incorporated nucleotides and washing steps (Knief 2014). Figure 1.6 gives an overview of the Illumina clonal amplification by bridge PCR and the sequencing.

Figure 1.6 Schematic representation of Illumina library amplification by bridge PCR and

base calling. a) Single stranded DNA amplicons with barcodes and adapters are added to the Illu-

mina flow cell and immobilized by hybridization. Isothermal bridge-amplification generates clusters. These clusters are denatured and cleaved and sequencing is initiated by the addition of sequencing primer, polymerase and four reversible dye terminators. Post incorporation fluorescence is recorded, also know as base calling (b).

(v) Bioinformatics analysis of amplicon sequencing data

Bioinformatics analysis of amplicon sequencing data includes three main steps: (a) pre-processing of raw amplicon reads, (b) microbial diversity analysis and (c) complex data analysis and visualization.

A variety of bioinformatics tools and pipelines have been developed to analyse and visualize amplicon HTS data. The most widespread pipelines are Quantitative Insights Into Microbial Ecology (QIIME) (Caporaso et al. 2010), mothur (Schloss

et al. 2009) and UPARSE (Edgar 2013).

(a) Pre-processing of raw amplicon reads

Pre-processing of raw amplicon reads usually includes joining of paired-end reads, de-multiplexing of barcoded amplicon sequences, quality filtering, chimera checking and data normalization.

Joining paired-end reads can be done for example within QIIME or with the software tool Fast Length Adjustment of SHort reads (FLASH) developed by Magoč

and Salzberg (2011). De-multiplexing of the barcoded amplicon sequences can be done in QIIME or mothur. Quality filtering can also be done within e.g. QIIME or mothur. UCHIME (Edgar et al. 2011) for example can be used for chimera checking and data normalization can be done by the methods of relative abundance or rarefaction (QIIME).

(b) Microbial diversity analysis

The microbial diversity analysis normally consists of setting up an operational taxonomic unit (OTU) table, including OTU picking, picking representative sequences, aligning them and assigning them taxonomically and building a phylogenetic tree.

OTU picking can be done within QIIME using OTU reference databases such as SILVA (Pruesse et al. 2007) or Greengenes (DeSantis et al. 2006). For aligning representative sequences USEARCH/Uclust (Edgar 2010), mothur or BLAST (Altschul et al. 1990) can be used. The taxonomic assignment can be done again with USEARCH/Ulcust, BLAST or mothur. Aligned representative OTU sequences can be visualized in a phylogenetic tree with software packages like MEGA (Tamura et al. 2011), PhyML (Guindon et al. 2010) or ARB (Ludwig et

The more advanced data analysis and visualization usually includes an alpha- diversity (diversity within a sample) and beta-diversity (diversity across samples) analysis. Moreover, it can involve clustering and coordinates analysis (e.g. princi- pal component analysis and plots) and advanced data visualization (e.g. heatmaps and network analysis and plots).

Alpha- and beta-diversity can be analysed using QIIME. Statistical analysis can be done using R packages or Statistical Analysis of Metagenomic Profiles (STAMP) (Parks et al. 2014).

Relevant amplicon sequencing studies

High-throughput amplicon sequencing of bacterial marker genes (16S rRNA gene) is increasingly being used in studies of microbial community composition and consequently also for characterisation of the bacterial community in phyllosphere and rhizosphere. So far more than 100 articles on rhizosphere and no less than 40 articles on phyllosphere bacterial communities using HTS technologies have been published (Knief 2014). Most of these studies used the 454 sequencing platform by Roche and only a few applied Illumina MiSeq (Bokulich et al. 2014, Bulgarelli et al. 2015, Jiang et al. 2013). A few of these phyllosphere and rhizosphere studies are summarized in Table 1.3 and Table 1.4, respectively.

The majority of the amplicon sequencing studies of the phyllosphere were com- pleted to define and identify the microorganisms that colonize plants. It was investigated if the plant taxon determines the composition of the community (Bokulich et al. 2014, Delmotte et al. 2009) and if the community composition differs between different plant compartments (Bodenhausen et al. 2013, Ottesen et al. 2013).

Regarding the rhizosphere it was investigated whether factors like season, plant species, soil and sediment type or plant growth conditions have an impact on the bacterial community associated with the host plant (summarized by Knief 2014). These studies show the advantage of high-throughput amplicon sequencing for investigating microbial colonization of plant.

Shotgun metagenomics

Shotgun metagenomics is sequencing of the entire community DNA of a sample and allows to determine the identity of microorganisms present and, at the same time allows their functionalities and metabolic potentials to be assessed (Di Bella et al. 2013). Therefore a robust estimation of the microbial community composi- tion and the diversity is possible without the need for targeting and amplifying a particular gene (Poretsky et al. 2014). However, metagenomic studies are much more challenging than amplicon studies as summarized by Di Bella et al. (2013), e.g. challenges associated with the assembly of metagenomic reads and the analysis of the metagenomic data (Di Bella et al. 2013)

Similar to 16S rRNA amplicon sequencing shotgun metagenomics involves the following steps: (i) DNA extraction, (ii) choosing a sequencing platform, (iii) library preparation, (iv) the sequencing process and (v) bioinformatics and statistical analysis of the HTS data.

(i) DNA extraction for shotgun metagenomic sequencing

The DNA extraction process for shotgun metagenomics is similar to the DNA extraction for 16S rRNA amplicon sequencing and can therefore looked up under point (i) DNA extraction for 16S rRNA HTS (page 37).

(ii) Choosing a sequencing platform

Sequencing platforms for shotgun metagenomics are the same as for 16S rRNA amplicon sequencing and can therefore looked up under point (ii) Choosing a sequencing platform (page 37).

(iii) Library preparation for shotgun metagenomic sequencing

Library preparation for shotgun metagenomics normally includes three steps: (a) fragmentation of DNA molecules, (b) generating blunt ends to the fragments for further processing and (c) adaptors ligation to the fragments.

Fragmentation of genomic DNA is normally done either mechanically or enzymatically. Mechanic fragmentation processes are nebulization, hydrodynamic shearing and ultrasonication. The fragment size is critical and depends on the sequencing platform that will be used (Knief 2014). For Illumina libraries the fragment size is between 300 and 500 bp inclusive of adapters. However, it is also

density within the flow cell to prevent interference of library fragments during the sequencing step (Knief 2014). After fragmentation the sequencing platform specific adaptors are ligated on the ends of the fragments. These adapters are not only used for attaching to the solid surface within the sequencing flow cell but they also act as primer target sites for amplification. ‘Tagmentation’ is a novel ap-

proach by Illumina NexteraTM which uses a ‘TransposomeTM’ enzyme which

fragments DNA strands, blunts ends and attaches sequence tags all in one step.

(iv) Library amplification and sequencing

The process of shotgun metagenomics library amplification and sequencing is similar to 16S rRNA amplicon sequencing and can be looked up under point (iv) Library amplification and sequencing (page 40).

(v) Bioinformatics analysis of shotgun metagenomic sequencing data

Shotgun metagenomics generates enormously huge datasets of short sequences

that are challenging to analyse (Scholz et al. 2015). Therefore, an assembly of the

reads into contigs and their orientation into scaffolds is recommended. The assembly of these short reads into contigs can be (i) reference-based mapping or (ii)

de novo assembly. A tool for reference-based assembly is e.g. MetAMOS

(Treangen et al. 2013) and tools for de novo assembly are EULER (Pevzner et al.

2001), Velvet (Zerbino and Birney 2008) and SOAP (Li et al. 2008). Next genera-

tion assembly tools, which combine binning and assembly in order to generate more precise assemblies from more complex samples and datasets containing

multiple genomes (Oulas et al. 2015) are MetaVelvet (Namiki et al. 2012) and

Meta-IDBA (Peng et al. 2011). Binning is the method of grouping reads or con-

tigs into “bins” that are likely to be from the same or similar species, subspecies or genus. Binning can be compositional-based, similarity based or both. Binning can be performed before or after assembly. Because assembly of metagenomic

reads and contigs can be very challenging as described by Di Bella et al. (2013),

assembly of metagenomic datasets is often not done. If assembly and binning of reads and contigs is carried out then annotation of the metagenomic sequences and gene finding can take place. Without an assembly it is still possible to identify genes, however annotation of the metagenomic sequences is not possible.

Annotation of the metagenomic sequences involves steps such as (i) trimming of low-quality reads, (ii) masking of low-complexity reads, (iii) de-replication and (iv) a screening step to screen for sequences that match model organisms (Oulas et al. 2015). After that genes can be predicted within the assembled contigs with tools like Prodigal (Hyatt et al. 2010), Orphelia (Hoff et al. 2009), or FragGen- eScan (Rho et al. 2010). The latter is qualified for prokaryotic genomes only and believed to be one of the most precise gene-prediction tools currently available (Oulas et al. 2015) and therefore used by EBI Metagenomics and MG-RAST. FragGeneScan can also be used for gene prediction from reads when no assembly was carried out. In order to determine the function of the predicted sequences, different databases can be used for the functional annotation such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Ogata et al. 1999), Clusters of Orthologous Groups (COG) (Tatusov et al. 2000) and SEED (Overbeek et al. 2005).

Numerous computational open source pipelines can be found online to streamline de-replication, quality control, mapping of fragments against databases and phylogenetic and metabolic reconstruction of metagenomic data. The Metagenomics RAST server (MG-RAST) and the EBI Metagenomics service are two of these pipelines.

Table 1.7 Bioinformatic tools used for metagenomic data analysis

Scope Tool/ software Function Reference

Assembly Velvet Genome assembly based on de Bruijn graph algorithm

Zerbino and Birney (2008)

Meta-Velvet Genome assembly based on de Bruijn

graph algorithm Namiki et al. (2012) Meta-IDBA Genome assembly based on de Bruijn

graph algorithm

Peng et al. (2011) EULER Online tool to assemble DNA se-

quences using an Eulerian approach

Pevzner et al. (2001) SOAP Short-read assembly tool designed for

Illumina short-reads

Li et al. (2008) MetAMOS Integrated assembly pipeline using the

Bambus2 metagenomic scaffolder Treangen et al. (2013) Binning TETRA Calculates correlation of tetranucleo-

tide usage patterns in DNA sequences Teeling et al. (2004) CARMA Binning based on Pfam domains and Krause et al. (2008)

Table 1.7 continued

Scope Tool/ software Function Reference

Binning SPHINX Uses composition- and alignment-

based binning algorithms Mohammed(2011) et al.

Annotation FASTX-Toolkit Trimming of low-quality reads Su et al. (2012)

FastQC Trimming of low-quality reads Andrews (2010)

DUST Masking of low-complexity DNA

sequences

Morgulis et al. (2006)

Bowtie 2 Read-aligner Langmead and

Salzberg (2012)

RAST Automated service for annotation of

bacterial and archaeal genomes

Aziz et al. (2008)

HMMer3 Sequence alignments based on profile

hidden Markov models Mistry et al. (2013)

Gene prediction Prodigal Bacterial and archaeal gene finding Hyatt et al. (2010)

Orphelia ORF finding tool for the prediction of

protein coding genes

Hoff et al. (2009)

FragGeneScan Application for finding (fragmented)

genes in short reads Rho et al. (2010)

Taxonomic assignment

PhyloPhlAn Taxonomic assignment and phylo-

genomic assessment by placing the contigs into the microbial tree of life; uses 400 markers

Segata et al. (2013b)

MetaPhlAn Marker-based approach, short reads

are mapped against representative genes

Segata et al. (2012)

Kraken Tool for assigning taxonomic labels to

short DNA sequences

Wood and Salzberg (2014)

Databases SEED & KEGG Functional annotation of genes Mitra et al. (2011)

COG/KOG Functional annotation database Tatusov et al. (2000)

Pfam Protein family database Finn et al. (2008)

GOLD Genomes On Line Database Liolios et al. (2010)

Greengenes 16S rRNA gene sequence database DeSantis et al. (2006)

Additional

analysis LefSe LDA Effect Size (LefSe) is an algo-rithm to identify differences between two or more biological conditions

Segata et al. (2011)

STAMP Software package for analysing taxo-

nomic or metabolic profiles Parks et al. (2014)

PICRUSt Prediction of metagenome functional

content

Langille et al. (2013)

Servers/ open source pipelines

MG-RAST Automated pipeline to perform quality

control, gene and protein prediction, clustering and annotation

Meyer et al. (2008)

EBI Meta-

Relevant shotgun metagenomic sequencing studies

So far, only a small number of shotgun metagenomic studies of plant-associated microorganisms have been published. A few of these phyllosphere and rhizosphere studies are summarized in Table 1.3 and Table 1.4, respectively.

Metagenomic data characterizing the phyllosphere microbiome are available from soybean, rice, clover, Arabidopsis thaliana and tomato (Atamna-Ismaeel et al. 2012, Delmotte et al. 2009, Knief 2014, Knief et al. 2012, Ottesen et al. 2013). Knief et al. (2012) and Delmotte et al. (2009) studied metaproteomic data in combination with metagenomic data and found great consistency in the metapro- teomes and the microbial community composition (phylum level) of the phyllosphere of different plant species (Knief 2014). Another metaproteogenomic study on rice phyllosphere and rhizosphere samples showed a specific metagenomic and

In document Environmental genomics and proteomics of plant associated microbial dimethylsulfide degradation in a coastal salt marsh (Page 59-70)