Chapter 1 General Introduction
1.5. Bioinformatics analysis of high throughput data
1.5.2. Oligotyping and Minimum Entropy Decomposition
High throughput sequencing of a highly conserved 16S rRNA gene region was an appropriate molecular probe for investigation of whole bacterial community present in our environmental reservoirs. This probe can also monitor microbial dynamic diversity over diverse climatic change or spatial distance difference (Huber, Mark Welch et al. 2007). In order to delineate total bacterial diversity, each sequence needs to be considered. However there are two limitations in QIIME; first was that OTU are based on abundance similarity clustering. Non-dominant sequences with subtle differences will not be represented in their own OTU, but will be clustered with more dominant similar sequences. The second limitation was that a large proportion of the 16S rRNA reference sequences were isolated from culture so it offers poor resolution for environmental diversity analysis when species classification still relies on curated databases. (Garrity 2004).
20
The computational method of oligotyping aimed to reveal the concealed diversity in sequence analysis without comparison to current databases. Species clustering in oligotyping relies on comparing all positions in an amplicon, and using entropy analysis to select positions with appropriate levels of entropy (Figure 1.4. Step 2). The formula of entropy used in Oligotyping is listed in Equation 2.4 to achieve the effective accumulation of positions of interest to emphasise the high variation in the sequence reads. Random insertion and deletion sequence errors cause low levels of entropy as they are randomly distributed through the sequence, and are thus excluded from oligotyping analysis (Eren, Maignien et al. 2013). A previous study reported that oligotyping was able to differentiate between two high similarity bacterial species with only 0.2 % variation in the short hypervariable 16S rRNA region (Eren, Zozaya et al. 2011). Therefore Oligotyping was chosen as suitable for differentiation of SGM members especially MAC because these bacterial species have only 0.3 % difference in 16S rRNA gene region (Chaves, Sandoval et al. 2010).
21
Figure 1.4 The major processes in Oligotyping analysis. The step 1 indicated that the diverse sequence reads in an OTU in the dataset. Step 2 depicted that the Shannon entropy analysis was applied on these reads to accumulate the higher variation position was revealed. Step 3, each oligotypes were identified according to entropy analysis (Eren, Maignien et al. 2013).
The species classification in Oligotyping relied on BLAST, using a curated reference database, to identify closely related sequence for taxonomic analysis. This limitation of Oligotyping resulted in being unable to address rare proportion of undefined bacterial species. However the potential of entropy analysis optimised differentiation of homogenous units through only looking at the fraction of the available nucleotide data instead of species classification based on sequence similarity, to perform directly diversity analysis (Figure 1.5) (Eren, Morrison et al. 2015). Each node generated from
22
MED was segregated as a new phylogenetically homogenous unit in the end of analysis. Sequences are assigned to nodes using the most entropic position, entropy within nodes is then recalculated, and new nodes are then created based on the most entopic position. This process continues until all sequences are in nodes with no entropy above a user defined level. These nodes can be used for analysis in the same way as OTU can in QIIMEs pipeline. A current study revealed that MED as a sensitive technology for diversity analysis of closely related bacterial species and this approach facilities as a biomarker in-depth metagenomics analysis (Eren, Morrison et al. 2015).
Figure 1.5 Flow chart illustrates the principle of MED to decompose each high entropy sequence location to different Node group. The step 0 Shannon entropy analysis was performed in Oligotyping. In step 1, any node with a Shannon entropy more than 0 was decomposed and built a reanalysis for regeneration of new nodes in step 2 (Eren, Morrison et al. 2015).
A comparison was performed of two bioinformatics NGS analysis pipelines and their associated algorithms consisting of Uparse and MED (Table 1.4). Only QIIME was limited by the sequence read length but the QIIME platform provided several
23
algorithms supporting oligotyping analysis methods such as denoising, chimera removal, alignment and statistical analysis. Uparse is an independent programme developed to improve QIIME with high accuracy and sensitivity of OTU generation and with its own denoising, chimera removal, and alignment system but the diversity and statistical analysis still relied on QIIME (Edgar 2013). Oligotyping and MED analysis pathways were found to be superior to QIIME and Uparse in terms of diversity and statistical analysis.
Table 1.4 The comparison of two sequencing analysis methods, QIIME and
Oligotyping, with their associated algorithms.
QIIME Uparse Oligotyping MED Sequencing platform 454 Pyroseq & Illumina 454 Pyroseq & Illumina 454 Pyroseq & Illumina 454 Pyroseq & Illumina
Read length 200 ~ 1000 bp 0 ~ user defined 0 ~ user defined 0 ~ user defined
Denoising Denoiser Usearch Denoiser, Usearch Denoiser, Usearch Chimera removal ChimeraSlayer, Usearch Usearch ChimeraSlayer, Usearch ChimeraSlayer, Usearch Sequence cluster
OTU OTU based
usearch Shannon entropy algorithm Shannon entropy based Node algorithm Species classification BLAST, RDP, Greengenes BLAST, RDP, Greengenes BLAST None
Alignment Pynast, Mothur Usearch Pynast, Mothur Pynast, Mothur
Phylogenetic Tree Phylogenetic Tree Phylogenetic Tree
Clustering Tree Clustering Tree
Diversity analysis
Alpha & Beta diversity
Alpha & Beta diversity Clustering & NMDs Clustering & NMDs Statistical mapping analysis PCoA, NMDS and ANOSIM etc. PCoA, NMDS and ANOSIM etc. NMDs and ANOSIM etc. NMDs and ANOSIM etc.