Implementation of Qiime pipeline - Pyrosequencing methods and bioinformatics

2.12 Pyrosequencing methods and bioinformatics

2.12.4 Implementation of Qiime pipeline

Qiime, or Quantitative Insights Into Microbial Ecology, is an open source bioinformatics

package implemented in a 64-bit Ubuntu environment and distributed as a virtual disc

image (.vdi) mounted through the x86 virtulization software package VirtualBox. De-

veloped at the University of Colorado (Caporaso et al., 2010a), it allows the comparison

and analysis of microbial communities based on raw high-throughput amplicon sequencing data. Steps were implemented in the same manner on both the 16S rDNA sequences and GH18 and GH19 sequences unless stated.

2.12.4.1 Generation and validation of mapping file The mapping file is a user-generated, tab-delimited, Mimarks-formatted, metadata file that contains all information about the samples necessary to perform data analysis. Multiple samples and pyrosequencing runs can be combined and analysed simultaneously. An example mapping file is shown in figure 15. The required fields are SampleID, a short, meaningful, unique identification label that must be the leftmost column; the barcode sequence, an identification sequence that is a function of the multiplexed 454-pyrosequencing run; a linker primer sequence that follows on from the barcode; and a description of the sample which must be the last column. Additional fields, which do not necessarily need to be used, can be added to provide more

information about the samples that may be of use later in the analysis. Mapping files were validated for illegal characters and other errors before use15_.

Figure 15: An example of a sample mapping file for 16S rDNA data

2.12.4.2 Similarity-based OTU classification and representative sequence picking

16S, GH18 and GH19 sequences Conventionally, Operational Taxonomic Units (OTUs) are defined at a phylogenetic distance of 0.03 for species, 0.05 for genus, and 0.10 for family,

based on the whole 16S rRNA gene (Kimet al., 2011). For the V1-3 variable region of the

16S rRNA gene a phylogenetic distance of 0.03 for species is acceptable (Kimet al., 2011).

Quality controlled sequences were clustered into OTUs using uclust16 _{(Edgar, 2010) at}

a >97% sequence similarity. uclust is based on a new search paradigm that employs

a fast heuristic, designed to enable rapid identification of one or a few good hits rather

than all hits to increase throughput. Distance measures are first derived from k-mer17

counting, which has been shown to correlate well with percentage identity derived from sequence alignment methods (Edgar, 2004b), then the database sequences are sorted in order of decreasing number of shared words. If a hit exists in the database it is likely to be found amongst the first few candidates. The probability of subsequent hits being found decreases rapidly as the number of failed hits increases, so by terminating the algorithm at a pre-set threshold, hits are rapidly obtained with minimal cost to sensitivity. The next sequence with <97% similarity becomes the seed sequence for the next cluster. Each OTU is represented by a sequence, used in the downstream analysis. The sequence that initially

seeded the OTU is chosen as the representative sequence18_.

15_{Mapping file validation script:} _{check_id_map.py}

16_{OTU picking script:} _{pick_otus.py -i [input file].fas -M 4096 -o [output file].txt}

17_k_{-mer: A contiguous subsequence of length}_k_{, also known as a ‘word’}

18_{Representative OTU picking script:} _{pick_rep_set.py -i [otu mapping file] -f [input fasta}

The previous most eﬃcient cluster method, CD-HIT-EST (Li and Godzik, 2006), or- ders sequences by length, longest to shortest, ascribing the representative sequence to the longest sequence within the cluster. Subsequent sequences are then either clustered with

a previous group or become a representative sequence for an additional cluster. uclustis

significantly faster, demands less RAM, clusters at lower identities, and has greater sensitivity overlooking fewer matches and more frequently identifying the closest cluster (Edgar, 2010).

2.12.4.3 Assigning Taxonomy

16S rRNA gene sequences Taxonomy was assigned by searching19 _{representative OTU} sequences against a BLAST database of pre-assigned reference sequences, the most recent

GreenGenes OTU alignment20_{. The quality scores assigned by the BLAST taxonomy}

assigner are E-values; a stringent cut-oﬀ of <0.001 was chosen (Dinsdale et al., 2008;

Caporaso et al., 2011). Once taxonomy had been assigned, a workflow script21 _{was run}

to summarize and graphically represent the data in the form of proportional stacked bar charts.

GH18 and GH19 sequences The assigning of taxonomic identities for the functional

glycoside hydrolase sequences was done using the in-house pipeline and Blast22_{. The out-}

put file required modification inExcel[Microsoft Corp, WA, USA] for compatibility with

the Qiimepipeline. In summary, the column corresponding to OTU_ID, Blast_Hit and

E-value were selected from the output file; OTUs with E-values >0.001 BLAST hits were removed (explained further in section 3.6.3.4); where multiple hits were present (often for the same gene) the lowest E-value hit was chosen; and finally phylogeny information was

imported using Excel’s VLOOKUP function against the full GH18 and GH19 databases

19_{Qiime taxonomic assignment script:} _{assign_taxonomy.py -i [representative OTU sequences] -m}

blast -r [aligned reference sequences] -t [mapping template] -e 0.001

20_{http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Caporaso_Reference_} OTUs/gg_otus_4feb2011.tgz

21_{Script to generate taxonomy graphics:} _{summarize_taxa_through_plots.py -i [otu table] -m}

[samples map] -o [output file]

22_{In-house taxonomic assignment script:} _{blastall -p blastn -d [database] -i [representative OTU}

and correctly formatted in the style of k__#;p__#;c__#;o__#;f__#;g__#;s__#

where the letters stand for kingdom, phylum, class, order, family, genus, and species re- spectively. Once taxonomy had been assigned, graphics were generated as with the 16S rRNA gene sequences.

2.12.4.4 Aligning representative OTUs & creating phylogenetic trees.

16S rRNA gene sequences The 16S rDNA representative OTUs were aligned using

pyNAST, a python implementation of theNASTalignment algorithm23 _(DeSantis_{et al.}_,

2006a; Caporaso et al., 2010b). pyNAST aligns to the best-matching sequence in a pre-

aligned database of sequences, in this case the pyNAST-aligned GreenGenes core set

which contains ⇠5 000 non-chimeric candidate sequences. Candidate sequences are not

permitted to introduce new gap characters into the template database, so the algorithm introduces local mis-alignments to preserve the existing template sequence. As sequences obtained through next-generation sequencing methods are typically much shorter than full 16S rDNA sequences, gaps are inserted during the alignment. A script was used to remove

gaps which occurred in all sequences24_.

Some downstream statistical analyses required phylogenetic trees constructed from rep-

resentative OTU alignments; these trees were created using the FastTree 2 tool25_.

FastTree2 employs maximum-likelihood nearest-neighbour interchanges (NNI) and mini-

mum-evolution subtree-pruning-regrafting (SPR) for constructing phylogenetic trees from

large alignments (Price et al., 2010). NNI and SPR are tree topology strategies; NNI

reroots internal branches or subtrees to obtain new topographical configurations until a maximum-likelihood is achieved, and SPR removes subtrees and reinserts them onto an- other branch to form new trees, this process can be repeated for each subtree and receiving branch combination, until no further likelihood improvements can be found.

23_{Qiime sequence alignment script:} _{align_seqs.py -i [representative OTU sequences].fas -t}

core_set_aligned.fasta.imputed

24_{De-gap alignment script:} _{filter_alignment.py -i [pynast alignment file]}

GH18 and GH19 sequences Due to the lack of a template for alignment the GH18 and GH19 representative OTUs were aligned using Multiple Sequence Comparison by log-

expectation algorithm, Muscle (Edgar, 2004a,b; Goujon et al., 2010) provided by EMBL-

EBI using default settings and a Pearson.Fasta output format. As the sequences were not aligned by gap expansion against a template, filtering and removal of gaps was not necessary.

Phylogenetic trees were created using FastTree 2, as with the 16S sequences. Modific-

ations to the formatting of the resulting .tre tree file were required for compatibility with

the downstream statistical analyses inQiime26_.

2.12.4.5 Alpha & Beta diversity analysis OTU tables were created27_{and summarized}28_. These contained the frequencies of sequences within each OTU across the samples, along with taxonomy. Along with the previously created phylogenetic trees, the OTU tables were

used as the input for the rarefaction plots29 _{calculating alpha diversity metrics: observed}

species and chao1, and the phylogeny based metric: Phylogenetic Diversity. The Observed Species metric is based on a simple count of unique OTUs found in each sample. The chao1 metric estimates species richness based on the concept that the number of rare species (singletons and doubletons) confer information about the number of missing species (Chao, 1984, 2005). The Phylogenetic Diversity metric takes into account total phylogenetic branch length belonging to each sample, assigning a higher number to more diverse samples (Faith, 1992).

OTU tables and phylogenetic trees were used to compute beta diversity; of all sequences

within the samples30 _{and with random resampling based on the smallest sample size to}

rarify the OTU table and remove sample heterogeneity31_{. The output included weighted}

26_{Script to correct formatting of tree files:} _{sed -e "s/\/1\-\(.\{3\}\)//g" -e "s/>//g"}

27_Script _for _creating _OTU _summary _tables: _{make_otu_table.py -i [List of OTUs] -t}

[representative OTU taxonomic assignment] -o [output file]

28_{Script for summarizing OTU tables:} _{per_library_stats.py -i [OTU table] > [Output file]}

29_{Script to calculate alpha diversity:} _{alpha_rarefaction.py -i [OTU Table] -m [Samples Map] -t}

[Phylogenetic Tree] -o [Output file]

30_{Script for calculating beta diversity:} _{beta_diversity_through_plots.py -i [OTU Table] -m}

[Samples Map] -t [Phylogenetic Tree] -o [Output file]

(OTU presence/absence) and unweighted (relative OTU abundance), discontinuous 2 prin- ciple component analysis plots (PCA).

In document A polyphasic approach to the study of chitinolytic bacteria in soil (Page 70-75)