Sequence Data - Estimating the public health risk associated with drinking water in New Zealand

5.4 Results

6.2.4 Sequence Data

The output from the Illumina MiSeq system were 16S rRNA gene (hereafter refered to simply as 16S) and WGS reads (sequences), organised in fastq files. In every fastq file, each nucleotide was accompanied by a corresponding quality score. A single fastq file contained tens of thousands to millions of unpaired 16S or WGS reads from a single sample. Since the reads were obtained using a pair of primers (forward and reverse), two fastq files per sample were produced, one for each primer. These unpaired or unaligned raw reads were 150 bases long for 16S and 250 bases long for WGS. Once paired, the 16S reads were expected to have a length of 250 base pairs and 400–500 base pairs long for WGS.

Sequence classification

Sequences from next-generation sequencing (NGS) platforms, such as MiSeq, are classified using numerical taxonomical methods before downstream analyses can be performed. For reasons of clarity some terms used in metagenomics are specified here based on the defi- nitions provided by Sneath and Sokal (Sneath and Sokal, 1973; Sokal and Sneath, 1963).

Materials and methods

on the basis of their character states. Taxon (plural taxa) is the taxonomic group of any nature or rank. Operational taxonomic unit (OTU)s are the units of study, which could be individual organisms, taxonomic groups such as species, genus and so on. A collec- tion of OTUs make up a taxon. In the current study, taxonomic ranks used were (from highest to lowest): kingdom, phylum, class, order, family, genus, and species. In 16S sequence studies OTUs are commonly composed by clustering reads that are≥97 % similar, although this threshold can be adjusted by the user. The potential consequences of this are threefold: firstly, different species that are ≥97 % similar on the sequenced gene(s) are merged resulting in OTUs with multiple species. Secondly, species that have paralogs with

<97 % similarity are split across multiple OTUs. Thirdly, artifacts including read errors and chimaeras may result in spurious clusters. After clustering, the OTUs are matched to a database in order to be assigned to a species. OTUs that are not matched to a species are flagged as novel or unknown.

6.2.5 Data analysis

Sequence statistics and quality

The quality of the input sequences was ascertained using standard quality analyses that included the analysis of base call accuracy, base content, sequence quality and sequence lengths. The Q-score (Illumina Inc., 2014), also known as Phred quality score, was the main tool used to assess base call accuracy and sequence quality. The Q-score is the probability that a given base was called incorrectly by the sequencer. It is logarithmically related to the base calling error probability and is defined by Equation 6.1.

Q=−10 log₁₀P (6.1)

where P is the estimated probability that a given base call is incorrect. A higher Q-score indicates a smaller probability that a base was incorrectly called. For example a Q-score of 20 represents a probability of 1 in 100 of an incorrect base call or 99.0 % accuracy in the base call. Similarly, a Q-score of 30 represents a probability of 1 in 1000 of an incorrect base call or 99.9 % accuracy in the base call. The Q-score was used to determine the quality of each position of any given sequence.

16S sequence processing and analysis

The 16S raw sequences in each sample were first paired (aligned or combined) to form 253 base pair overlapping sequences using fast length adjustment of short reads (FLASH) 1.2.6 (Mago˘c and Salzberg,2011) and then they were quality trimmed usingSolexaQA2.2 (Cox et al., 2010). From each sample a maximum of 300 000 aligned sequences were randomly selected as input into the quantitative insights into microbial ecology (QIIME) process. The output of this process included an OTU table, phylogenetic tree, representative sequences,

taxa summary charts and alpha rarefaction curves. The OTU table, phylogenetic tree and the set of representative sequences were the main input files for the R package Phyloseq (McMurdie and Holmes, 2013) which was used to perform various analyses. The OTU table, formatted as a biological observation matrix (BIOM) file (McDonald et al., 2012), contained the 16S OTUs with their corresponding abundance scores (counts of taxa) on a per-sample basis.

Species or taxa richness of the samples was measured in order to investigate how the number of species (taxa) varied across the sample sources. Three diversity indices were used for this purpose: the Chao1 (Chao,1984), Shannon (Molles,2013; Tuomisto,2010) and Inverse Simpson indices which is derived from the Simpson index (Simpson,1949; Southwood and Henderson, 2009). The Chao1 index calculates the estimated true species diversity of a sample. The Shannon index quantifies the uncertainty in predicting the species identity of an individual that is taken at random from the dataset. The Inverse Simpson index indicates the effective number of species that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of species in the dataset.

Public health hazard assessment using 16S sequences

Initially, a multivariate analysis using canonical correspondence analysis (CCA) was ap- plied to all the 16S metagenomes. This was done in order to investigate whether the 16S sequence abundance scores could be used to determine similarities and/or differences among metagenomes of different origins. Thereafter, the public health hazard associated with drinking water supplied at the campgrounds was assessed using 16S taxa in two ways. Figure 6.1 is a schematic representation of the procedure used to assess the public health significance of the 16S metagenomes found at the campgrounds. The first approach was based on taxa belonging to the Family Campylobacteraceae. This bacterial Family was chosen because it includes Campylobacter species which are the leading causes of gastrointestinal illness in New Zealand (Environmental Science and Research, 2014). TheCampylobacter- aceae phylogenetic tree was extracted and overlaid with taxa abundance scores according to sample sources. This allowed for the visualisation of the phylogenetic relatedness of the

Campylobacteraceae taxa from different sources. Then the Campylobacteraceaetaxa abundance scores, per sample source, were used to calculate proportional similarity index (PSI) and construct a tree using the neighbor-net algorithm (Bryant and Moulton, 2004). The second approach was based on taxa related to drinking water-associated bacterial pathogens recognised by World Health Organization (WHO) (Table 2.1 on page 14). The 16S OTUs (both faecal and water) were queried for eight bacterial genera (Burkholderia,Campylobac- ter,Escherichia, Francisella,Legionella,Leptospira,Mycobacterium,Salmonella) in order to retrieve the related taxa. However, no taxa under theFrancisella and Salmonella genera queries were retrieved. This could be that members of the Francisella and Salmonella

Materials and methods

The sequences corresponding to the identified taxa were retrieved from the repesentative set and matched against the National Center for Biotechnology Information (NCBI)-nr (Wheeler et al.,2000) database in an automated process using basic local alignment search tool (BLAST) via the internet. This allowed for identification of taxa that have previously been reported elsewhere. Further, the retrieved taxa with the accompanying abundance scores were analysed using PSI. Inclusion of spurious sequences into the PSI analysis was minimised by setting the minimum number of sequences per taxa to twenty.

PSI is a measure of similarity that estimates the area of congruence between two frequency distributions (Feinsinger et al., 1981). PSI values range from zero to one, with zero indicating distributions with no common elements and one indicating distributions containing the same elements. The percentile method described by Efron and Tibshirani (1986) was employed to calculate the bootstrapped 95 % confidence intervals for PSI values using 2000 iterations. To demonstrate taxa dissimilarity (divergence) among metagenomes of different origins, values of 1-PSI were used to construct a NeighborNet tree inSplitstree4.13.1.

WGS sequence processing and analysis

Due to computer resource constraints, only a subset of 20 out of 69 WGS metagenomes were processed using a large computer server available at Massey University. Processing all the 69 WGS metagenomes would have resulted in both the storage capacity and computational capibility of a desktop computer to analyse the data being exceeded. However, the WGS sequences in each of the 20 samples were paired using FLASH to form 400–500 base pair overlapping sequences which were quality trimmed using SolexaQA. The paired sequences were then matched against the NCBI-nr database using protein alignment using a DNA aligner (PAUDA) 1.0.1 (Huson and Xie,2013) in order to assign taxonomic ranks. The functional content of the resultant metagenome files was analysed using theSEED classification system (Overbeek et al., 2005) within metagenome analyzer (MEGAN) 5.7.0 (Huson et al.,

2007). This process resulted in production of abundance scores for each functional factor on a per-sample basis.

Public health hazard assessment using WGS sequences

Since microbial community profiling of metagenomes was performed using 16S sequences, the WGS analysis was focused on the functional content. Figure 6.2 is a schematic representation of the procedure used to assess the public health significance of WGS metagenomes found at the campgrounds. Among the functional factors identified in the metagenomes, virulence factors were isolated and a tree constructed based on their abundance scores using the neighbor-net algorithm withinMEGAN. This was done in order to identify similarities/differences among metagenomes of different origins. The same process was repeated for antimicrobial/toxic compound resistance factors (a subset of virulence factors).

16S rRNA gene metagenomes CCA Campylo- bacteraceae WHO- pathogens Phylogenetic tree NeighborNet

tree NCBI search

NeighborNet tree

1-PSI 1-PSI

Figure 6.1: Flow diagram showing how 16S rRNA gene metagenomes were analysed.

WGS metagenomes Functional factors Virulence factors Resistance factors NeighborNet tree NeighborNet tree Abundance score Abundance score

Results

6.3 Results

In document Estimating the public health risk associated with drinking water in New Zealand : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy at Masey University (Page 157-162)