Sequencing and bioinformatic analyses - Metatranscriptomic analysis ofcommunity structure and

2.5.1 Preparation of samples for sequencing

For 454 pyrosequencing, Rubicon generated cDNA was submitted TGAC as part of a

Capacity and Capability Challenge (CCC) project. Here it was assessed with a high-sensitivity Agilent bioanalyser to ensure size profiles fragments were consistent across samples. Multiplexing and sequencing were carried out on a 454 GS Flx sequencer using Titanium chemistry (Roche). For Illumina HiSeq (Illumina, San Diego, CA, USA) sequencing DNA, and rRNA depleted RNA samples were submitted to TGAC for library construction and

sequencing with 100 bp paired end reads, again as part of a CCC project.

2.5.2 Bioinformatic analysis of 454 pyrosequencing data

Sequences were quality filtered using standard 454 Newbler parameters during conversion of .fna to .fasta files formats. They were then de-multiplexed to provide individual .fasta files for each sample. The conserved tail generated by the Rubicon procedure for cDNA synthesis was removed using a Perl script that removed the first 22bp of each read. The

49 emulsion PCR step during library preparation of 454 sequencing has been shown to

introduce a bias resulting in artificial replicate sequences which can be filtered out (Gomez-Alvarez et al., 2009). However due to the dominance of rRNA sequences and their similarity, particularly at their transcription start sites, the filtering step would likely have removed genuine biological replicates which would have down-weighted abundant taxa. Therefore reads were used in downstream analyses without filtering artificial replicates, as in another metatranscriptomic study (Ottesen et al., 2011).

Read files were used as queries against a cleaned and de-replicated (95% identity) set of sequences in a single database derived from the small sub-unit (SSU) SILVA (Pruesse et al., 2007) and RDP (Cole et al., 2009) rRNA databases using USEARCH in UBLAST mode (Edgar, 2010). An E-value cut-off of 10-7 was applied, and the top 100 hits were recorded in an output file, short reads (<10 bp) were discarded in the process. Output files were uploaded into MEGAN (Huson et al., 2007) using default parameters, except that Min. Support was set to 1, and Top Percent to 5.

To compare groups of samples, comparison files were generated in MEGAN for all relevant samples using absolute counts, and numbers of assigned reads per taxa were extracted for different taxonomic levels. Reads were normalised by expressing as a percentage of the total number of reads assigned in MEGAN minus any reads that were assigned to Viridiplantae. Means were calculated for each group of samples from the same

environment and differences between environments were statistically validated using an unpaired t-test. Pair-wise comparisons were made between each of the plant rhizospheres with soil, and for the wild-type oat versus the sad1 oat mutant. Statistically significant differences were further filtered using an abundance cut off of 0.01% of assigned reads for the environment in which they were more abundant. For example a taxon statistically more abundant in the wheat rhizosphere compared to bulk soil would be ignored unless it contributed at least 0.01% of the reads assigned to the wheat rhizosphere community. Rarefaction analyses were performed separately on prokaryotes and eukaryotes at the phylum and genus levels for each sample using MEGAN. Data were extracted and absolute read numbers were calculated. Means for both number of reads sampled and number of taxa detected were generated for each group of samples, and then used to plot rarefaction curves.

50 Additional analyses were performed by Mark Alston (TGAC) as part of the bioinfomatic support accompanying the CCC project agreement. Between-classes principal component analysis (PCA) was carried out using the R package ade4 (Dray and Dufour, 2007). Before analysis, the taxon abundance counts for each sample were normalised to 100,000 reads within MEGAN (Huson et al., 2007) and low abundance taxa removed if the average abundance across all the samples was < 0.01% or < 0.1% depending on the taxonomic level being tested. PCAs were performed at both phylum and genus level for both prokaryotes and eukaryotes, and also at genus level for four major eukaryotic groups (Fungi, Nematoda, Amoebozoa, Alveolata).

2.5.3 Analysis of Illumina HiSeq sequencing data

All samples were de-multiplexed and quality filtered as standard, and data analysed by two different approaches, at the The European Bioinformatics Institute (EBI, Hinxton, UK) and at TGAC. At TGAC, analyses were largely performed by Mark Alston as part of the

bioinfomatic support accompanying the CCC project agreement. Sequence data from DNA samples was analysed using Metaphlan (Segata et al., 2012) and Metaphyler (Liu et al., 2010) to determine taxonomic composition based on protein coding genes. Data were also uploaded to MG-RAST (Meyer et al., 2008) to assign functional information based on the SEED database (Overbeek et al., 2005) and analysed using default paramaters, i.e. an E- value cut-off of 1E-5, minimum identity cut-off of 60%, and a minimum alignment length cut-off of 15 bp.

For the RNA data, residual rRNA sequences were removed from samples in silico using SortMeRNA (Kopylova et al., 2012) and the number of copies of RIS recovered was determined using USEARCH with an identity cut-off of 1. Sequencing depth and

transcriptional activity per gram of soil were then calculated (see 2.5.4). Non-rRNA reads were filtered using Sickle (https://github.com/najoshi/sickle) then analysed using

Metaphyler to determine taxonomic composition. A subset of the data (25 million reads based on the lowest read count sample) were analysed using rapsearch2 (Zhao et al., 2012), a reduced alphabet BLAST-like algorithm, against the non-redundant nucleotide collection at the National Centre for Biotechnology Information (NCBI). Output files were uploaded into MEGAN (Huson et al., 2007) using default parameters (min support = 5, min score = 50, top% 10) to visualise and compared samples based on taxonomic composition,

51 SEED and KEGG (Kanehisa and Goto, 2000) assignments. Pair-wise comparisons were made between each plant rhizosphere and soil using un-paired t-tests with a 95% confidence interval. Some multiple comparisons were made using analysis of variance (ANOVA). In addition, all samples in full were uploaded to MG-RAST and analysed using default parameters. Multidimensional scaling analysis was performed in PRIMER6. Data were normalised to a percentage then square root transformed before a Bray-Curtis similarity matrix was generated and used to plot data on x and y axis to generate the plot in Excel. At EBI, at subset of reads (mean 92 million) were analysed using the EBI Metagenomics Portal courtesy of Peter Sterk. SeqPrep (https://github.com/jstjohn/SeqPrep) was used to merge mate pairs and perform additional quality filtering. The parameters used were as follows: -f -r -1 -2 -3 -4. If reads did not overlap, both reads were used in the analysis. Further filtering, including a 100 bp cut-off was applied using Trimmomatic

(http://www.usadellab.org/cms/?page=trimmomatic) with default parameters. Residual rRNA sequences were removed from the RNA sample in silico using rRNASelector (Lee et

al., 2011). Non-rRNA reads were analysed by InterProScan 5 (Quevillon et al., 2005;

Zdobnov and Apweiler, 2001) to generate InterPro and Gene Ontology (GO) assignments. Pair-wise comparisons were made between each plant rhizosphere and soil using unpaired t-tests with 95% confidence interval.

2.5.4 Calculation of sequencing depth and transcript abundances

The length of the RIS generated (967 bp), as determined by the Experion bioanalyser (2.3.2), allowed the sequence to be estimated, based on the number of base pairs

downstream of the T7 promoter, which in turn allowed calculation of the molecular weight. This was used to determine the number of copies of RIS per µl of the stock solution, and thus how many copies were added to each RNA sample during extraction. Post-sequencing, USEARCH was used to determine the % of a subset of reads from each sample that

matched the RIS sequence with 100% identity. The % of the subset was used to calculate the number of RIS sequences recovered in the whole sample. Sequencing depth was calculated using the following equation (Gifford et al., 2011):

(Standards recovered / Standards added) x 100%

The % non-rRNA in the samples was determined by Mark Alston at TGAC using SortMeRNA (Kopylova et al., 2012), and transcript abundance per sample was calculated using the following equation (Moran et al., 2013):

52 (Standards added / Standards recovered) x non-rRNA transcripts sequenced

This value was then divided by the total mass of input soil for each RNA extraction to obtain a value for transcripts per g soil.

Subsequent analyses provided numbers of reads matching particular protein coding genes or taxonomic groups in a database. To convert this to a quantitative value of number of transcripts per g a modification of the above equation was applied as follows:

(Standards added / Standards recovered) x “specific protein coding transcript” sequenced Again, this value was then divided by the total mass of input soil for each RNA extraction to obtain a value for transcripts per g soil.

In document Metatranscriptomic analysis of community structure and metabolism of the rhizosphere microbiome. (Page 56-60)