Processing of raw RNA-seq data from Pst infected wheat

CHAPTER 4 Changes in the wheat transcriptome during Pst infection

4.3.1 Processing of raw RNA-seq data from Pst infected wheat

The wheat transcriptome analysis was based on RNA-seq data generated by Dr. Diana Garnica (The Australian National University), who infected young stage wheat plants with PST-79 and harvested mRNA for RNA-seq at 0, 6 and 9 dai, here labelled as IT0, IT6 and IT9 respectively. These time points do not correspond to the previous chapters data points, since these data was generated by someone else. The first step in the pipeline (Figure 4.2) was to clean the raw sequence reads of adaptors and primers. Using the same software, I removed reads with low quality scores, based on the chromatogram that was used to call the nucleotide bases. For IT0 and IT6 samples I recovered the reads as 80% paired-end and three to seven percent as single- end reads (Table 4.2).

For IT9, more than 80% of paired-end reads were recovered from one of the three biological IT9 replicates, IT9_3, and 66-68% were recovered from the IT9_1 and IT9_2 replicates respectively paired-end. The lower values for these replicates was compensated by a higher number of single-end reads. In general, more than 90% of the reads were recovered in all samples after cleaning (Section 4.2.1). Of these, 66-86% were recovered as paired-end reads and the rest as

single-end reads (Table 4.2). The single-end reads means that only one of the sequences that constitutes the paired-end read was kept, named forward or reverse. Most of the single-end reads were forward reads, suggesting a possible sequencing bias. With the cleaned reads, I proceeded to the next step.

Table 4.2. Illumina RNA-seq data from Pst-infected wheat leaves at zero (IT0), 6 (IT6) and 9 (IT9) days after infection. The values are the reads before (input read pairs) and after trimming (surviving) with trimmomatic software. Surviving reads are divided into paired-end reads (PE, both surviving) and single-end reads (SE, forward and reverse). Forward and reverse reads represent the forward and reverse complement (reverse) sequences of a broken paired-end read where only one of them survived. Reads are represented as million (M) reads. There are three biological replicates for each sample, indicated by the numbers 1-3.

Sample Input Read Pairs (M) Both Surviving (PE) (%) Forward Only Surviving (%) Reverse Only Surviving (%) IT0_1 13.6 84.09 6.91 3.97 IT0_2 15.6 83.64 7.05 4.06 IT0_3 17.2 83.81 7.09 4.01 IT6_1 18.9 84.54 6.75 3.83 IT6_2 19.9 84.59 6.65 3.89 IT6_3 18.2 84.18 6.81 3.98 IT9_1 116.6 68.55 14.75 5.68 IT9_2 73.2 66.91 15.64 5.63 IT9_3 16.1 86.95 5.47 3.63

The next step in the pipeline was to remove the reads with homology to PST-79. For this step, I mapped the recovered reads from Table 4.2 to a PST-79 draft genome generated by Dr. Diana Garnica and Mr Will Jackson (unpublished, The Australian National University). As expected, the percentage of reads that mapped to PST-79 was almost zero in the three biological replicates of IT0 (Table 4.3). On average, about 3 million reads mapped to the PST-79 genome in the IT6 replicates. In the IT9 data, all three biological replicates contained very large numbers of reads that mapped to the PST-79 genome, but these samples were quite variable (17-66 million reads). The variation within these samples may be associated with an uneven infection, suggesting that the tissue samples for IT9_3 and IT9_2 accumulated less fungal biomass than IT9_1. Alternatively, poor quality reads in these replicates did not align to the Pst genome. Overall, the number of mapped reads corresponded to the expected increase in fungal biomass at these time points.

Table 4.3. Numbers and percentages of reads that mapped to the PST-79 draft genome and wheat transcriptome. The clean reads (total reads) from Table 4.2 were aligned to the PST-79 draft genome. The reads that did not mapped to the Pst genome (selected reads) were separated in paired-end and single-end reads. Both read sets were mapped separately to the wheat transcriptome. Reads are represented as million (M) reads. For each sample, there are three biological replicates indicated by the numbers 1-3.

Samples

Clean reads mapped against PST-79

Selected reads mapped to the wheat Paired-end reads Single-end reads Total reads Mapped reads New input

reads Mapped reads New input reads Mapped reads (M) (%) (M) (%) (M) (%) IT0_1 24.48 0.01 22.91 92.75 1.49 89.84 IT0_2 27.93 0.01 26.18 88.19 1.74 90.48 IT0_3 30.67 0.02 28.76 83.33 1.9 84.99 IT6_1 33.88 11.87 19.15 46.04 1.81 87.91 IT6_2 35.81 8.4 28.46 85.08 1.84 87.08 IT6_3 32.61 11.58 27.05 86.08 1.79 87.47 IT9_1 183.7 35.99 101.05 76.48 16.53 78.23 IT9_2 113.5 24.79 73.24 79.46 12.15 82.85 IT9_3 29.42 56.7 12.04 66.94 0.7 68.24

The reads that did not map to the PST-79 genome should have been derived from wheat mRNAs. I extracted the file containing unmapped sequences from the data presented in Table 4.3 and separated them into single-end and paired-end reads using the software SAMtools, bam2fastx and bedtools, as this step is required by the aligning software. I tried to map the RNA-seq reads to the wheat genome using the STAR (Dobin et al., 2013), SOAP (Li et al., 2008) and TopHat software. However, I did not have access to the requisite computer platform that could support the mapping of genomes larger than 4 Gb. As an alternative, I mapped the single-end and paired- end reads separately onto the wheat transcriptome which is smaller than 1 Gb. For mapping the reads, I used the software Burrows-Wheeler Aligner (BWA). I performed the alignment with the paired-end and single-end reads separately as the software does not run all files at the same time. For IT0 replicates, more than 80% of the paired-end and single-end reads mapped to the wheat genome with low variability between the biological replicates (Table 4.3). More than 80% of the paired-end reads from IT6_2 and IT6_3 mapped to the wheat transcriptome, while for IT6_1 this figure was only 46%. In contrast, 87% of the single-end reads from all IT6 replicates mapped to the wheat transcriptome. Finally, the percentage of paired-end and single-end reads that mapped to the wheat transcriptome in the IT9 datasets was very variable. The replicates IT9_1 and IT9_2 contained very large numbers of reads, but only 76-78% of paired-end reads and 64-75% of single-end reads mapped to the wheat transcriptome. On the other hand, only 67% paired-end reads and 68% of single-end reads of IT9_3 mapped to wheat genes. This implies

that the unmapped reads from the results presented in Table 4.3 were not wheat reads, or that the wheat transcriptome is incomplete.

I used the paired-end and single-end reads that mapped to the wheat transcriptome from Table 4.6 to count the number of reads that mapped to each wheat gene, which provides a measure of gene expression. When I did this analysis, I was unable to separate the homeologues of each gene because the alignment tool BWA is unable to recognize them. Consequently, the reads were dispersed between homeologues and alternative gene transcripts. I concatenated the counts for single-end and paired-end reads results and created a table containing all the read counts that I used for differential gene expression analysis, using the R programming environment. I did this analysis using the R package DESeq. However, the variance between biological replicates was very high (data not shown) so I changed to the R Package EdgeR, which is more flexible to change the normalization parameters, thus reducing variability between samples. I plotted the biological coefficient variance (BCV) which is the relative standard deviation divided by the mean. This reduces the variation due to the sample size and measures de dispersion of gene expression in each RNA sample. The BCV plot indicated that the biological replicate IT9_2 is different from the other two IT9 samples (Figure 4.3). To remove the source of this variability, I used the script provided by Dr. Sylvain Floret (see section 4.2.1).

The filter removed 5-10% of the reads and maintained a minimum of 61365 reads for each gene in the 9 samples. This is possibly removing homeologues as the software (BWA) does not differentiate them. I plotted the filtered data with BCV (Figure 4.4). The new figure (Figure 4.4) showed biological replicates closer than Figure 4.3, the distance between samples was reduce and the figure also showed that BCV is very sensitive to low abundant genes. The biological replicates for IT0 clustered together as in Figure 4.3. The filtering reduced the variation between the replicates for the IT6 and IT9 samples, as replicates clustered closer than in Figure 4.3 and the scale of the graph is reduced (Figure 4.4). I then proceeded to the next step in the pipeline as the normalized and filtered data were less variable than before.

Figure 4.3. Biological coefficient of variation plot for all biological replicates of IT0, IT6 and IT9. The variation was calculated with the counts for all genes in each sample. The dispersion is explained by two vectors.

Figure 4.4. Biological coefficient of variation plot of samples after count per million normalization and filtering genes with low numbers of counts.

In document Understanding wheat stripe rust through studies on host and pathogen metabolism (Page 141-145)