RNAseq: data analysis - Bioinformatics methods

Chapter 2 Materials and Methods

2.3 Bioinformatics methods

2.3.5 RNAseq: data analysis

Preprocessing RNAseq reads

Before RNAseq analysis took place, an essential quality control step took place on the raw data to ensure it was of adequate quality, ensuring the reliability of results. Quality was checked using FastQC, a quality control tool for high throughput sequence data. This program was used to determine quality (Phred score) at each base pair throughout all reads, highlighting possible machine sequencing errors, poor quality reads and over-represented sequences indicating the presence of adaptors and primers.

Phred scores were used to assess data quality, a phred score <Q20 indicated poor

quality data, Q20 - Q30 indicated data of intermediate quality and finally a phred score

>Q30 indicated high quality data. Should data dip below Q30 for a large proportion of

the reads, preprocessing of the reads was considered necessary enduring ensuring only high quality reads were passed to the aligner or assembler. Where necessary, this preprocessing was carried out using Trimmomatic, which accepts paired-end reads.

$ java -jar ./trimmomatic.jar PE -threads 4 -phred33 -trimlog trimlog File.txt -basein R1.fastq R2.fastq SLIDINGWINDOW:4:20 ILLUMINACLIP:ov errepresented-adaptor-seqs.fa

Genome-guided RNAseq analysis

The inner mate pair distance is a metric required by TopHat and represents the distance between the two reads. This was calculated by carrying out a Bowtie run on the data with the minimum and maximum insert size set to 0 and 500 respectively:

$ bowtie-build Boleracea.v1.cds/genome.fasta Boleacea

$ bowtie Boleracea -1 R1.reads.fastq -2 R2.reads.fastq -I 0 -X 500 -S

where -I and -X refer to the minimum and maximum insert size and -S indicates the

output should be in sam format. The insert size was calculated using Picard tools:

$ java -Xmx2g -jar picard-tools/CollectInsertSizeMetrics.jar INPUT=bowt iealignment.sam HISTOGRAM FILE=InsertSizeMetricsHist.pdf OUTPUT=InsertS izeMetrics.txt

The inner mate distance was estimated by:

Mean insert size - 2x read length = Inner mate distance

TopHat2 (v2.2.1) was used to align the reads to the referenceB. oleracea (TO1000) genome

(Trapnell et al., 2012a) using the following parameters:

$ tophat Boleracea.v1.genome.fasta sample1-1.fastq.gzv sample1-2.fastq.gz - r 74 -i 50 -I 50000 -p 8 --no-mixed --transcriptome-index Boleracea.v1. cds.fasta

in which compressed fastq read files were aligned to the B. oleracea transcriptome with

the following paramaters: -r is the inner mate distance, as calculated above,-i minimum

intron size, -I maximum intron size, -p the number of threads, finally the --no-mixed

option prevents the reporting of reads where only one of the read pairs has aligned. The output was a .bam file containing the sequence alignment data in binary format which can be used in downstream steps.

Cufflinks (v2.2.1) was used to assemble the transcriptome containing novel transcripts, producing a gtf/gff (Gene Transfer Format/General Feature Format) file. Cuffmerge (v2.2.1), followed by the gffread function supplied with the Cufflinks software was used to merge multiple gtf files and produce a multifasta file of all transcript sequences, as per the

$ samtools sort bamfile1.bam sortedbamfile1

which sorts the alignments by the leftmost coordinate. The transcriptome is assembled using Cufflinks and multiple gtf file were merged using Cuffmerge:

$ cufflinks -g Boleracea.v1.genes.gff3 -o sortedbamfile1.bam

$ cuffmerge -o out -g Boleracea.v1.genes.gff3 -s Boleracea.v1.genome. fasta gtf-to-merge.txt

in which thegtf-to-merge.txt file contains a list of all the gtf/gff files that are to be merged

in this step, all to be found in the same directory.

$ gffread -w output.fasta -g Boleracea.v1.genome.fasta merged.gtf

where-w is the name of the file to write the output to, -g is the genome from which to

extract the sequence data. This produced a multi-fasta file of all features present in the

merged.gtf file, as determined from theB. oleracea genome sequence.

De-novo assembly of RNAseq reads

De-novo of RNAseq reads was carried out using the Trinity software (Haas et al., 2013) us-

ing iPlant Collaborative cloud computing resources (Goff, 2011). Firstly, all pre-processed read files were concatenated in identical order:

$ cat S1 R1.fastq...Sx R1.fastq > All.R1.fastq

$ cat S1 R2.fastq...Sx R2.fastq > All.R2.fastq

Read files were then normalised by k-mer coverage to reduce computational time:

$ Trinity normalize by k-mer coverage --left=./All.R1.fastq --right=./ All.R2.fastq --seqType=fq --max cov=30 --kmer size=25 --max pct stdev=100

where --seqType indicates the type of input file (fastq), --max cov is the targetted

maximum coverage for reads,--kmer size is the kmer size and --max pct stdev is the

maximum pct of mean for standard deviation of kmer coverage across the read. The output of this was a normalized version of all the reads to be written, reducing the memory and time taken to run the Trinity software in subsequent steps:

$ $TRINITY HOME/Trinity --seqType=fq --left=./norm R1.fastq --right=./ norm R2.fastq --outputAssemblyFasta=Trinity.fasta

The output of this was a de-novo assembly of all reads in a multi-fasta file format.

Removing redundant transcripts from multi-fasta file

CD-HIT est (Li and Godzik, 2006) was used to cluster together highly homologous sequences to reduce the size of large multi-fasta files. CD-HIT est was implemented using the following options:

$ cd-hit-est -i input.fasta -o output.fa -c 0.95

where -c is the sequence identity threshold of similarity at which to collapse similar

sequences into a cluster. Counting RNAseq reads

Co-ordinate sorted .bam files were passed to HTSeq-count, which counts the number of reads aligning to each feature, here each gene model was used as the feature.

$ samtools sort bamfile1.bam sortedbamfile1

$ htseq-count sortedbam1.bam Boleracea.v1.genes.gff3 -f bam -r pos -t

gene id -m union > counts.txt

where -f indicates the file type, -r is used to determine how paired-end data is sorted, in

this case by co-ordinate (as above), -t is the feature type to be used for counting, in this

case gene id was the preferred option and finally -m is the method to be used to handle

reads which overlap more than one feature, union being the preferred option. Identifying differentially expressed genes using count data

DESeq2 (Love et al., 2014) was used for differential expression analysis of replicated RNAseq count data. DESeq2 models read counts as following a negative binomial distribution and uses Empirical Bayes shrinkage for dispersion estimation and fold change estimation.

Finally, a Wald test produces ap-value by comparing the beta estimate Bir divided by

2.3.6 Creating and querying a BLAST database

In document Transcriptional analysis of salt shock in Brassica oleracea (Page 58-62)