2. MATERIALS AND METHODS
2.4 A standard RNA sequencing workflow
The RNA-seq workflow, from sample preparation through to data analysis, enables rapid profiling and deep investigation of the transcriptome. RNA-seq is the simultaneous execution of millions of sequencing reactions of relatively short read length (30 – 500 bp) in parallel, generating massive amounts of sequence data per run (Shendure and Ji, 2008). The term RNA- seq denotes an ever expanding menagerie of protocols, nevertheless, they all have similar concepts: extracting cellular RNA, removing rRNAs, isolating the poly-A mRNA transcripts and converting this population of mRNA to a library of cDNA fragments, which are then sequenced (Kratz and Carninci, 2014). RNA-seq experiments must be analysed with robust, efficient and statistically principled algorithms (Trapnell et al., 2013, Trapnell et al., 2012). The direct product of an RNA-seq experiment is a large electronic file that contains millions of sequencing reads from each sample. The first step is to align the sequencing reads to the reference genome in order to know where they have originated from. Because of the massive amount of reads produced, specialized algorithms need to be used to do the alignment. These algorithms significantly increase the alignment speed by indexing the reference sequence in a way which makes it possible to quickly match the reads against the reference (Langmead and Salzberg, 2012). Following RNA-seq reads mapping, the data needs to be converted into a quantitative measure of gene expression. Because the number of reads produced from an
41
RNA transcript is the function of that transcripts abundance, read density can be used to measure transcript and gene expression (Cloonan and Grimmond, 2008). There are many different RNA-seq analysis packages that can be used for RNA-seq data analysis however, the Tophat and Cufflinks protocol was used in this study.
2.4.1 cDNA library preparation
The cDNA libraries were prepared using the TruSeq Stranded mRNA preparation kit (Low- Throughput protocol; Illumina) according to manufacturer’s instructions. Briefly, 2 µg of total RNA sample from globin mRNA-depleted whole blood and placental tissue was used for polyA mRNA selection using polyT oligo attached magnetic beads and two rounds of purification. During the second elution of the polyA RNA, the RNA is fragmented and primed for cDNA synthesis. cDNA was synthesized from the enriched and fragmented RNA using reverse transcriptase, SuperScript II and random primers. The cDNA was converted into double stranded DNA (dsDNA) which was used for library preparation. The overhangs on the dsDNA resulting from fragmentation are then converted into blunt ends. A single ‘A’ nucleotide is added to the 3’ ends of the blunt fragments to prevent them from ligating to one another during the adapter ligation reaction. A corresponding single ‘T’ nucleotide on the 3’ end of the adapter provides a complementary extension for ligating the adapter to the fragment. A multiple indexing adapter is then ligated to the ends of the dsDNA, preparing them for hybridization onto a flow cell in the HiSeq 2000 (Illumina). PCR is then used to selectively enrich the DNA fragments that have adapter molecules on both ends and to amplify the amount of DNA in the library (For the full protocol description, see Appendix E).
2.4.1.1 cDNA library QC
The quality and quantity of the sample libraries were assessed before the sequencing procedure. To achieve the highest quality data on Illumina sequencing platforms, it is important to create optimum cluster densities across every lane of the flow cell. Optimizing cluster densities requires accurate quantitation of cDNA library templates. The concentrations of the libraries were quantified using the Nanodrop and the size and purity (quality) were
42
measured using the Agilent Bioanalyzer. 1 µl of the cDNA library was loaded on the Bioanalyzer using a DNA-specific chip, the Agilent DNA 1000.
2.4.1.2 Normalization and Pooling of cDNA Libraries
RNA-seq protocols use a RNA fragmentation approach prior to sequencing to gain sequence coverage of the whole transcript. This means that long transcripts will have more reads mapping to them when compared with short transcripts of similar expression level. For this reason, read counts need to be properly normalized to extract meaningful expression estimates (Mortazavi et al., 2008). Moreover, each sequencing run has a given variability which will influence the number of fragments mapped across samples. Hence, it is also necessary to normalize for each sequencing run in order to avoid the possibility that genes will appear to be differentially expressed only as a result of the presence of more sequences in one condition when compared to another. One of the ways in which sequencing data is normalized is to use the reads per kilobase of transcript per million mapped (RPKM) metric, which normalizes a transcript’s read by both its length and the total number of reads mapped in the sample. In a similar way, fragments per kilobase of transcript per million mapped (FPKM) metric normalizes paired-end data (Oshlack et al., 2010).
10 μl of the indexed cDNA libraries were normalized to 10 nM in the DCT (Diluted Cluster Template) plate using Tris-HCl 10 mM, pH 8.5 with 0.1% Tween 20, and then pooled in equal volumes in the PDP (Pooled DCT Plate). Each normalized sample library to be pooled together was transferred from the DCT plate to one well of the PDP plate. The 24 indexed cDNA libraries were pooled together in equal concentrations into 6 pools (4 samples per pool as follows: placenta_case; placenta_control; blood_case; blood_control) (Table 2.3).
43 Table 2.3: Sample arrangement in the flow cell for sequencing. Each pool indicates the samples run in a single
lane in the HiSeq2000 sequencer. The indexes used in each pool are listed as this is how the samples are recognized. Indexes used in each lane need to be compatible.
2.4.2 Sequencing the cDNA libraries
Each of the six pools (containing 4 samples each) was sequenced (75 bp, paired-end) in a separate lane on the HiSeq 2000 platform in the same sequencing run for side-by-side comparison. This sequencing depth should generate ~ 50,000,000 reads per sample. Before
Sample ID Tissue Case/Control Index Index Sequence
Pool 1 1048b Blood Case 2 C G A T G T 1107b Blood Control 7 C A G A T C 1087p Placenta Control 14 A G T T C C 1054p Placenta Case 4 T G A C C A Pool 2 1067p Placenta Control 5 A C A G T G 1054b Blood Case 18 G T C C G C 1094b Blood Control 15 A T G T C A 1060p Placenta Case 16 C C G T C C Pool 3 1048p Placenta Case 5 A C A G T G 1094p Placenta Control 15 A T G T C A 1086b Blood Case 12 C T T G T A 1067b Blood Control 19 G T G A A A Pool 4 10225b Blood Case 2 C G A T G T 1090b Blood Control 4 T G A C C A 1090p Placenta Control 7 C A G A T C 1086p Placenta Case 16 C C G T C C Pool 5 1107p Placenta Control 12 C T T G T A 1060b Blood Case 6 G C C A A T 1087b Blood Control 13 A G T C A A 10276p Placenta Case 14 A G T T C C Pool 6 1061p Placenta Control 6 G C C A A T 10276b Blood Case 13 A G T C A A 10225p Placenta Case 18 G T C C G C 1061b Blood Control 19 G T G A A A
44
analysing the sequences generated and extracting biological conclusions from them, it is critical to evaluate the quality of the sequences as well as the overall sequencing performance. Therefore, before aligning the sequencing reads to a reference genome, the low-quality bases must be removed. Quality Control (QC) of the sequences takes into account duplication rate, rRNA abundance, strand specificity, coverage continuity at all annotated transcripts and performance at 5’ and 3’ ends. The resulting Phred score is used to evaluate the quality of the sequencing; the content of bases; the amount of N (specific nucleotide not called) bases and the sequenced read lengths. Based on this type of analysis, the bases with low sequencing quality should be trimmed ensuring the high quality of the sequencing data. In this study, the
program FASTQC was used to check the quality of high throughput sequence. FASTQC
produces several quality control plots which are important when evaluating the condition of the millions of generated raw sequence files before doing any further analysis. If the quality is not optimal, trimming and filtering of sequence reads must be done, otherwise the downstream analysis will not provide statistically relevant results (Niiranen, 2015). The FASTQC files for each sample were analysed in order to determine the quality of the millions of generated sequences.