2 Methods and Materials
2.9 Whole transcriptome sequencing and analysis
RNA sequencing of the transcriptome is a recently developed method which enables accurate and complete transcript profiling of cells (Wang et al., 2009). Whole Transcriptome Analysis Kit was used to convert the full set of RNA transcripts expressed within a cell into a cDNA library for sequencing analysis on the Applied Biosystems SOLiD™ Sequencing system, as described in manufacturer’s instructions (Applied Biosystems, CA, USA). RNA sequencing was carried out out by Dr James Colley and colleagues at Wales Gene Park (Cardiff University), supported by additional funding from the Tom Owen Scholarship fund.
The basic principles of SOLiD™ transcriptome sequencing comprised 3 stages, i. sample preparation, ii. substrate preparation, and iii. SOLiD analyser barcoding. Each stage was carried out as follows:
I. Sample preparation. A mate-paired library was created by sheering sample DNA to a specified size and ligating adapters to the ends to produce 2 DNA fragments at a known distance apart in the target sample. The resultant mate-paired library comprised millions of unique molecules which represented the entire target sequence. II. Substrate preparation. Mate-paired libraries underwent emulsion PCR to hybridise libraries to beads; libraries were then clonally amplified. Bead enrichment was carried out to capture beads containing amplified library templates, which were then covalently attached to a glass slide.
III. Barcoding on SOLiD™ analyser. Template beads bound to the glass slide were combined with universal sequencing primer, ligase, and a pool of fluorescently labelled
118
probes (4 dyes representing 4/16 dinucleotide sequences). A complimentary probe was hybridised to the template sequence and ligated, after which fluorescence was measured and the dye cleaved off for further reactions. A new primer was hybridised off-set by 1 base and the ligation cycle repeated; this primer reset process was repeated for 5 rounds to provide dual measurement of each base separated by several cycles, to increase sequence accuracy.
The resultant raw sequence data underwent down-stream analysis in conjunction with Dr Peter Giles and Dr Kevin Ashelford at Wales Gene Park (Cardiff University), according to the flow chart illustrated in Figure 2.10.
2.9.1 Statistical and data analysis
2.9.1.1 Quality control of RNA and sequence data
Concentration and quality of cell line RNA extracts was assessed using Nanodrop 2000 spectrophotometer and Agilent to ensure RNA was suitable for SOLiD™ sequencing, as previously described. Data yield and mapping success of the resulting SOLiD™ sequence data was prepared for all samples to ensure transcript data quality was sufficient for down-stream gene expression analysis.
2.9.1.2 Quantification of gene expression
RPKM values (Reads per Kilobase exon Model per million mapped reads) (Mortazavi et al., 2008) the common method for quantifying expression levels for RNA-sequence datasets, were calculated for individual transcripts using in-house software. To define the transcript locations used in the analysis, the RefSeq gene model RefGene, as provided by the UCSC human reference hg19 site, was used along with the gene model for HPV16 downloaded from NCBI (accession no: NC_001526). Integrative Genomics Viewer (IGV) was used to visualise mapped transcript reads to reference these reference genomes.
119
Figure 2.10. Summary of SOLiD™ sequencing and data analysis.
Clonal cell line total RNA extracts underwent SOLiD™ Whole Transcriptome sequencing and the resultant raw data analysed. Raw data in the form of transcript reads underwent quality control (QC) to ensure data was suitable for down-stream analysis. Transcripts were then mapped to the human and HPV genomes; this provided an estimation of mapping yield and success to determine whether sequence data was suitable for further down-stream analyses. Sequence data was then subjected to i. gene expression analysis, and ii. differentially expressed gene (DEG) analysis followed by over-representation analysis (ORA) and gene ontology (GO). Gene expression was assessed by converting sequence data to reads per KB per million reads (RPKM values), the common method for quantifying gene expression in this instance. Transcripts were then visualised through IGV and mapped to the RefSeq gene model RefGene, as provided by the UCSC human reference hg19 site, along with the gene model for HPV16, downloaded from NCBI (NC_001526). As part of DEG analysis, corrected p-values were filtered for transcripts presenting with a high degree of confidence as being significantly different between 2 data sets. This data was filtered by FDR multiple testing correlation and a cut-off of p<0.05 applied. ORA GO analysis was then applied to the significantly expressed (p<0.05) transcripts to sort them in to different GO categories; GO ORA identified changes in expression for a number of genes in an ontology group and allowed a selection of genes to be identified as having some differential expression which all map to the same category.
120
2.9.1.3 Quality control of transcript count dataTo check any issues with the underlying transcript data distributions for each sample, the smallest value, lower quantile, median, upper quantile, and largest value, were calculated and illustrated in a box plot for easy visual representation using R statistical software. MvA plots were also produced using R statistical software to check samples for irregularities in transcript data in a pairwise comparison with other samples.
2.9.1.4 Identification of differentially expressed genes (DEG) using edgeR
To identify differentially expressed genes in response to CDV treatment, edgeR analysis (Robinson et al., 2010) was carried out on normalised transcript count data. The resultant p-values were corrected for multiple testing and false discovery issues using the false discovery rate (FDR) method (Benjamini and Hochberg, 1995). This provided a list of significant transcripts found to be over or under-expressed after treatment with CDV. Heat maps of significant transcripts were constructed using R statistical software in order to reveal patterns within the dataset which could be visualised more easily.
2.9.1.5 Over representation analysis (ORA) of gene transcripts against Gene Ontology (GO) annotations
GO ORA was carried out using goseq method (Young et al., 2010c) to standardise the representation of significant gene transcripts using a controlled vocabulary of terms. Transcripts were grouped into associated functions (GO ID and term) which were then categorised under 3 general themes, 1. Biological processes (BP), 2. Molecular functions (MF), and 3. Cellular component (CC). GO ORA of the top 500 significant gene transcripts was carried out against GO annotations and the resulting data was corrected for multiple testing and false discovery using the FDR method.