Functional genomics and data analysis methods

Chapter I Introduction

1.10 Functional genomics and data analysis methods

1.10.1 Microarray and next generation sequencing (NGS)

The transcriptome encompasses the whole set of transcripts in a cell or tissue. This includes RNA molecules from protein coding (mRNA) to noncoding RNA, including rRNA, tRNA, lncRNA, miRNA. The key goal of transcriptomics is to catalogue all species of transcripts and identify the transcriptional architecture of genes in respect to their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications and to measure thevarying expression levels of each transcript during development and under different conditions. The transcriptome reflects the genes that are being actively expressed at any given time in cells or tissue. Therefore deciphering the transcriptome is of central importance to understand molecular mechanisms and signalling pathways. Unraveling the transcriptome can also be a valuable way to trace phylogenetic relationships between individuals and discover biomarkers (Li et al., 2017).

Several different technologies have been used over the past years to measure gene expression. Recently, advances in genome sequencing have enabled high resolution gene expression analysis. Microarray technology, based on hybridisation, allows RNA samples to be interrogated by binding onto a chip containing single stranded DNA molecules, so called probes. RNA is extracted from cells or tissue, reverse transcribed and then labelled with a fluorescent dye. Sequences complementary to the probe will hybridise, allowing gene expression to be measured optically by the amount of fluorescence associated with each probe, using multiple samples in tandem (Allison et al., 2006; Tarca et al., 2006).

Platforms such as Illumina can simultaneously probe for over 47,000 gene transcripts. However, despite being a powerful and cheap option, it presents several limitations. One of the main flaws is background noise from non-specific binding of cDNA that are only partially complementary to the probe, thus resulting in unreliable expression measurements. Similarly, comparison between different transcripts in the same microarray can be imprecise and use of microarrays is restricted to the detection of differential gene expression of the same probe target between different samples

64 (Marioni et al., 2008). There is furthermore a dependence on the existing knowledge of the sequences of interest and probe sequences must be pre-specified.

RNA sequencing (RNA-Seq) is a more recent technology which provides greater resolution and also allows mRNA splice variant analysis (Qian et al., 2014). It is gradually replacing microarrays to measure gene expression levels, and exon arrays in alternative splicing analyses (Wang et al., 2008). Early RNA-Seq approaches applied Sanger sequencing technology, which was low throughput, costly and error- prone and generally not very quantitative. To overcome these limitations, tag-based methods were developed, such as serial analysis of gene expression (SAGE) (Harbers and Carninci, 2005; Velculescu et al., 1995), cap analysis of gene expression (CAGE) (Kodzius et al., 2006; Shiraki et al., 2003) and massively parallel signature sequencing (MPSS) (Brenner et al., 2000; Peiffer et al., 2008; Reinartz et al., 2002). Although these methods are high throughput and enable precise digital gene expression levels, they are still based on Sanger sequencing and short tags are not able to be mapped uniquely to the reference genome.

Only in recent times, with the emergence of next generation sequencing (NGS) technology, can we exploit the full potential of RNA-Seq. This method has already been used to map and quantify transcriptomes of Saccharomyces cerevisiae,

Schizosaccharomyces pombe, Arabidopsis thaliana, mouse and human cells (Cloonan et

al., 2008; Lister et al., 2008; Marioni et al., 2008; Morin et al., 2008; Mortazavi et al., 2008; Nagalakshmi et al., 2008; Wilhelm et al., 2008).

One of the main advantages of RNA-Seq is very low if any background signal, as it is possible to map DNA sequences distinctly to unique regions of the genome. There is no upper quantification limit, which relates with the number of sequences obtained. There is a large dynamic range of expression levels by which transcripts can be detected (Wang et al., 2009; Westbrook and Lucks, 2017). In contrast, microarrays lack the sensitivity for extreme conditions such as low or very high levels of expression and therefore have a smaller dynamic range. When compared to quantitative PCR (qPCR), RNA-Seq has shown to be highly accurate in quantifying expression levels (Nagalakshmi et al., 2008) and spike-in RNA controls of known concentration

65 (Mortazavi et al., 2008). RNA-Seq has also shown high levels of reproducibility for both technical and biological replicates (Cloonan et al., 2008). Another advantage is that RNA-Seq uses less input RNA as there are no cloning steps, and due to Helicos technology there is no amplification step involved.

The vast number of publications in high profile journals highlights the popularity of this new technique (Parkinson et al., 2009). It allows research groups to investigate aspects that were not accessible previously with microarrays, such as allele specific expression and identification of beforehand unknown transcribed regions (Montgomery et al., 2010; Trapnell et al., 2010). It is possible to not only look at gene expression but also alternative splicing (Pan et al., 2008), novel transcript expression (Guttman et al., 2010), allele specific expression (Degner et al., 2009), gene fusion events (Edgren et al., 2011) and genetic variation. As it is still a relatively novel technique, there is no common gold standard for analysis or standard pipelines yet and experimental and methodological biases exist (Hayden, 2012). Also, for the huge amount of data, specialised algorithms and more powerful servers are required to analyse the data properly (Pop and Salzberg, 2008).

The levels of different RNA species in a cell at any given time point are controlled by regulatory systems that feed back to each other and therefore allow cells to react to environmental changes and also maintain expression patterns specific to the particular cell type. These regulatory systems are (1) the regulation of the timing and rate of transcription initiation and elongation, (2) the regulation of the processing of transcripts, (3) the regulation of the rate of transcript degradation, (4) and the post- transcriptional modification of transcripts (Heyn et al., 2015).

66 1.10.2 RNA sequencing

The main steps in an NGS RNA-Seq workflow include; RNA fragmentation into random DNA or cDNA fragments, a so-called cDNA library followed by addition of adapters to the 5’ and 3’ ends of each fragment. The adapters contain functional elements that allow sequencing, including an amplification element and the primary sequencing site. Adapter-ligated fragments are then PCR amplified. Next, the cDNA library gets analysed by NGS, which results in short sequences which correspond to either one or both ends of the fragment. The library sequencing depth depends on the techniques with which the output data will be analysed. After that, short sequences from one end (single end sequencing) or both ends (paired end sequencing) are obtained with a length of typically 30-400bp. Single-read is associated with lower costs and a faster technique (1% of Sanger sequencing), where cDNA is only sequenced from one end. With paired-end methods, cDNA is sequenced from both sides and thus represents a more cost intensive and more laborious approach (Sengupta et al., 2011). Double stranded molecules are then denatured into single stranded molecules and loaded into a flow cell with surface-bound oligos complementary to the library adapters capturing the fragments. Bound fragments get amplified by bridge amplification to create clusters of identical molecules which serve as the templates for sequencing. Sequencing primers are added and clusters get reverse complemented at the same time. During each sequencing step, one fluorescently labelled nucleotide is added to each growing complementary strand. The dye of the nucleotides is different for each nucleotide type and a laser is then used to identify the location and identity of the nucleotide which was incorporated into the cluster. The fluorescent dye is removed together with the terminal group and the same process is repeated until the desired number of times, usually 30 to 200 times. At the end, a sequence of images with spots representing a cluster and the specific colour represents the base type (Figure 1.15). Nucleotide sequences are usually in FASTQ file format. Once the newly identified sequences are obtained, gene expression levels can be analysed by aligning to a reference genome. After alignment, a number of analysis options are available such as single nucleotide polymorphism

67 (SNP), insertion-deletion (indel), identification, read counting for robust multi-array average (RMA) methods, phylogenetic or metagenomics analysis and more.

Figure 1.15. Schematic overview of RNA sequencing.

RNAs are first transformed into a library of cDNA fragments through either RNA fragmentation or DNA fragmentation. Sequencing adaptors (blue and orange) are then added to each cDNA fragment and a short sequence is obtained from each cDNA using high-throughput sequencing technology. The resulting sequence reads are aligned with the reference genome or transcriptome. This is then used to generate a base-resolution expression profile for each gene (Wang et al., 2009).

RNA-Seq is a valuable tool to understand transcriptomic dynamics during development and normal physiological changes and analysis and comparison of diseased and normal tissues, such as comparing cancerous cells to normal cells. In this project RNA-Seq will be used to study drug resistance mechanisms of cancer cells by comparing gene expression profiles of drug resistant cancer cell populations to their parental lines.

In document Drug resistance mechanisms of FGFR-driven cancers (Page 87-92)