MICROARRAY DATA - EMPIRICAL WORK - DIMENSION REDUCTION

DIMENSION REDUCTION

CHAPTER 5 EMPIRICAL WORK

5.2 MICROARRAY DATA

The focus in this part of the thesis will be on classification in the context of DNA microarray gene expression analysis, with the goal of classifying and predicting the diagnostic category of a sample based on its gene expression profile. DNA stands for deoxyribonucleic acid, and is defined as the self-replicating material which is present in nearly all living organisms and which is the basic material that makes up human chromosomes (Oxford Dictionaries, 2017, s.v. ‘DNA’). DNA microarrays are glass microscope slides that are printed with thousands of tiny spots in defined positions, with each spot containing a known DNA sequence of genes. DNA microarrays are used to measure the gene expression of a cell by measuring the amount of mRNA (messenger ribonucleic acid) present in that gene. Buhler (2002) describes mRNA as the type of RNA which codes for proteins. DNA microarrays were invented in the 1990s and are considered a breakthrough technology in biology, facilitating the quantitative study of thousands of genes simultaneously from a single sample of cells. Subsequently,

microarrays have been the most popular technology for large-scale studies of gene expressions as they are readily affordable by many laboratories (Zhao et al., 2014:1).

An 𝑁 × 𝑝 gene expression data matrix is the output of 𝑁 microarray experiments or biological samples analysing 𝑝 different genes. The symbols 𝑋₁, 𝑋₂, . . , 𝑋_𝑝 denote the gene expression measurements. The standard experimental protocol for obtaining gene expression data sets is described schematically in the figure below.

Figure 5.1: Schematic of the steps in an experimental protocol to study differential expression of genes

Source: Buhler, 2002.

The six steps in Figure 5.1 are explained in detail by Buhler (2002). The first step is choosing the cell sample populations, for example in Figure 5.1, one sample is taken from reference cells (orange) and another from target cells (red). Here, the reference sample could represent normal cells while the target sample represents tumorous cells. In Step 2, mRNA is isolated and extracted from the two cell samples. Buhler (2002) explains that mRNA is prone to being destroyed and degraded, therefore mRNA is converted into the more stable complementary DNA (cDNA) form using enzyme reverse transcriptase to prevent experimental samples from being lost. In Step 3, the cDNA is labelled with fluorescent markers to detect the cDNA of the two samples bound to the microarray. The reference sample cDNA is labelled with green dye (which replaces the orange used in Step 1) and the target sample cDNA with red dye. These are represented by the red and green circles attached to the cDNA in Step 3.

Following this step, the two cDNA samples are hybridized onto the same microarray slide, which holds hundreds or thousands of spots, each containing a different DNA sequence.

The term “hybridize” refers to the binding of complementary pairs of DNA molecules (Buhler, 2002). The labelled cDNA from both samples are mixed together and placed onto a DNA microarray slide, where each gene represented by a cDNA molecule will hybridize to the spot containing its cDNA sequence on the microarray. The amount of cDNA bound to a spot is directly proportional to the initial number of RNA molecules present for that gene in both samples.

Step 5 involves scanning the hybridized microarray using a laser at suitable wavelengths to detect both red and green dyes. The laser scanner measures the fluorescence of each spot on the hybridized array and the results of the one colour are then superimposed over those of the other.

The final step is to interpret the scanned image consisting of thousands of different coloured dots shown in Step 6 in Figure 5.1. The fluorescence intensity for each spot on the array is related to the amount of cDNA in the sample (and therefore also the mRNA level in the cell) for that gene (Howard Hughes Medical Institute, 2017). In the image in Step 6, if a gene has a high expression level in the target cell but not in the reference cell then it is represented by a dot that glows bright red. Conversely, a dot that glows bright green represents a high gene expression level in the reference cell but not in the target cell. The Howard Hughes Medical Institute (2017) explain that if a gene is expressed to the same extent in both samples it will be represented by a yellow spot indicating the combination of the red and green light. Finally, if the gene is not expressed in either sample then the spot is black.

Before the microarray data obtained in Step 6 in Figure 5.1 can be analysed, the different dot colours need to be converted into numbers that represent the intensity of the red and green dye. The intensities of RNA hybridized at each spot provided by the microarray image in Figure 5.1 can be quantified by computing the log of the expression ratio. For more information regarding the expression ratio, refer to the paper by Babu (2004).

The above process results in thousands of numbers measuring the expression level of each gene in the target sample relative to the reference sample. A positive value indicates a higher gene expression level in the target sample versus the reference sample, indicating that the gene was stimulated to make more mRNA by tumour formation (Howard Hughes Medical Institute, 2017). Conversely, negative values indicate higher gene expressions in the reference sample versus the target sample.

The most commonly used and important application of microarray technology is tumour diagnosis through classification (Boulesteix, 2004). Tibshirani et al. (2001:6567) state that there are two main reasons why classification of microarrays is challenging. Firstly, there are a large number of inputs (genes) from which to predict classes and a relatively small number of samples leading to high-dimensional data analysis problems (gene expression data are usually wide, implying 𝑝 ≫ 𝑁). Secondly, it is important to identify which genes make the highest contribution to the classifier, as not all the genes used in the expression profile are informative and many of them are redundant. To overcome the curse of dimensionality in microarray data, it is crucial to identify the genes contributing the most to classification as it can aid biological understanding of the disease process as well as playing a vital role in the development of clinical tests for early diagnosis (Tibshirani et al., 2003:104). Precise identification of tumours is very important for treatment and diagnosis.

RNA sequencing (RNA-seq) is a recently developed approach to transcript profiling which makes use of deep-sequencing technologies (Wang et al., 2009:57). Recently, RNA-seq has been used as an alternative to microarrays. Zhao et al. (2014:1) state that although RNA-seq has benefits compared to microarrays, microarrays are still the more common choice of researchers when conducting transcriptional profiling experiments.

In document Statistical classification in high-dimensional scenarios with a focus on microarray data sets (Page 74-77)