Objectives and statement of problems - Statistical methods for the analysis of RNA sequencing d

1.6.1 Functional clustering for time-course genomic data

Data clustering allows for the grouping of similar data points in order to discover and explain relationships among the data. Functional clustering of genomic data can identify co-expressed genes with similar functions and help explain the complexities of biological systems (Eisen et al., 1998). Exploring the patterns shown in genomic data from time-course experiments can provide us with important information on changes in expression levels over time. Some of the major applications of time course genomic experiments include (Androulakis et al., 2007): understanding the dynamics of biological systems such as cell cycles, examining the devel- opment of processes such as cell differentiation for organisms, analyzing response dynamics by monitoring how gene expression changes according to varying drug dosages, and studying disease progression over time.

co-expressed genes (Cooke et al., 2011; Grün et al., 2012; Ng et al., 2006; Schliep et al., 2003; Yuan and He, 2008) and described in comprehensive reviews (Androulakis et al., 2007; Bar-Joseph, 2004; Möller-Levet et al., 2003; Wang et al., 2008). However, existing clustering methods used for microarray data are not appropriate for the discrete-type RNA-seq data. We will give a more detailed review on existing clustering methods for microarray data in a later section. Clustering methods for static data have been applied to RNA-seq datasets (Jäger et al. 2011; Pauli et al., 2012) but the approaches ignored the sequential property of time-course data. To the best of our knowledge, statistical methods which thoroughly defines a statistical model for analyzing RNA-seq count data with time-dependence nature and over-dispersion property are very limited. There is a tremendous need for developing novel clustering methods that are suitable for temporal RNA-seq data. In this dissertation, thefirst research topicinvolves developing an efficient data clustering method to identify patterns on gene expression data from time-course RNA-seq experiments. The goal is to use a model-based clustering approach to identify co-expressed genes and their expression patterns from gene expression levels mea- sured by read counts over time.

1.6.2 Initialization procedures for finite mixture models

Finite mixture models are commonly used in cluster analyses of genomic data and the expectation- maximization (EM) algorithm (Dempster et al., 1977) is often used as the method for maximum likelihood (ML) estimation. The EM algorithm is the most suitable method for parameter estimations in situations with incomplete-data, such as missing data, truncated distributions and censored or grouped observations (McLachlan and Krishnan, 2008). The incompleteness of the data may not be natural or evident, and it would then depend on the statistician to formulate the incompleteness in an appropriate manner to facilitate the application of the EM algorithm. When finite mixture distributions are used to model heterogeneous data, it is a classic exam- ple of a problem with incomplete data since the goal is to estimate the proportions in which

the components of the mixture occur along with the component densities parameters. Another application of the EM algorithm is when the likelihood is analytically intractable, then the statistician may simplify the likelihood function by assuming the values of additional parameters as missing data. In this case, the incompleteness of the data is not natural but then it is formulated such that the application of the EM algorithm is appropriate for the optimization of the likelihood function.

The EM algorithm is an iterative procedure where each iteration consists of two steps: the Expectation (E-step) and the Maximization step (M-step). During the E-step, the algorithm finds the expected value of the complete-data log-likelihood with respect to the unknown data, given the observed data and the current parameter estimates. The M-step then consists of maximizing the expected log-likelihood obtained in the first step and update the parameter estimates. Starting from some initial values, the E- and M-steps are repeated until some convergence criterion is satisfied. Each iteration is guaranteed to increase the log-likelihood and thus the algorithm nearly always converges to a local maximum of the ML function (McLachlan and Krishnan, 2008). However, reliable global convergence is not certain and the performance of the EM algorithm can be improved by using good starting values. Different initialization procedures for EM algorithm have been proposed and investigated for ML estimations in Gaussian mixtures models (Biernacki et al., 2003) and also in mixtures of regression models with respect to time-course microarray data in a model-based clustering setting (Scharl et al., 2010). It is of interest to identify a reliable initialization strategy for clusterwise regression specifically for time-course discrete count data from RNA-seq experiments. This leads to thesecond research topicthat will be addressed in this dissertation. It can be stated as the following: what is an effective initialization procedure for model-based clustering of time-course RNA-seq data?

1.6.3 Missing value imputation methods for clustering of time-course ge-

nomic data

High-throughput analyses such as microarrays and sequencing technologies combined with statistical data analyses provide researchers with the ability to explore and understand complex biological processes. However, technical limitations might lead to the presence of missing values in the data, such as from corrupted spots on microarray through damaged or suspicious spots being filtered during the image analysis phase. Missing value imputation methods have been reviewed and evaluated on their impact on gene expression profiles analyses (e.g. Liew et al., 2010; Oh et al., 2011). Celton et al. (2010) performed an extensive comparison of the effects of imputation methods on cluster analysis of microarray data and noted that data with even a low missing rate would affect gene cluster stability. Noting the difference in data types between microarray and RNA-seq data, there is a need to evaluate imputation methods for the discrete count data, especially in the time-course experiments setting.

Therefore, thethird research topicthat will be investigated in this dissertation is the biological impact of missing value imputation on clustering analyses of genomic data from time- course sequencing experiments. Limited research has been done on the impact of missingness on sequenced data, thus it is desirable to further explore into this area and answer the following questions:

1. Are genomic data produced from next generation sequencing technologies often pep- pered with missing values?

2. What are the key issues that need to be addressed when dealing with missingness in time-course sequenced data with respect to the time-dependence nature of the data? 3. How can clustering of temporal RNA-seq data be improved by imputation methods when

In document Statistical methods for the analysis of RNA sequencing data (Page 38-42)