1.7 Organization of dissertation
2.1.3 Clustering of time-course RNA-seq data
Since RNA-seq has been widely adopted as an attractive alternative to microarrays for studying gene expression, the development of clustering algorithms suitable for RNA-seq data becomes
an important area of research since they allow for analysis of multiple groups rather than simple two-group analyses. RNA-seq data uses counts of read to quantify gene expression levels and the discrete nature of the data differs from the continuous expression measurements resulting from microarray experiments. The difference in data types makes it problematic to directly apply statistical tools developed for microarrays onto RNA-seq data.
One approach to overcome this issue is to use data transformations. Li et al. (2010) per- formed log-transformation onto gene expressions from RNA-seq experiment and identified dif- ferentially expressed genes using K-means clustering algorithm. J¨ager et al. (2011) standard- ized the RNA-seq count values from their experiment and performed hierarchical clustering on the normalized data to obtain gene groups with similar expression. In another time-course ex- periment focusing on the early zebrafish development, the RNA-seq data were also normalized and clustered using K-means (Pauli et al., 2012). These heuristic approaches have the advan- tage of easy implementation; however, they have not been evaluated for RNA-seq data analysis and the employed clustering methods ignore the time-dependence among the time-series data. There are also complications when analyzing transformed count data. Transformation of count data cannot be well approximated by continuous distributions, and it is particularly problematic for data with small sample sizes and lower count ranges (Oshlack et al., 2010). Data with very small counts after transformation are far from normally distributed and count data usually con- tain a mean-variance relationship that is not addressed by normal-based analyses (McCarthy et al., 2012).
Genome analyses using count models can better distinguish biological from technical vari- ability than analyzing transformed data with the use of continuous distributions (Robinson and Smyth, 2008). Count models have also been shown to have higher power in detecting differ- ential expression than approximate normal models (Robinson and Oshlack, 2010). Following this notion, Si et al. (2014) proposed the use of model-based clustering algorithms with Pois-
son or negative binomial distributions to analyze RNA-seq data and assessed the model-based clustering methods compared to K-means and SOM. To our knowledge, this has been the first statistical research to focus on clustering methodology for RNA-seq data. However, the analy- sis for time-course RNA-seq has not been explored.
The clustering of time-course data can be viewed as identifying developmental trajectories (or temporal pattern) within a dataset. The semi-parametric group-based trajectory modeling approach (Nagin, 1999; Nagin, 2005) is an example of model-based clustering method for lon- gitudinal data. The method models the data as a mixture of distinct groups/clusters defined by their trajectories and by clustering data with similar trajectory, differences that may ex- plain individual- (or sample-) level variability can be expressed in terms of cluster differences (Nagin, 2005). For analyzing time-course genomic data, the group-based trajectory model can: (1) determine the optimal number of distinct expression patterns and identify those pat- terns/trajectories, (2) estimate the proportion of samples that is believed to have produced the expression pattern of each group, (3) relate the cluster assignments to covariates of characteris- tics of the genes, and (4) use the cluster membership probabilities for purposes such as creating summaries of expression patterns of clustered genes.
The model was first applied by Nagin and Land (1993) as a mixed Poisson model to crim- inal career data, then Roeder et al. (1999) followed the idea with a mixture of zero-inflated Poissons to handle situations where the data contains more zero’s than expected from Poisson distributions. The group-based estimation model has then been implemented as a SAS-based procedure, Proc Traj, by Jones, Nagin and Roeder (2001). The procedure uses mixtures of zero- inflated Poissons, censored normals, and Bernoulli models for longitudinal count data, scale data and binary data respectively. Recently, KmL (Genolini and Falissard, 2011), a K-means clustering algorithm for longitudinal data has been proposed and compared to Proc Traj on the modeling of time-course data. While producing similar results as Proc Traj when applied to
real dataset with count data, KmL suffers from the limitation that all clusters are assumed to have the same variance, which would affect its ability to correctly identify cluster if the under- lying subpopulations have different variances (Genolini and Falissard, 2011).
The objective of this research is to develop a clustering algorithm suitable for time-course RNA-seq data. Since RNA-seq data suffers from the over-dispersion problem (Nagalakshmi et al.,2008; Robinson and Smyth, 2007), a common approach is to model the count data us- ing negative binomial distributions to accommodate over-dispersion. Here we develop an effi- cient model-based clustering method with mixtures of negative binomials to cluster time-course RNA-seq data using the semi-parametric group-based modeling approach proposed by Nagin (1999). By identifying the clusters of genes with similar expression patterns, differences that may explain individual-level variability can be expressed in terms of cluster differences. The parameters of this model can be estimated by a direct maximization method, such as the gen- eral Quasi-Newton procedure, but the use of this procedure is highly dependent on the starting values (Roeder et al., 1999). To avoid this problem, we present an EM algorithm for maximum likelihood estimation of the parameters in our group-based approach.