K- means Clustering Algorithm
3.5 Comparison of Independent Microarray Experiments
4.1.1 Parametric Models for Clustering
Parametric models are increasingly used in the field of cluster analysis (McLachlan and Basford, 1988; Kerr and Churchill, 2001). As Aitkin and Aitkin (1996) wrote, “When clustering samples from a population, no clustering method is a priori believable without a statistical model.” Parametric models help to formalize cluster analysis by allowing statistical models to be developed and tested. One type of parametric cluster analysis uses a class of models known as mixture models. Using mixture models, a likelihood based approach is applied to cluster experimental data. The mixture models approach assumes that the observations for the entities to be clustered are from a mixture of a specified number of groups in various proportions. By assuming a parametric form for the density function in each group, a likelihood function can be formed in terms of a mixture density. The unknown parameters of the distribution can be estimated by the method of maximum likelihood. The maximum likelihood (ML) equations for mixture models are nonlinear, and, therefore, they are solved iteratively. This process leads to estimates of cluster specific parameters as well as the proportion of observations falling in each cluster and the posterior probability of each observation falling in a specific cluster. Clustering proceeds by assigning each entity to a
group based on the relative value of the estimated posterior probability of belonging to that group compared with the posterior probabilities of belonging to the other groups. Karl Pearson (1894) was one of the first to apply parametric mixture models. Pearson fitted a two component univariate normal mixture model using the method of moments to a set of measurements on the ratio of forehead to body length on 1,000 crabs. This data set is reanalyzed in Section 4.3.6.
The type of mixture model used in cluster analysis is often called a finite mixture model because of the assumption that there are a finite number, C, of groups in the data. If at least one of the mixture components comes from a discrete distribution (such as the multinomial), the mixture model approach is sometimes called latent class analysis ( McLachlan and Peel, 2000; Everitt et al., 2001). Latent class analysis dates back to Lazarsfeld (1950) and has been widely applied in the social sciences. This dissertation focuses on finite mixture models.
One of the first applications of finite mixture models to microarray data was by Ghosh and Chinnaiyan (2002). They define normal mixture models and develop the Expectation/Maximization algorithm for fitting these models using the approach discussed in Section 4.3. They apply a variety of constraints on the cluster structure. The method proposed in this dissertation relaxes some of these constraints. One of the advantages for using mixture models for microarray data is that they provide a statistical criterion for assessing the number of clusters present in the data. A strong assumption made in fitting mixture models to microarray data is that the genes are independent and identically distributed according to the mixture density defined in Equation 4.2. More work is needed in
order to relax this assumption. (One way of doing this may be to fit multivariate mixture models, as suggested in Chapter 6.) Ghosh and Chinnaiyan (2002) suggest the Bayesian Information Criterion (discussed in Section 4.3.5) as a statistic useful for comparing mixture models having different numbers of clusters. They apply average linkage hierarchical clustering using the Euclidean distance measure to obtain starting values for fitting mixture models. They comment that convergence problems frequently occurred when using these starting values and suggest using random partitions of the data to generate starting values in such cases. The results of a k-means cluster analysis to provide the starting values is found in this dissertation to lead to better convergence properties. When clustering across samples, Ghosh and Chinnaiyan (2002) comment that the number of samples is typically much smaller than the number of genes. For such situations, they suggest reducing the dimension of the data by using principal components analysis. This method does not reduce the dimensionality enough in larger microarray experiments. An alternative method of filtering the genes is proposed in Section 4.3.8. Ghosh and Chinnaiyan (2002) analyze two microarray datasets. The first is from a malignant melanoma study reported by Bittner et al.
(2000). There were 31 melanoma samples and 3,613 genes included in the analysis. The data were normalized using the usual log corrected intensity ratios. They found two clusters of melanomas – one of size 19 and the other of size 12. They did not report the results of clustering the genes. The second microarray dataset was from a prostate cancer study (Dhanasekaran et al., 2001). There were 26 samples from 5 different biopsy locations.
There were 3,955 genes included in the analysis. They clustered the genes and investigated
the results for cluster sizes of 250, 1000, and 2000. For 250 clusters, several biologically plausible groups of genes were found.
Several other papers which apply mixture models to microarray data have been published. Mjolsness et al. (2000) apply mixture models to determine the number of clusters present in a microarray data set. They select the “optimal” model based on the maximum log likelihood. They discuss a process known as circuit inference which uses simulated annealing to establish a network of connection strengths similar those of a neural network (neural networks are briefly discussed in Chapter 6). McLachlan et al. (2002) suggest applying mixture models as a gene filtering tool. They select a subset of genes by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of including the gene in the model versus not. They introduce a software package called EMMIX-GENE to perform this ranking. Pan (2002) applies the normal mixture model suggested by Ghosh and Chinnaiyan (2002) to cluster microarray data on the susceptibility of rats to ear infections. He found three clusters, two of which were attributed genes with no altered expression levels and one of which contained at least 30 genes having differential expression levels. Allison et al. (2002) apply mixture models to cluster p-values indicating whether or not differential gene expression is present.
This approach is only relevant in the sense that mixture models were applied.