A review on mixture model-based clustering methods of mixed data types

Clustering of multivariate data using a mixture of multivariate normals for continuous variable data or a mixture of multivariate Bernoulli densities for binary data as proposed by Wolfe (1970) and Everitt (1984) respectively are the essential technique known as latent class analysis [66]. There are several approaches to cluster data described by different types of variable or mixed data. Some possible approaches are to perform a separate clustering on each type of variables, convert all types of variables to a single type of variable followed by a clustering or clustering data set with both continuous variables and binary or ordinal values as proposed by Everitt (1988) [66]. The latter is important early work on model-based clustering of mixed mode data. Here, the authors proposed that the binary and ordinal variables come from an underlying continuous distribution of not-directly observable continuous variables. The method involves estimating the parameters of the unobservable continuous data by setting certain threshold values as the cut-off points.

A similar attempt was by Morlini [67], where author proposed a model-based clustering approach based on a multivariate Gaussian mixture model in clustering binary and continuous variables using a mixture of discrete (multinomial) and continuous (multivariate) distributions. With the assumption that the observed binary values ‘0’ and ‘1’ correspond to small and large latent continuous values/scores respectively, the author estimates the scores of the latent continuous variables which produces the observed binary values and then, combines this together with the scores of the original continuous variables for clustering. This involved deciding on the thresholds for each binary variable. These ideas were extended to ordinal and nominal variables by McParland and Gormley [68] in their clustering method called ClustMD. They proposed a method using a latent variable model with underlying mixture Gaussian distributions to estimate the mixed type observed data of any combination of continuous, binary, ordinal or nominal variables.

The latent continuous data underlying both the ordinal and nominal data were assumed to be Gaussian, as were any observed continuous data. Thus, the joint vector of observed and latent continuous data, 𝑧_𝑖 was assumed to follow a multivariate Gaussian distribution, 𝑧_𝑖 ~ 𝑀𝑉𝑁_𝑝(𝜇,∑).

In all of the approaches mentioned above, an expectation maximization framework was adopted in estimating the maximum likelihood of the observed data which requiring manual specification of cluster numbers, and methods were applied to problems with relatively small numbers of variables (<10 continuous and <20 other). Similar ideas have been explored by by Cai and co-workers in a Bayesian context [69]. Here, a generalized latent variable model was proposed with cumulative probabilities of various types of observed variables specified by a linear model of latent variables. Different density functions are applied for different types of data (i.e. Gaussian for continuous variable, Gaussian with thresholds for ordinal variable, Poisson or Binomial for count data, and multinomial logit link for nominal variable). Although this method can simultaneously model multiple data types, it is again dependent of the defined number of mixture components or prior Bayesian estimates and fairly large sample size is required to obtain accurate results.

Alternatively, addressing problems with large numbers of variables of different types and incorporating dimension reduction as an integral component, iCluster [70] and integrative phenotyping framework (iPF) [71] were developed specifically for integrating and clustering mixed genome-scale (‘omics’) data for disease subtype discovery. In iCluster, the link between data types was achieved by assuming a shared underlying latent variable model representing the disease subtypes. It also utilizes the k-means procedure to find the actual cluster assignments given latent variable values. iPF is a workflow developed to integrate independent homogenous clustering from different omics data in an agglomerative manner. It utilizes a dissimilarity matrix of features from clusters across omics data. This then followed by visualization of heterogeneous clustering of pairwise omics sources.

All of the approaches mentioned above assume a common clustering with a known/common set of clusters across all data types. A different approach was taken by the Bayesian MDI package [72, 73], which first cluster data sets based on pairwise relations (linking coefficients) between data sets, and then fusing entities together if they have same linking coefficients. MDI combines the entities into statistically distinct clusters while exploiting any latent structure in cluster allocations across data sets. A flexible Bayesian mixture modelling approach was applied. Although MDI provides adequate flexibility for grouping of fused entities, it does not clearly encourage any

sharing of clusters across more than pairs of data sets. Thus, we think our work would fill this gap by making an algorithm that is truly flexible in term of entities cluster allocations while exploiting all data sets of different types simultaneously.

Methodology

In this work, our intention was to use mixture-model clustering to cluster genes using both gene expression (continuous) and regulatory (binary) information (e.g. TF binding), and also to classify cancer patients into sub-classes based on gene expression (continuous) and mutation patterns (binary). The probabilistic model which will be explained below is in relation to the first one, where clustering expression of, and TF binding to genes is used to infer a relationship between them. This probabilistic model of genetic regulation for genes works the same way for clustering cancer patient data. In our genetic regulation framework, we refer to a set of genes that are regulated by a set of TFs as a regulatory module. Similarly, for the cancer patient classification, we refer a set of patients having different patterns of markers (i.e. gene expression and mutation) as a cluster.

In document Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer (Page 46-48)