2.2 Cluster analysis problems
2.2.1 Clustering algorithms
Clustering in gene expression data sets is a challenging problem. One can consider two types of clustering in gene expression data sets: clustering of genes and clustering of sam- ples.
Clustering of genes
Most of methods were designed to solve gene clustering problems. In unsupervised meth- ods, current knowledge regarding the functional role of different genes is not considered [30]. Hence, unsupervised microarray data analysis introduces a process in which the sys- tem shows existing gene categories and ignores an imposed structure. The system uses a data set to find regularities, patterns or groups.
It is assumed that each gene belongs to a category that is associated with a function or co-regulation. In this case it is expected that unsupervised analysis will introduce a new explanation regarding gene expression association that has not been evident previously.
Clustering invokes unsupervised methods that can be used to determine if the elements of a gene expression matrix belongs to a special group. It is assumed that similar expres- sion levels must indicate the same biological function or co-regulation. Clustering helps to determine the function of the unknown genes. In the clustering process, expression values are grouped according to the distance function.
The members of each gene expression cluster are similar to other members in the same cluster, but they are different from the members in the other clusters. The first step in
Clustering in gene expression data sets 2.2. Cluster analysis problems
clustering is describing similarity and dissimilarity by a distance function.
Different algorithms for clustering of genes have been proposed [47,91,134,135]. Some of the main techniques are described in further detail below.
Hierarchical clustering Hierarchical clustering is used to identify genes with similar profiles and thus similar functions [30]. In the clustering process, each gene expression value is expressed as coordinates, which represent the distance from the other genes, by using pair-wise similarity measures. Hierarchical clustering is divided into two sub groups according to the criterion of dissimilarity (divisive) and similarity (agglomerative). A divi- sive approach (top-down) starts with all gene expression values in a single cluster and starts splitting until a criterion is met. An agglomerative approach (bottom-up) begins with each gene expression value in different (singleton) sets, and merges the clusters until a criterion is met.
The result of hierarchical clustering is a tree-shaped graph called a dendrogram. This represents a visual summary of the clustering process. A dendrogram is a colour-coded graph in which each gene expression value is a leaf. Red indicates an increase in gene expression levels and green indicates a decrease in gene expression levels. The intensity of the colour is a measure of the difference between other values and clusters. The length of the horizontal line that connects two clusters (nodes) shows the relative closeness.
Bi-clustering (or two way clustering) is a technique that is capable of clustering genes and microarray subsets simultaneously. Hierarchical clustering employs different meth- ods [30] including: single-linkage method, complete-linkage method, average-linkage clus- tering, centroid-linkage method, median-linkage clustering, and Ward’s clustering method. Partitional clustering Partitional clustering divides gene expression values g into k groups until each group presents a cluster and k ≤ g [30]. This process has two require- ments.
Firstly, each cluster must contain at least one gene expression value. Secondly, each gene expression value must belong to a cluster.
k-means clustering This is one of the most popular unsupervised methods applied to microarray data [30]. It is said that k-means methods give better results in microarray data sets where the clusters have similar gene expression values and they are expected to be compact and therefore have similar biological functions. In microarray data analysis, the k- means algorithm represents each gene expression value as a point. The algorithm identifies
Clustering in gene expression data sets 2.2. Cluster analysis problems
k-points (seed points) and assumes them as centroids. For a pass through a data set, k- points are assumed to be fixed at any iteration. In the next iteration, the remaining points are assigned to the nearest k-points so as to minimize the sum of the distance between seed points and all the other points.
Fuzzy clustering In ’fuzzyfication’, numbers such as gene expression levels are changed to qualitative descriptors [30]. The difference between fuzzy k-clustering and standard k- clustering is that fuzzy k-clustering assumes each gene point as a member of each cluster with certain degree. This allows a fuzzy k-means algorithm to identify overlapping groups of genes and identify the role of a gene in different pathways.
Clustering of samples
VizCluster technique for sample clustering Zhang et al. [137] present the VizCluster technique, which is a visualization approach to cluster analysis. The aim of clustering and classification is to find out the pattern or structure of data sets. Visualizing these pat- terns or structures can help in exploratory data analysis. This technique uses graphical visualization methods to show the data structure or underlying data pattern. Using both high-dimensional scatterplot and parallel coordinate plots helps to produce a non-linear projection and changes n-dimensional vectors into two dimensional points.
Zhang et al. have developed two approaches:
1. Supervised maximum entropy approach, which uses pre-known classes of samples as a training set, then applies the maximum entropy model to generate the optimal pattern model which can be used on new samples.
2. Unsupervised interrelated two-way clustering method, which dynamically uses the relationship between the groups of genes and samples while clustering through both gene-dimension and sample-dimension to identify important genes and classify sam- ples simultaneously.
VizCluster supports three types of data analysis including cluster/class discovery in both supervised and unsupervised analysis, class prediction and class assessment. Here the goal is the classification of samples in gene expression data.
Due to the large number of genes only a few algorithms can be applied to the clustering of samples [13]. As the number of clusters increases the number of variables in the cluster-
Clustering in gene expression data sets 2.2. Cluster analysis problems
ing problem increases drastically and most clustering algorithms become inefficient. The k-means algorithm and its different variations are among those algorithms which are still applicable to the clustering of samples in gene expression data sets. However as the num- ber of clusters increase, the k-means algorithms in general converge only to local minima and these local minima may be significantly different from the global solutions. Recently the global k-means algorithm has been proposed to improve global search properties of k-means algorithms [89].
In their work Bagirov et al. [15] propose a new clustering algorithm which is based on methods of non-smooth optimization. In this algorithm, clusters are calculated incremen- tally. The algorithm calculates as many clusters as exist in a data set, with respect to a given tolerance.