1.6 Microarray Data Analysis
1.6.2 Clustering
Clustering is the grouping of objects based on similarity. In other words it is the partitioning of a data set into subsets, so that the data in each subset share some common trait. The measure for a common trait is defined before the clustering is performed and is often a distance metric defining the relative similarity among the two objects. Data clustering is a common technique for statistical data analysis, and has applications to many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.
Clustering gene expression data helps in identifying genes of similar function. These co- expressed genes with poorly characterized or novel genes may provide a simple means of gaining insight to the functions of many genes for which information is not available currently (Eisen et al., 1998). Co-regulated families of genes cluster together, as was demonstrated by the clustering of ribosomal genes as a group (Alon et al., 1999). Clustering is also used to identify the grouping patterns of specimens and has been widely used in studying the heterogeneity of cancer. Clinical breast cancers cluster as distinct groups based on their gene expression profiles and can be correlated with clinical outcomes (Sorlie et al., 2001).
Primarily, most clustering techniques use a distance metric to define the similarity or difference among the two objects. Some of the most common distance metrics used are Euclidean distance, Manhattan distance and Correlation distance. Euclidean distance is the distance between two points that would be measured with a simple ruler, and can be also calculated by repeated application of the Pythagorean Theorem. Thus the distance measure would be:
Distance = √ (∑ (Xi –Yi) 2)
X and Y are expression vectors of genes or samples.
Manhattan distance is the distance between two points expressed as the sum of the absolute differences of their coordinates. Therefore the distance between point P1 with
coordinates (x1, y1) and the point P2 at (x2, y2) would be |x1 - x2| + |y1 - y2|.
Correlation distance measures the similarity between two points expressed as the correlation between the two objects. Often the Pearson correlation measure is taken as distance measure for most of the microarray data clustering. Correlation measure value range from -1 to +1. Positive values indicate a positive correlation (i.e. increase in value of one corresponds to increase in the value of the other). Negative values indicate a
1.6.2.1 Hierarchical clustering
Hierarchical clustering is a technique to generate a hierarchy among objects based on their similarity or differences. The similarity or difference is measured based on the distance criteria explained above. Hierarchical clustering may be constructed using an agglomerative or divisive approach. The representation of this hierarchy is a tree also known as dendrogram, with individual elements at one end and a single cluster containing every element at the other (Fig 1.6.2.1.1). Agglomerative algorithms begin at the leaves of the tree, whereas divisive algorithms begin at the root. Agglomerative clustering can be single linkage clustering, complete linkage clustering or average linkage clustering.
Fig 1.6.2.1.1: An example of a tree or dendrogram. The leaves are shown in red and the nodes are shown in blue. A leaf reflects the entity and a node reflects the relationship between two entities, one entity and one node or among two nodes.
Single linkage clustering: The distance between groups is defined as the distance between the closest pair of objects, and only pairs consisting of one object from each group are considered (Fig 1.6.2.1.2).
Fig 1.6.2.1.2: Single linkage clustering. The closest element in the cluster is used to calculate the reference distance among the two clusters.
Complete linkage clustering: The complete linkage, also called farthest neighbour clustering method is the opposite of single linkage. The distance between groups is defined as the distance between the most distant pair of objects, one from each group (Fig 1.6.2.1.3).
Average linkage clustering: Distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group (Fig 1.6.2.1.4).
Fig 1.6.2.1.4: Average linkage clustering. The average of the element in the cluster is used to calculate the reference distance among the two clusters. The green is the average or centroid of the cluster.
Hierarchical clustering has been extensively used in cancer research to identify relationship among genes and samples. Hierarchical clustering using multiple markers can group breast cancers into various classes with clinical relevance and is superior to individual prognostic markers (Makretsov et al., 2004). Hierarchical clustering has been widely used in studying the sub-groups in breast cancer (Sorlie et al., 2001; Charafe- Jauffret et al., 2006; Weigelt et al., 2005; Hu et al., 2006).
1.6.2.2 K-Means clustering
The k-means algorithm is an algorithm to cluster objects into k partitions using the similarity between the objects. k is the number of partitions/clusters and is provided by
or by using some heuristic approaches. It then calculates the centroid (mean point), of each set. Thereafter, it constructs a new partition by associating each object with the closest centroid. The centroids are then recalculated for the new clusters, and the process repeated by alternate application of these two steps until convergence, which is obtained when the objects no longer switch clusters or the centroids no longer change. K-means is one of the most commonly used clustering methods and has a wide application in microarray studies (Do and Choi 2008).
Limitations of k-means clustering (MacKay 2003)
1) Since k-means clustering starts with random seed points, the end result will not be the same and will depend on the initial random vector.
2) K-means clustering needs the number of clusters from the uses and forces all the genes/samples to fit on those defined number of clusters.
3) Does not work well with non-globular clusters. Non-globular clusters are those whose boundaries are not well defined.