• No results found

1.2 Data analysis methods

1.2.6 Cluster analysis

Cluster analysis has become a standard computational method for gene function dis-covery as well as for more general explanatory data analysis. e objective of cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. In the analysis of RNAi screens, clustering is used to group genes giving similar phenotypes. In transcriptomics, cluster analysis is applied to group genes based on their expression paerns.

Clustering algorithms can be divided into three categories: hierarical, partitional and model-based. Hierarical algorithms find successive clusters using previously es-tablished clusters. ese algorithms can be either agglomerative (‘boom-up’) or divi-sive (‘top-down’). Partitional algorithms typically determine all clusters at once. Model-based algorithms assume that the data were generated by a model and tries to recover the parameters describing it. Calculated parameters are then used to define clusters and the assignment of observations.

Regardless of the used method, the final number of clusters must be decided at some point. Algorithms helping in taking the right decision are described in the last part of this section.

Hierarical clustering

Hierarical clustering creates a hierary of clusters that may be represented in a bi-nary tree data structure called a dendrogram. A ‘root’ of the tree consists of a single cluster containing all observations, and the ‘leafs’ correspond to individual observations.

e final grouping of observations is obtained by cuing the dendrogram at a specified

‘height’ (se Figure 1.5).

1 2 3 4 5 root

leaf observations

cut-off height

A B C

clusters

Figure 1.5: Hierarical clustering dendrogram

Observations 1 to 5 belong to their own clusters - ‘leafs’ of the dendrogram. Observations 2 and 3 are the most similar to ea and thus, they are merged at the lowest ‘height’ into one cluster (B). Sucesive merging of observations and formed clusters lead to the final single cluster called ‘root’. Application of a similarity cut-off allows a final definition of clusters. In this example, three clusters are defined: A, containing a single observation 1; B with observations 2 and 3; C with observations 4 and 5.

A dendrogram can be constructed either by iterative division of bigger clusters into two subclusters or by merging two subcluster into one bigger. In order to decide whi

clusters should be combined (for agglomerative clustering), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required.

In most methods of hierarical clustering, this is aieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criteria that specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

In the analysis of high-throughput data, the commonly used metrics include the Euclidean distance, Mahalonobis distance and other metrics based on correlation coef-fitients (i.e. Pearson product-moment correlation, uncentered correlation).

A dissimilarity between sets of observations can be computed using different linkage criteria (for visual interpretation, see Figure 1.6):

Complete linkage (furthest neighbor). A distance between sets is defined as the maximum distance between any observation from one set, to any observation from the second set. is method usually performs quite well in cases when the samples are naturally separated into distinct groups.

Single linkage (nearest neighbor). As opposite to the previous method, single linkage takes the minimum distance between any two observations from two sets. is method is suitable for more diffused or elongated clusters.

Unweighted pair-group average (UPGMA). In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. UPGMA is a compromise between the two previous methods, performing well in cases of well defined and difused clusters of data. It is the most commonly used linkage tenique.

Weighted pair-group average (WPGMA). is method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of observations contained in them) is used as a weight. us, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven.

Unweighted pair-group centroid (UPGMC). e distance between clusters is de-fined as the distance between their centroids, dede-fined as an average of all objects in the cluster.

Weighted pair-group centroid (WPGMC). is method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). us, when there are (or we suspect there to be) considerable differences in cluster sizes, this method is preferable to the previous one.

Ward’s method. is method is distinct from all other methods because it uses an analysis of variance approa to evaluate the distances between clusters. In short, this method aempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at ea step. is method is regarded as very efficient, however, it tends to create clusters of small size.

Single linkage Complete linkage

Pair-group average Pair-group centroid

Figure 1.6: Hierarical clustering linkage criteria.

Calculation of distances between two clusters according to different linkage criteria. Dots represent obser-vations in two-dimensional feature space and ellipses indicate obserobser-vations belonging to a single cluster.

Lines show distances taken into consideration for calculating distances between clusters. Crosses in the

‘Pair-group centroid’ axis represent centroids of the clusters.

Partitional clustering

e most widely used partitional clustering method is a 𝑘-means algorithm (Maceen, 1967). is algorithm divides a given set of observations 𝑋 = 𝐱􏷠, 𝐱􏷡, … , 𝐱𝑛), where ea

is a 𝑑-dimensional vector, into an a priori decided 𝑘 sets (𝑘 < 𝑛) {𝑆􏷠, 𝑆􏷡, … , 𝑆𝑘}. e aim is to minimize the within-cluster sum of squares:

arg min

𝐒 𝑘

􏾜

𝑖=􏷠

􏾜

𝐱𝑗∈𝑆𝑗

‖𝐱𝑗− 𝝁𝑖􏷡 (1.17)

where 𝝁𝑖 is the mean of all observations in 𝑆𝑖 and the ‖ • ‖ operator represents any distance metric (e.g. Euclidean distance).

A derivative of the 𝑘-means algorithm is the fuzzy 𝑐-means clustering (Bezdek, 1981).

In the fuzzy clustering, ea observation has a degree of belonging to clusters, rather than belonging completely to just one cluster. e fuzzy logic equations are used in the processes of finding the optimal solution. en, the cluster space is discretized and ea

observation is associated with a single cluster.

Model-based clustering

In model-based methods (Villarroel et al., 2009), ea observation 𝐱𝑖 is modelled by a finite mixture distribution with the prior probability 𝜶𝑗 that every sample 𝐱𝑖is a mem-ber of only one mixture component 𝑗, and the conditional probability modelling ea

component 𝑗 by the parametrized probability density function 𝑃𝑗􏿴𝐱𝑖|𝚯𝑗􏿷 (usually the multivariate Gaussian distribution). e finite mixture model expresses the probability of observing the sample 𝐱𝑖 as a sum of individual components:

𝑃(𝐱𝑖|𝚯) =

𝑘

􏾜

𝑗=􏷠

𝜶𝑗𝑃𝑗􏿴𝐱𝑖|𝚯𝑗􏿷 (1.18)

e aim of this clustering tenique is to find su parameters 𝚯𝑗and 𝜶𝑗, that maximize the joint probability of observing the data set 𝐱. is goal is usually obtained by apply-ing Expectation-Maximization (EM) algorithms. e number of mixture components is analogous to the number of clusters and the association between an observation 𝐱𝑖and a cluster is based on the final probability 𝑃𝑗􏿴𝐱𝑖|𝚯𝑗􏿷 (whi should be maximal for the cluster 𝑗).

e main advantage of all partitional and model-based clustering teniques, com-pared to hierarical clustering, is that they aempt to find centers of natural clusters in the data set. However, their drawba is that an inappropriate oice of the number of clusters 𝑘 may yield misleading results.

Optimizing number of clusters

e best method of determining the optimal number of clusters is to partition data into 𝑘 clusters and measure the quality of this clustering. e final partitioning is found by comparing quality measures for different values of 𝑘. However, the goal of cluster analysis is not to find the best partitioning of the given sample, but to approximate the true partitions of the underlying space. In the analysis of biological data, usually, the best quality measures are obtained for very large number of clusters. is is the classical overfiing problem, whi can be overcame by taking the smallest 𝑘 yielding reasonably good clustering quality and for whi an increase of 𝑘 would not significantly imrpove the clustering. is criterion is called an ‘elbow’ method (Van Ryzin, 1995).

e most straightforward method of measuring how well the clusters are formed is by calculating the average silhouee of all observations (Rousseeuw, 1987). A silhouee 𝑠 of an observation 𝑖 is calculated with the following formula:

𝑠𝑖 = 𝑏𝑖− 𝑎𝑖

max{𝑎𝑖, 𝑏𝑖} (1.19)

where 𝑎𝑖denotes the average dissimilarity of 𝑖 with all other observations belonging to the same cluster, and 𝑏𝑖is the average dissimilarity of 𝑖 with all observations belonging to the closest cluster to 𝑖. e closets cluster is defined by having the smallest average dissimilarity with 𝑖. Values of 𝑠𝑖 can range between -1 and 1, with -1 meaning that the observation 𝑖 should rather be a member of the closest cluster and 1 meaning that the observation 𝑖 fits well to its cluster. An average 𝑠𝑖 for a single cluster tells how compact the cluster is, and the average 𝑠𝑖for all observations defines the whole clustering quality.

Evaluation of model-based clusters can be based on the likelihood that the obtained clustering model describes the data well. e likelihood is one of the parameter calcu-lated by expectation-maximization (EM) clustering algorithms and reported as one of the clustering results.