K-means clustering - Developing an interactive webbased learning. environment for bioinformatic

K-means clustering is one of the simplest and fastest algorithms, and is therefore widely used. It is a non-hierarchical algorithm that starts by defining k points as cluster centres, or centroids in the input space (i.e. the n- dimensional space defined by the number of genes, or the m-dimensional space defined by the number of samples).

The algorithm clusters the objects (e.g. genes/rows or samples/columns) of a dataset by iterating over the objects, assigning each object to one of the centroids, and moving each centroid towards the centre of a cluster. This process is repeated until some termination criterion is reached. When this criterion is reached, each centroid is located at a cluster centre, and the objects that are assigned to a particular centroid form a cluster. Thus, the number of centroids determines the number of possible clusters.

Hence, the number of centroids affects the results of the algorithm; a clustering using five centroids would obviously produce different results than a clustering using four centroids. The results are further affected by the initial positions of the centroids; different initial positions may cause an object to be assigned to a different centroid, and the algorithm may therefore yield a different set of clusters. Thus, the number of centroids and their position has to be chosen carefully. There are different ways of implementing this algorithm. The BioTeach system implements two variants: the batch variant and the online variant (The names are taken from Ripley, 1996).

4.3.1 The batch variant

The batch variant of the k-means algorithm can be divided into two steps: object assignment and centroid relocation. The first step, object assignment, starts once the centroids have been placed in the input space. In this step the algorithm iterates over the objects in the dataset and assigns each object to the closest centroid. The next step, centroid relocation, moves each centroid to the

Chapter 4 Clustering of microarray data

position in the input space that corresponds to the average position of the vectors representing the objects assigned to each centroid.

As the centroids are moved, some of the objects may now be closer to different centroids than the ones they initially were assigned to, requiring the object assignment step to be repeated. As the assignments of the objects are revaluated, some centroids may receive additional objects, while others may have some objects removed. The average position of the vectors representing the objects assigned to a centroid may thus shift, requiring the centroids to be relocated again.

The cycle of object assignment and centroid relocation is repeated until the clusters stabilise (i.e. the objects assigned to the centroids remain the same), or until a predefined maximum number of cycles (e.g. 20.000 to 100.000) has been reached.

The batch variant implemented in the BioTeach system uses the former termination criterion.

Thus, the steps involved in a k-means batch clustering are:

1) iterate over the set of objects, and for each object in the set a. find the closest centroid

b. assign the object to the closest centroid 2) iterate over the centroids, and for each centroid

a. calculate the average vector of the objects that are assigned to the centroid

b. relocate the centroid at the position of the average vector of the objects that are assigned to the centroid

3) repeat steps 1 and 2 until the centroids no longer have to be relocated, or until the predefined number of cycles is reached.

Chapter 4 Clustering of microarray data

4.3.2 The online variant

This variant of the k-means algorithm uses the same approach as the batch variant, that is, it can be divided into the same two steps as the batch variant. The two variants do, however, differ in their execution of the two steps. While the batch variant iterates over the objects of the whole dataset before the centroids are relocated, the online variant moves a centroid at each step of the iteration, that is, each object of the dataset pulls the nearest centroid a certain distance towards itself. In the BioTeach system this distance is 1% of the distance between the object and the centroid. This approach is similar to that of Self Organizing Maps (which are discussed in the next section), and the result is that the centroids appear to be gliding rather than jumping towards the cluster centres of the dataset.

The online variant also uses a different termination criterion than the batch variant. The centroids are only moved a slight distance each time, and the objects assigned to a centroid could therefore appear to be stable for a while, but, as the centroid moves towards the cluster centre, it could move in such a way that it become the closest centroid to objects that are assigned to other centroids. Thus, it is possible for a centroid to “steal” objects from other centriods and change an object assignment that seemed to be stable. The termination criterion used in the batch variant would, in such cases, cause the algorithm to terminate prematurely. One way of ensuring convergence, and to avoid premature termination, is to reduce the distance a centroid is moved gradually over a number of iterations. The termination criterion implemented in the BioTeach system will be discussed in the next chapter.

The steps involved in a k-means online clustering are thus: 1. iterate over the set of objects, and for each object

a. find the closes centroid

b. move the closest centroid a certain distance towards the object 2. repeat step 1 until termination criterion is reached.

Chapter 4 Clustering of microarray data

In document Developing an interactive webbased learning. environment for bioinformatics. Master thesis. Daniel Løkken Rustad UNIVERSITY OF OSLO (Page 44-47)