k-means clustering - Unsupervised learning algorithms

1 The foundations of lesion function inference in the

1.3 Lesion segmentation

1.3.3.1 Unsupervised learning algorithms

1.3.3.1.1 k-means clustering

K-means clustering is an iterative algorithm that seeks to partition a dataset

into a fixed number (k) of groups and achieves this by minimizing some measure of within-group dissimilarity (Forgy, 1965; MacQueen, 1967). By knowing the number of centres within the data set, the algorithm seeks to minimize the maximum distance of every point from its closest centre. Data points are then grouped into their most appropriate cluster where the objective is to minimize the sum distance from its centre.

Figure 1.1 - k-means clustering.

The following 6 panels illustrate the process of k-means clustering. The above example displays a collection of 12 data points distributed in a 2 dimensional feature space, with a single feature along the horizontal and vertical axis. There are 3 groups within the dataset represented by a different colour. The starting locations of the centroids can either be explicitly specified or randomly determined. Since k=3, there are three centroids, represented as a filled circle, in this scenario (b). The algorithm then proceeds to assign each data point to the closest of the three centroids using a series of perpendicular bisectors (c). After all data points have been assigned, the centroids’ locations are then shifted to the mean location of the corresponding centroid groups (d and e), and the process is repeated until no further displacement occurs (f).

a c e b d f

Consider a dataset distributed in 2-dimensional space with one feature along the x axis and another along the y as illustrated in figure XX. The value of k refers to the number of starting points (centroids) the algorithm will use to explore the feature space. This is essence

specifies the maximum number of clusters you expect to find within the dataset. The example in figure XX uses a k value of 3. Next the data points are then assigned to their closest centroid by using a series of perpendicular bisectors resulting in the formation of 3 clusters based on the entire dataset. For each cluster the associated centroid is updated to the mean of its constituent data points. The process is then repeated, slowly moving each centroid to the minimum distortion point (MacKay 2003) with termination of the process defined by a minimum displacement. In this way the algorithm does not necessarily need to evaluate all ��

� pair-wise dissimilarities.

To evaluate the displacement between iterations, a method to calculate the distance is necessary �� 1_{2 �}�� Variables used: ��_{� �� } �� _{� �� } ��_{� �� } ��_{� �� } �_��

Each data point is assigned to the nearest centroid within the set of centroids using the distance measure above.

��_{� ��}

� ��

�_�� 1 �� _{� �}

After assigning each data point to its closest centroid, the means are adjusted to match the sample means of the data points they are responsible for, i.e. the locations of the set of centroids are updated.

��_�∑ ��

��

Here ��_{is the total responsibility of the mean k.}

Consider a dataset distributed in 2-dimensional space with one feature along the x axis and another along the y as illustrated in figure 1.1. The value of k refers to the number of starting points (centroids) the algorithm will use to explore the feature space. This in essence specifies the maximum number of clusters you expect to find within the dataset. The example in figure 1.1 uses a

k value of 3. The starting locations of the centroids can either be specified by

the operator or randomly determined. Next the data points are then assigned to their closest centroid by using a series of perpendicular bisectors resulting in the formation of 3 clusters based on the entire dataset. For each cluster the associated centroid is updated to the mean of its constituent data points. The process is then repeated, iteratively moving each centroid to the minimum distortion point (MacKay, 2003) with termination of the process defined by a minimum displacement. In this way the algorithm does not necessarily need to evaluate all pair-wise dissimilarities, where n is the number of data points.

To evaluate the displacement between iterations, a method to calculate the distance is necessary

Variables used:

� pair-wise dissimilarities.

To evaluate the displacement between iterations, a method to calculate the distance is necessary �� 1_{2 �}��_�� _�� Variables used: �� _{� �� } �� _{� �� } �� _{� �� } ��_{� �� } �_��

Each data point is assigned to the nearest centroid within the set of centroids using the distance measure above.

�� _{� ��}

� ��

�_�� 1 �� _{� �}

�� _�∑ �� Here ��_{is the total responsibility of the mean k.}

� pair-wise dissimilarities.

To evaluate the displacement between iterations, a method to calculate the distance is necessary �� 1 2 �_� �� Variables used: ��_{� �� } �� _{� �� } ��_{� �� } ��_{� �� } �_��

Each data point is assigned to the nearest centroid within the set of centroids using the distance measure above.

��_{� ��}

� ��

�_�� 1 �� _{� �}

��_�∑ ��

Each data point is assigned to the nearest centroid within the set of centroids using the distance measure above.

The process is then repeated until there is no further change in location for the set of centroids.

Although the algorithm tends towards a local minimum, it may not necessarily be a global minimum (Kanungo et al., 2002). In fact, depending on the

starting point, the algorithm can discover a variety of solutions. To increase the probability of finding the optimal solution it is recommended to perform multiple runs of the algorithm at different start locations (Bradley and Fayyad, 1998; Duda and Hart, 1973).

automated), this assumes the algorithm will always be able to correctly identify normal tissue and cluster these data points into one group. However k-means

� pair-wise dissimilarities.

Each data point is assigned to the nearest centroid within the set of centroids using the distance measure above.

��_{� ��}

� ��

�_�� 1 �� _{� �}

��_�∑ ��

��

Here ��_{is the total responsibility of the mean k.}

� pair-wise dissimilarities.

Each data point is assigned to the nearest centroid within the set of centroids using the distance measure above.

��_{� ��}

� ��

�_�� 1 �� _{� �}

��_�∑ ��

��

Here ��_{is the total responsibility of the mean k.}

�� _{� � �} ��

The process is then repeated until there is no further change in location for the set of centroids.

Although the algorithm tends towards a local minimum, it may not necessarily be a global minimum (Kanungo et al. 2002). In fact depending on the starting point, the algorithm can discover a variety of solutions. To increase the probability of finding the optimal solution it is recommended to perform multiple runs of the algorithm at different start locations (Duda and Hart 1973) (Bradley and Fayyad 1998).

Need to introduce the idea of prior knowledge of the number of centroids Healthy vs damaged

Healthy vs. Multiple lesions (separate clusters if parameterised spatially)

These problems partly arise from the multi-dimensionality of the data and how the algorithm processes this information. Although it is possible to manually screen the images first (thereby forcing the algorithm to be at best semi-automated), this assumes the algorithm will always be able to correctly identify normal tissue and cluster these data points into one group. However K-means clustering is known to have important limitations. These include its inability to represent the size, shape, weight or breadth of each cluster (MacKay 2003). Therefore successfully clustering normal tissue into a single group is unlikely to be the prevailing result, since the component clusters within the dataset is heterogenous in terms of these features.

Another drawback of the k-means method is hard-clustering, whereby each data point is assigned to exactly one cluster and all points within are equal in that cluster. Intuitively, it would appear more appropriate if data points located between 2 or more clusters played a partial role in determining the centroids of all the clusters it could plausibly be assigned to. To address this criticism the soft k-means algorithm was developed.

Only the assignment step is modified to account for the “slackness” factor of the algorithm �_�� exp ��

��_{� �}��_��

∑ exp ��_�_{� �}_��_��

This algorithm is similar to the original k-means formula, but possesses an additional

Here is the total responsibility of the mean k.

� pair-wise dissimilarities.

To evaluate the displacement between iterations, a method to calculate the distance is necessary �� 1_{2 �}�� Variables used: ��_{� �� } ��_{� �� } ��_{� �� } ��_{� �� } �_��

Each data point is assigned to the nearest centroid within the set of centroids using the distance measure above.

��_{� ��}

� ��

�_�� 1 �� _{� �}

��_�∑ ��

��

clustering is known to have important limitations. These include its inability to represent the size, shape, weight or breadth of each cluster (MacKay, 2003). Therefore successfully clustering normal tissue into a single group is unlikely to be the prevailing result, since the component clusters within the dataset are heterogeneous in terms of these features.

Another drawback of the k-means method is hard-clustering, whereby each data point is assigned to exactly one cluster and all points within are equal in that cluster. Intuitively, it would appear more appropriate if data points located between 2 or more clusters played a partial role in determining the centroids of all the clusters it could plausibly be assigned to. To address this criticism the soft k-means algorithm was developed (MacKay, 2003).

Only the assignment step is modified to account for the “slackness” factor of the algorithm

�� _{� � �} ��

The process is then repeated until there is no further change in location for the set of centroids.

Need to introduce the idea of prior knowledge of the number of centroids Healthy vs damaged

Healthy vs. Multiple lesions (separate clusters if parameterised spatially)

Only the assignment step is modified to account for the “slackness” factor of the algorithm �_�� exp ��

��_{� �}��_��

∑ exp ��_�_{� �}_��_��

This algorithm is similar to the original k-means formula, but possesses an additional parameter [beta] . [beta]  represents how strict the algorithm handles its borders, such

This algorithm is similar to the original k-means formula, but possesses an additional parameter b. The parameter b represents how strict the algorithm handles its borders, such that as it approaches infinity, the more closely the soft k-means algorithm resembles the original k-means formula.

In spite of this, the necessity to specify k prior to execution remains a significant drawback. In a lesion segmentation application, k would be related to the number of lesions – one of the questions we are using lesion segmentation to answer. Removing the need to specify k would avoid the constraints and complications described above associated with k-means clustering. One alternative is the mean-shift clustering algorithm.

In document The foundations of lesion-function inference in the human brain (Page 35-40)