5.6 Object Merging Using Characteristic Rules
5.6.2 Cluster Analysis
Cluster analysis is the organisation of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity and, patterns within a cluster are intuitively more similar to each other than there are to pattern belonging to a different cluster [138].
It is crucial to understand the difference between clustering and discriminant analysis, also respectively known as unsupervised and supervised classification [138,139]. In discriminant analysis, a collection of labelled patterns are provided and a learning/training procedure is conducted to produce decision boundaries [139]. The main task is to label a newly encoun- tered, yet unlabelled pattern based on the decision boundaries. In the case of clustering, the main issue is to group a set of unlabeled patterns into clusters with a degree of similarity among their cluster members. In a sense, labels are also associated with clusters, but these category labels are data driven, which means that they are obtained solely from the data.
Humans perform competitively in a clustering process in two dimensional space. However, many clustering problems in real applications deal with a higher dimensional feature space. Most patterns are better described using more features. The more features used to describe a pattern, the higher the dimension of a feature space becomes and this is where humans fail, because it is difficult for humans to obtain an intuitive interpretation of data embedded in high-dimensional space. Different approaches to data clustering are also described in
(a)
(b)
(c)
Figure 5.7: Images visualising the features used to characterise a crack-network: a) the ratio of the shorter side against the longer side of an RMBR and the square root of the RMBR area can be used as features; b) the centroid of the crack-networks shown by the “x” sign and the axis of minimum inertia as shown by the dotted lines; c) a node as denoted by the “+” sign is also useful as a feature when their density with respect to either the
[138,140].
5.6.2.1 The Flexibility of Clustering
There is a vast choice of clustering methodologies with flexibility in their implementation. The choice of a clustering algorithm depends on preferences related to the user and the problem in hand, which might involve issues such as simplicity, speed, flexibility and relia- bility. The issues surrounding cluster analysis are as follows [138]:
• Agglomerative versusDivisive: An agglomerative technique begins with each pattern considered as a cluster and successively merges clusters until a stop criterion is sat- isfied. On the other hand, a divisive technique begins with all patterns in a single cluster and splitting is performed until a stopping rule is achieved.
• MonotheticversusPolythetic: This issue relates to the sequential or simultaneous use of features in a clustering process. The majority of techniques are polythetic, where all features are used to compute distances between patterns and decisions are based on these distances. Monothetic approaches consider features sequentially rather than simultaneously. A monothetic clustering approach is explained by Anderberg [141], where a dataset is divided into two by using featuref1. Each of the resultant clusters are further divided independently using feature f2. The main problem with this approach is that it creates 2d clusters in ad-dimensional space. In a high dimension feature space, the number of clusters will be so large that the end result will be uninterestingly small and fragmented.
• Hard versus Fuzzy: A hard clustering approach allocates each pattern to a single cluster during operation and as an end result. A fuzzy clustering approach [142] assigns a pattern membership to all clusters in the dataset. Hence, each pattern is associated with every cluster. The patterns will have membership values in [0,1] for each cluster.
• Deterministic versus Stochastic: This issue is the most relevant to partitional ap- proaches designed to optimise a squared error function. This optimisation can be accomplished using traditional techniques or through a random search of the state space consisting of all possible labellings.
• Incremental versusNon-incremental: When a dataset to be clustered is large, execu- tion time or memory space constraints affect the architecture of the algorithm. The advent of data mining has nurtured the development of clustering techniques that
minimise the number of patterns examined during execution, or reduce the size of the data structures used in the process.
5.6.2.2 The Hierarchical Agglomerative Clustering Algorithm
Hierarchical clustering methods are categorised into agglomerative (bottom-up) and divi- sive (top-down) [140, 143]. As mentioned in Section 5.6.2.1, an agglomerative clustering starts with one point (singleton) clusters and recursively merges two or more of the most appropriate clusters. Divisive clustering starts with one cluster of all data points and recur- sively splits the most appropriate clusters. The process continues until a stopping criterion is achieved. The advantages of hierarchical clustering include [144]:
• Embedded flexibility regarding the level of granularity.
• Ease of handling of any forms of similarity or distance.
• Consequently, applicability to any attribute types. The disadvantages of hierarchical clustering relate to:
• Vagueness of termination criteria.
• The fact that most hierarchical algorithms do not revisit intermediate clusters, once constructed for the purpose of their improvement.
A hierarchical algorithm yields adendrogram representing the nested grouping of patterns and similarity levels at which groupings change. A dendrogram (see Figure 5.11(b)) is an
n-tree [145] with the additional property that a height h is associated with each of the internal nodes. For each pair of objects, (i,j), hij is defined as the height of the internal
node, specifying the smallest class to which both object i and j belong; the smaller the value of hij, the more similar objects iand j are regarded to be.
The algorithm of a hierarchical agglomerative clustering is as follows:
1. Compute the proximity matrix containing the distance between each pair of patterns and treat each pattern as a cluster.
2. Find the most similar pair of clusters using the proximity matrix. 3. Merge these two clusters into a single cluster.
4. Update the proximity matrix.
5.6.2.3 Linkage Metric
Hierarchical agglomerative clustering initialises a cluster system as a set of singleton clusters and proceeds iteratively with merging or splitting of the most appropriate clusters until a stopping criterion is achieved. The appropriateness of clusters for merging depends on the (dis)similarity of cluster elements. This reflects a general presumption that clusters consist of similar points. An important example of (dis)similarity between two points is the distance between them.
To merge subsets of points rather than individual points, the distance between individual points has to be generalised to the distance between subsets. Such a derived proximity measure is called alinkage metric. The type of linkage metric used significantly affects hi- erarchical algorithms, since it reflects the particular concept of closeness and connectivity. The main inter-cluster linkages [146] include single linkage, complete linkage, average link- age, centroid linkage and incremental sum of squares (ward) linkage [145]. Among these, the single linkage and the complete linkage are the most widely used [138].
In the single linkage scheme, the distance between two clusters is taken as the minimum distance between all pairs of patterns drawn from the two clusters. In the complete linkage scheme, the distance between two clusters is the maximum of all possible distances between points of the two clusters.
Implementation-wise, the choice of linkage scheme does effect the outcome of the hierarchi- cal process. The complete linkage algorithm produces tightly bound or compact clusters [147]. The single linkage on the other hand suffers from a chaining effect [148] and has a tendency to produce clusters that are straggly or elongated. However, in most applications, it has been observed that complete linkage produces more useful hierarchies compared to the single linkage [140].
Letnr be the number of objects in clusterr and ns be the number of objects in clusters.
xri is theith object in clusterrandxsj is thejth object in clusters. The distance between
two clustersd(r, s) using the linkage metrics are as follows:
• Single Linkage: smallest distance between objects in the two clusters.
d(r, s) = min(d(xri, xsj)), i∈(1, ..., nr), j∈(1, ..., ns) (5.5)
• Complete Linkage: largest distance between objects in the two clusters.
• Average Linkage: average distance between all pairs of objects in the two clusters. d(r, s) = 1 nrns nr X i=1 ns X j=1 d(xri, xsj) (5.7)
• Centroid Linkage: distance between centroids of the two clusters.
d(r, s) =d( ¯xr,xs¯) (5.8) where x¯r= 1 nr nr X i=1 xri and x¯s= 1 ns ns X j=1 xsj (5.9)
• Ward Linkage: incremental sum of squares, which is the increase in the total within- cluster sum of squares as a result of joining clustersr ands.
d(r, s) = nrns nr+ns d2rs (5.10)
whered2rs is the distance between the centroids of clustersr ands.
5.6.2.4 Distance Metric
The distance between 2 objects can be calculated using various metrics. Among the common ones are the Euclidean distance, the Manhattan distance (special cases of the Minkowski distance) and the Mahalanobis distance. They are as explained below.
Given two pointsx= (x1, ..., xn) andy= (y1, ..., yn), the Minkowski metric for calculating the distance betweenxand y (dxy) in Euclidean n-dimensional spaceRn is defined as
dxy = ( n X i=1 |xi−yi |p )1/p . (5.11)
In a special case where p= 1, the Minkowski metric gives the Manhattan distance defined in Equation 5.12, and when p=2, the metric represents the Euclidean distance described by Equations 5.13and 5.14.
The Manhattan distance function (p= 1) which is also known as the City-Block distance, computes the distance that would be travelled to get from one data point to the other if a grid-like path is followed. The Manhattan distance between x and y is the sum of the differences of their corresponding components computed as
dxy =
n X
i=1
The Euclidean distance (p= 2) between xand y is computed as dxy =p|x1−y1 |2 +|x2−y2|2 +...+|xn−yn|2= v u u t n X i=1 |xi−yi|2. (5.13) or in matrix form written as
dxy =
q
(x1−y1)T(x1−y1). (5.14) The statistical distance or Mahalanobis distance betweenx and y is defined as
dxy = q
(x1−y1)TC−1(x1−y1). (5.15) whereC−1 is the covariance matrix of the dataset.