CHAPTER 2: HIERARCHIC FUSION 2,1 GENERAL PROCESS
2.2 METHODS Linkage Techniques
(i) Single linkage. The criterion of the similarity between two clusters is defined as the highest similarity between two indivi duals, one from each cluster. This method, which is generally attributed to Sneath (1957; see also Sokal and Sneath, I9 6 3) has
evidently been proposed independently by McQuitty ('Linkage analysis', 1957; 1961 ; 1967a) and Gengerelli (1963).» and is also associated
with minimum spanning tree techniques (Plorek, et al, 1951; Gower and Ross, 1969). The method is well-known for its 'chaining'
effect (Porgey, 1964, I9 6 5; Needham, 1965a; Williams, et al, 1966;
Hodson, et al, I9 6 6; Lance and Williams, 1967a; Jardine and Sibson, 1968; Shepherd and Willmott, I9 6 8; see also Chapter 6) which produces
long straggling clusters. This is generally considered to be unde sirable, especially with large populations for which the method tends to isolate the distribution core as one cluster and single peri
-55-
The hierarchical algorithm is developed by several writers
(Williams, et al, I9 6 6; Johnson, 19^7) who select, at each fusion,
those two clusters which contain the closest pair of individuals or the 'nearest neighbours'. The hierarchical algorithm is sometimes referred to as 'nearest neighbour', and it is simple to show that it derives all the N possible groupings which can be obtained with Sneath's original algorithm using any threshold.
(ii) Complete linkage, A group of individuals comprises a cluster provided that no two individuals have a similarity which is less than the threshold (S/rensen, 1948; Sokal and Sneath, I9 6 3). This
is the exact opposite of single linkage in the sense that the farthest neighbours must satisfy the similarity criterion; when
d is used, spherical or 'tight' clusters are obtained. Macnaughton- Smith (1965), McQuitty ('Syndrome analysis', 1966a) and Johnson
(1967) evidently suggest the hierarchical algorithm whereby two
clusters are fused if the resulting least similarity between pairs
2
of members is greatest. That is, using d the diameter of the resulting cluster must be minimum. The method depends for its
fusion decision on the vagaries of pairs of points, and is therefore rather unstable; furthermore, the diameter constraint is probably too severe, and sometimes a type of chaining is observed (Wishart,
1969b.; Crawford, et al, 1970; see Appendices la and Id).
(ill) Average linkage. Sokal and Michener (1958), with their
-56-
into account group structure in clustering. Using product-moment correlation coefficients to measure the similarities between indi viduals, they define the similarity between two clusters as the average of all the similarities between pairs of individuals, one from each cluster. The hierarchical method is proposed by Ray and Berry (1965), Lance and Williams ('group average', 1966a,
1967a), and McQuitty ('similarity analysis', 1966b), and the con
cept of average linkage as a compromise between the single and complete linkage extremes is discussed by Sokal and Sneath (1963),
Hodson, et al (1966), Proctor (I9 6 6) and Sneath (1966a). Aver
age linltage is also used to augment single linkage as a de chaining (Shepherd, 1966; Shepherd and Willmott, 1968) and
counter-chaining (Carmichael, et al, I9 6 8) mechanism. On the
whole, the method seems to behave well (see Appendix Ic); however, the hierarchical algorithm has been known to chain with very
large populations (e.g. Wishart, 1969d; Crawford, et al, 1970),
(iv) Median linkage. As an alternative compromise between the single and complete linkage extremes, Kendrick and Proctor (1964) propose a median linkage method which they say is 'easier than the mean^ and unaffected by outlying values'; Proctor (19 6 6)
further claims that 'in the absence of a computer program it is easier to obtain than the arithmetic average '. The similarity ^centroid sorting - see next paragraph,
-57-
between two clusters is defined as that similarity between two individuals, one from each cluster, which represents the median
2
of all between-cluster links; that is, one-half of the inter cluster similarities are less than the median. This method will behave very much like average linkage, but is considerably more difficult to programme (despite the authors' claims). At the fusion of two clusters, the similarities between all other
clusters and the new group must be obtained from a search of each submatrix of the similarity matrix which contains the between- group similarities for a cluster pair. Each search must also include an ordering mechanism to isolate the new median, which must then be stored elsewhere (the similarity matrix has to be retained in full). By contrast, average linkage has a very nice
'combinatorial solution' - see Sect. 8.1.
(v) Proportional Link linkage (Sneath, 1966a). As its name
implies, proportional link linkage would combine two clusters if a specified proportion / of the between-cluster similarities exceeded a chosen threshold (a similar suggestion is made by Shepherd and Willmott, I9 6 8). The hierarchical procedure would
require that the similarity S (/) between two clusters be defined
2if either of the clusters has an even number of members, then
there will be an even number 2r of between-cluster similarities.
For convenience we shall adopt the (r+1)th highest similarity as the median in this case.
”58“
as that between-cluster similarity which is the (ék k )th member
p q
of the ordered list of between-cluster coefficients, where k , kP q
1
are the cluster sizes, and 8k k is a rounded-up integer. Hence
P q
= 0, and 1 exactly reproduce single, median and complete link
age . Fusion would be defined for those clusters p and q for which Spq(jz^) is greatest. Although theoretically nice, the method unfor tunately suffers the same computational disadvantages of median linkage, and does not seem to have been programmed or used. Centroid Sorting
One very attractive generalisation of the hierarchical fusion method is 'centroid sorting', for which a group of one or more individuals is represented by a point located at the group's mean or centroid. This concept permits us to compare two groups in terms of any quantitative similarity criterion (see Chapter 1): for example, we may use the distance separating their centroids as measure of similarity, or the cosine of the angle between two lines connecting the origin with the centroids. Obviously the cosine criterion is dependent, while distance independent, on the position of the origin, and therefore centroid sorting permits us to compare very different similarity criteria within the same clustering framework (see Chapter 5)«
1
the term 'rounded-up' is used here in the special sense that for any value K 4 /k k <[(k+1), we choose the number (K+1), excepting
the case when 8 = r, where we use k k .
-59-
Sokal and Michener (1958) seem to have been the first to adopt centroid sorting, naming it the 'weighted variable-group' method, and in this instance they use the product-moment corre lation coefficient as similarity criterion.
In general, with the hierarchical fusion algorithm, we fuse two clusters p and q and then compute the centroid or mean of the new cluster distribution; then the similarities Spq,r
between all other clusters r and the new cluster (p+q) are com puted using the two cluster centroids as if they were single indi viduals . The next fusion is then indicated for those two clusters having highest similarity, and the process is repeated N - 1 times Ward's Method
Ward (1963) proposed a method for hierarchical fusion which
is probably one of the most used procedures of its kind, particu larly in the social sciences. The 'disorder' within a cluster is measured by the sum of the squared distances of the points from the cluster mean; hence if X. is the value of the jth variable1J t for the ith point of cluster t, which contains k points, then
where U .. is the mean of the jth variable for cluster t. The t J total 'error sum of squares' E is then defined by Ward as the sum of the values for all T clusters -
—60“ T
With hierarchic fusion, two clusters p and q are chosen for fusion in order to minimise E; that is, they are fused if the increase in E
I = E ^ - E - Epq p+q p q
is minimum. This method is independently proposed by Orloci (1967b),
and E is considered by Edwards and Cavalli-Sforza (I9 6 5) in their
exhaustive polythetic divisive method (Sect. 3*2), and Beale (1969)
for iterative relocation (Sect. 4.2). Wishart (1969c) found the
combinatorial transformation for I , and proposed an efficientpq computer algorithm for this and other hierarchical methods (see Sect. 8.4). It is fairly straightforward to prove (Sect, 8.2) that
^pq ^pqVq'^^^p
(equation 8,2.4), where k , k are cluster sizes, and d is theP q pq distance between the cluster centroids. Using this form, the method can be included in the group of 'centroid sorting' tech niques, where clusters are represented by their centroids and the similarity criterion is I , as stated above,pq
Gower's Median
In introducing his median strategy, Gower (19 6 7) writes:
-61-
is a similarity between individuals 1 and j, then the
distance between their point representations p^ and Pj is 1^(1 - Gower (1966) has shown that the latent
vectors of the similarity matrix, scaled so that the sum of squares of the elements of the rth vector is equal to the rth latent root, gives directly a set of coordinates with this distance property."
In fact, Gower (1966) considers precisely one definition of 8.., 1J
namely Sokal's matching coefficient (Chapter 1) A + D , B + C , ,2
hj = - M ~ = ^ ' - hj
2
where d^^ is the binary squared distance measure. Hence for the matching coefficient the distance property holds. Gower (I9 6 7)
then goes on to define the combinatorial solution (see Sect. 8.1) for the fusion of two clusters p and q by centroid sorting as
which is identical with the centroid sorting combinatorial for-
2
mula (Sect. 8,1) on substitution of 1 - d for each S . At thispq pq point, in connection with an involved discussion of weighting schemes, Gower suggests that we may "wish to give each cluster unit weight, regardless of the number of individuals in it" from which he deduces the alternative combinatorial formula
-62- 2
and when 1 - d Is substituted for the matching coefficient S ,pq pq this is the same as
' ? { q (2.2 0 )
Formula (2.2,3) is correctly interpreted (for distances only) by Lance and Williams (1967a) as a 'median' strategy, in the sense
that the new cluster formed by the fusion of p with q is assigned to a point midway between the points representing p and q (the median of the line which connects p with q), regardless of cluster
sizes. Formula (2.2.2) will produce an identical fusion hier archy when is the matching coefficient; however, in deriving
(2.2.1) and (2.2.2) Gower (I9 6 7) appears to generalise the strategy
for all similarity coefficients. The point should be made that Gower’s 'median' and 'centroid' sorting strategies, as defined geometrically above, are only obtained with the distance measures d^^ or (B + C)/M, or the complement (A + D)/Mj any other simi larity coefficient will either not satisfy the geometrical inter pretations of (2.2.1) and (2.2.2), or will require additional
proof (see, for example. Sect. 8.3)*
This dichotomous situation is best resolved by adopting equation ( 2.2.3) in connection with the distance statistics d. .
1J
and (B + C)/M, these being the only measures discussed here which satisfy the geometrical interpretation of Gower's median method.
-•63- Informatlon Statistic
The 'information' and 'information gain' statistics I and AI are generally introduced (Hyvarinen, 1962; Macnaughton-Smith, I9 6 5:
Williams, et al, I9 6 6; Lance and Williams, 1966b; Orloci, 1968a, 1968c, 1969a, 1969b) for binary data, thus :
Shannon (1948) defined the quantity 'information' for a finite dis crete probability function taking R states by adopting the, entropy function H (Tolman, 1938; Brillouin, I9 6 2) as a measure of the
'disorder' of a system. If p is the probability associated with the rth state, then entropy is given by
R
H = - y p log p r=1
For classification purposes, it is usual to consider the case when there exist only two possible states (presence and absence) of a binary attribute j - however, Hyvarinen (1962) and Orloci (1968a, 1968c, 1969b) continue with the general case of R^ states associ
ated with each multistate character j in order to adapt the binary result to semiquantitative data (such as species density counts within stands). Hence we can write
j = - I^Pj log Pj + (1 - Pj) log (1 - Pj)j
HJ
for the binary case. Clearly, when p. ^1, H . ^ 0 since log p. ^ 0,J O J and similarly when p. 0, H . ^0; in fact, the value of H. achievesJ J J a maximum at p = ^ (Shannon, 1948). If, in a classification process
-64-
we derive a group of individuals for which the Jth attribute is either completely absent or completely present, then p^ will be
0 or 1 respectively, and we conclude that the group is well-
defined for that attribute.
This statistic is further generalised by Shannon for Markoff chain processes to the case when there are M events J
each having entropy H ., so that the total disorder of the system may be measured by the total entropy (or average information)
M H = Z H
j=l J
which, for 2-state data, reduces to
M
H = - ^ j^Pj log Pj + (1 - Pj) log (1 - Pj)J
By introducing the factor n (group size) we obtain the ’working formula' for total information
M i
I = hH - Mn log n - I f. log f. + (n - f.) log (n - f.) tL J J J J-d where the f.’s are attribute frequencies (f. = p.n).J J J
In the context of the hierarchical fusion process for binary data, three variants of these 2-state statistics are used to
measure dissimilarity. MacArthur and MacArthur (1961) have adopted
total entropy H (= l/n) to measure diversity, while Lambert and Williams (1966) use I, In either case, we fuse those two clusters
-65-
whose resulting H or I is minimum. Alternatively, we can define 'information gain' AX at fusion (Macnaughton-Smith, I9 6 5; Williams,
et al, 1966; Lance and Williams, 1966b) as
AI = I , - I - IPq P+q P q
and combine those two groups whose AI is minimum. Since the p^'s constitute centroid coordinates, all three variants of the infor mation statistic may be Included within the 'centroid sorting'
category, since each group at fusion is represented by its centroid, It is for this reason that Williams et al (1966) and others des
cribe 'information analysis' as another variant of centroid. Kullback (1959) has evidently established a relationship
between % and these functions under random sampling, and Macnaughton-Smith (I9 6 5) has compared I with2 % ^ (see also
Association analysis. Sect. 5.I)' Using the relationship,
Lambert and Williams (I9 6 6) have set up a nul1-hypothesis test
of confidence whereby fusion is terminated when 2AI 4 (M d.f.). However, it has been generally admitted that this significance test is conservative, and unreliable when used in the context of monothetic division, particularly when M is large (Lambert and Williams, I9 6 6; see also Sect. 3*1)• Although Lance and Williams
(1966b) appear to be fairly satisfied with the test when used for
hierarchic fusion, it would appear that frequently too many final groups are indicated. For instance, with the 450 quadrat x 37
-66-
species Andean survey data (Crawford, et al, 1970 - Appendix la),
35 clusters were indicated by the significance test at p - 0.0 1.
Clearly, this test needs further improvement, and cannot be con-
1
fidently used in its present form .
In practice, information appears to behave very much like the error sum of squares, and at the present time the only
realistic approach to determining a cut-off point on the hier archy is to look for large relative ’jumps' in the fusion coef ficient and then examine the prior grouping for 'meaningful’
clusters. Sneath (1969) has noted that the information statistic
dislikes small groups when large clusters are around; it is readily seen that a single peripheral individual, if grouped to a large cluster, is unlikely to modify the p.'s extensively, regardless of its attribute structure. Hence, the statistic can be accused of tending to force clusters of equal size (see also figure 2.3,2).
2 .3 DISCUSSION
One of the most attractive features of the generalised hierarchical algorithm is that the entire procedure can be repre sented by a 'dendrogram' or 'linkage tree'. Every individual is
1
Orloci (1968a, 1968c, 1969b) merely mentions a relationship
between 21 and X established by Kullback (1959), but does not adopt the significance test, and Lance and Williams (1968) admit
that the number of degrees of freedom is usually very large, and the test of significance correspondingly weak.
-67-
allocated a node (point) on a graph, and each fusion conveyed by connecting the two branches associated with the fused groups. These connections are usually drawn parallel to points on a coef
ficient scale which correspond to the fusion coefficient values, so that large jumps in the coefficient can be readily observed. An example is shown in figure 2«3,2(B5) where the large jump from the 2 to 1 cluster level could be 'interpreted' as the transition
from a 'well-ordered* to 'disordered' classification.
This device is used by several writers (Sneath, 1966a;
Williams, et al, I9 6 6; Lance and Williams, 1966b, 1967a) for the
visual comparison of hierarchical methods. Figures 2.3°1 and
2.3 .2 are used by Williams et al (1966) to compare single linkage
with centroid sorting for five different similarity criteria. The most striking aspect of figure 2.3.1 (single linkage) is the
consistent 'chaining' effect throughout, while figure 2,3»2 (cen
troid sorting) shows the phenomenon known as 'coefficient
reversals', for which the fusion coefficient values are not always monotonie decreasing. Another important feature of centroid
sorting (shown in figure 2.3.2) is that different similarity
criteria often produce very different results. We see that the
information gain statistic (B5 of 2.3*2) appears to produce (force ?) a nicely nested hierarchy within which variation of cluster size seems to be reduced, while the other criteria show differing ten-
— Ô v D — . Â1 h CORRELATION COEFFICIENT A 2