POLYTHETIC DIVISION - DIVISIVE METHODS 3.1 MONOTHETIC DIVISION

CHAPTER 3: DIVISIVE METHODS 3.1 MONOTHETIC DIVISION

3.2 POLYTHETIC DIVISION

Edwards and Cavalli-Sforza (I9 6 5)

The error sum of squares E, although apparently first used by Ward (19^3) as an 'objective function' optimised by hierarchic fusion (Sect. 2.2 and 8.2), is often attributed to Edwards and Cavalli-Sforza (I9 6 5) as a homogeneity indicator in the context

-88-

of cluster analysis (Orlocl, 1967b; Gower, I967; Callnski and

Harabasz, 1970); in fact, Orloci says (personal communication) that he was "inspired by Edwards and Cavalli-Sforza" when he reproposed Ward's method (Orloci, 1967b).

That there exist one or more absolutely optimum solutions for the error sum of squares E for a given number of clusters has intrigued many writers (Porgey, 1964, I965; Dagnelie, I9 6 7;

Bolshev, 1969; Callnski, 1969; Callnski and Harabasz, 1970), and

Edwards and Cavalli-Sforza are attributed with the only method which guarantees to find the optimum partition. They do so by examining all (2^ ^ - 1) possible divisions of the population of

N individuals into two classes, and compute the error sum of squares in every case. Having found the best two clusters, how ever, they then abandon the idea of finding the optimum division into 3 groups (because the examination of (3^ ^ - 2^ + 1 ) /2 classi

fications is "impossible") and prefer instead to partition into two the first two groups obtained, using the same division pro cedure as before. They continue in this way, obtaining a

division tree which is the same as the monothetic 'nested sub division' (Sect. 3.1). Since it cannot be claimed that, in

general, successive optimum solutions for E satisfy the imposed hierarchical structure, Edwards and Cavalli-Sforza cannot guarantee to find the optima for other than 2 groups; in fact, this drawback

-89-

CalInski and Harabasz (19?0)«

It should also be noted that the Edwards and Cavalli-Sforza method is extremely inefficient, being computational "impossible" for more than about 20 individuals (Macnaughton-Smith, 19^5;

Lance and Williams, 1966bj Orloci, 1967b; Gower, 1967b) due to the

enormous number (2^ ^ - 1) divisions that have to be examined. The method is a classical example of the mis-use of computational facilities, and is of interest solely for its treatment of E, Callnski-Harabasz Shortest Dendrite Method

The ’shortest dendrite' or minimum spanning tree (Florek et al, I95I; Gower and Ross, I9 6 9) is the graph of N - 1 edges which

connects all points in the sample space, has the least overall length and no circuits. It is analagous to the hierarchical fusion process for single linkage, where the pair of nearest neighbours at each step defines an edge of the graph. Calinski and Harabasz (1970; see also Calinski, I969) reason intuitively

that the optimum error sum of squares solution for k clusters may be obtainable by removing k - 1 edges from the shortest dendrite. That this is not always true, is demonstrated by figure 3*2,1 for which the optimum solution for E when k ~ 2 requires the removal of two edges (as indicated by the dotted partition line c). How ever, the method was shown by the authors to yield a better

solution than the Edwards and Cavalli-Sforza method when k - 5 with a population of 12 Indian castes; in fact, the Calinski-

“90-

Harabasz result confirmed a previous finding of Rao (1952) who used an average distance criterion with principal components

analysis.

(c)

(b) (1.1.0) (1.1.1) (1.1.2)- (1.1.3) (1.1.4) (1.1.5)

p.---o---o-- 1----O---O--- O

( o .6. - o .6) p 1 \

O ’ --- —"O-- — '— —-"O--- T---- — ---- O— -— — ”-'0

(0.0) (0.1) (0.2) , \(0.3) (0.4) (0.5)

(a )

(o)

(b)

(a)

Figure 3*2.1. Example of a minimum spanning tree (solid lines) which cannot be partitioned to find the optimum error sum of

squares for two clusters. The points’ coordinates are given, and the distance matrix was computed without standardisation. Partitions are; (a) solution for 2 classes by Calinski^

Harabasz method; (b) starting solution for the iterative relocation procedure (Chapter 4); (c) final solution after

6 relocations and 2 iterations. Solution (c) is evidently

the optimum.

An important feature of the Calinski-Harabasz method is that it is non-hierarchical, as opposed to all other polythetic divisive schemes considered here, and hence it is not subject to the criticism that any one result is dependent on previous parti tions for its efficiency (see below). The method is much faster than the Edwards and Cavalli-Sforza technique, requiring computa tion time proportional to . However, although this permits

-91-

populatlons of order about 6o to be considered, the method is computationally slow when compared with the hierarchic fusion and iterative relocation procedures, the time factor being roughly when N k. It can also be argued that the method is inefficient because it considers some partitions of the den drite which are highly unlikely to be profitable (viz. the removal of k-1 edges located together at an extreme vertex of the graph).

Dissimilarity Analysis

Macnaughton-Smith et al (1964; see also Macnaughton-Smith,

1 965) propose a polythetic divisive scheme which determines a

single partition of a cluster, and then derives a nested sub division tree in the same fashion as Edwards and Cavalli-Sforza. The method works as follows:

1, Each individual is compared with the set of all the rest, and we choose that individual which is least similar to the rest. With the centroid criterion, we would select the point which is farthest from the group centroid.

2. We next consider all pairs of individuals, including the one chosen above, and select the ’best’ pair.

3. Each triad, containing the pair selected at 2, is considered

and the best such triad chosen.

4, In this way, we develop a partition of the set by moving each individual which belongs to the ’rest’ into an accumulating

-92-

subset, and then examine the resulting similarity between the ’subset’ and the ’rest’. At the end of each cycle, we move to the subset that single individual whose move results in the greatest dissimilarity between the subset and the rest.

5. The procedure stops when the ’best’ individual is more alike

to the ’rest’ than to the ’subset’, at which stage the move is deemed to be unprofitable.

Mathematically, we denote by f(P,Q,) the chosen function which measures the similarity between sets P and Q,. Hence if x^ is an individual belonging to the ’rest’ R, and G is the growing subset, then we evaluate for each x^eR

d(x^) = f(G+x^,x^) - f(R-x^,x^)

and choose that individual X for which d(X) is a maximum. Then, provided that d(x) %> 0, we remove X from R and place X in G. If d(X) ^ 0 we stop, and the best partition of the set into subsets G and R has been found. Each subset thus found is considered separately for further division, thereby deriving the nested sub division sequence (regrettably, no rule for the order of such divisions is suggested by the authors).

The important features of this analysis are as follows: 1. The computation is not fast, being of the order of

-93-

when cluster G ends up with g members. If g = n/2 (and this is

not necessarily the "worst" case, as suggested by Lance and Williams (1966b), because it is possible that g > n/2), then this reduces

to 3^(n+2)/8, which must be further multiplied by the factor corres

ponding to the evaluation of d(x^). In fact, the dissimilarity

function used by Macnaughton-Smith et al is of the order M , where M is the number of attributes (binary), so that the time for each division step is proportional to 3M n(n+2 ) /8 in this case, where

n is the size of the group being considered (Lance and Williams (1966b) incorrectly deduce the factor 3n /4 for dissimilarity

analysis).

2, The method, like all divisive schemes, suffers the drawback that inefficient early partitions cannot be corrected (see Gower,

1967; also Sect. 3*1, and below). For example, a natural 3-cluster

grouping is unlikely to be found since one of the clusters will probably be split in two at the first step- Also, the initial direction of the partition is determined by the most remote indi vidual (when the distance criterion is used), which is not parti cularly likely to indicate the direction of a natural density saddle. Furthermore, this likely peripheral misfit, regardless of its final relationships with R and G, is constrained to belong to G from the very start.

3. The method is evidently suggested for use with nested sub

-94- been outlined (Sect. 3*1)*

Conclusions

Many authors express a preference for divisive systems, using such arguments as :

"divisive methods, which start with the whole sample, are in general safer than agglomerative methods"

- Macnaughton-Smith et al (1964) "since divisive methods are preferable to agglomerative.

Similarity Analysis (meaning single linkage) is not considered in the present paper"

- Macnaughton-Smith (1965)

"when monothetic classification by attributes is accept able or even desirable, a more powerful divisive system is possible, in that the function used for selection of attributes can be calculated over the entire population"

- Lance and Williams (1965)

"the single greatest advantage of a divisive system like association analysis is that the analysis begins at a high information level"

- Lambert and Williams (I9 6 6)

By contrast, Gower (1 967) writes;

"it is held that divisive methods will not lead to any spurious groupings and although this is probably mostly true there appears to have been no formal investigation. For example, suppose we have three well-defined groups; then no harm is done if division is made as in figure

3.2.2(a), but can we guarantee that it will not occur

as in figure 3*2.2(b)? We would, however, probably be happier if divisions were made as in figure 3.2.2(c),

which is the type of clustering found by agglomerative methods."

In fairness to the previous advocates of divisive systems (who, incidentally, are all attributed with the authorship of divisive

XXX XX X XX X X X XXX X X X y x x X X xxx X X X 1 1^ X y xxx_XXX X X .III X X X X XX I _•I

Figure ^.2.2. Possible divisive and agglomerative clustering resultsj (b) Indicates the type of irreversible failure found in monothetic methods; (c) shows the sort of Glustering produced by agglomerative methods.

methods), it should be noted that the agglomerative methods had not been fully exploited during the period 1964-66 (although most of those discussed in Chapter 2 had been published). In fact, Macnaughton-Smith (19^5) mentions only single-linkage before dis posing of agglomerative methods (in general), and Williams et al (1966) consider only centroid sorting and single-linkage: is it

possible, therefore, that their dissatisfaction with agglomera tive methods is merely the disguised practical experience of chaining? Paced with a dendrogram showing extensive chaining

(e.g.. Appendix Ic, or figures 2.3.1 and 2.3 .2) it would be very

easy to attribute the failing to "false decisions made in the early stages of the analysis" - Macnaughton-Smith et al (1964).

In any event, Gower's lucid assessment now seems more plausible than the previous unsubstantiated remarks, and it is therefore recommended here that agglomerative (Chapter 2) and iter

ative relocation (Chapter 4) methods should be used where possible, rather than divisive methods.

”96—

CHAPTER 4: ITERATIVE RELOCATION

In document Some problems in the theory and application of the methods of numerical taxonomy (Page 100-109)