HIERARCHICAL METHOD - THE PROBABILISTIC MODEL 6.1 MINIMUM-VARIANCE TECHNIQUES

Spx-Px'kp-T'T M) (5.2.2)

CHAPTBH 6: THE PROBABILISTIC MODEL 6.1 MINIMUM-VARIANCE TECHNIQUES

6.4 HIERARCHICAL METHOD

A major criticism of the one-level test is that two thres holds r and k must be chosen by the user. This external control can be reduced by defining a hierarchical algorithm which is based on the order in which points become dense. The method can be summarized as follows ;

(a) Select the density threshold k, compute the inter-point distance matrix and the distances PD from each point to its kth nearest point.

(b) Order the distances PD so that the smallest is first using the array KP as an index. Thus KP defines the order in which the data points become dense: point KP(1) has the smallest kth distance PD(l) and is first to become dense when r = PD(1), point KP(2) is second at PD(2), and so on.

(c) Select distance thresholds PMIN from successive PD values, initialising a new dense point at each cycle. As the second and each subsequent dense point is introduced, the method tests the new point to determine one of three possible fusion phases:

either (i) the new point does not lie within PMIN of another dense point, in which case it initialises a new cluster mode,

(ii) the point lies within PMIN of dense points from one cluster only, and therefore the point is directly fused to that cluster.

-i46-

or (ill) the point falls in the saddle region, lying within PMIN of dense points from separate

clusters, and the clusters concerned are fused.

(d) Finally, a note must be kept of the nearest-neighbour dis tance DMIN between dense points of different clusters, When PMIN exceeds DMIN, the direct fusion of the two

clusters separated by DVIIN is indicated.

This algorithm is concisely represented by the flow chart in figure 6.4.1.

Output of classifications

It is conceivable that, at each cycle of the algorithm, all the non-dense points could be reallocated to the cluster nuclei and the cluster groupings made available. However, this leads to a vast collection of results which are confusing, and it is therefore desirable to restrict the output in some way. The fusion of a new dense point to an existing cluster (c(ii) of the algorithm) is probably the least significant step. This can be interpreted as the growth of a mode and corresponds to an information-gain for the cluster concerned, thus the previous grouping has a lower information content and can be considered of less value. Simi larly, at the introduction of a new cluster nucleus (c(i) of the algorithm), the groupings become outdated when, the cluster sub sequently 'grows' and increases in information-content. The really critical phases are therefore those at which existing

Yes No

Is KMIN a n w cluster nucleus? No

Y \ f .

—££-[ Poos KMIN join one existing cluster” No

Is PMIN > DMIN? KMIN » KP(IL) PMIN " PD(IL) Select Density k

KMIN causes the fusion of two or more clusters

Compute ININ, the smallest distance betaveen dense points LIM, LINK of separate clusters

Increase the threshold to DMIN and fuse the two clusters containing points LIM 5 LINK

Increase tlie distance threshold to PMIN and introduce the next dense point KMIN

Re-ollocate non-dense points by some similarity criterion and output classifications obtained immediately prior to this fusion Compute the Distances Pi) from each point to its kth. nearest point. Order PD with K3’ as index - tiuis point KP(1) is first to bcccsaa dense when the threshold reaches PD(1)

Figure 6.4.1. Flow Chart for the Hierarchical Mode Analysis Computer Program.

-148-

clusters are fused (c(iii) and d)^ and output is restricted to those groupings which are obtained immediately before such a fusion. Two alternative levels of classification are offered to the user; the nuclei level groups only those data points

(including any which are non-dense) that lie within those spheres that correspond to dense points, while at the complete classifi cation level, each non-dense point is allocated to the cluster containing its nearest dense point. Non-dense points which lie outside dense spheres are denoted unclassifiable at the nuclei level, while, for those users who demand a best-possible fit for all their cases, the complete level of classification allocates the entire population to the cluster modes.

Unique Features of Mode Analysis

1 . For the first and last cycles of the analysis only one cluster is defined. Thus, at some intermediate stage, the

number of clusters reaches a maximum that can be interpreted as the widest classification which is 'natural' or ’taxonomically significant'. It is possible that an analysis will never reveal more than one cluster, indicating that the data swarm is unlmodal In a large study of several real data matrices (population sizes ranging from 30 to 350), the method never defined a grouping of more than nine clusters, and the average analysis maximum, was about six.

-149-

6 depending on the population size. For large data sets (N> 200). empirical trials indicate that values of k in the range 3 to 5 yield practically identical results. Thus the user control is severely restricted, and seldom critical.

3' When k takes the value 1, the algorithm degenerates, by definition, to nearest neighbour, making available this additional method as an option for very small data sets.

4. The number of separate classifications is severely limited, by the output control, to those groupings obtained prior to clus ter fusions. During the trials, the largest number of groupings obtained was 24, while the average was about 11. In one case, the method generated only six groupings for a population of size 3 1 0.

In document Some problems in the theory and application of the methods of numerical taxonomy (Page 158-162)