CLUSTER SIGNIFICANCE ANALYSIS - Unsupervised Learning

Unsupervised Learning

5.5 CLUSTER SIGNIFICANCE ANALYSIS

The advantage of unsupervised learning methods is that any patterns that emerge from the data are dependent on the data employed. There is no intervention by the analyst, other than to choose the data in the first place, and there is no attempt by the algorithm employed to ‘fit’ a pattern to the data, or seek a correlation, or produce a discriminating function (see Chapter 7). Any groupings of points which are seen on a non-linear map, a principal components plot, a dendrogram, or even a simple bivariate plot are solely due to the disposition of samples in the parameter space and it is unlikely, although not impossible, to have happened by chance. There is, however, a major drawback to the unsupervised learning approach and that is an evaluation of the quality or ‘significance’ of any clusters of points. Many analytical methods, partic- ularly the parametric techniques based on assumptions about population distributions, have significance tests built in. If we look at the principal component scores plot for the fruit juices (Figure 4.9) or the dendrogram for the same data (Figure 5.8) it seems obvious that the groupings have some ‘significance’, but is this always the case? Is it possible to judge the quality of some unsupervised picture? McFarland and Gans [17] ad- dressed this problem by means of a method which they termed cluster significance analysis (CSA). The concept underlying this method is quite

Figure 5.12 Plot of active () and inactive (◦) compounds described by two pa-

rameters (reproduced from ref. [17] with permission of the American Chemical Society).

simple: for a given display of N samples which contains a cluster of M active (or otherwise interesting) samples, how ‘tight’ is the cluster of M samples compared with all the other possible clusters of M samples? Various measures of tightness could be used but the one chosen was the mean squared distance (MSD) which involves taking the sum of the squared distances between each pair of points in the cluster divided by the number of points in the cluster (M).

The process is nicely illustrated by a hypothetical example from the original report. Figure 5.12 shows a two-dimensional plot of six compounds, three active and three inactive. The total squared distance (TSD) for the active cluster is given by

TSD= (x1− x2)2+ (y1− y2)2+ (x1− x3)2+ (y1− y3)2

+ (x2− x3)2+ (y2− y3)2 (5.4)

and the mean squared distance

MSD= TSD_3. (5.5)

The probability that a cluster as tight as the active cluster would have arisen by chance involves the calculation of MSD for all the other possible clusters of three compounds. The number of clusters with an MSD value equal to or less than the active MSD is denoted by A (including the active cluster) and a probability is calculated as

Figure 5.13 Plot of active () and inactive (◦) inhibitors of monoamine oxidase

(from ref. [17] copyright (1986) American Chemical Society).

where N is the total number of possible clusters of that size, in this case three compounds. It is obvious from inspection of the figure that there is one other cluster as tight as or tighter than the active cluster (compounds 2, 3, and 4) and that all other clusters have larger MSD values since they include compounds 1, 5, or 6. There are 20 possible clusters of three compounds in this set and thus A= 2, N = 20, and

p= 2_{20 = 0.10} (5.7)

If a probability level of 0.05 or less (95 % certainty or better) is taken as a significance level then this cluster of actives would be regarded as fortuitous.

Figure 5.13 shows a plot of a set of inhibitors of the enzyme monoamine oxidase (MAO) described by steric (Esc) and hydrophobic (π) pa- rameters. It can be seen that the seven active compounds mostly cluster in the top left-hand quadrant of the plot. The original data set involved a dummy parameter, D, to indicate substitution by OCH3or OH at a particular position, and in the application of CSA to this problem, a set of random numbers, RN, was added to the data. The results of CSA analysis for this data are shown in Table 5.10 where it is seen that lowest probability of fortuitous clustering is given by the combination ofπ and Esc. This illustrates another feature of CSA; not only can it be used to judge the significance of a particular set of clusters, it can also be used to test the effect (on the tightness of clusters) of adding or removing a particular descriptor. Thus, it may be used as a selection criterion for the usefulness of parameters. One thing that should be noted from the table is the large number of possible subsets (77 520) that can be generated

Table 5.10 Application of CSA to a set of 20 MAO inhibitors (reproduced from ref. [17] with permission of the American Chemical Society).

Parameters A∗ p D 21464 0.27688 RN 14825 0.19124 πc 1956 0.02523 Es 118 0.00152 D,π 1299 0.01676 D, Esc 1175 0.01516 RN, Esc 172 0.00222 π, Esc 71 0.00092 RN,π, Esc 151 0.00195 D,π, Esc 78 0.00101 ∗

From a total possible set of 77,520 subsets of 7.

for this data set. This may cause problems in the analysis of larger data sets in terms of the amount of computer time required. An approach to solving this problem is to compute a random sample of the possible combinations rather than exhaustively examining them all [17]. CSA has been compared with three other QSAR techniques in the analysis of three different data sets [18].

5.6 SUMMARY

Unsupervised learning methods, like the display techniques described in Chapter 4, are very useful in the preliminary stages of data analysis. Cluster analysis and FA produce easily understood displays from high-dimensional data sets and may be used when the number of variables in the set exceeds the number of samples. Although care must be ex- ercised in the choice of class members when using k-nearest-neighbours, this and other methods described in this chapter should be reasonably safe from the danger of chance correlations. Cluster significance analysis allows us to attempt to assign significance levels to any ‘interesting’ groupings of samples seen using these methods or multivariate display techniques. Finally, in common with all of the other methods described in this book, it is not possible to say that any one technique is ‘best’.

In this chapter the following points were covered: 1. classification by k-nearest neighbours;

3. use of factor analysis to see the relationships between variables; 4. use of factor analysis to visualize samples;

5. use of scree plots to choose ‘significant’ factors;

6. cluster analysis to examine relationships between samples; 7. cluster significance analysis to judge the ‘quality’ of clusters.

REFERENCES

[1] Kowalski, B.R. and Bender, C.F. (1972). Analytical Chemistry, 44, 1405–11. [2] Chu, K.C., Feldman, R.J., Shapur, M.B., Hazard, G.F., and Geran, R.I. (1975).

Journal of Medicinal Chemistry, 18, 539–45.

[3] Scarminio, I.S., Bruns, R.E., and Zagatto, E.A.G. (1982). Energia Nuclear e Agri-

cultura, 4, 99–111.

[4] Goux, W.J. and Weber, D.S. (1993). Carbohydrate Research, 240, 57–69. [5] Chatfield, C. and Collins, A.J. (1980). Introduction to Multivariate Analysis.

Chapman & Hall, London.

[6] Malinowski, E.R. (1991). Factor Analysis in Chemistry. John Wiley & Sons, Inc., New York.

[7] Jackson, J.E. (1991). A User’s Guide to Principal Components. John Wiley & Sons, Inc., New York.

[8] Li-Chan, E., Nakai, S., and Wood, D.E. (1987). Journal of Food Science, 52, 31–41. [9] Takagi, T., Shindo, Y., Fujiwara, H., and Sasaki, Y. (1989). Chemical and Pharma-

ceutical Bulletin, 37, 1556–60.

[10] Svoboda, P., Pytela, O., and Vecera, M. (1983). Collection of Czechoslovak Chem-

ical Communications, 48, 3287–306.

[11] Ford, M.G., Greenwood, R., Turner, C.H., Hudson, B., and Livingstone, D.J. (1989).

Pesticide Science, 27, 305–26.

[12] Hudson, B.D., George, A.R., Ford, M.G., and Livingstone, D.J. (1992). Journal of

Computer-aided Molecular Design, 6, 191–201.

[13] Cormack, R.M. (1971). Journal of the Royal Statistical Society, A134, 321–67. [14] Willett, P. (1987). Similarity and Clustering in Chemical Information Systems. Re-

search Studies Press, John Wiley & Sons, Ltd, Chichester.

[15] Dizy, M., Martin-Alvarez, P.J., Cabezvdo, M.D., and Polo, M.C. (1992). Journal of

the Science of Food and Agriculture, 60, 47–53.

[16] Lewi, P.J. (1976). Arzneimettel-Forschung, 26, 1295–1300.

[17] McFarland, J.W. and Gans, D.J. (1986). Journal of Medicinal Chemistry, 29, 505–14.

6

In document Livingstone, Data Analysis (Page 159-164)