An evaluation of cluster analysis and related multivariate techniques for operational research

(1)

City, University of London Institutional Repository

Citation: Newport, R. (1974). An evaluation of cluster analysis and related multivariate techniques for operational research. (Unpublished Doctoral thesis, City University London)

This is the accepted version of the paper.

This version of the publication may differ from the final published version.

Permanent repository link: http://openaccess.city.ac.uk/8585/

Link to published version:

Copyright and reuse: City Research Online aims to make research outputs of City, University of London available to a wider audience. Copyright and Moral Rights remain with the author(s) and/or copyright holders. URLs from City Research Online may be freely distributed and linked to.

City Research Online: http://openaccess.city.ac.uk/ [email protected]

(2)

API EVALUATION OF CLUSTER ANALYSIS

AND RELATED MUIPT I VARI IXTE TECHNIQUES

FOR OPERATIONAL RESEARCH

by RAY NEWPORT

submitted to the Graduate Business Centre, City University, _for _{the degree} _of

Doctor _{of Philosophy}

v

(3)

Al EVALUATION OF CLUSTER ANALYSIS

AND RELATED MULTIVARIATE TECHNIQUES

FOR OP : RATIONAL RESEARCH

(4)

VOLUME 1

(A) _INTRODUCTION

(B)

_MULTIVARIATE

ANALYSIS

Page 1

1. Methods _{of MVA} ₉

2. Cluster _Analysis ₁₆

3. _Ordination

₅₄

4. _Seriation ₆₀

5. _{Dissimilarity} _{and Similarity} _Measures ₆₉

(C) _{CLUSTER ANALYSIS METHODS}

1. General Discussion ₁₁₆

2. Explanation _{and Discussion} _{of Methods} ₁₂₀ 3. Comparisons _{of Other} _Researchers ₂₁₄

4. _Choice _{of Methods} _for _Study ₂₂₆

5. Comparison _{of Methods} ₂₃₃

6. _Conclusions ₃₀₂

VOLUME 2

(D) _{ORDINATION METHODS}

1. General Discussion ₃₂₉

2. Explanation _{of Methods} ₃₃₄

3. _Discussion _{and Comparison} _{of Methods} ₃₅₄

4. _Conclusions ₃₉₄

(E) _{USES IN OPERATIONAL RESEARCH} ₄₀₀

ADDENDA - Operational Research _{Case Studies}

1. Data Investigation ₄₁₆

(a) Input-Output _Analysis (b) Stock Market _Data

(c) _{A Manpower Study}

2. Specific _Applications ₄₇₂

(a) Vehicle Routing

(b) Sewer Pipes Problem

(c) Team Organizing _Problem (d _{Gap Analysis}

(e _Factory _Layout APPENDICES

I. Programs ₅₂₉

2. Data ₅₄₆

(5)

Acknowledgements

I am indebted to Al Russell _of the Graduate _Business

Centre for _his help _{and guidance} during this _project, _{and in}

-particular-to his faith in multivariate methods.

I would also like to pay tribute to the excellent

library _{and computer} _facilities _{of the GBC and City} _University, on which this _{work has depended} so much.

I would like to thank the following _people for their _help

in the _case _studies _given _in _the _Addenda _{- John Faupel} _of _the

Printing _{and Publishing} _Industry _Training _Board _for _the

manpower data, Dr. John Green _of the Local Government

Operational _Research _Unit _for _first _introducing _{me to the}

sewer _{pipes-problem,} _{and Jock} Scholefield _of the GBC for

information _{on the} _{team organizing} _problem.

(6)

Preface

The following _{work. is an investigation} into _methods _of

Cluster Analysis _{and Ordination.} _{The main objective} _{of this} thesis has been to investigate _{the capabilities} _{of these}

methods for practical usage. An important subsidiary aim has been to collect together _related _{work which has been}

carried out in many different areas _{- ecology,} biology, archaeology,

_ psychology, _{, ,}etc.., into, one work.

After _{a_ brief} introduction to the _general _concepts _of

multivariate analysis in Section A of the thesis, Section B

gives an introductory account of the methods of Clustering,

Ordination _{and Seriation,} _putting them into the _context _of

the _{by now better} _established _multivariate techniques.

Section C considers Cluster _Analysis in depth, _explaining _and

examining various methods reported in the literature,

together _with _methods developed _{by the} _author. _The

suitability of the methods for _practical _{use is} discussed and,

decision _rules _are _set _out for the _choice _{of method} to be

used in _{any particular} study, based _{on the} results of

extensive comparative tests _of the _methods.

In Section _{D the various} _ordination _methods _are.

considered, giving an, overall viere and relating the methods to each other. Particular _emphasis _{is paid to the rather}

neglected metric methods.

Section _{E, after} _{a survey} _{of published} _applications _of

the _methods, _suggests _{new areas} _where the _methods _previously

discussed _could _{be valuable} _aids for data investigation _and

problem solving. An Addenda is included which describes

several operational research case studies using these

methods.

Computer _programs _{are given} for the most successful _of the newly introduced cluster methods, and an extensive

(7)

LIST OF SYMBOLS USED CONSISTENTLY THROUGHOUT Nin _{the number of observations} _{or objects}

bS

,m the number- of variables

s any similarity measure, unless otherwise specified

S(a, b) the _similarity between the _objects _{or sets}

aandb

D _{any dissimilarity} _measure

D(a, b) the dissimilarity between the objects _{or sets} a and b.

da dissimilarity _measure, _usually _constrained to

(8)

A

(9)

2.

A. MULTIVARIATE ANALYSIS AND OPERATIONAL RESEARCH In various disciplines _{such as psychology,} biology,

botany, _and, _more _recently, business _studies, there _{are many}

situations where we have large sets of data which depend on

many variables. The analysis of many of these data sets is

complex because of the interaction between variables _{- one}

useful statistical approach is the use of multivariate

analysis techniques, which can take account of inter-variable

relationships. These methods were developed largely in the

psychological and biological fields, where large data sets

are abundant, and few methods previously existed for their

adequate analysis. These methods are now used in most

sciences, _{and their} use is currently being examined in the

social sciences.

The definition _{of multivariate} _analysis (m. v. a. ) is

rather vague; in its widest sense it could be thought of'as encompassing the whole field of statistics, but in practice,

boundaries _{need to be placed} _{on our definition.} For our

purposes we use the definition due to M. Kendall (1968) which extracts two essential features of m. v. a:

(a) _{We are concerned} _with _{a set of n individuals,} _{each of} which bears the value of p different variates. The multivariate character, _{so to speak,} lies in the

multiplicity _{of the p variates,} not in the size of the set n.

(b) _{The variates} _are _dependent _{among themselves} _{so that} _we

cannot split _off _{one or more from} the others and

consider it by itself. The variates _must be considered

(10)

3.

Although there _exist _several _mathematical _methods _of m. v. a., the sum of the methods do not comprise the science. The same remark can be made regarding-operational research. Both m. v. a. and operational research _{are approaches} to

problems which may involve models but they _{are primarily} ways of looking _{at problems.}

The approach of multivariate analysis is that of a

search technique. Information is sought from a set of data

as to Whether or not it has structure, and the nature of any structure _present. Because of the large computer times with

some multivariate methods, this _analysis of data can be costly _{and thus,} as with other search techniques, one

important _aspect is _{to balance} _{the cost of the search with} the expected gain.

The operational research _worker is concerned with the

solution. _{of problems} in the control of all aspects of an

organized system (often by mathematical means), but-not with

the decision taking _within that _system. The trend towards

increased _complexity _of _{organizations,} _{and the} _{need for}

greater control led to the _early growth of O. R. The

continued growth is a measure of the effectiveness of O. R. as

an aid to organizational control, and as a m. -trod of

increasing the knowledge _{of systems} behaviour, _{which may lead} to future benefits.

Most management situations, which are amenable to an

(11)

4.

variables those which seem most relevant. However any

attempt to study the effect of change in one variable in

isolation fmrA the rest _is _often _meaningless. _Multivariate

analysis brings to these _problems _{a battery} of techniques,

which _are _essentially multidimensional in their approach,

taking into _account the interplay between _variables. It is

surprising that _{so many published} works _{and degree} courses

concentrate on methods for deterministic situations. In

organizational problems there are relatively few such

situations. Multivariate _analysis however represents a set

of techniques for predictive and descriptive operational

research.

History

Multivariate _analysis _{has been in existence} _{as a}

statistical method for most of this century.. Its history can be divided into three _parts:

1890 - 1930 Isolated early works on multivariate

methods following Galton's correlation coefficient first published in 1888.

1930 - 1950 The discussion on the structure of human ability by Burt, Thurstone, Spearman, etc., in

psychology, which demanded new methods to analyse sets of intelligence tests to try _{and determine} the structure

of intelligence.

1950 to date _{The rapid} _expansion _{of techniques} to

other disciplines, _{and the increase} in number of methods available and the size of problem that could be tackled,

(12)

5.

The Data Box

One convenient _{way of displaying} the relationships

between _{some of the} _multivariate _analyses _that _{may be carried}

out is by use of what has been called by Cattell (1965). the

data box. This is based _{on a classification} _of

organizational _phenomena into three _classes _{on which}

measurements may be made:

1. Individuals (e. g. people, _countries, flowers)

2. Responses (e. g. size, _temperature, _colour)

3. Situations (e. g. times, _places, _{environments)}

These form _{a three-dimensional} _matrix _which _{can be visualized} as a box., the faces of which refer to types of multivariate

analyses.

1

ýi

1

c TIi

i i

00

i

41

J i

ý yýW _{1 CUnC}º5 J

t 4-U In _AA

Thus we can describe six types of technique:

(13)

6.

R-techniques _analyse the relation between _responses _using individual _scores _{e. g. principal} _components _analysis, factor _analysis.

P-techniques _{are concerned} _with _variations in response _over time _{e. g. spectral} _analysis..

O, S, T-techniques _{are more} difficult to visualize but their

meaning can be extracted from the diagram of the data

box. These _methods have been little _used, but Cattell

(1965) _refers _to _0-technique _being _used _in _meteorology

and Goronzy (1969a) refers to it being used in clinical

psychology.

In _{a particular} _analysis, _which _constitutes the individual,

the _response, _{or the} _situation _{may not} be clear cut _and

indeed _{one may decide} to use a particular technique twice on

the _{same matrix,} by _simply transposing the matrix and

repeating the analysis.

Cattell (1966) has since proposed _{a more comprehensive} data box in ten dimensions _called the Basic Data Relation

Matrix (BDRLI). The suggested dimensions _{of the- BDRTNI}are: 1. Person _{or organism}

2. Focal _stimulus

3. Non-focal _stimulus (environmental background _other than stimulus)

4. Response _{or unitary} _ongoing _process

5. Observer

ö. State _{of the organism}

7. Variant _{of the} _stimulus

8. _{Phase of the environmental} _background 9. Style _{of the response}

(14)

7.

Of these, items _1,2 _{and 4 refer} _{to the co-ordinates} _{of the} original 3D data box.

. Items 6,7 and 9 are related to the

exact nature of the items 1, 2 and 4. Co-ordinates 3 and 8

refer to subsidiary _stimuli _{or situations,} _{and items} 5 and 10

refer to the _observer _{who can be a further} influence _{on the}

situation.

The listing _{of the data box dimensions} _{is of particular}

use in consideration _{of how the} dimensions _which _{we are not}

including, _{in a study may affect} _it. _{Most present} _techniques only _enable _{a face} _{of the box to be examined} _{and thus} _other

dimensions _{we usually} _ignored. _Since _{many of the faces} _of the box represent techniques _which _{are difficult} _{to picture,}

and indeed, the exact face we are examining may be doubtful,

the BDR2,1 is _{of more use as- a set of variables} _{which have an} effect on data, than _{a method of picturing} techniques. The list _{can also be extended} _{to include,} _for _example, _{the effect}

of several _non-focal _stimuli, _several observers, etc. itiultivariate _analysis _{is best} _explained _{by the}

(15)

B

MULTIVARIATE _ANALYSIS

1. Methods _{of DIVA}

2. Cluster _Analysis 3. Ordination

4. Seriation

(16)

9.

B. 1 METHODS OF MULTIVARIATE ANALYSIS

There _{are several} types _{of multivariate} _analyses, _all

having _their _{own approach} _{to simplifying} _{data blocks.} _The major _methods _{may be listed:}

1., Principal _components

2., Factor _analysis

3. Canonical _correlation _analysis 4. Spectral _analysis

5. _Discriminant _analysis 6. _Cluster _analysis

7. _Ordination _techniques

Explanations _{of some of the above methods may be found} in LT. Kendall (1968), _{Hope (1968)} _{and Cooley} _{and Lohnes}

(1971). _{There have also been several} _{books published}

recently _{on the application} _{of these} techniques in particular

disciplines, _{such as L. King} (1969) _{on geography,} _Miller _and Kahn (1902) _{on geology,} _{D. Clarke} (1968) _{on archaeology,} _{' and}

Green and Tull (1970) _{on marketing.} Principal Components _Analysis

This _{method is perhaps} _{the most well-known} _{and most} widely _{used of the multivariate} _methods. If _{we consider}

data _{as a set of points} _{in n-dimensional} _{space then principal} components _analysis is concerned _with the rotation _and

(17)

10.

contributes the maximum to the total variance. The second principal component is orthogonal to the first _{and is} _such

that it _contributes _{the maximum to the residual} _variance.

Further _principal _components _{may be obtained} _until _{the total}

variance has been 'explained'.

The model is therefore _given by:

n

Pi ₌ _{ai jxj} (i=1,

... n) j=1

where Pi is the ith principal component _{and xi} _{are the} original variables, _{and the Pi 's are orthogonal.}

The aim of the model is to determine the effective

dimensionality _{of the original} _variables, _{and to study} the

underlying variables of the data set. By using _{p. c. a. , one} hopes to be able to reduce the number of variables, and to

obtain _{a more meaningful} _set. Since the method is dependent on the total variance _{of the original} variables it is

customary to normalize the variables before _performing _{p. c. a.} in order to have variables _measured in the same units.

Details _{of the mathematics} involved _{in extraction} _of

components can be found in Cooley _{and Lohnes} (1971), Harman (1967), _{and Seal} (1964).

Factor Analysis

Factor _analysis _{is currently} _{one of the most important} techniques _{of multivariate} _analysis. _{The aim of factor}

analysis is the resolution _{of a set of variables} linearly in

(18)

11.

thus _{are trying} _{to explain} _{our M dimensional} data in terms _of a specified' number of factors m, and a unique variation-(an

'error' term) _{in each variate.} _{The model is:} m

Xik

j=1 aijFjk + diUik

(i=i,... _n)(k=1,... _M)

each of the Fjk is called _{a common factor} _{and each} of

the _{Ui a unique} factor. _{The common factors-} _are _usually

orthogonal, although _methods for the construction of

oblique axes exist.

There _are _{an infinite} _number _{of ways of} _obtaining factors from

a set of data, depending on the required form of the factors.

Thus _several _methods _exist (see Harman _1966, M. Kendall _1968,

Horst _1955, Lawley _{and Maxwell} _1971). Perhaps the _most.

widely used methods are Lawley's maximum likelihood (Lawley

1940) _{and the} Minres _method _{of Harman} _{and Jones} (1966). A

typical _example _{of a study} _using factor _analysis is Baehr _and Williams' (1967) _study _{of personal} _{data and its} _relationship

to _occupation.

Canonical Correlation Analysis

This _concerns two different _sets _{of data} _{on the same} individuals _{and the method assesses} _{the correlation} between

the two sets, and the-linear function of each set of variables which gives that correlation. This is repeated as in

principal components to find the largest correlation

orthogonal to the first _{and so on.} The aim of c. c. a. is to determine _{the redundancy} _{in two sets of variables-and} the

(19)

12.

Spectral _Analysis

This is _{the analysis} _{of time} _series, _especially that _of sinusoidal series, and the removal of autocorrelation in

multivariate data, _which is used in forecasting, _{and in}

pattern recognition and signal detection. The methods and

problems of this type of analysis have recently come into

prominence with the introduction of a new technique by Box

and Jenkins (1970) (see also Chatfield and Prothero 1973, and

Box and Jenkins 1973). Spectral _analysis is _also discussed

in _{a book} by M. Kendall (1973).

Discriminant Analysis

If _{a set of data} _consists _{of two or more known classes}

(i. e. two types of plant, _people who failed or passed an

exam, etc. )-then the. data matrix is divided into

. that number

of groups and gives an identification routine so that a new

set of observations _{can be assigned} to these classes so that

the _probability _{of incorrect} _assignment is _minimized. This

is _normally _performed for _only _{a few} _classes (2,3 _{or 4) and}

is designed for _normally _distributed _classes _which _overlap.

For _{a discussion} _of the _method _{see Cooley} _{and Lohnes} (1971)

or M. Kendall (1968), _{or the original} work of R. Fisher (1937). Some interesting _recent _examples _{are Brainerd}

.

(1973), _Massey (1971) _{and Graham (1970).}

Cluster Analysis

This _{is concerned} _with _{the existence} _{of groups} _of

(20)

13.

not, or to divide _{a set of data into} the 'best' _groups for _a particular purpose. The techniques _will be discussed _at

length in this _work. _{There are few (if} _{any) major} _published works _{on cluster} _analysis. The best _references _{are chapters}

from books _{on pattern} _recognition _such _{as Meisel} (1972) _and

Duda and Hart (1973)p there is _also _{a good review} _by

Bolshev (190-9).

Ordination

This is _{the representation} _{of data in ordinate,} _form,

normally with the. view of drawing the objects _{as points} in

one, two or three dimensions, so that their structure can be visually examined. Both principal _components _analysis _and factor _analysis _{can be used as ordination} _methods, _but _other newer techniques _also _exist (such _{as multidimensional}

scaling), which _{are primarily} designed for _producing _ordinate representations.

Classifications _{of rmultivariate} _{methods have been}

attempted by LT. Kendall _{and Babington} Smith (1950), Sheth '(1971) _{and Kinnear} _{and Taylor} (1971). _{In all} three, the methods _{are divided} firstly into dependence _and

interdependence techniques. _{Of the above techniques,}

spectral _{and canonical} _correlation _analyses _{are of the}

dependence type, _{and principal} _components _{and factor} _analysis are interdependence models. The other three techniques do not fall into _either _{of these} _categories, _since they _are

(21)

14.

Of the methods we have outlined _above, we have selected cluster _analysis _{and ordination} _{as the subject} _{of this} _work. The reason for the selection of these two techniques is the lack' _of, _{and need for,} _research into these _areas. All the.

other methods which we outlined have been the subject of

lengthy _analysis _{by other} _workers, _{and, over} the _years _{one or}

two methods _{of performing} _each technique have _{come out} _{as the}

'best' _methods. Also _each _method has been _applied to many

fields _of interest, _{has been} _programmed _into _readily _avail-

able packages, and been discussed in several books. In

contrast, there have been proposed over _{a hundred,} different

cluster analysis methods, and most of these have not been

compared or analysed to any great extent. In fact, the

progress of cluster-analysis has been hampered by the lack of.

a 'best' method. Ordination methods have not as yet

suffered from over-plentiful methods. However, especially

with the new multi-dimensional scaling techniques, there is a lack _{of comparative} _studies between _{the methods.}

Thus one reason for our selection of these two methods is _{the need for} _{a comparison} _study _{in each case.} Another

area of investigation from which each technique would benefit is the practical applications in which each could be used.

The rationale _{in examining} both _methods together is that they _{are complementary} techniques for _examining data

(22)

15.

One of our aims in this _{work will} be to bring together

the large _amount _{of literature} _{on the} _subject, _{and the}

methods _which have been _proposed, into _{one volume.} The need

for _techniques _for _examining _data _structures _in _{many sciences}

has lead _to _the _development _{of cluster} _analysis _and

ordination in many fields, _most _workers having been unaware

of parallel developments in _other _sciences.

Our second aim, following this introduction _of the

techniques, _will _{be the comparison} _{of a variety} _{of cluster}

analysis methods, _{and an investigation} _{of the properties} _of ordination techniques.

The third _part _{of the work will} _{be designed} _{to show}

} a

i

F

A

i

rw, äßy J'

applications of the methods, _{as operational} _research

techniques. _{Case studies} _will _{be given,} _{and a discussion} _of other _{applications,} to shovr the power of cluster _analysis _and ordination _{as management} decision _aids.

(23)

16.

B. 2 CLUSTER ANALYSIS General

in order to reduce large _nasses _{of data into} more

compact _{and usable} sets, it is often necessary. to arrange items into _groups. _{For example,} _{in order} _{to discuss} the

behaviour _{of human beings} it _{would be almost} impossible (and

of doubtful use) to discuss the behaviour of each individual _- thus _{we group} them into _classes, _{by various} _attributes _{- by}

country, age, weight, size of family, etc. _{- depending} upon our need in a particular instance. Thus if we were

concerned with human social behaviour we might group into classes _{such as age, sex or social} class, and if we were

interested _{in genetic} _{characteristics} _{we might} _group

according to colour of hair, height or weight. In order for these _{or other} _groupings _{to be of use, we need the following}

two axioms to be observed:

1. All _members _of the _population _are _assignable to _{a group.}

(A group may consist of _{a single} member, or no members

at all. )

2. All _elements _{of a group possess} _{a property} (or-

properties) which all elements in other groups lack.

For instance, _{when we classify} individuals _according to their age, then the people in a group possess the property of

having _{a certain} _{age, which} _{no other} _people _possess, _{and also} each person has an age and is hence assignable to a group.

Groupings _which _abide by these _{two axioms are called}

(24)

17.

arbitrary, for instance if we group people according to the age groups: under 21,21-40,41-60, over 60, then there is no suggestion that the ages 21,41 _{and 60 are anything} more

than _{a convenient} division _point, _{we might} _easily _{have, chosen} the groups: under 18,18-35,36-53, over 54.

Any set of data can be dissected, but if _{we try} _and

obtain groupings with less _arbitrary divisions between them,

and find 'natural groupings', then the groups contain more

information, _{and may give} _more _of _{an indication} _{of the}

structure of the data. Thus we place _{a stronger} condition

on our dissection to obtain groups that will give insight to

the data, _{or at least} _{a more meaningful} dissection. This

condition replaces axiom 2:

21. _{We require} _{members of each particular} _group _{to possess a} degree _{of similarity} _which _{is not possessed} by items _not from _{the same group as each other.}

This type _{of grouping} _{is called} _clustering. (Note _that overlapping groups whilst _obeying _{axiom 2',} do not obey axiom 2, and that any non-overlapping _groups do form _a

dissection. ) "kith _clustering _{we are searching} _for _evidence that the _{data is multi-modal,} _{and thence} _grouping _the

observations. We can probably best _visualize this _as

looking for 'swarms' _{of points} _{in two dimensions.} _These

'swarm s' may be visualized in _three _dimensions (but _{of course} cannot be displayed _{so easily)} _{and one can hence use the}

analogy for higher dimensions.

(25)

18.

Given _{a set B of N observations} (bl, b2,... _{bN) measured on} a set of M attributes such that bi is defined by the attribute

profile (ail' _{ai2 ,... aiM) for i=1,...} N. We wish to partition B into _{K subsets} Bk(k=1,... K) such that _-

UB _=B k

(Axiom ₁₎

If _{we wish} _{our groups} _{to be non-overlapping} _{then we have} Bi _nBj=0 (for _all _{i, j such that} i/ j)

We define _{a clustering} function _between _{the observation}

bi and a subset A of B where comprises the p elements ba

m (m. 1,... p) by:

s(i, (>) _ f('5º(Gýº)CYT, %) .. o

),

92(&2)

cb

_{, "..}

_..

... 13,1 (iM, N _.

_"m

_{) ...}

where gj( ) is some function relating the jth _attribute of observation bi with the jth _attributes _{of those} bi Eß _and f is _{some function} _combining _{the attribute} _functions.

We can thus express Axiom 2' _as

S(i, q) > S(j, r)

for _all _{i, q such that} b,

_, bq r= Bk, all j such that

bjE BZ and all br BL.

The different _choices _{of the various} _functions _{in the} expression for S gives rise to a wide range of types of

clustering method. An approach _{used by many methods} is to

base the clustering function _between _{a point} _{and a group} _{on a} defined _relationship _between _{a pair} _{of points.} _{Thus we have}

S(i, j)ý, (IN(ct

(26)

18A.

and S(i, ß-) ₌ fa c(L)/?

i) )S (ý) 7z. ) _{, ...} s (. i

)/ P

(27)

19.

a similarity function between points, and from this to

construct groups. Often the functions gi (i=1... J) are identical _{or simply} _multiples _{of each other,} _{and similarly}

with hm (m=1... n) so the expressions simplify further in practice.

The relationships between principal components analysis

and a certain type of clustering called least squares cluster

analysis can be illustrated. v hilst p. c. a. attempts to

summarize the observation vs. variables matrix by reducing

the _number _{of variables,} _with _cluster _analysis _{we attempt} to

reduce the number of observations. However, p. c. a. always gives an optimum result, which is not true for least squares

classification.

E. g. If we are able in p. c. a. to reduce tvio-

dimensional data to one dimension then _graphically _{we are}

approximating a series of points to a series of linear

points.

U

(28)

20.

With _{the same set of points,} if _{we cluster} _{the seven} observations, we obtain two groups.

With p. c. a. we have reduced the 2 variable 7 observation matrix to a1 by 7 matrix and with cluster _analysis obtained

a2 by 2 matrix.

In a particular study _{of a set of data one might} well

use more than one multivariate technique _{- e. g. p. c. a.} and

cluster analysis. However, if data has clusters present

then _principal _components _{may give} _misleading _results _since

each cluster may have different underlying dimensions.

y"ö"

.f

X

zý

(29)

21.

In the above diagram a principal components analysis

would yield a first component of : C', whilst each cluster has its _{own major} _underlying dimension _{of YY' or ZZ'.}

Applications

The number of fields in which cluster analysis is being used is increasing rapidly. Also in each of the fields

where it has been used, its use is becoming more widespread.

Some aspects of most sciences have been subjected to

clustering methods in the past, and now the methods are being

applied in the business and social science fields.

Some examples of the use of clustering include:

1. Palaeontology, Botanical _{and Zoological} Sciences: the main application in this field has been in taxonomy _--the

grouping of plants, animals, insects, etc., into species. Previous to the use of mathematical techniques, detailed

measurement of specimens (a measurement is called an

operational taxonomic unit or OTU), together with the. past experience of the taxonomist, produced possible groupings

into _species. The problem is clearly _{one which} _cluster

analysis can tackle _{- finding} homogenous groups from many measurements of specimens _{- and-thus} it is not surprising

that this field _{was one of the first} to use clustering.

Indeed, _{some of the first} _numerical _methods _{viere propounded} by such practitioners of the zoological sciences as Sneath,

Sol-al, Rohlf _{and L ichener.} _{one of the first} _major _works in this field _{of numerical} taxonomy is Sokal _{and Sneath} (1963)

(30)

22.

clustering methods then available. Earlier articles which introduced _{new methods} _{are Sneath} (1957) _{and Sokal} _and

P:

Zicrener

(1958).

(See also Sneath 1902,1964a,

1964b, 1967;

Rohlf _{and Sokal} 1965,1967; Rohlf 1965,1957,1970;

Kubica _et _al 1973; _{and a book} by Blackith _{and Rayment} 1971")

Similarly Cheesman _{and Berridge} (1959) _attempt _grouping _of

bacteria from _{chromatography} _results.

t1nother _early _application _{was in ecology,} _{where the}

object of study is the relationship between different plant

species, and between them and their environment. By

dividing _{an area} _{of land} into _equal _sized _squares (called

quadrats or stands) each square can be examined and it can be

noted which plants are present in each particular square.

From this, by clustering _methods, _plants _{can be grouped,} so

that _plants _which tend to _co-exist _are in the _{same group.}

From this the _effects _of, _say, burning _off _grass, _{or of}

nearby canals, on vegetation can be assessed, and the effects

of one species on another. The early work in this field was

carried out at about the same time as the first investigations

in taxonomy _{and the} _ecology _methods _were then _quite different.

The pioneers of these methods were Lance and Williams and

their _co-workers, _currently in Australia. Their _papers

include lilliams _{and Lambert} (1959), Williams _{and Dale}

(1965), _{Lance and Williams} (1967). _However, _perhaps _the earliest work in this particular application is Sorenson

(1948) _{who made a study} _{of the vegetation} _{on Danish} _commons. Other _methods have been introduced by Rogers _{and Tanimoto}

(31)

23.

2. Psychology: One of the uses of clustering in this field has been to try _{and cluster} individuals into _groups _{on the}

basis _{of the results} _{of psychological} tests, _either to group people into 'types' as in McQuitty (1957,1964), for use in personnel administration (Ward and Hook 196-1. ), or to

allocate people to jobs, as in the armed services (see

Thorndike 1953). Another _{use is} in trying to _group

psychological tests into similar sets on the basis of test

results, and thus to gain insight into the nature of the

particular tests, as in Harman (1966). In psychology

cluster analysis is used as a method of data investigation in

order to seek structure in the data; this type of

application is very different from the taxonomy problem in

%fai ch the groupings are the required results. Thus in this

type _of data investigation, the _clustering _methods _{are used}

similarly to factor analysis, as a method for possible

hypothesis _generation. A typical _application _{of. this} type

is Liller (1969) _{who investigated} the _perceived _similarity

between _meanings _of _words.

3. Earth Sciences: The problem, in geography, _{of defining} 'regions' is _{one which} _{can be partially} _resolved by the use of cluster analysis. The definition of a region is very.

similar to that of a cluster: an area, the parts of which

are very similar to each other but dissimilar to the parts of other regions. An often desired property of the resultant

groups is that they must for.: contiguous areas (there is a discussion _{of this} in Johnston _1970). This _{can be achieved}

(32)

24.

observing the resultant groups, which _{may well} turn _{out to be} contiguous (as shown in King 1969) or by incorporating _{such a} constraint in the clustering _algorithm. Another _geographic

application is simply as a method of classification, _as-in the classification of American business districts by Berry

(1967). _{or British} _{towns by Andrews} ₍₁₉₇₁₎ _{and Moser and}

Scott (1961). Richter (1969) _{has clustered} _{the geographical}

location _{of firms} to investigate industry _grouping.

Another _problem _similar to that _{of the} _{regionalization}

problem is that of political districting, _{where an area must} be divided into _equal-sized _parts _for _{parliamentary}

constituencies or wards. Here, _apart from the constraints

of contiguousness and equal size, there is normally that of compactness. Whereas geographical _regions _{are often} long

thin _meandering _areas, it _{is normally} _considered _necessary _to have political regions _{as compact} _{as possible.} Given a

completely evenly spaced population the solution is that _of

tesselated hexagons (although _this _is _still _unproven). _See ; leaver _{and Hess (1963),} Harris (1964), _Thoreson _and

Lüttschwaoer (1907), Kaiser (1966 ), Bunge (1966) _and

Garfinkel

_{et al (1969)..}

Another _geographical _aspect _{in which classification}

methods have had some recent success is _.that _{of land} _use

analysis which is of particular _{use in environmental} _planning. Clustering has been used to find _areas (not _contiguous) _in

cities, in which similar _activities _occur. For example,

(33)

25.

city into five regions _{- the} liest End, Nestminster, Soho,

The City _{and Bloomsbury.} _Alexander (1972) _has _divided _the

centre of Perth, Australia into five _regions _{on the} basis _of

the business _activities _present. _In _{an earlier} _paper,

Goddard (1958), _grouped _London _into _five _regions _also _{on the}

basis _of the location _of _types _{of business} _activity _and

grouped the business _activities into _eighteen _groups. See

also Gittus (10954), Jones (1968), Dear (1969) _{and Golder} _and

Yeomans (1973).

Pedology is _another _{of the earth} _sciences _{to which} clustering has been applied in order to discover _soil

- regions. Grouping is carried out by measuring the soil coiposition, colour, type _{of stones,} _roots _present, _etc.

The main exponent _{of modern numerical} _methods _{in soil} _science

is Rayner (see Rayner 1965,1966,1969,

_{and also Bidwell}

_and

Hole _{1964 mid Grigal} _1969).

4. Life Sciences: _Apart _from _the _use _{of classification} _in

biological taxonomy, the _other _main _life _science _in _which

cluster _analysis has been used is that _{of anthropology.} _One of the areas of study has been the evolution _{of the American}

Indian tribes _{- by obtaining} _information _{on the social}

customs _{and behaviour} _{of each tribe,} it is _suggested _that by grouping together tribes _which _{are similar} _{on these} _counts,

one can trace the origins _{of each tribe.} Research _{has been} carried _{out along} these lines by Froeber (1939) _{and Clements}

(1954). _Linguistics _is

another subject to which cluster

(34)

26.

particular tribes or races, _either by comparing languages _or dialects for linguistic _similarity (see Levelt _1970).

Driver (1965) _{in a review} _article _discusses _{the progress} _of numerical classification _{and other} _statistical _methods in

anthropology up to 1963.

5. _Archaeology: This field _is _related _to _{anthropology,} _and

clustering has been used in a similar _way, to discover _past

relationships between _cultures, by measuring _similarities

between _artefacts. _{The use of cluster} _analysis _{has been}

discussed in Hodson, Sneath _{and Doran} (1966) _{who compare.} _the

relationships between 30 brooches found in Iron Age

cemeteries in Switzerland, _{as measured} by numerical _methods

and by the intuitive groupings by four _{archaeologists} _{and an}

anatomist, and conclude that the numerical _methods _{came out}

at least as good as the best intuitive _results. Another well written piece is that of Clarke (1962) _{who applied}

simple cluster analysis to _early British beaker _pottery.

Other _applications _are found _in _the _book _{of readings} _by

Hodson, Kendall _{and Taut-a} (1971). (See also Von Hagen-

Bordaz _{and Bordaz} _{1970. )}

6. _Information _Retrieval: _{In information} _retrieval _the

user's requirements are often that he wants _related

information to _{a particular} _subject. _This _{can be stated} _as

requiring to know other _similar information to _{a particular}

given piece of information, i. e. one wants to find the _other

members of the cluster to which the _possessed information

belongs. For instance _{one could} _measure _the _similarity

between journal _articles _{by the} _number _of _references _to _other

(35)

27.

data, _{or better,} _analyse _articles _{and classify} _{each into}

several descriptors (called keywords) and measure similarity

by the number of descriptors in common. Since _{one piece} _of information _{may be of use in} _several different fields _the

emphasis in this field has been on clusters which overlap, and i&i ch, in information retrieval, have been called

clumps _{- the} subject has been discussed by Minder _et _al

(1973). _{A large} _amount _of _the _English _work _in _this _field

has been done by Needham _{and Jones} _at _the _Cambridge _Language

Research Unit. A great deal _{of work} _{has been} _done _{on the}

type _{of classification} _needed in information _retrieval (this

is _covered in Jones _1971), but little _{has been} _attempted in

terms _{of practical} _results; _{one of the methods used is} explained by Needham (1965).

7. Medicine: One can consider the task _{of a doctor} _or psychiatrist as the allocation of patients to groups

corresponding to diseases, according to their symptoms. Any, patient with a particular disease or disorder _will have symp-

toms similar to those of other _patients _with that _condition. Thus cluster analysis could _possibly be used for fast

diagnosis, _{or as an aid for} human diagnosis. _{The numerical} methods _{are of more use in the psychiatric} field _{where the}

diagnosis _problem is more difficult. By investigation _of

data _{on patients} _with known disorders, _cluster _analysis _may be used to find the major discriminating _variables _between

disorders _{and thus} diagnosis _{of new patients} _{may be achieved}

(36)

28.

1963,1965), _{and an overview} of statistical methods in

psychiatric research is given by Moran (1969). See also

Strauss _et _al (1973), Everitt _et _al (1971), Pilowsky _et _al

(1969) _{and Paykel} (1971). _Applications _in _diagnosis _of

liver diseases _are _given in Baron _{and Fraser} (1968), _{and the}

use of clustering in investigating child deaths has been

investigated by Carpenter (Royal Statistical Society,

Multivariate Study Group Meeting, 9th October 1973).

Another _application in medicine is that _{of clustering}

particular cases of a disease, in space and time, in order to

detect _epidemics _early. This is discussed by Knox (1964),

Armitage (1971) _{and Pike} _{and Smith} (1968).

8. _Signal _Detection _{and Pattern} _Recognition: In the

physical sciences, there has been a large amount of work done in analysing noisy data _{such as in radio} signals, or pictures

from _satellites (see Haralick _{and Dinstein} 1971). Here

cluster analysis can be used to reduce this noise. Another

similar problem is in machines which _{are required} to 'read' or interpret other visual material. With machine readers, clustering can be used to reduce input characters to 26

groups, which hopefully represent each of the letters of the alphabet. There is a large volume of work on this type of

subject, , which also includes work in related fields such as classification into known distributions,, discriminant

analysis, etc. See Sebestyn (1962a, b, 1966), Ball and Hall

(10,66),

Rutovitz (1966),

. Casey. and Nagy (1968), Switzer

(37)

29.

9. Business: Because _{of the varied} _aspects _{of management,}

and the large amounts of data available, business is one of

the largest _areas for _cluster _analysis _{applications.} The

main discussion of the use of clustering methods is left

until later in this work, but at this point we can outline

one or two areas in which clustering is of obvious use.

In _marketing, because _of _the _abundance _of _data _from

surveys and the lack of detailed knowledge _{of many aspects} of

marketing such as buyer behaviour, _advertizing _{effectiveness,}

product market, etc., many multivariate methods have, been

applied. Examples include: clustering towns in order to

select one from each cluster as a representative set in order

to test _market _products (see Green, Frank _{and Robinson} _1967,

Morrison _{1967) ;} _clustering _stores, in _order to find _similar

stores to test product prices (see Day and Feeler 1971); and

clustering different models of the _{same product} to. ensure

your company is-competing in each sector of the _product

market (see Green and Tull 1970, and Frank _{and Green} 1968).

0

In investment _analysis _{the grouping} _{of shares} _into _sets which behave similarly over time can enable one to build a

portfolio which minimizes expected risk, i. e. one selects one investment in each group, _{to protect} _{the portfolio} _from _a

particular slump in one area of the market (see King 1966, Farrar _{1962 and Russell} _{and Taylor} _1068).

Other Applications

(38)

30.

investigated by these _methods _-a few examples are listed below:

Geology _{- Miller} _{and Ka: 1n (1962 ),. Parks} (1966), Gower (1970).

Sociology _{- Chabot} (1950), Alexander (1963),

. Beum and

Brundage (1950). Coleman _{and McRae (1960),} Forsyth

and Katz (1946), Laumann and Guttman (1966).

Criminology _{- Wilkins} _{and BlcNaughton-Smith} (1904).

Experimental Design _{- Cochran} _{and Cox (1957),} Kish

(1965), _Kennard _{and Stone} (1969), _Marriott (1970), Ca. linski (1971), Golder _{and Yeomans (1973).}

Economic Aggregation _{- Skolka} (1964), Fisher (1969).

DNA Classification _{- Ehrlich} (1964), Silvestri _{and Hill} (1904), _Fitch _{and Margoliash} (-1967)

.

Discarding Variables in p. c. a. - Jolliffe (1972,1973). History

The early work on clustering _{was carried} _out in two main

fields _{- those} _{of psychology} _{and of zoological} _{classification}

(numerical _taxonomy). _Cluster _analysis _{was first} _{used in.} the. great debate on the structure _{of human ability} _which.

followed Spear-man's 1927 book 'The Abilities _{of Irian' which}

brought together _all the ideas _{of the structure} _{of the mind's} processes which had begun to germinate in the 1920's. In

the ensuing discussing over the twenty _years _after Spearman' s book, _{psychologists} _{such as Thurstone} _{and Burt} _{used early}

multivariate techniques, especially factor analysis in order

(39)

31.

clustering method, trying to group intelligence tests

according to the precise ability they _{were testing,} _{such as} mathematical or verbal _ability.

Classification _methods in their _{own right} _{were first}

used in the late 1930's (see Holtzinger _{and Harman 1938) and} it _{was about} then that _{the term cluster} _analysis _{was first}

used. The first major _published _{work on method was Tryon} (1939). _{These early} _methods

were simple _{and fairly} _rough since they had to be computed manually _{- they normally}

involved _simple investigations _{of correlation} _matrices.

Because _{of this} limitation _{on methods,} _{the subject} _advanced very slowly after the publication of these few hand methods,

and it was not until the advent of the high _{speed computers} in the _1950's that the _subject _{was to advance to any great}

extent.

The first breakthroughs _{of the computer} _{age were in} zoology, where the first computer-based _methods began to appear in the late 1950's (see Sneath 1957, Sokal _and

I,: ichener _{1958 and Michener} _{and Sokal} _1957). _{These methods} were used as a simulation _{of the way in which the taxonomist} was able to group species, and resulted from the realization

of the fact that taxonomy _{was a simple} _clustering _problem.

With the publication of the first few methods and with larger and faster computers, larger problems could be solved by the existing methods, and more complex _methods _{were introduced}

(40)

32.

other fields _were _also _{experimenting} _with _other forms _of

clustering, _especially in _ecology _{and information} _retrieval,

and also workers in _psychology had continued their _early

work. Because _of this _growth in _parallel, _with _each _stream

of investigation having little knowledge _of _the _work _in _other

streams, _{came a great} deal _of duplication _{of work,} _and

differing _terminology. _However, _{as these} _areas _{of interest}

became _aware _of _each _other, _methods _discovered _in

one subject

were used in others and cross-fertilization _expanded the _use

of cluster _analysis. Thus _cluster _analysis became _{a subject}

in its _{own right} _{and was another} _mathematical _tool _{to be used}

in _{many kinds} _{of research.}

In the last five _{to ten years} _{the subject} _{has expanded}

rapidly in two directions _{- new applications} _{and new methods,} and along with this expansion has been a related increase in the theoretical difficulties. _{One of the more unique} _and

troubling _properties _{of this} _{new technique} _{of cluster}

analysis is that over recent _years the number of methods in use has increased rapidly, _{and there} have been few attempts

to rationalize _{the use. of particular} _methods. If there has been one factor which has inhibited the growth 'of the use of

clustering, it is the fact that there _exists _{no one-method} which has been proved 'best'.

In the last _six _{or seven years,} _operational _research

scientists have begun to look _{at cluster} _analysis _{and other} multivariate _methods _{and to use these} techniques _{as methods}

(41)

33.

first _{area to be subjected} to these methods, being an area in which data is abundant and knowledge about that data is

scarce, and the first published works appeared in 1966 and 1957. Another management function currently using

clustering is the personnel field where studies in psychology are of relevance to selection and placement.

Methodology

One definition _of _cluster _analysis is that _{we are}

searching for 'a degree of similarity between members from the _{same group,} _which is not possessed by members not both

from _{the same group'} _{and another} definition is that _{we obtain} clusters 'which are in some sense optimal'. The exact

meaning of these definitions is one of the central problems in cluster analysis _{- what do we mean by 'a degree} of

similarity', what do we mean by 'similarity', what does 'in

some sense optimal' mean, and in what sense? The answers to these _questions _will _vary _with the particular _application _and

thus the method used will be dependent _{on the application.} In general, however, all cluster methods reach their

conclusions by passing through four stages _{- Measurement,}

Similarity, Method _{and Analysis} _{of Results.} Measurement is the obtaining of the required data and the process of putting this in the required form for _{our investigations,.} Similarity

is the production of a measure of similarity by which we will be able to compare the objects under study, Method is the