C

(1)

FionnMurtagh

Schoolof Computer Science

The Queen's University of Belfast

BelfastBT7 1NN, NorthernIreland

f.murtagh@qub.ac.uk

July10, 2000

Abstract

Wereviewthetimeand storagecosts ofsearchand clusteringalgorithms. Weexemplify

these,basedoncase-studiesinastronomy,informationretrieval,visualuserinterfaces,chemical

databases, andotherareas. Theoreticalresults developedas far backas the 1960sstillvery

oftenremaintopical. Morerecentworkisalsocoveredinthisarticle. Thisincludesasolution

for the statistical question of how many clusters there are in a dataset. We also look at

oneline ofinquiry intheuse ofclustering for human-computeruserinterfaces. Finally, the

visualizationofdataleadstotheconsiderationofdataarraysasimages,andwespeculateon

futureresultstobeexpectedhere.

`Now',saidRabbit,`thisisaSearch,andI'veOrganisedit{'

`Donewhattoit?' saidPooh.

`Organisedit. Whichmeans {well,it'swhatyou dotoaSearch,

whenyou don'talllook inthesameplace atonce.'

A.A.Milne,TheHouseatPoohCorner (1928){M.S.Zakaria

Topics:

1. Introduction

2. BinningorBucketing

3. MultidimensionalBinary SearchorkDTree

4. ProjectionsandOtherBounds

5. TheSpecialCaseofSparseBinaryData

6. Fast ApproximateNearestNeighborFinding

7. HierarchicalAgglomerativeClustering

8. GraphClustering

9. NearestNeighborFindingonGraphs

10. K-MeansandFamily

11. Fast Model-BasedClustering

12. NoiseModeling

(2)

14. Imagesfrom Data

15. Conclusion

1 Introduction

Nearestneighborsearchingisconsidered insections2to6foronemain reason: itsutilityforthe

clusteringalgorithmsreviewedin sections7to10and, partially,11. Theyarethebuildingblocks

forthemosteÆcientimplementationsofhierarchicalclusteringalgorithms,and theycanbeused

tospeedupotherfamiliesofclusteringalgorithms.

The best match or nearestneighbor problem is important in many disciplines. In statistics,

k-nearestneighbors,wherekcanbe1,2,etc.,isamethodofnon-parametricdiscriminantanalysis.

Inpatternrecognition,thisisawidely-usedmethodforunsupervisedclassication(seeDasarathy,

1991). Nearestneighboralgorithmsarethebuildingblockofclusteringalgorithmsbasedonnearest

neighborchains;or ofeectiveheuristicsolutionsforcombinatorialoptimizationalgorithms such

asthetravelingsalesmanproblem,whichisaparadigmaticprobleminmanyareas. Inthedatabase

andmoreparticularlydataminingeld,NN searchingiscalledsimilarityquery,orsimilarityjoin

(Bennettetal.,1999).

In section 2,webegin with data structures where the objectiveis to break the O(n)barrier

fordetermining thenearestneighbor(NN) of apoint. A databaserecordortuple may be taken

as a point in a space of dimensionality m, the latter being the associated number of elds or

attributes. Theseapproacheshavebeenverysuccessful,buttheyarerestrictedtolowdimensional

NN-searching. For higher dimensional data, a wide range of bounding approaches have been

proposed,whichremainO(n)algorithmsbut withalowconstantofproportionality.

Weassumefamiliaritywithbasicnotionsofsimilarityanddistance, thetriangularinequality,

ultrametricspaces,Jaccardandother coeÆcients,normalizationandstandardization. Foranim-

plicittreatmentofdatatheoryanddatacoding,seeMurtaghandHeck(1987). Usefulbackground

readingcan befoundinArabieetal.(1996). Inparticularoutputrepresentationalmodelsinclude

discretestructures, e.g. rooted labeledtrees ordendrograms,and spatial structures (Arabie and

Hubert,1996),withmanyhybrids.

Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a

basisforclusteringalgorithmstofollow. Sections7to11reviewanumberoffamiliesofclustering

algorithm. Sections 12to 14 relateto visualorimagerepresentations ofdata sets, from which a

numberofinterestingalgorithmicdevelopmentsarise.

2 Binning or Bucketing

Inthisapproachto NN-searching, apreprocessingstage precedesthesearchingstage. All points

aremappedontoindexedcellularregionsofspace,sothatNNsarefoundinthesameorin closely

adjacentcells. Takingtheplaneasasexample,andconsideringpoints(x

i

;y

i

),themaximumand

minimumvaluesonallcoordinatesareobtained(e.g.(x min

j

;y min

j

)). Considerthemapping(Fig.1)

x

i

!fb(x

ij x

min

j )=rcg

where constant r is chosen in terms of the number of equally spaced categories into which the

interval [x min

j

;x max

j

] is to bedivided. Thisgivesto x

i

aninteger valuebetween 0and b(x max

ij

x min

ij

)=rcforeachattributej. O(nm)timeisrequiredtoobtainthetransformationofallnpoints,

andtheresultmaybestoredasalinkedlist withapointer from each cellidentierto thesetof

pointsmappedontothatcell.

(3)

x ;y =0;0,x ;y =50;40,r=10

*

0 1 2 3 4 5 0

1 2 3 4

*

Point(22,32)ismappedontocell(2,3);point(8,13)is mappedontocell(0,1).

Figure1: Exampleofsimplebinningintheplane.

NN-searchingbeginsby ndingthe closestpointamong those which havebeenmapped onto

the same grid cell as the target point. This gives acurrent NN point. A closer point may be

mapped onto some other grid cell ifthe distance betweentarget point and current NN point is

greaterthanthedistancebetweenthetargetpointandanyoftheboundariesofthecellcontaining

it. Somefurther implementationdetails canbefoundin Murtagh(1993a).

A powerfultheoretical resultregardingthis approach isasfollows. Foruniformly distributed

points,theNNofapointisfoundinO(1),orconstant,expectedtime(seeDelannoy,1980,orBent-

leyetal., 1980,forproof). Thereforethis approachwill work well ifapproximateuniformitycan

beassumedorifthedatacanbebrokendownintoregionsofapproximatelyuniformlydistributed

points.

SimpleFortrancodeforthisapproachislisted,anddiscussed,inSchreiber(1993). Thesearch

through adjacent cells requires time which increases exponentially with dimensionality (if it is

assumedthat thenumberofpointsassignedtoeachcellisapproximatelyequal). Asaresult,this

approachissuitableforlowdimensionsonly. Rohlf(1978)reportsonworkindimensions2,3,and

4;andMurtagh(1983)intheplane. Rohlfalsomentionstheuseoftherst3principalcomponents

toapproximateasetofpointsin15-dimensionalspace.

FromtheconstantexpectedtimeNNsearchresult,particularhierarchicalagglomerativeclus-

teringmethodscan beshowntobeoflinearexpectedtime,O(n)(Murtagh,1983). Theexpected

time complexity for Ward's minimum variance method is given as O(nlogn). Results on the

hierarchicalclusteringofupto12,000pointsarediscussed.

Thelimitationonthese veryappealingcomputationalcomplexityresultsisthattheyareonly

really feasible for data in the plane. Bellman's curse of dimensionality manifests itself here as

always. Fordimensionsgreaterthan2or3weproceedtothesituationwhereabinarysearchtree

canprovideuswithagoodpreprocessingofourdata.

3 Multidimensional Binary Search or kD Tree

Amultidimensionalbinarysearchtree(MDBST)preprocessesthedatatobesearchedthroughby

two-waysubdivision,and subdivisions continue until someprespecied numberof data pointsis

arrivedat. Seeexamplein Fig.2. Weassociatewitheachnodeofthedecisiontreethedenition

of a subdivision of the data only, and we associate with each terminal node a pointer to the

stored coordinates of the points. Using the approximate median of projections keeps the tree

balanced,and consequentlyO(logn) levels, at each of which O(n)processing is required. Hence

(4)

A2 B2

A1

B1

*

B2

* * *

*

* * *

A B

A1 A2 B1

Figure2: AMDBSTusingplanardata.

theconstructionofthetreetakesO(nlogn)time.

The search for aNN then proceeds by atop-down traversalof the tree. Thetarget pointis

transmittedthroughsuccessivelevelsofthetreeusingthedenedseparationofthetwochildnodes

ateachnode. Onarrivalataterminalnode,allassociatedpointsareexaminedandacurrentNN

selected. The tree is then backtracked: if the points associated with any node could furnish a

closerpoint,then subnodesmustbecheckedout.

Theapproximatelyconstantnumberofpointsassociatedwithterminalnodes(hyper-rectangular

cellsin the space of points) should be greaterthan 1 in order that some NNs may be obtained

withoutrequiringasearchofadjacentcells(otherterminalnodes). Friedmanetal.(1977)suggest

avalueofthenumberofpointsperbinbetween4and32basedonempiricalstudy.

The MDBST approach only works well with small dimensions. To see this, consider each

coordinate being used once and once only for the subdivision of points, i.e. each attribute is

consideredequallyuseful. Let therebeplevelsin thetree,i.e. 2 p

terminalnodes. Each terminal

nodecontainsapproximatelycpointsbyconstructionandso c2 p

=n. Thereforep=log

2

n=c. As

samplevalues,ifn=32768;c=32;thenp=10. I.e.in10-dimensionalspace,usingalargenumber

ofpointsassociatedwithterminalnodes,morethan30000pointswillneedto beconsidered. For

highdimensionalspaces,twoalternativeMDBSTspecicationsareasfollows.

All attributes need not be considered for splitting the data if it is known that some are of

greater interest than others. Linearity present in the data may manifest itself via the variance

ofprojectionsofpointson thecoordinates; choosingthecoordinatewithgreatestvarianceasthe

discriminatorcoordinateateachnodemaythereforeallowrepeateduseofcertainattributes. This

hastheaddedeectthatthehyper-rectangularcellsintowhichtheterminalnodesdividethespace

willbeapproximatelycubicalinshape. Inthiscase,Friedmanet al.(1977)showthatsearchtime

isO(logn)onaverageforthendingofaNN. Resultsobtainedfordimensionalitiesofbetween2

and8arereportedoninFriedmanetal.(1977),andintheapplicationofthisapproachtominimal

spanningtreeconstructioninBentleyandFriedman(1978). LispcodefortheMDBSTisdiscussed

inBroder(1990).

The MDBST has also been proposed for veryhigh dimensionality spaces, i.e. where the di-

mensionalitymaybegreaterthan thenumberof points,ascouldbethecasein akeyword-based

system. Keywords(coordinates)arebatched, andthefollowingdecisionruleis used: ifsomeone

ofa givenbatch of node-dening discriminating attributes is present, thentake theleft subtree,

elsetaketherightsubtree. Largen,wellinexcessof1400,wasstatedasnecessaryforgoodresults

(Weiss, 1981; Eastman and Weis, 1982). General guidelines for the attributes which dene the

directionofsearchateachlevelarethat theyberelated,andthenumberchosenshould keepthe

treebalanced. Onintuitivegrounds,ouropinionisthatthisapproachwillworkwelliftheclusters

ofattributes,deningthetreenodes,aremutuallywellseparated.

An MDBST approach is used by Moore (1999) in the case of Gaussian mixture clustering.

(5)

x x x

x

x(i,k) x x

x x

*

kth axis

Figure 3: Two-dimensional exampleof projection-based bound. Points with projections within

distancecofgivenpoint's(*)projection,alone,aresearched. Distancecisdened withreference

toacandidate orcurrentnearestneighbor.

OverandabovethesearchfornearestneighborsbasedonEuclideandistance,Mooreallowsforthe

Mahalanobismetric,i.e.distancetoclustercenterswhichare\corrected"forthe(Gaussian)spread

ormorphology ofclusters. Theinformation storedat eachnode ofthe treeincludes covariances.

Moore(1999)reportsresultsonnumbersofobjectsofaround160,000,dimensionalitiesofbetween

2and 6, and speedups of 8-fold to 1000-fold. Pelleg and Moore (1999) discuss results on some

430,000two-dimensionalobjectsfrom theSloanDigitalSkySurvey(seesection10below).

4 Projections and Other Bounds

4.1 Bounding using Projection or Properties of Metrics

Making use of bounds is a versatile approach, which may be less restricted by dimensionality.

Somelowerbound on the dissimilarityis eÆcientlycalculated in order to dispense with thefull

calculationin manyinstances.

UsingprojectionsonacoordinateaxisallowstheexclusionofpointsinthesearchfortheNNof

pointx

i

. Pointsx

k

,only,areconsideredsuchthat(x

ij x

k j )

2

c 2

wherex

ij

isthejthcoordinate

ofx

i

,andwhere cissomeprespecieddistance(seeFig.3).

Alternativelymorethanonecoordinatemaybeused. Thepriorsortingofcoordinatevalueson

thechosenaxisoraxesexpeditesthendingofpointswhosefulldistancecalculationisnecessitated.

Thepreprocessingrequiredwiththisapproachinvolvesthesortingofuptomsetsofcoordinates,

i.e.O(mnlogn)time.

Using oneaxis, it is evident that many pointsmay be excluded ifthe dimensionality is very

small, but that the approach will disimprove as the latter grows. Friedman et al. (1975) give

theexpectedNNsearchtime,under theassumptionthatthepointsareuniformlydistributed,as

O(mn 1 1=n

). This approachesthebruteforceO(nm)asngetslarge. Reported empiricalresults

arefordimensions2to8.

Marimont andShapiro (1979)extendthis approach bythe useof projectionsin subspacesof

dimensiongreaterthan 1(usuallyaboutm=2is suggested). This canbe further improvedif the

subspaceoftheprincipalcomponentsisused. Dimensionsupto40areexamined.

TheEuclideandistanceisverywidelyused. TwoothermembersofafamilyofMinkowskimetric

measuresrequirelesscomputationtime to calculate,andtheycanbeused to providebounds on

theEuclidean distance. Wehave:

d

1 (x;x

0

)d

2 (x;x

0

)d

1 (x;x

0

)

whered

1

istheHammingdistancedened as P

j jx

j x

0

j

j; theEuclidean distanceisgivenby

thesquarerootof P

j (x

j x

0

j )

2

;andtheChebyshevdistanceisdened asmax

j jx

j x

0

j j.

(6)

d

1

(x;y) p

(m)Æ where Æ is the current NN d

2

-distance. The more eÆciently calculated d

1 -

distancemaythusallowtherejectionofmanypoints(90%in10-dimensionalspaceisreportedon

byKittler). Kittler'sruleisobtainedbynotingthatthegreatestd

1

-distance betweenxand x 0

is

attainedwhen

jx

j x

0

j j

2

=d 2

2 (x;x

0

)=m

forall coordinates, j. Hence d

1 (x;x

0

)=d

2 (x;x

0

)=

p

m is thegreatestd

1

-distance betweenx and

x 0

. Inthecaseoftherejectionofpointy,wethenhave:

d

1

(x;y)d

2 (x;y)=

p

m

andsince,byvirtueoftherejection,

d

1

(x;y) p

mÆ

itfollowsthatÆd

2 (x;y).

Yunck (1976) presents a theoretical analysis for the similar use of the Chebyshev metric.

Richetin et al. (1980) propose the use of both bounds. Using uniformly distributed points in

dimensions2to 5,the latterreferencereportsthe best outcomewhen therule: reject all y such

thatd

1

(x;y)Æ precedestherulebasedonthed

1

-distance. Upto 80%reductioninCPUtime

isreported.

4.2 Bounding using the Triangular Inequality

Thetriangularinequalityissatisedbydistances: d(x;y)d(x;z)+d(z;y),wherex,yandzare

anythree points. Theuseofareferencepoint,z,allowsafulldistance calculationbetweenpoint

x,whoseNN issought,andy tobeavoidedif

jd(y;z) d(x;z)jÆ

whereÆ isthecurrentNN distance. Theset ofalldistancesto thereferencepointarecalculated

andstoredin apreprocessingsteprequiringO(n)timeandO(n)space. Theabovecut-oruleis

obtainedbynotingthat if

d(x;y)jd(y;z) d(x;z)j

then, necessarily, d(x;y) Æ. The former inequality abovereduces to the triangular inequality

irrespectiveofwhichofd(y;z)or d(x;z)isthegreater.

The set ofdistances to thereference point, fd(x;z)j xg, may besorted in the preprocessing

stage. Sinced(x;z)isxed duringthesearchfortheNN ofx,itfollowsthat thecut-orulewill

notthenneedtobeappliedinallcases.

The single reference point approach, due to Burkhardand Keller (1973), was generalizedto

multiplereferencepointsbyShapiro(1977). Thesortedlistofdistancestotherstreferencepoint,

fd(x;z

1

)jxg,isusedasdescribedaboveasapreliminarybound. Thenthesubsequentboundsare

similarlyemployedto furtherreducethepointsrequiringafulldistancecalculation. Thenumber

andthechoice ofreferencepointsto beused isdependentonthedistributionalcharacteristicsof

thedata. Shapiro(1977)ndsthatreferencepointsoughttobelocatedawayfromgroupsofpoints.

In10-dimensionalsimulations,itwasfoundthatatbestonly20%offulldistancecalculationswere

required(althoughthiswasverydependentonthechoiceofreferencepoints).

Hodgson(1988)proposesthefollowingbound,related to thetrainingsetof points, y,among

whichtheNNofpointxissought. DetermineinadvancetheNNsandtheirdistances,d(y;NN(y))

for all points in the training set. For point y, then consider Æ

y

= 1

2

d(y;NN(y)). In seeking

NN(x), andhavingat sometime intheprocessingacandidateNN, y 0

, wecanexcludeally from

considerationifwend thatd(x;y 0

)Æ

y 0

. Inthiscase, weknowthat wearesuÆcientlycloseto

y 0

thatwecannotimproveonit.

(7)

inter-point distances between the members of the training set. Given x, whose NN we require,

somememberofthetrainingsetisusedasareferencepoint. Usingthebounding approachbased

onthe triangular inequality, described above, allowsother trainingset members to be excluded

fromanypossibilityofbeingNN(x). Micoetal.(1992)andRamasubramanianandPaliwal(1992)

discussfurtherenhancementstothisapproach,focusedespeciallyonthestoragerequirements.

Fukunaga andNarendra(1975)makeuseofbothahierarchicaldecompositionofthedataset

(theyemployrepeatedlythek-meanspartitioningtechnique),andbounds basedonthetriangular

inequality. For each node in the decomposition tree, the center and maximum distance to the

centerofassociatedpoints(the\radius")aredetermined. For1000points,3levelswereused,with

adivisioninto 3classesateach node.

Allpointsassociatedwithanon-terminalnodecanberejectedinthesearchfortheNNofpoint

xifthefollowingrule(Rule1)isnotveried:

d(x;g) r

g

<Æ

whereÆisthecurrentNNdistance,gisthecenteroftheclusterofpointsassociatedwiththenode,

andr

g

istheradiusof thiscluster. Foraterminalnode, whichcannot berejectedonthebasisof

thisrule,each associatedpoint,y,canbetestedforrejection usingthefollowingrule(Rule2):

jd(x;g) d(y;g)jÆ:

Thesetworulesaredirectconsequencesof thetriangularinequality.

Abranchandboundalgorithmcanbeimplementedusingthesetworules. Thisinvolvesdeter-

miningsomecurrentNN(thebound)andsubsequentlybranchingoutofatraversalpathwhenever

thecurrentNNcannotbebettered. Notbeinginherentlylimitedbydimensionality,thisapproach

appearsparticularlyattractiveforgeneralpurposeapplications.

Other rejection rules are considered by Kamgar-Parsiand Kanal (1985). A simpler form of

clustering isused in the variantof this algorithm proposed byNiemann andGoppert (1988). A

shallowMDBSTisused,followedbyavariantonthebranchingandboundingdescribedabove.

Bennettetal.(1999)usethenearestneighborproblemasameanstowardssolvingtheGaussian

distributionmixture problem. Theyconsider apreprocessing approach similar to Fukunaga and

Narendra(1975)but with animportantdierence: to takebetteraccountof clusterstructure in

thedata,theclustersaremultivariatenormalbutnotnecessarilyofdiagonalcovariancestructure.

Thereforeveryellipticalclustersare allowed. This in turn impliesthat acluster radius isnotof

greatbenetforestablishingabound onwhetherornotdistancesneedtobecalculated. Bennett

etal.(1999)addressthisproblembyseekingastochasticguaranteeonwhetherornotcalculations

canbeexcluded. Technically,however,suchstochasticboundsarenoteasytodetermineinahigh

dimensionalspace.

An interesting issue raised in Beyer et al. (1999) is discussed also by Bennett et al. (1996):

if the ratio of the nearest and furthest neighbor distances converges in probability to 1 as the

dimensionalityincreases,thenisitmeaningfultosearchfornearestneighbors? Thisissueisnotall

thatdierentfrom sayingthat neighborsin anincreasingly highdimensional spacetend towards

beingequidistant. Insection5,wewill lookat approachesforhandlingparticularclassesofdata

ofthistype.

4.3 Fast Approximate Nearest Neighbor Finding

Kushilevitzet al. (1998),working in Euclidean andL

1

spaces, propose fast approximate nearest

neighborsearching,onthegroundsthatinsystemsforcontent-basedimageretrieval,approximate

resultsareadequate. Projectionsareusedtobound thesearch. Probabilityofsuccessfullynding

thenearestneighboristraded oagainsttimeandspacerequirements.

(8)

\High-dimensional",\sparse"and\binary"arethecharacteristicsofkeyword-basedbibliographic

data,with maybevaluesin excessof 10000forbothn andm. Suchdata isusuallystored aslist

datastructures,representingthemappingofdocumentsontoindexterms,orviceversa. Commer-

cial document collections are usually searched using a Boolean search environment. Documents

associatedwith particular termsareretrieved, andthe intersection(AND), union(OR) orother

operations on such sets of documents are obtained. For eÆciency, an inverted le which maps

termsontodocumentsmustbeavailableforBooleanretrieval. TheeÆcientNN algorithms,tobe

discussed,makeuseofboththedocument-termandtheterm-documentles.

TheusualalgorithmforNN-searchingconsiderseachdocumentinturn,calculatesthedistance

withthegivendocument,and updatesthe NNif appropriate. This algorithmis shown schemat-

ically in Fig. 4 (top). The inner loop is simply an expression of the fact that the distance or

similaritywill, in general, requireO(m)calculation: examplesof commonlyused coeÆcientsare

theJaccard similarity,andtheHamming(L

1

Minkowski)distance.

If m and nare, respectively, theaverage numbersof termsassociatedwith adocument, and

theaverage numberof documents associated with a term, then an average complexity measure,

overnsearches,ofthisusual algorithmis O(nm ). Itis assumedthat advantageis takenofsome

packedformofstorageintheinnerloop(e.g.using linkedlists).

Croft'salgorithm (seeCroft, 1977,and Fig.4) isof worst casecomplexity O(nm 2

). However

thenumberoftermsassociatedwiththedocumentwhoseNNisrequiredwilloftenbequitesmall.

TheNationalPhysicalLaboratorytestcollection,forexample,whichwasusedinMurtagh(1982)

hasthefollowingcharacteristics: n=11429,m =7491,m =19.9,andn=30.4. Theoutermost

and innermost loops in Croft's algorithm use the document-term le. The center loop uses the

term-document inverted le. An average complexity measure (more strictly, the time taken for

best match search based on an average document with associated average terms) is seen to be

O(nm 2

).

IntheoutermostloopofCroft'salgorithm,therewilleventuallycomeaboutasituationwhere

{if adocument hasnotbeen thus farexamined {the numberof termsremaining for thegiven

documentdonotpermitthecurrentNN documenttobebettered. Inthis case,wecancutshort

the iterations of the outermost loop. The calculation of a bound, using the greatest possible

numberoftermswhichcouldbesharedwithaso-farunexamineddocumenthasbeenexploitedby

SmeatonandvanRijsbergen(1981)andbyMurtagh(1982)insuccessiveimprovementsonCroft's

algorithm.

The complexity of all the above algorithms has been measured in terms of operations to be

performed. In practice, however, theactual accessing of termor document information may be

of fargreater cost. The document-termand term-document les are ordinarilystored on direct

access le storage because of their large sizes. The strategy used in Croft's algorithm, and in

improvementsonit,doesnotallowanyviableapproachestobatchingtogethertherecordswhich

aretoberead successively,inordertoimproveaccessing-relatedperformance.

ThePerry-Willettalgorithm(seePerryandWillett,1983)presentsasimplebuteectivesolu-

tiontotheproblemofcostlyI/O.Itfocusesonthecalculationofthenumberoftermscommonto

thegivendocumentx andeachotherdocument,y,inthedocumentcollection. This setofvalues

is built upin a computationallyeÆcient fashion. O(n) operations are subsequentlyrequired to

determinethe(dis)similarity, using anothervectorcomprisingthetotalnumbersofterms associ-

atedwitheachdocument. Computationtime(thesame\average"measureasthatusedabove)is

O(nm +n). Wenowturnattentiontonumbersofdirect-accessreadsrequired.

InCroft'salgorithm,alltermsassociatedwiththedocumentwhoseNNisdesiredmayberead

inonereadoperation. Subsequently,werequirenm reads,givinginall1+nm. InthePerry-Willett

algorithm,theouterloopagainpertainstotheone(given)document,andsoalltermsassociated

with this document can be read and stored. Subsequently, m reads, i.e. the average numberof

terms,eachofwhichdemandsareadofasetofdocuments,arerequired. Thisgives,inall,1+m.

(9)

Initialize current NN

For all documents in turn do:

... For all terms associated with the document do:

... ... Determine (dis)similarity

... Endfor

... Test against current NN

Endfor

Croft's algorithm:

For all terms associated with the given document do:

... For all documents associated with each term do:

... ... For all terms associated with a document do:

... ... ... Determine (dis)similarity

... ... Endfor

... ... Test against current NN

... Endfor

Endfor

Perry-Willett algorithm:

For all terms associated with the given document, i, do:

... For all documents, i', associated with each term, do:

... ... Increment location i' of counter vector

... Endfor

Endfor

Figure4: AlgorithmsforNN-searchingusinghigh-dimensionalsparsebinarydata.

(10)

e d

c b

a

Figure5: Fivepoints,showingNNsandRNNs.

Since these reads are verymuch the costliest operation in practice, the Perry-Willett algorithm

canberecommendedforlargevaluesofnandm. Itsgeneralcharacteristicsarethat(i)itrequires,

asdoall thealgorithms discussed in this section, theavailabilityof theinverted term-document

le;and(ii)itrequiresin-memorystorageoftwovectorscontainingnintegervalues.

6 Hierarchical Agglomerative Clustering

Thealgorithmsdiscussedinthissectioncanbecharacterizedasgreedy(HorowitzandSahni,1979).

Asequenceofirreversiblealgorithmstepsisusedtoconstructthedesireddatastructure.

We will not review hierarchicalagglomerative clustering here. Foressential background, the

readeris referredto Murtagh andHeck (1987), Gordon(1999), orJainand Dubes (1988). This

sectionborrowsonMurtagh(1992).

Onecould practicallysay that Sibson (1973)and Defays(1977)are partof the prehistoryof

clustering.TheirO(n 2

)implementationsofthesinglelinkmethodandofa(non-unique)complete

linkmethod, respectively,havebeenwidelycited.

In theearly 1980s a rangeof signicant improvementswere made to the Lance-Williams, or

related, dissimilarity update schema (deRham, 1980; Juan, 1982), which had been in wide use

sincethemid-1960s. Murtagh(1985)presentsasurveyofthesealgorithmicimprovements.Wewill

brie ydescribethem here. The newalgorithms, which havethepotentialforexactly replicating

resultsfoundintheclassicalbutmorecomputationallyexpensiveway,arebasedontheconstruction

ofnearest neighborchains andreciprocalormutualNNs(NN-chains andRNNs).

ANN-chainconsistsofanarbitrarypoint(ainFig.5);followedbyitsNN(binFig.5);followed

bythe NN from amongthe remainingpoints(c, d, and ein Fig. 5) ofthis second point; and so

onuntil wenecessarilyhavesome pairofpointswhich canbetermedreciprocalormutual NNs.

(Such apairofRNNsmaybethersttwopointsinthechain;and wehaveassumedthat notwo

dissimilaritiesareequal.)

In constructinga NN-chain, irrespectiveof thestarting point, wemay agglomeratea pairof

RNNs as soon as they are found. What guarantees that we can arrive at the same hierarchy

asifweused traditional \storeddissimilarities" or\storeddata"algorithms? Essentiallythis is

thesameconditionasthat underwhichno inversionsorreversalsare producedbytheclustering

method. Fig.6givesan exampleof this, where sis agglomeratedat alowercriterionvalue (i.e.

dissimilarity) than was the case at the previous agglomeration between q and r. Our ambient

spacehas thus contractedbecauseof the agglomeration. This is due to the algorithm used {in

particulartheagglomerationcriterion{anditissomethingwewouldnormallywishtoavoid.

Thisisformulatedas:

Inversionimpossibleif: d(i;j)<d(i;k)ord(j;k))d(i;j)<d(i[j;k)

This is essentially Bruynooghe's reducibility property (Bruynooghe, 1977; see also Murtagh,

1984). UsingtheLance{Williamsdissimilarityupdateformula,itcanbeshownthattheminimum

(11)

d1 d2

q r s q r s d2 d1

Figure 6: Alternativerepresentations of ahierarchywith aninversion. Assumingdissimilarities,

aswegoverticallyup,criterionvalues(d

1 , d

2

)decrease. Buthere,undesirably,d

2

>d

1 .

variancemethod doesnotgiveriseto inversions;neitherdothelinkagemethods; butthemedian

andcentroidmethodscannotbeguaranteednotto haveinversions.

Toreturn to Fig. 5, if we are dealing with a clustering criterion which precludes inversions,

thencanddcanjustiablybeagglomerated,sincenootherpoint(forexample,bore)couldhave

beenagglomeratedtoeither ofthese.

Theprocessingrequired,followinganagglomeration,istoupdatetheNNs ofpointssuch asb

inFig.5(andonaccountofsuchpoints,thisalgorithmwasdubbedalgorithme descelibatairesin

deRham,1980). Thefollowingisasummaryofthealgorithm:

NN-chain algorithm

Step1 Selectapointarbitrarily.

Step2 GrowtheNN-chainfromthis pointuntilapairofRNNsareobtained.

Step3 Agglomerate these points (replacing with a cluster point, orupdating the dissimilarity

matrix).

Step4 Fromthepoint which preceded theRNNs (or from any otherarbitrarypointifthe rst

twopointschosenin Steps1and 2constitutedapairofRNNs),returntoStep 2until only

onepointremains.

In Murtagh (1983, 1984, 1985) and Day and Edelsbrunner (1984), one nds discussions of

O(n 2

)timeandO(n)spaceimplementationsofWard'sminimumvariance(orerrorsumofsquares)

methodandofthecentroidandmedianmethods. ThelattertwomethodsaretermedtheUPGMC

andWPGMCcriteriabySneathandSokal(1973). Now,aproblemwiththeclustercriteriausedby

theselattertwomethodsisthatthereducibilitypropertyisnotsatisedbythem. Thismeansthat

thehierarchyconstructedmaynotbeuniqueasaresultofinversionsorreversals(non-monotonic

variation)in theclustering criterionvaluedeterminedin thesequenceofagglomerations.

Murtagh (1984) describes O(n 2

) time and O(n 2

) space implementations for the single link

method, thecomplete link method and fortheweightedand unweightedgroupaveragemethods

(WPGMAandUPGMA). Thisapproachisquite generalvis avis thedissimilarityused andcan

alsobeusedforhierarchicalclusteringmethods otherthanthosementioned.

DayandEdelsbrunner(1984)provetheexactO(n 2

)timecomplexityofthecentroidandmedian

methodsusinganargumentrelatedtothecombinatorialproblemofoptimallypackinghyperspheres

into anm-dimensionalvolume. Theyalso addressthequestion ofmetrics: resultsare valid in a

wideclass ofdistancesincludingthose associatedwiththeMinkowskimetrics.

(12)

agglomerationswheneverreciprocalnearestneighborsmeet,bothoerpossibilitiesforparalleliza-

tion. ImplementationsonaSIMDmachine weredescribedbyWillett(1989).

Evidently both coordinate data and graph (e.g., dissimilarity) data can be input to these

agglomerativemethods. Gilletetal.(1998)inthecontextofclusteringchemicalstructuredatabases

refertothecommonuseoftheWardmethod,basedonthereciprocalnearestneighborsalgorithm,

ondatasets ofafewhundredthousandmolecules.

Applications of hierarchical clustering to bibliographic information retrieval are assessed in

GriÆthsetal.(1984). Ward'sminimumvariancecriterionisfavored.

FromdetailsinWhiteandMcCain(1997),theInstituteofScienticInformation(ISI)clusters

citations(science,andsocialscience)byrst clusteringhighly citeddocumentsbasedonasingle

linkage criterion, and then four more passes are made throughthe data to create a subset of a

singlelinkagehierarchicalclustering.

7 Graph Clustering

Hierarchicalclustering methods arecloselyrelated to graph-basedclustering. Forastart,aden-

drogramis arooted labeled tree. Secondly, and moreimportantly, somemethods likethe sinlge

andcompletelinkmethodscanbedisplayedasgraphs,andareverycloselyrelatedtomainstream

graphdatastructures.

Anexampleoftheincreasingprevalenceofgraphclusteringinthecontextofdataminingonthe

webispresentedinFig.7: Amazon.comprovidesinformationonwhatotherbookswerepurchased

bylike-mindedindividuals.

Thesinglelinkmethodwasreferredtointheprevioussection,asawidely-usedagglomerative,

hencehierarchical,clusteringmethod. Rohlf (1982)reviewsalgorithmsforthesinglelinkmethod

withcomplexitiesrangingfromO(nlogn)to O(n 5

). Thecriterionusedbythesinglelinkmethod

forclusterformationisweak,meaningthat noisydatainparticular giveriseto resultswhichare

notrobust.

The minimal spanning tree (MST) and the single link agglomerativeclustering method are

closely related: the MST can be transformed irreversibly into the single link hierarchy (Rohlf,

1973). The MST is dened as of minimal total weight, it spans all nodes (vertices) and is an

unrooted tree. The MST has been a method of choice for at least four decades now either in

itsownrightfor data analysis (Zahn, 1971), asa data structureto be approximated(e.g. using

shortestspanning paths, see Murtagh, 1985, p. 96),orasa basisfor clustering. Wewill look at

somefastalgorithmsfortheMSTin theremainderofthissection.

Perhaps the most basicMST algorithm, due to Prim and Dijkstra, grows asingle fragment

throughn 1 steps. Wend theclosest vertex to an arbitraryvertex, calling these afragment

oftheMST.Wedeterminetheclosestvertex,notin thefragment,toanyvertexinthefragment,

andaddthisnewvertexintothefragment. Whiletherearefewerthannverticesin thefragment,

wecontinueto growit.

Thisalgorithmleadstoauniquesolution. AdefaultO(n 3

)implementationisclear,andO(n 2

)

computationalcostispossible(Murtagh,1985,p. 98).

Sollin'salgorithmconstructsthefragmentsinparallel. Foreachfragmentinturn,atanystage

oftheconstructionoftheMST,determineitsclosestfragment. Mergethesefragments,andupdate

thelist of fragments. A tree canbe guaranteed in this algorithm (although caremust be taken

incasesofequalsimilarity)and ourotherrequirements(allverticesincluded,minimaltotaledge

weight)areverystraightforward. Giventhepotentialforroughlyhalvingthedataremainingtobe

processedateachstep,notsurprisinglythecomputationalcostreducesfromO(n 3

)toO(n 2

logn).

TherealinterestofSollin'salgorithmariseswhenweareclusteringonagraphanddonothave

alln(n 1)=2edgespresent. Sollin'salgorithm canbeshowntohavecomputationalcostmlogn

wheremisthenumberofedges. Whenmn(n 1)=2thenwehavethepotentialforappreciable

(13)

TheMSTinfeaturespacescanofcoursemakeuseofthefastnearestneighborndingmethods

studiedearlierinthisarticle. SeeMurtagh(1985,section4.4)forvariousexamples.

Other graph data structures which have been proposed for data analysis are related to the

MST.Weknow,forexample,thatthefollowingsubsetrelationshipholds:

MSTRNGGGDT

whereRNGistherelativeneighborhoodgraph,GGistheGabrielgraph,andDTistheDelaunay

triangulation. Thelatter,intheformofit'sdual,theVoronoidiagram,hasbeenusedforanalyzing

theclusteringofgalaxylocations. ReferencestotheseandrelatedmethodscanbefoundinMurtagh

(1993b).

8 Nearest Neighbor Finding on Graphs

Clusteringongraphsmayberequiredbecauseweareworkingwith(perhapscomplexnon-Euclidean)

dissimilarities. Insuchcaseswherewemusttakeintoaccountanedgebetweeneachandeverypair

ofvertices,wewill generallyhaveanO(m)computational costwhere m isthe numberofedges.

Inametricspacewehaveseenthat wecanlookforvarious possiblewaystoexpeditethenearest

neighborsearch. An approach basedonvisualization {turning ourdatainto animage {will be

lookedatbelow. Howeverthereisanotheraspectofoursimilarity(orother)graphwhichwemay

be able to turn to ouradvantage. EÆcientalgorithms forsparse graphsare available. Sparsity

canbearranged{wecanthresholdouredgesifthesparsitydoesnotsuggestitselfmorenaturally.

Aspecialtypeof sparsegraphisaplanargraph,i.e.agraphcapableof beingrepresentedin the

planewithoutanycrossoversofedges.

For sparse graphs,algorithms with O(mloglogn) computational costwere described by Yao

(1975)andCheritonandTarjan(1976). Ashortalgorithmicdescriptioncanbefoundin Murtagh

(1985,pp.107{108)andwereferinparticularto thelatter.

Thebasicideaistopreprocessthegraph,inordertoexpeditethesortingofedgeweights(why

sorting? {simplybecausewemustrepeatedlyndsmallestlinks,andmaintainingasortedlistof

edgesisagood basisfordoingthis). Ifwewere tosort alledges,thecomputationalrequirement

wouldbeO(mlogm). Insteadofdoingthat, wetaketheedgeset associatedwitheachandevery

vertex. Wedivideeachsuch edgeset intogroupsofsizek. (Thefact thatthelastsuchgroupwill

usuallybeofsize <k istakeninto accountwhenprogramming.)

Letn

v

bethenumberofincidentedgesatvertexv,suchthat P

v n

v

=2m.

Thesorting operationforeachvertexnowtakesO(klogk) operationsforeachgroup,andwe

have n

v

=k groups. For all vertices the sorting requires a number of operations which is of the

order of P

v n

v

logk = 2mlogk. This looks like a questionable { or small { improvementover

O(mlogm).

DeterminingthelightestedgeincidentonavertexrequiresO(n

v

=k)comparisonssincewehave

to check allgroups. Therefore thelightest edges incidenton allverticesare foundwith O(m=k)

operations.

Whentwovertices,andlaterfragments,aremerged,theirassociatedgroupsofedgesaresimply

collectedtogether,thereforekeepingthetotalnumberofgroupsofedgeswhichwestartedoutwith.

Wewillbypasstheissueofedgeswhich,overtime,aretobeavoidedbecausetheyconnectvertices

in the same fragment: given the fact that we are building an MST, the total number of such

edges-to-be-avoidedcannotsurpass2m.

Tond what to merge next, again O(m=k) processing is required. Using Sollin'salgorithm,

thetotalprocessingrequiredinndingwhattomergenextisO(m=k logn). Thetotalprocessing

requiredforgroupingtheedges,andsortingwithintheedge-groups,isO(mlogk),i.e.itisone-o

andaccomplishedatthestartoftheMST-buildingprocess.

(14)

dominatesandgivesoverallcomputationalcomplexityasO(mloglogn).

This result has been further improved to near linearity in m by Gabow et al. (1986), who

developanalgorithmwithcomplexityO(mlogloglog:::n)wherethenumberofiteratedlogterms

isbounded bym=n.

Motwaniand Raghavan(1995,chapter10)base astochasticO(m)algorithmfor theMSTon

randomsampling toidentifyandeliminateedgesthatareguaranteednottobelongtotheMST.

Let'sturn ourattentionnowto thecaseofaplanargraph. Foraplanargraphweknowthat

m3n 6form>1. (Forproof,seeforexampleTucker,1980,oranybookongraphtheory).

ReferringtoSollin'salgorithm,describedabove,O(n)operationsareneededtoestablishaleast

costedgefromeachvertex,sincethereareonlyO(n)edgespresent. Onthenextround,following

fragment-creation, there will be at most ceil(n=2) new vertices, implying of the order of n=2

processingtondtheleastcostedge(whereceilistheceilingfunction,orsmallestintegergreater

thantheargument). Thetotalcomputationalcostisseentobeproportionalto: n + n=2 + n=4 + :::=

O(n).

SodeterminingtheMSTofaplanargraphislinearin numbersofeither verticesoredges.

Before ending this review of veryeÆcient clustering algorithms for graphs, we note that al-

gorithmsdiscussed sofar haveassumed that the similaritygraphwasundirected. Formodeling

transport ows,or economictransfers,the graphcould wellbedirected. Componentscanbede-

ned, generalizing the clusters of the single link method, or the complete link method. Tarjan

(1983)providesanalgorithmforthelatteragglomerativecriterionwhichisofcomputationalcost

O(mlogn).

9 K-Means and Family

The non-technical person more often than not understands clustering as a partition. K-means

lookedatin thissection,orthedistribution mixtureapproachlookedatsection10,providesolu-

tions.

A mathematical denition of a partition implies no multiple assignments of observations to

clusters,i.e.nooverlappingclusters. Overlappingclustersmaybefasterto determinein practice,

andacaseinpointistheone-passalgorithm describedinSaltonandMcGill (1983). Thegeneral

principlefollowed is: make onepass through thedata, assigning each objectto the rst cluster

which is close enough, and making a new cluster for objects that are not close enough to any

existingcluster.

Broderet al.(1997)usethis algorithmforclustering theweb. Afeature vectorisdetermined

foreachHTMLdocumentconsidered,basedonsequencesofwords. Similaritybetweendocuments

is basedon an inverted list, using an approach likethose described in section 5. The similarity

graphisthresholded,andcomponentssought.

Broder(1998)solvesthe sameclusteringobjectiveusing athresholdingand overlappingclus-

teringmethodsimilartotheSaltonandMcGillone. Theapplicationdescribedisthatofclustering

theAltavista repositoryin April1996, consistingof 30millionHTML andtext documents,com-

prising150GBytesofdata. Thenumberofserviceableclustersfoundwas1.5million,containing7

milliondocuments. Processingtimewasabout10.5days. An analysisoftheclusteringalgorithm

usedbyBrodercan befound inBorodinet al.(1999),whoalso considerstheuseofapproximate

minimalspanningtrees.

The threshold-basedpass of the data, in its basicstate, is susceptibleto lack of robustness.

A bad choice of threshold leads to too many clusters ortwofew. Toremedy this, we canwork

on a well-dened data structure such as the minimal spanning tree. Or, alternatively, we can

iterativelyrenetheclustering. Partitioningmethods,suchask-means,useiterativeimprovement

ofaninitialestimationofatargetedclustering.

Averywidelyusedfamilyofmethodsforinducingapartitiononadatasetiscalled k-means,

(15)

eralnames(non-overlappingnon-hierarchicalclustering)ormorespecicnames(minimaldistance

orexchangealgorithms).

Theusualcriteriontobeoptimizedis:

1

jI j X

q2Q X

i2q k~ ~qk

2

whereIistheobjectset,j:jdenotescardinality,qissomecluster,Qisthepartition,andqdenotes

asetinthesummation,whereas~qdenotessomeassociatedvectorintheerrorterm,ormetricnorm.

Thiscriterionensuresthatclustersfoundarecompact,andthereforeassumedhomogeneous. The

optimizationcriterion,byasmallabuseofterminology,isofterreferredtoasaminimumvariance

one.

Anecessaryconditionthatthiscriterionbeoptimizedisthatvector~qbeaclustermean,which

fortheEuclideanmetriccaseis:

~ q=

1

jqj X

i2q

~

Abatch updatealgorithm, dueto Lloyd (1957),Forgy(1965),andothers, makesassignments

to a set of initially randomly-chosen vectors, ~q, as step 1. Step 2 updates the cluster vectors,

~

q. This is iterated. The distortion error, equation 1,is non-increasing, and alocal minimumis

achievedinanitenumberofiterations.

An onlineupdate algorithmis dueto MacQueen(1967). Aftereachpresentationofanobser-

vationvector,~,theclosestclustervector,~q,isupdatedtotakeaccountofit. Suchanapproachis

well-suitedforacontinuousinputdatastream(implying\online"learningofclustervectors).

Bothalgorithmsaregradientdescentones. Intheonlinecase,muchattentionhasbeendevoted

tobestlearningrateschedulesintheneuralnetwork(competitivelearning)literature: Darkenand

Moody(1991,1992),Darkenet al.(1992),Fritzke(1997).

A diÆculty, lesscontrollablein thecaseof the batch algorithm, is that clusters may become

(and stay) empty. This may be acceptable, but also may be in breach of our original problem

formulation. Analternativeto thebatchupdatealgorithm isSpath's(1985)exchangealgorithm.

Each observation is considered for possible assignment into anyof theother clusters. Updating

and\downdating"formulasaregivenbySpath. Thisexchangealgorithmisstatedtobefasterto

convergeandtoproducebetter(smaller)valuesoftheobjectivefunction. Overdecadesofuse,we

havealsoveriedthat itisasuperioralgorithmtotheminimaldistance one.

K-meansisverycloselyrelatedtoVoronoi(Dirichlet)tesselations,to Kohonenself-organizing

featuremaps,andvariousothermethods.

Thebatchlearningalgorithmabovemaybeviewedas

1. AnassignmentstepwhichwewilltermtheE(estimation)step: estimatetheposteriors,

P(observationsjclustercentres)

2. A cluster update step,theM (maximization)step, whichmaximizes acluster center likeli-

hood.

Nealand Hinton(1998)cast thek-means optimization problemin such awaythat bothE- and

M-stepsmonotonicallyincreasethemaximand'svalues. TheEMalgorithmmay,too,beenhanced

toallowforonlineaswellasbatch learning(SatoandIshii,1999).

InThiessonetal.(1999),k-meansisimplemented(i)bytraversingblocksofdata,cyclically,and

incrementallyupdatingthesuÆcientstatisticsandparameters,and(ii)insteadofcyclictraversal,

samplingfromsubsetsofthedataisused. Suchanapproachisadmirablysuitedforverylargedata

sets,wherein-memorystorageisnotfeasible. ExamplesusedbyThiessonetal.(1999)includethe

clusteringofahalf million300-dimensionalrecords.

(16)

It is traditional to note that models and (computational) speed don't mix. We review recent

progressinthis section.

10.1 Modeling of Signal and Noise

A simpleand applicablemodel is adistribution mixture, with the signalmodeled byGaussians,

inthepresenceofPoissonbackgroundnoise.

Consider data which are generated by a mixture of (G 1) bivariate Gaussian densities,

f

k

(x;) N(

k

;

k

),forclustersk=2;:::;G,andwithPoissonbackgroundnoisecorresponding

tok=1. Theoverallpopulationthushasthemixturedensity

f(x;)= G

X

k =1

k f

k (x;)

wherethemixing orpriorprobabilities,

k

,sumto1,andf

1

(x;)=A 1

,where Aistheareaof

thedataregion. Thisisthebasisformodel-basedclustering(BaneldandRaftery,1993,Dasgupta

andRaftery,1998,MurtaghandRaftery,1984,BanerjeeandRosenfeld,1993).

Theparameters, and,canbeestimatedeÆcientlybymaximizingthemixturelikelihood

L(;)= n

Y

i=1 f(x

i

;);

withrespectto and,wherex

i

isthei-th observation.

Nowletusassumethepresenceoftwoclusters,oneofwhichisPoissonnoise,theotherGaussian.

Thisyieldsthemixturelikelihood

L(;)= n

Y

i=1

"

1 A

1

+

2 1

2

p

jj exp

1

2 (x

i

) T

1

(x

i

)

#

;

where

1 +

2

=1.

Aniterativesolutionisprovidedbytheexpectation-maximization(EM)algorithmofDempster

etal.(1977). Wehavealreadynotedthis algorithmin informaltermsin thelast section,dealing

withk-means. Letthe\complete"(or\clean"or\output")databey

i

=(x

i

;z

i

)withindicatorset

z

i

=(z

i1

;z

i2

)givenby(1;0) or(0;1). Vectorz

i

hasamultinomial distribution with parameters

(1;

1

;

2

). Thisleadstothecompletedata log-likelihood:

l(y;z;;)= n

i=1

2

k =1 z

ik [log

k +logf

k (x

k

;)]

TheE-stepthen computesz^

ik

=E(z

ik jx

1

;:::;x

n

;), i.e. theposteriorprobabilitythat x

i is in

clusterk. TheM-stepinvolvesmaximizationoftheexpectedcompletedata log-likelihood:

l

(y;;)= n

i=1

2

k =1

^ z

ik [log

k +logf

k (x

i

;)]:

TheE-and M-stepsareiterateduntilconvergence.

For the2-classcase(Poisson noiseandaGaussian cluster),thecomplete-datalikelihoodis

L(y;z;;)= n

Y

i=1 h

1

A i

zi1

"

2

p

j j exp

1

2 (x

i

) T

1

(x

i

)

#

zi2

Thecorrespondingexpectedlog-likelihoodisthenusedin theEMalgorithm. Thisformulationof

theproblem generalizesto thecaseofGclusters,ofarbitrarydistributionsanddimensions.

(17)

software.

Inordertoassesstheevidence forthepresenceofasignal-cluster,weusetheBayes factorfor

themixturemodel, M

2

,thatincludesaGaussiandensityaswellasbackgroundnoise,againstthe

\null" model, M

1

, that containsonly background noise. The Bayes factoris the posteriorodds

forthemixturemodelagainstthepurenoisemodel,whenneitherisfavoredapriori. Itisdened

as B = p(xjM

2

)=p(xjM

1

), where p(xjM

2

) is the integrated likelihood of the mixture model M

2 ,

obtainedbyintegratingovertheparameterspace. ForageneralreviewofBayesfactors,theiruse

inappliedstatistics,andhowtoapproximateandcomputethem, seeKassandRaftery(1995).

Weapproximate theBayesfactorusing theBayesianInformation Criterion(BIC) (Schwartz,

1978). ForaGaussian clusterandPoissonnoise,this takestheform:

2logBBIC=2logL(

^

;^)+2nlogA 6logn;

where

^

and^ are themaximumlikelihood estimatorsof and , and L(

^

;)^ is themaximized

mixturelikelihood.

AreviewoftheuseoftheBICcriterionformodelselection{andmorespecicallyforchoosing

thenumberofclustersinadataset{canbefoundin FraleyandRaftery (1998).

An application of mixture modeling and the BIC criterion to gamma-ray burst data canbe

found in Mukherjee et al. (1998). So far around 800 observations have been assessed, but as

greaternumbersbecomeavailablewewillndtheinherentnumberofclustersin asimilarway,in

ordertotrytounderstandmoreaboutthecomplexphenomenon ofgamma-raybursts.

10.2 Application to Thresholding

Consideranimageoraplanaror3-dimensionalsetofobjectpositions. Forsimplicityweconsider

thecaseofsettingasinglethresholdintheimageintensities, orthepointset'sspatialdensity.

WedealwithacombinedmixturedensityoftwounivariateGaussiandistributions f

k

(x;)

N(

k

;

k

). Theoverallpopulationthushasthemixturedensity

f(x;)= 2

X

k =1

k f

k (x;)

wherethemixingorpriorprobabilities,

k

,sumto 1.

Whenthemixing proportionsareassumedequal,thelog-likelihoodtakestheform

l()= n

X

i=1 ln

"

2

X

k =1 1

2

p

j

k j

exp

1

2

k (x

i

k )

2

#

TheEMalgorithm isthenused toiterativelysolvethis (seeCeleux andGovaert,1995). This

method is used for appraisals of textile (jeans and other fabrics) fault detection in Campbell et

al.(1999). Industrialvision inspectionsystemspotentiallyproducelargedatastreams, andfault

detectioncanbeagoodapplicationforfastclusteringmethods. Wearecurrentlyusingamixture

modelof thissort onSEM(scanning electronmicroscope) imagesof cross-sectionsof concreteto

allowforsubsequentcharacterizationofphysicalproperties.

Imagesegmentation,perse,isarelativelystraightforwardapplication,buttherearenoveland

interesting aspects to the two studies mentioned. In the textile case, the faults are very often

perceptual and relative, ratherthan \absolute"orcapable ofbeinganalyzed in isolation. In the

SEMimagingcase,arstphaseof processingis appliedto de-specklethe images,using multiple

resolutionnoiseltering.

Turningfrom concreteto cosmology,theSloanDigitalSkySurvey(SDSS, 1999)isproducing

askymap ofmorethan 100millionobjects,together with 3-dimensionalinformation(redshifts)

for a million galaxies. Pelleg and Moore (1999) describe mixture modeling, using a k-D tree

preprocessingto expeditethendingoftheclass(mixture)parameters,e.g.means,covariances.

(18)

InStarcket al.(1998)andin awiderangeofpapers,wehavepursuedanapproachforthenoise

modelingofobserveddata. Amultipleresolution scalevisionmodelordatagenerationprocessis

used,toallowforthephenomenonbeingobservedondierentscales. Inaddition,awiderangeof

optionsarepermittedforthedatagenerationtransferpath,includingadditiveandmultiplicative,

stationaryand non-stationary,Gaussian (\readout"noise), Poisson (randomshotnoise), andso

on.

Given pointpatternclustering intwo-orthree-dimensionalspaces, wewilllimitouroverview

heretothePoissonnoisecase.

11.1 Poisson noise with few events using the a trous transform

If awavelet coeÆcient w

j

(x;y) is due to noise, it canbe considered as arealization of the sum

P

k 2K n

k

ofindependentrandomvariableswiththesamedistributionasthatofthewaveletfunc-

tion(n

k

beingthenumberofeventsusedforthecalculationofw

j

(x;y)). Thisallowscomparisonof

thewaveletcoeÆcientsofthedatawiththevalueswhichcanbetakenbythesumofnindependent

variables.

The distributionof oneeventin wavelet space isthen directly given by the histogramH

1 of

thewavelet . As weconsider independent events, thedistribution of acoeÆcientw

n

(note the

changedsubscriptingforw,forconvenience)relatedtoneventsisgivenbynautoconvolutionsof

H

1 :

H

n

=H

1 H

1

:::H

1

For alargenumberofevents,H

n

convergestoaGaussian.

Fig. 8 shows an exampleof where point pattern clusters { density bumps in this case {are

sought,with agreatamountof backgroundclutter. Murtagh andStarck(1998)referto thefact

that there is no computational dependence on the number of points (signal or noise) in such a

problem,whenusingawavelettransformwithnoisemodeling.

Some other alternative approaches will be brie y noted. The Haar transform presents the

advantageof its simplicityfor modeling Poisson noise. Analytic formulas forwavelet coeÆcient

distributionshavebeenderivedbyKolaczyk(1997),andJammalandBijaoui(1999). Usinganew

wavelettransform,theHaaratroustransform,Zhenget al.(1999)appraiseadenoisingapproach

fornancialdatastreams, {animportantpreliminary stepforsubsequentclustering,forecasting,

orotherprocessing.

11.2 Poisson noise with nearest neighbor clutter removal

Thewaveletapproachiscertainlyappropriatewhenthewaveletfunctionre ectsthetypeofobject

sought(e.g.isotropic),andwhensuperimposed pointpatternsareto beanalyzed. However,non-

superimposed pointpatternsofcomplexshapeareverywelltreatedbytheapproachdescribedin

ByersandRaftery(1998). UsingahomogeneousPoissonnoisemodel,theyderivethedistribution

ofthedistance ofapointtoitskthnearestneighbor.

Next,Byersand Raftery (1998)considerthe caseofaPoisson processwhichis signal,super-

imposed on aPoisson process which is clutter. The kth nearestneighbordistances are modeled

as amixture distribution: ahistogram of these, for given k, will yield a bimodal distribution if

ourassumption iscorrect. This mixture distribution problem is solvedusing the EMalgorithm.

Generalizationto higherdimensions,e.g.10,isalsodiscussed.

SimilardatawasanalyzedbynoisemodelingandaVoronoitesselationpreprocessingofthedata

inAllardandFraley(1997). Itispointedouttherehowthiscanbeaveryusefulapproachwhere

the Voronoi tiles have meaning in relation to the morphology of the point patterns. However,

it does not scale well to higher dimensions, and the statistical noise modeling is approximate.

(19)

Voronoitesselation forastronomicalX-rayobjectdetectionandcharacterization.

12 Cluster-Based User Interfaces

Informationretrievalbymeans of\semanticroadmaps"wasrstdetailed by Doyle(1961). The

spatialmetaphor isapowerfulonein humaninformationprocessing. The spatial metaphoralso

lends itself well to modern distributed computing environments such as theweb. TheKohonen

self-organizing feature map (SOM) method is an eective means towards this end of a visual

information retrieval user interface. We will also provide an illustration of web-based semantic

mapsbasedonhyperlinkclustering.

TheKohonen mapis, atheart, k-meansclustering withtheadditionalconstraintthat cluster

centersbelocatedonaregulargrid(or someothertopographicstructure)and furthermoretheir

location on the grid be monotonically related to pairwise proximity (Murtagh and Hernandez-

Pajares,1995). The nicething aboutaregular gridoutput representationspace is that it lends

itselfwellasavisualuserinterface.

Fig.9showsavisualandinteractiveuserinterfacemap,usingaKohonenself-organizingfeature

map(SOM).Colorisrelatedtodensityofdocumentclusterslocatedat regularly-spacednodesof

the map, and some of these nodes/clusters are annotated. The map is installed as a clickable

imagemap,withCGIprogramsaccessinglistsofdocumentsand{throughfurtherlinks{inmany

cases, the full documents. In the example shown, the user has queried a node and results are

seenin theright-hand panel. Such maps are maintained for (currently)12000 articlesfrom the

Astrophysical Journal, 7000 from Astronomy and Astrophysics, over2000 astronomicalcatalogs,

andotherdataholdings. Moreinformationonthedesignofthisvisualinterfaceanduserassessment

canbefoundin Poincotet al.(1998,1999).

Guillaume(GuillaumeandMurtagh,1999)developedaJava-basedvisualizationtoolforhyperlink-

baseddata,consistingofastronomers,astronomicalobjectnames,articletitles,andwiththepos-

sibilityofotherobjects(images,tables, etc.). Throughweighting,thevarioustypesoflinks could

be prioritized. An iterativerenementalgorithmwasdevelopedto map thenodes(objects) to a

regulargridofcells,which asfortheKohonenSOMmap, areclickableandprovideaccessto the

datarepresentedbythecluster. Fig.10showsanexampleforanastronomer(Prof.JeanHeyvaerts,

StrasbourgAstronomicalObservatory).

These newcluster-based visualuserinterfacesare notcomputationallydemanding. Theyare

not howeverscalable in their current implementation. Document management (see e.g. Cartia,

1999)islessthemotivationasisinsteadtheinteractiveuserinterface.

13 Images from Data

Itisquiteimpressivehow2D(or3D)imagesignalscanhandlewitheasethescalabilitylimitations

of clustering and many other data processing operations. The contiguity imposed on adjacent

pixelsbypasses theneed fornearest neighbornding. Itis veryinterestingthereforeto consider

thefeasibilityof taking problemsof clustering massivedatasets into the 2Dimage domain. We

willlookat afewrecentexamplesofwork inthisdirection.

ChurchandHelfman(1993)addresstheproblemofvisualizingpossiblymillionsoflinesofcom-

puterprogramcode,ortext. Theyconsideran approachborrowedfrom DNAsequenceanalysis.

Thedatasequenceistokenizedbysplittingitintoitsatoms(line,word,character,etc.) andthen

placingadotatposition i;j iftheithinput tokenis thesameasthejth. Theresultingdotplot,

it is argued, is not limited by the available display screen space, and can lead to discoveryof

large-scalestructurein thedata.

Whendata do nothaveasequence wehaveaninvarianceproblem which canberesolved by

ndingsome row and column permutation which pulls largearrayvalues together, and perhaps

(20)

sparsearrays. Gatheringlarger(ornonzero)arrayelementstothediagonalcanbeviewedinterms

of minimizing the envelope of nonzero values relative to the diagonal. This can be formulated

and solvedin purely symbolic termsby reorderingverticesin asuitable graphrepresentationof

thematrix. A widely-used method for symmetric sparse matrices is the Reverse Cuthill-McKee

(RCM)method.

ThecomplexityoftheRCMmethodfororderingrowsorcolumnsisproportionaltotheproduct

of the maximum degree of any vertex in the graphrepresenting the array values and the total

numberof edges(nonzeroesin thematrix). Forhypertextmatriceswithsmall maximumdegree,

themethod would be extremelyfast. Thestrength ofthemethod isits lowtime complexitybut

itdoessuer from certain drawbacks. The heuristicfor nding thestarting vertex is in uenced

bytheinitial numberingofverticesandsothequalityof thereorderingcanvaryslightlyfor the

sameproblem for dierent initial numberings. Next,the overall method does notaccommodate

denserows (e.g., a commonlink used in everydocument),and if arow has asignicantly large

numberofnonzeroesitmightbebesttoprocessitseparately;i.e.,extractthedenserows,reorder

theremainingmatrixandaugmentitbythedenserows(orcommonlinks)numberedlast. Elapsed

CPUtimesfor arangeof arraysand permutingmethods aregiveninBerry etal.(1996), andas

anindicationshowperformancesbetween0.025to3.18secondsforpermutinga4000400array.

Areview ofpublicdomain softwareforcarryingoutSVD andother linearalgebraoperationson

largesparsedatasets canbefoundin Berryet al.(1999,section8.3).

Once we have a sequence-respecting array, we can immediately apply eÆcient visualization

techniquesfromimageanalysis. Murtaghetal.(2000)investigatetheuseofnoiseltering(i.e.to

removelessusefularrayentries)usingamultiscalewavelettransformapproach.

An examplefollows. FromtheConciseColumbia Encyclopedia(1989 2nded., onlineversion)

asetofdatarelatingto12025encyclopediaentriesandto9778cross-referencesorlinkswasused.

Fig.11showsa500450subarray,basedonacorrespondenceanalysis(i.e.orderingofprojections

ontherstfactor).

Thispartoftheencyclopediadatawaslteredusingthewaveletandnoise-modelingmethodol-

ogydescribedinMurtaghetal.(2000)andtheoutcomeisshowninFig.12. Overalltherecoveryof

themoreapparentalignments,andhencevisuallystrongerclusters,isexcellent. Therstrelatively

long\horizontalbar"wasselected{itcorrespondstocolumnindex(link)1733=geological era.

Thecorrespondingrowindices(articles)are, insequence:

SILURIAN PERIOD

PLEISTOCENE EPOCH

HOLOCENE EPOCH

PRECAMBRIAN TIME

CARBONIFEROUS PERIOD

OLIGOCENE EPOCH

ORDOVICIAN PERIOD

TRIASSIC PERIOD

CENOZOIC ERA

PALEOCENE EPOCH

MIOCENE EPOCH

DEVONIAN PERIOD

PALEOZOIC ERA

JURASSIC PERIOD

MESOZOIC ERA

CAMBRIAN PERIOD

PLIOCENE EPOCH

CRETACEOUS PERIOD

Theworkdescribedhereisbasedonanumberoftechnologies: (i)datavisualizationtechniques;

(21)

wavelet transform haslinear computationalcost in termsof image rowand column dimensions,

andisindependentof thepixelvalues.

14 Conclusion

Viewedfromacommercialormanagerialperspective,onecouldjustiablyaskwherewearenowin

ourunderstandingofproblemsinthisarearelativetowherewewerebackinthe1960s? Depending

onouranswertothis,wemaywellproceedtoasecondquestion:Whyhaveallimportantproblems

notbeensolvedbynowin thisarea{aretheremajoroutstandingproblemstobesolved?

Asdescribedinthischapter,asolidbodyofexperimentalandtheoreticalresultshavebeenbuilt

upoverthelastfew decades. Clustering remainsarequirementwhichisacentralinfrastructural

elementofverymanyapplication elds.

There is continual renewal of the essential questions and problems of clustering, relating to

new data, new information, and new environments. There is no logjam in clustering research

and development simply because the riversof problems continue to broaden and deepen. Clus-

teringandclassicationremainquintessentialissuesinourcomputingandinformationtechnology

environments(Murtagh,1998).

Acknowledgements

Some of this work, in particular in sections 9 to 12, represents various collaborations with the

following: J.L.Starck,CEA;A.Raftery,UniversityofWashington;C.Fraley,UniversityofWash-

ingtonandMathSoftInc.;D.Washington,MathSoft,Inc.;Ph. PoincotandS.Lesteven,Strasbourg

Observatory;D.Guillaume, StrasbourgObservatory,Universityof IllinoisandNCSA.

References

1. Allard, D. and Fraley, C., \Non-parametric maximum likelihood estimation of features in

spatial point processesusing Voronoitessellation", Journalofthe AmericanStatisticalAs-

sociation,92,1485{1493,1997.

2. Arabie,P.andHubert,L.J.,\Anoverviewofcombinatorialdataanalysis",inArabie,P.,Hu-

bert,L.J.andDeSoete,G.,Eds.,ClusteringandClassication,WorldScientic,Singapore,

5{63,1996.

3. Arabie,P.,Hubert,L.J. andDeSoete,G.,Eds.,ClusteringandClassication,WorldScien-

tic,Singapore, 1996.

4. Banerjee, S. and Rosenfeld, A, \Model-based cluster analysis", Pattern Recognition, 26,

963{974,1993.

5. Baneld, J.D. and Raftery, A.E., \Model-based Gaussian and non-Gaussian clustering",

Biometrics,49,803{821,1993.

6. Bennett,K.P.,Fayyad,U.andGeiger,D.,\Density-basedindexing forapproximatenearest

neighborqueries",MicrosoftResearchTechnicalReportMSR-TR-98-58,1999.

7. Bentley, J.L.andFriedman,J.H., \Fastalgorithmsforconstructingminimalspanning trees

incoordinatespaces",IEEETransactionsonComputers,C-27,97{105,1978.

8. Bentley, J.L., Weide, B.W. and Yao, A.C., \Optimal expected time algorithms for closest

pointproblems",ACM TransactionsonMathematical Software,6,563{580,1980.

(22)

browsing hypertext", in Renegar, J., Shub, M. and Smale, S., Eds., Lectures in Applied

Mathematics (LAM)Vol. 32: The Mathematics of Numerical Analysis, AmericanMathe-

maticalSociety,99{123,1996.

10. Berry, M.W., Drmac, Z. and Jessup, E.R., \Matrices, vectorspaces, and information re-

trieval", SIAMReview, 41,335{362,1999.

11. Beyer,K.,Goldstein,J.,Ramakrishnan,R.andShaft,U., \Whenisnearestneighbormean-

ingful?", in Proceedings of the 7th InternationalConference on DatabaseTheory (ICDT),

Jerusalem,Israel,1999,inpress.

12. Borodin, A., Ostrovsky, R. and Rabani, Y., \Subquadratic approximation algorithms for

clusteringproblemsin highdimensionalspaces",Proc. 31stACMSymposiumonTheoryof

Computing(STOC-99),1999.

13. Broder, A.J., \StrategiesforeÆcientincrementalnearest neighborsearch",PatternRecog-

nition,23,171{178,1990.

14. Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G., \Syntactic clustering of the

web",Proc. Sixth InternationalWorldWide WebConference,391{404,1997.

15. Broder, A.Z., \On the resemblance and containment of documents", In Compressionand

ComplexityofSequences(SEQUENCES'97),21{29,IEEEComputerSociety,1998.

16. Bruynooghe,M.,\Methodesnouvellesenclassicationautomatiquedesdonneestaxinomiques

nombreuses",Statistiqueet AnalysedesDonnees,no.3,24{42,1977.

17. Burkhard,W.A.andKeller,R.M.,\Someapproachestobest-matchlesearching",Commu-

nicationsoftheACM,16,230{236,1973.

18. Byers,S.D. andRaftery,A.E., \Nearestneighborclutter removalforestimating featuresin

spatialpointprocesses",JournaloftheAmericanStatisticalAssociation,93,577{584,1998.

19. Campbell, J.G., Fraley, C., Stanford, D., Murtagh, F. and Raftery, A.E., \Model-based

methodsfortextilefaultdetection",InternationalJournalofImagingScienceandTechnology,

10,339-346,1999.

20. Cartia,Inc.,MappingtheInformationLandscape,client-serversoftwaresystem,

http://www.cartia.com,1999.

21. Celeux,G. and Govaert,G., \Gaussianparsimonious clustering models",Pattern Recogni-

tion,28,781-793,1995.

22. Cheriton,D. andTarjan,D.E.,\Findingminimumspanningtrees",SIAMJournalonCom-

puting,5,724{742,1976.

23. Church,K.W.andHelfman,J.I.,\Dotplot: aprogramforexploringself-similarityinmillions

of lines of text and code", Journalof Computationaland Graphical Statistics, 2, 153{174,

1993.

24. Croft,W.B., \Clustering largelesof documentsusing thesingle-link method", Journalof

theAmericanSocietyforInformation Science,28,341{344,1977.

25. Darken, C. and Moody, J., \Note onlearning rate schedules for stochastic optimization",

Advances in Neural Information Processing Systems 3, Lippmann, Moody and Touretzky,

Eds.,MorganKaufmann,PaloAlto,1991.

(23)

search",NeuralNetworksforSignalProcessing2,Proceedingsofthe1992IEEEWorkshop,

IEEEPress,Piscataway,1992.

27. Darken,C.andMoody,J.,\Towardsfasterstochasticgradientsearch",Advancesin Neural

InformationProcessingSystems4,Moody, HansonandLippmann,Eds.,MorganKaufman,

SanMateo,1992.

28. Dasarathy, B.V., Nearest Neighbor (NN) Norms: NN Pattern Classication Techniques,

IEEEComputerSocietyPress,NewYork,1991.

29. Dasgupta, A. andRaftery, A.E.,\Detecting features inspatial pointprocesseswithclutter

via model-basedclustering", Journal of theAmerican Statistical Association, 93, 294{302,

1998.

30. Day,W.H.E.andEdelsbrunner,H.,\EÆcientalgorithmsforagglomerativehierarchicalclus-

teringmethods",JournalofClassication,1,7{24,1984.

31. Defays, D., \An eÆcient algorithm for a complete link method", Computer Journal, 20,

364{366,1977.

32. Delannoy, C., \Unalgorithme rapide de recherchede plus proches voisins", RAIRO Infor-

matique/ComputerScience,14,275{286,1980.

33. Dempster, A.P.,Laird, N.M. andRubin, D.B., \Maximumlikelihood from incompletedata

viatheEMalgorithm",JournaloftheRoyalStatisticalSociety,Series,B39,1{22,1977).

34. Dobrzycki, A., Ebeling, H., Glotfelty, K., Freeman, P., Damiani, F, Elvis,M. and Calder-

wood, T.,ChandraDetect1.0UserGuide,ChandraX-RayCenter,SmithsonianAstrophys-

icalObservatory,Version0.9, 1999.

35. Doyle, L.B., \Semantic road maps for literature searchers", Journal of the ACM, 8, 553{

578,1961.

36. Eastman, C.M. andWeiss, S.F., \Tree structures for high dimensionality nearestneighbor

searching",InformationSystems,7,115{122,1982.

37. Ebeling,H.andWiedenmann,G.,\DetectingstructureintwodimensionscombiningVoronoi

tessellationand percolation",PhysicalReviewE,47,704{714,1993.

38. Forgy, E.,\Clusteranalysis ofmultivariatedata: eÆciency vs.interpretabilityof classica-

tions",Biometrics,21,768,1965.

39. Fraley,C., \Algorithmsformodel-basedGaussianhierarchicalclustering",SIAMJournalof

ScienticComputing,20,270-281,1999.

40. Fraley, C.andRaftery, A.E.,\Howmanyclusters? Which clustering method? Answersvia

model-basedclusteranalysis",TheComputerJournal,41,578{588,1999.

41. Friedman,J.H.,Baskett,F.andShustek,L.J.,\Analgorithmforndingnearestneighbors",

IEEETransactionsonComputers,C-24,1000-1006,1975.

42. Friedman, J.H., Bentley, J.L. and Finkel, R.A.,\An algorithm for ndingbest matches in

logarithmicexpectedtime",ACMTransactionsonMathematicalSoftware,3,209{226,1977.

43. Fritzke, B., \Some competitive learning methods", http://www.neuroinformatik.ruhr-uni-

bochum.de/ini/VDM/research/gsn/JavaPaper