FionnMurtagh
Schoolof Computer Science
The Queen's University of Belfast
BelfastBT7 1NN, NorthernIreland
f.murtagh@qub.ac.uk
July10, 2000
Abstract
Wereviewthetimeand storagecosts ofsearchand clusteringalgorithms. Weexemplify
these,basedoncase-studiesinastronomy,informationretrieval,visualuserinterfaces,chemical
databases, andotherareas. Theoreticalresults developedas far backas the 1960sstillvery
oftenremaintopical. Morerecentworkisalsocoveredinthisarticle. Thisincludesasolution
for the statistical question of how many clusters there are in a dataset. We also look at
oneline ofinquiry intheuse ofclustering for human-computeruserinterfaces. Finally, the
visualizationofdataleadstotheconsiderationofdataarraysasimages,andwespeculateon
futureresultstobeexpectedhere.
`Now',saidRabbit,`thisisaSearch,andI'veOrganisedit{'
`Donewhattoit?' saidPooh.
`Organisedit. Whichmeans {well,it'swhatyou dotoaSearch,
whenyou don'talllook inthesameplace atonce.'
A.A.Milne,TheHouseatPoohCorner (1928){M.S.Zakaria
Topics:
1. Introduction
2. BinningorBucketing
3. MultidimensionalBinary SearchorkDTree
4. ProjectionsandOtherBounds
5. TheSpecialCaseofSparseBinaryData
6. Fast ApproximateNearestNeighborFinding
7. HierarchicalAgglomerativeClustering
8. GraphClustering
9. NearestNeighborFindingonGraphs
10. K-MeansandFamily
11. Fast Model-BasedClustering
12. NoiseModeling
14. Imagesfrom Data
15. Conclusion
1 Introduction
Nearestneighborsearchingisconsidered insections2to6foronemain reason: itsutilityforthe
clusteringalgorithmsreviewedin sections7to10and, partially,11. Theyarethebuildingblocks
forthemosteÆcientimplementationsofhierarchicalclusteringalgorithms,and theycanbeused
tospeedupotherfamiliesofclusteringalgorithms.
The best match or nearestneighbor problem is important in many disciplines. In statistics,
k-nearestneighbors,wherekcanbe1,2,etc.,isamethodofnon-parametricdiscriminantanalysis.
Inpatternrecognition,thisisawidely-usedmethodforunsupervisedclassication(seeDasarathy,
1991). Nearestneighboralgorithmsarethebuildingblockofclusteringalgorithmsbasedonnearest
neighborchains;or ofeectiveheuristicsolutionsforcombinatorialoptimizationalgorithms such
asthetravelingsalesmanproblem,whichisaparadigmaticprobleminmanyareas. Inthedatabase
andmoreparticularlydataminingeld,NN searchingiscalledsimilarityquery,orsimilarityjoin
(Bennettetal.,1999).
In section 2,webegin with data structures where the objectiveis to break the O(n)barrier
fordetermining thenearestneighbor(NN) of apoint. A databaserecordortuple may be taken
as a point in a space of dimensionality m, the latter being the associated number of elds or
attributes. Theseapproacheshavebeenverysuccessful,buttheyarerestrictedtolowdimensional
NN-searching. For higher dimensional data, a wide range of bounding approaches have been
proposed,whichremainO(n)algorithmsbut withalowconstantofproportionality.
Weassumefamiliaritywithbasicnotionsofsimilarityanddistance, thetriangularinequality,
ultrametricspaces,Jaccardandother coeÆcients,normalizationandstandardization. Foranim-
plicittreatmentofdatatheoryanddatacoding,seeMurtaghandHeck(1987). Usefulbackground
readingcan befoundinArabieetal.(1996). Inparticularoutputrepresentationalmodelsinclude
discretestructures, e.g. rooted labeledtrees ordendrograms,and spatial structures (Arabie and
Hubert,1996),withmanyhybrids.
Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a
basisforclusteringalgorithmstofollow. Sections7to11reviewanumberoffamiliesofclustering
algorithm. Sections 12to 14 relateto visualorimagerepresentations ofdata sets, from which a
numberofinterestingalgorithmicdevelopmentsarise.
2 Binning or Bucketing
Inthisapproachto NN-searching, apreprocessingstage precedesthesearchingstage. All points
aremappedontoindexedcellularregionsofspace,sothatNNsarefoundinthesameorin closely
adjacentcells. Takingtheplaneasasexample,andconsideringpoints(x
i
;y
i
),themaximumand
minimumvaluesonallcoordinatesareobtained(e.g.(x min
j
;y min
j
)). Considerthemapping(Fig.1)
x
i
!fb(x
ij x
min
j )=rcg
where constant r is chosen in terms of the number of equally spaced categories into which the
interval [x min
j
;x max
j
] is to bedivided. Thisgivesto x
i
aninteger valuebetween 0and b(x max
ij
x min
ij
)=rcforeachattributej. O(nm)timeisrequiredtoobtainthetransformationofallnpoints,
andtheresultmaybestoredasalinkedlist withapointer from each cellidentierto thesetof
pointsmappedontothatcell.
x ;y =0;0,x ;y =50;40,r=10
*
0 1 2 3 4 5 0
1 2 3 4
*
Point(22,32)ismappedontocell(2,3);point(8,13)is mappedontocell(0,1).
Figure1: Exampleofsimplebinningintheplane.
NN-searchingbeginsby ndingthe closestpointamong those which havebeenmapped onto
the same grid cell as the target point. This gives acurrent NN point. A closer point may be
mapped onto some other grid cell ifthe distance betweentarget point and current NN point is
greaterthanthedistancebetweenthetargetpointandanyoftheboundariesofthecellcontaining
it. Somefurther implementationdetails canbefoundin Murtagh(1993a).
A powerfultheoretical resultregardingthis approach isasfollows. Foruniformly distributed
points,theNNofapointisfoundinO(1),orconstant,expectedtime(seeDelannoy,1980,orBent-
leyetal., 1980,forproof). Thereforethis approachwill work well ifapproximateuniformitycan
beassumedorifthedatacanbebrokendownintoregionsofapproximatelyuniformlydistributed
points.
SimpleFortrancodeforthisapproachislisted,anddiscussed,inSchreiber(1993). Thesearch
through adjacent cells requires time which increases exponentially with dimensionality (if it is
assumedthat thenumberofpointsassignedtoeachcellisapproximatelyequal). Asaresult,this
approachissuitableforlowdimensionsonly. Rohlf(1978)reportsonworkindimensions2,3,and
4;andMurtagh(1983)intheplane. Rohlfalsomentionstheuseoftherst3principalcomponents
toapproximateasetofpointsin15-dimensionalspace.
FromtheconstantexpectedtimeNNsearchresult,particularhierarchicalagglomerativeclus-
teringmethodscan beshowntobeoflinearexpectedtime,O(n)(Murtagh,1983). Theexpected
time complexity for Ward's minimum variance method is given as O(nlogn). Results on the
hierarchicalclusteringofupto12,000pointsarediscussed.
Thelimitationonthese veryappealingcomputationalcomplexityresultsisthattheyareonly
really feasible for data in the plane. Bellman's curse of dimensionality manifests itself here as
always. Fordimensionsgreaterthan2or3weproceedtothesituationwhereabinarysearchtree
canprovideuswithagoodpreprocessingofourdata.
3 Multidimensional Binary Search or kD Tree
Amultidimensionalbinarysearchtree(MDBST)preprocessesthedatatobesearchedthroughby
two-waysubdivision,and subdivisions continue until someprespecied numberof data pointsis
arrivedat. Seeexamplein Fig.2. Weassociatewitheachnodeofthedecisiontreethedenition
of a subdivision of the data only, and we associate with each terminal node a pointer to the
stored coordinates of the points. Using the approximate median of projections keeps the tree
balanced,and consequentlyO(logn) levels, at each of which O(n)processing is required. Hence
A2 B2
A1
B1
*
B2
* * *
*
* * *
A B
A1 A2 B1
Figure2: AMDBSTusingplanardata.
theconstructionofthetreetakesO(nlogn)time.
The search for aNN then proceeds by atop-down traversalof the tree. Thetarget pointis
transmittedthroughsuccessivelevelsofthetreeusingthedenedseparationofthetwochildnodes
ateachnode. Onarrivalataterminalnode,allassociatedpointsareexaminedandacurrentNN
selected. The tree is then backtracked: if the points associated with any node could furnish a
closerpoint,then subnodesmustbecheckedout.
Theapproximatelyconstantnumberofpointsassociatedwithterminalnodes(hyper-rectangular
cellsin the space of points) should be greaterthan 1 in order that some NNs may be obtained
withoutrequiringasearchofadjacentcells(otherterminalnodes). Friedmanetal.(1977)suggest
avalueofthenumberofpointsperbinbetween4and32basedonempiricalstudy.
The MDBST approach only works well with small dimensions. To see this, consider each
coordinate being used once and once only for the subdivision of points, i.e. each attribute is
consideredequallyuseful. Let therebeplevelsin thetree,i.e. 2 p
terminalnodes. Each terminal
nodecontainsapproximatelycpointsbyconstructionandso c2 p
=n. Thereforep=log
2
n=c. As
samplevalues,ifn=32768;c=32;thenp=10. I.e.in10-dimensionalspace,usingalargenumber
ofpointsassociatedwithterminalnodes,morethan30000pointswillneedto beconsidered. For
highdimensionalspaces,twoalternativeMDBSTspecicationsareasfollows.
All attributes need not be considered for splitting the data if it is known that some are of
greater interest than others. Linearity present in the data may manifest itself via the variance
ofprojectionsofpointson thecoordinates; choosingthecoordinatewithgreatestvarianceasthe
discriminatorcoordinateateachnodemaythereforeallowrepeateduseofcertainattributes. This
hastheaddedeectthatthehyper-rectangularcellsintowhichtheterminalnodesdividethespace
willbeapproximatelycubicalinshape. Inthiscase,Friedmanet al.(1977)showthatsearchtime
isO(logn)onaverageforthendingofaNN. Resultsobtainedfordimensionalitiesofbetween2
and8arereportedoninFriedmanetal.(1977),andintheapplicationofthisapproachtominimal
spanningtreeconstructioninBentleyandFriedman(1978). LispcodefortheMDBSTisdiscussed
inBroder(1990).
The MDBST has also been proposed for veryhigh dimensionality spaces, i.e. where the di-
mensionalitymaybegreaterthan thenumberof points,ascouldbethecasein akeyword-based
system. Keywords(coordinates)arebatched, andthefollowingdecisionruleis used: ifsomeone
ofa givenbatch of node-dening discriminating attributes is present, thentake theleft subtree,
elsetaketherightsubtree. Largen,wellinexcessof1400,wasstatedasnecessaryforgoodresults
(Weiss, 1981; Eastman and Weis, 1982). General guidelines for the attributes which dene the
directionofsearchateachlevelarethat theyberelated,andthenumberchosenshould keepthe
treebalanced. Onintuitivegrounds,ouropinionisthatthisapproachwillworkwelliftheclusters
ofattributes,deningthetreenodes,aremutuallywellseparated.
An MDBST approach is used by Moore (1999) in the case of Gaussian mixture clustering.
x x x
x
x
x(i,k) x x
x x
*
kth axis
Figure 3: Two-dimensional exampleof projection-based bound. Points with projections within
distancecofgivenpoint's(*)projection,alone,aresearched. Distancecisdened withreference
toacandidate orcurrentnearestneighbor.
OverandabovethesearchfornearestneighborsbasedonEuclideandistance,Mooreallowsforthe
Mahalanobismetric,i.e.distancetoclustercenterswhichare\corrected"forthe(Gaussian)spread
ormorphology ofclusters. Theinformation storedat eachnode ofthe treeincludes covariances.
Moore(1999)reportsresultsonnumbersofobjectsofaround160,000,dimensionalitiesofbetween
2and 6, and speedups of 8-fold to 1000-fold. Pelleg and Moore (1999) discuss results on some
430,000two-dimensionalobjectsfrom theSloanDigitalSkySurvey(seesection10below).
4 Projections and Other Bounds
4.1 Bounding using Projection or Properties of Metrics
Making use of bounds is a versatile approach, which may be less restricted by dimensionality.
Somelowerbound on the dissimilarityis eÆcientlycalculated in order to dispense with thefull
calculationin manyinstances.
UsingprojectionsonacoordinateaxisallowstheexclusionofpointsinthesearchfortheNNof
pointx
i
. Pointsx
k
,only,areconsideredsuchthat(x
ij x
k j )
2
c 2
wherex
ij
isthejthcoordinate
ofx
i
,andwhere cissomeprespecieddistance(seeFig.3).
Alternativelymorethanonecoordinatemaybeused. Thepriorsortingofcoordinatevalueson
thechosenaxisoraxesexpeditesthendingofpointswhosefulldistancecalculationisnecessitated.
Thepreprocessingrequiredwiththisapproachinvolvesthesortingofuptomsetsofcoordinates,
i.e.O(mnlogn)time.
Using oneaxis, it is evident that many pointsmay be excluded ifthe dimensionality is very
small, but that the approach will disimprove as the latter grows. Friedman et al. (1975) give
theexpectedNNsearchtime,under theassumptionthatthepointsareuniformlydistributed,as
O(mn 1 1=n
). This approachesthebruteforceO(nm)asngetslarge. Reported empiricalresults
arefordimensions2to8.
Marimont andShapiro (1979)extendthis approach bythe useof projectionsin subspacesof
dimensiongreaterthan 1(usuallyaboutm=2is suggested). This canbe further improvedif the
subspaceoftheprincipalcomponentsisused. Dimensionsupto40areexamined.
TheEuclideandistanceisverywidelyused. TwoothermembersofafamilyofMinkowskimetric
measuresrequirelesscomputationtime to calculate,andtheycanbeused to providebounds on
theEuclidean distance. Wehave:
d
1 (x;x
0
)d
2 (x;x
0
)d
1 (x;x
0
)
whered
1
istheHammingdistancedened as P
j jx
j x
0
j
j; theEuclidean distanceisgivenby
thesquarerootof P
j (x
j x
0
j )
2
;andtheChebyshevdistanceisdened asmax
j jx
j x
0
j j.
d
1
(x;y) p
(m)Æ where Æ is the current NN d
2
-distance. The more eÆciently calculated d
1 -
distancemaythusallowtherejectionofmanypoints(90%in10-dimensionalspaceisreportedon
byKittler). Kittler'sruleisobtainedbynotingthatthegreatestd
1
-distance betweenxand x 0
is
attainedwhen
jx
j x
0
j j
2
=d 2
2 (x;x
0
)=m
forall coordinates, j. Hence d
1 (x;x
0
)=d
2 (x;x
0
)=
p
m is thegreatestd
1
-distance betweenx and
x 0
. Inthecaseoftherejectionofpointy,wethenhave:
d
1
(x;y)d
2 (x;y)=
p
m
andsince,byvirtueoftherejection,
d
1
(x;y) p
mÆ
itfollowsthatÆd
2 (x;y).
Yunck (1976) presents a theoretical analysis for the similar use of the Chebyshev metric.
Richetin et al. (1980) propose the use of both bounds. Using uniformly distributed points in
dimensions2to 5,the latterreferencereportsthe best outcomewhen therule: reject all y such
thatd
1
(x;y)Æ precedestherulebasedonthed
1
-distance. Upto 80%reductioninCPUtime
isreported.
4.2 Bounding using the Triangular Inequality
Thetriangularinequalityissatisedbydistances: d(x;y)d(x;z)+d(z;y),wherex,yandzare
anythree points. Theuseofareferencepoint,z,allowsafulldistance calculationbetweenpoint
x,whoseNN issought,andy tobeavoidedif
jd(y;z) d(x;z)jÆ
whereÆ isthecurrentNN distance. Theset ofalldistancesto thereferencepointarecalculated
andstoredin apreprocessingsteprequiringO(n)timeandO(n)space. Theabovecut-oruleis
obtainedbynotingthat if
d(x;y)jd(y;z) d(x;z)j
then, necessarily, d(x;y) Æ. The former inequality abovereduces to the triangular inequality
irrespectiveofwhichofd(y;z)or d(x;z)isthegreater.
The set ofdistances to thereference point, fd(x;z)j xg, may besorted in the preprocessing
stage. Sinced(x;z)isxed duringthesearchfortheNN ofx,itfollowsthat thecut-orulewill
notthenneedtobeappliedinallcases.
The single reference point approach, due to Burkhardand Keller (1973), was generalizedto
multiplereferencepointsbyShapiro(1977). Thesortedlistofdistancestotherstreferencepoint,
fd(x;z
1
)jxg,isusedasdescribedaboveasapreliminarybound. Thenthesubsequentboundsare
similarlyemployedto furtherreducethepointsrequiringafulldistancecalculation. Thenumber
andthechoice ofreferencepointsto beused isdependentonthedistributionalcharacteristicsof
thedata. Shapiro(1977)ndsthatreferencepointsoughttobelocatedawayfromgroupsofpoints.
In10-dimensionalsimulations,itwasfoundthatatbestonly20%offulldistancecalculationswere
required(althoughthiswasverydependentonthechoiceofreferencepoints).
Hodgson(1988)proposesthefollowingbound,related to thetrainingsetof points, y,among
whichtheNNofpointxissought. DetermineinadvancetheNNsandtheirdistances,d(y;NN(y))
for all points in the training set. For point y, then consider Æ
y
= 1
2
d(y;NN(y)). In seeking
NN(x), andhavingat sometime intheprocessingacandidateNN, y 0
, wecanexcludeally from
considerationifwend thatd(x;y 0
)Æ
y 0
. Inthiscase, weknowthat wearesuÆcientlycloseto
y 0
thatwecannotimproveonit.
inter-point distances between the members of the training set. Given x, whose NN we require,
somememberofthetrainingsetisusedasareferencepoint. Usingthebounding approachbased
onthe triangular inequality, described above, allowsother trainingset members to be excluded
fromanypossibilityofbeingNN(x). Micoetal.(1992)andRamasubramanianandPaliwal(1992)
discussfurtherenhancementstothisapproach,focusedespeciallyonthestoragerequirements.
Fukunaga andNarendra(1975)makeuseofbothahierarchicaldecompositionofthedataset
(theyemployrepeatedlythek-meanspartitioningtechnique),andbounds basedonthetriangular
inequality. For each node in the decomposition tree, the center and maximum distance to the
centerofassociatedpoints(the\radius")aredetermined. For1000points,3levelswereused,with
adivisioninto 3classesateach node.
Allpointsassociatedwithanon-terminalnodecanberejectedinthesearchfortheNNofpoint
xifthefollowingrule(Rule1)isnotveried:
d(x;g) r
g
<Æ
whereÆisthecurrentNNdistance,gisthecenteroftheclusterofpointsassociatedwiththenode,
andr
g
istheradiusof thiscluster. Foraterminalnode, whichcannot berejectedonthebasisof
thisrule,each associatedpoint,y,canbetestedforrejection usingthefollowingrule(Rule2):
jd(x;g) d(y;g)jÆ:
Thesetworulesaredirectconsequencesof thetriangularinequality.
Abranchandboundalgorithmcanbeimplementedusingthesetworules. Thisinvolvesdeter-
miningsomecurrentNN(thebound)andsubsequentlybranchingoutofatraversalpathwhenever
thecurrentNNcannotbebettered. Notbeinginherentlylimitedbydimensionality,thisapproach
appearsparticularlyattractiveforgeneralpurposeapplications.
Other rejection rules are considered by Kamgar-Parsiand Kanal (1985). A simpler form of
clustering isused in the variantof this algorithm proposed byNiemann andGoppert (1988). A
shallowMDBSTisused,followedbyavariantonthebranchingandboundingdescribedabove.
Bennettetal.(1999)usethenearestneighborproblemasameanstowardssolvingtheGaussian
distributionmixture problem. Theyconsider apreprocessing approach similar to Fukunaga and
Narendra(1975)but with animportantdierence: to takebetteraccountof clusterstructure in
thedata,theclustersaremultivariatenormalbutnotnecessarilyofdiagonalcovariancestructure.
Thereforeveryellipticalclustersare allowed. This in turn impliesthat acluster radius isnotof
greatbenetforestablishingabound onwhetherornotdistancesneedtobecalculated. Bennett
etal.(1999)addressthisproblembyseekingastochasticguaranteeonwhetherornotcalculations
canbeexcluded. Technically,however,suchstochasticboundsarenoteasytodetermineinahigh
dimensionalspace.
An interesting issue raised in Beyer et al. (1999) is discussed also by Bennett et al. (1996):
if the ratio of the nearest and furthest neighbor distances converges in probability to 1 as the
dimensionalityincreases,thenisitmeaningfultosearchfornearestneighbors? Thisissueisnotall
thatdierentfrom sayingthat neighborsin anincreasingly highdimensional spacetend towards
beingequidistant. Insection5,wewill lookat approachesforhandlingparticularclassesofdata
ofthistype.
4.3 Fast Approximate Nearest Neighbor Finding
Kushilevitzet al. (1998),working in Euclidean andL
1
spaces, propose fast approximate nearest
neighborsearching,onthegroundsthatinsystemsforcontent-basedimageretrieval,approximate
resultsareadequate. Projectionsareusedtobound thesearch. Probabilityofsuccessfullynding
thenearestneighboristraded oagainsttimeandspacerequirements.
\High-dimensional",\sparse"and\binary"arethecharacteristicsofkeyword-basedbibliographic
data,with maybevaluesin excessof 10000forbothn andm. Suchdata isusuallystored aslist
datastructures,representingthemappingofdocumentsontoindexterms,orviceversa. Commer-
cial document collections are usually searched using a Boolean search environment. Documents
associatedwith particular termsareretrieved, andthe intersection(AND), union(OR) orother
operations on such sets of documents are obtained. For eÆciency, an inverted le which maps
termsontodocumentsmustbeavailableforBooleanretrieval. TheeÆcientNN algorithms,tobe
discussed,makeuseofboththedocument-termandtheterm-documentles.
TheusualalgorithmforNN-searchingconsiderseachdocumentinturn,calculatesthedistance
withthegivendocument,and updatesthe NNif appropriate. This algorithmis shown schemat-
ically in Fig. 4 (top). The inner loop is simply an expression of the fact that the distance or
similaritywill, in general, requireO(m)calculation: examplesof commonlyused coeÆcientsare
theJaccard similarity,andtheHamming(L
1
Minkowski)distance.
If m and nare, respectively, theaverage numbersof termsassociatedwith adocument, and
theaverage numberof documents associated with a term, then an average complexity measure,
overnsearches,ofthisusual algorithmis O(nm ). Itis assumedthat advantageis takenofsome
packedformofstorageintheinnerloop(e.g.using linkedlists).
Croft'salgorithm (seeCroft, 1977,and Fig.4) isof worst casecomplexity O(nm 2
). However
thenumberoftermsassociatedwiththedocumentwhoseNNisrequiredwilloftenbequitesmall.
TheNationalPhysicalLaboratorytestcollection,forexample,whichwasusedinMurtagh(1982)
hasthefollowingcharacteristics: n=11429,m =7491,m =19.9,andn=30.4. Theoutermost
and innermost loops in Croft's algorithm use the document-term le. The center loop uses the
term-document inverted le. An average complexity measure (more strictly, the time taken for
best match search based on an average document with associated average terms) is seen to be
O(nm 2
).
IntheoutermostloopofCroft'salgorithm,therewilleventuallycomeaboutasituationwhere
{if adocument hasnotbeen thus farexamined {the numberof termsremaining for thegiven
documentdonotpermitthecurrentNN documenttobebettered. Inthis case,wecancutshort
the iterations of the outermost loop. The calculation of a bound, using the greatest possible
numberoftermswhichcouldbesharedwithaso-farunexamineddocumenthasbeenexploitedby
SmeatonandvanRijsbergen(1981)andbyMurtagh(1982)insuccessiveimprovementsonCroft's
algorithm.
The complexity of all the above algorithms has been measured in terms of operations to be
performed. In practice, however, theactual accessing of termor document information may be
of fargreater cost. The document-termand term-document les are ordinarilystored on direct
access le storage because of their large sizes. The strategy used in Croft's algorithm, and in
improvementsonit,doesnotallowanyviableapproachestobatchingtogethertherecordswhich
aretoberead successively,inordertoimproveaccessing-relatedperformance.
ThePerry-Willettalgorithm(seePerryandWillett,1983)presentsasimplebuteectivesolu-
tiontotheproblemofcostlyI/O.Itfocusesonthecalculationofthenumberoftermscommonto
thegivendocumentx andeachotherdocument,y,inthedocumentcollection. This setofvalues
is built upin a computationallyeÆcient fashion. O(n) operations are subsequentlyrequired to
determinethe(dis)similarity, using anothervectorcomprisingthetotalnumbersofterms associ-
atedwitheachdocument. Computationtime(thesame\average"measureasthatusedabove)is
O(nm +n). Wenowturnattentiontonumbersofdirect-accessreadsrequired.
InCroft'salgorithm,alltermsassociatedwiththedocumentwhoseNNisdesiredmayberead
inonereadoperation. Subsequently,werequirenm reads,givinginall1+nm. InthePerry-Willett
algorithm,theouterloopagainpertainstotheone(given)document,andsoalltermsassociated
with this document can be read and stored. Subsequently, m reads, i.e. the average numberof
terms,eachofwhichdemandsareadofasetofdocuments,arerequired. Thisgives,inall,1+m.
Initialize current NN
For all documents in turn do:
... For all terms associated with the document do:
... ... Determine (dis)similarity
... Endfor
... Test against current NN
Endfor
Croft's algorithm:
Initialize current NN
For all terms associated with the given document do:
... For all documents associated with each term do:
... ... For all terms associated with a document do:
... ... ... Determine (dis)similarity
... ... Endfor
... ... Test against current NN
... Endfor
Endfor
Perry-Willett algorithm:
Initialize current NN
For all terms associated with the given document, i, do:
... For all documents, i', associated with each term, do:
... ... Increment location i' of counter vector
... Endfor
Endfor
Figure4: AlgorithmsforNN-searchingusinghigh-dimensionalsparsebinarydata.
e d
c b
a
Figure5: Fivepoints,showingNNsandRNNs.
Since these reads are verymuch the costliest operation in practice, the Perry-Willett algorithm
canberecommendedforlargevaluesofnandm. Itsgeneralcharacteristicsarethat(i)itrequires,
asdoall thealgorithms discussed in this section, theavailabilityof theinverted term-document
le;and(ii)itrequiresin-memorystorageoftwovectorscontainingnintegervalues.
6 Hierarchical Agglomerative Clustering
Thealgorithmsdiscussedinthissectioncanbecharacterizedasgreedy(HorowitzandSahni,1979).
Asequenceofirreversiblealgorithmstepsisusedtoconstructthedesireddatastructure.
We will not review hierarchicalagglomerative clustering here. Foressential background, the
readeris referredto Murtagh andHeck (1987), Gordon(1999), orJainand Dubes (1988). This
sectionborrowsonMurtagh(1992).
Onecould practicallysay that Sibson (1973)and Defays(1977)are partof the prehistoryof
clustering.TheirO(n 2
)implementationsofthesinglelinkmethodandofa(non-unique)complete
linkmethod, respectively,havebeenwidelycited.
In theearly 1980s a rangeof signicant improvementswere made to the Lance-Williams, or
related, dissimilarity update schema (deRham, 1980; Juan, 1982), which had been in wide use
sincethemid-1960s. Murtagh(1985)presentsasurveyofthesealgorithmicimprovements.Wewill
brie ydescribethem here. The newalgorithms, which havethepotentialforexactly replicating
resultsfoundintheclassicalbutmorecomputationallyexpensiveway,arebasedontheconstruction
ofnearest neighborchains andreciprocalormutualNNs(NN-chains andRNNs).
ANN-chainconsistsofanarbitrarypoint(ainFig.5);followedbyitsNN(binFig.5);followed
bythe NN from amongthe remainingpoints(c, d, and ein Fig. 5) ofthis second point; and so
onuntil wenecessarilyhavesome pairofpointswhich canbetermedreciprocalormutual NNs.
(Such apairofRNNsmaybethersttwopointsinthechain;and wehaveassumedthat notwo
dissimilaritiesareequal.)
In constructinga NN-chain, irrespectiveof thestarting point, wemay agglomeratea pairof
RNNs as soon as they are found. What guarantees that we can arrive at the same hierarchy
asifweused traditional \storeddissimilarities" or\storeddata"algorithms? Essentiallythis is
thesameconditionasthat underwhichno inversionsorreversalsare producedbytheclustering
method. Fig.6givesan exampleof this, where sis agglomeratedat alowercriterionvalue (i.e.
dissimilarity) than was the case at the previous agglomeration between q and r. Our ambient
spacehas thus contractedbecauseof the agglomeration. This is due to the algorithm used {in
particulartheagglomerationcriterion{anditissomethingwewouldnormallywishtoavoid.
Thisisformulatedas:
Inversionimpossibleif: d(i;j)<d(i;k)ord(j;k))d(i;j)<d(i[j;k)
This is essentially Bruynooghe's reducibility property (Bruynooghe, 1977; see also Murtagh,
1984). UsingtheLance{Williamsdissimilarityupdateformula,itcanbeshownthattheminimum
d1 d2
q r s q r s d2 d1
Figure 6: Alternativerepresentations of ahierarchywith aninversion. Assumingdissimilarities,
aswegoverticallyup,criterionvalues(d
1 , d
2
)decrease. Buthere,undesirably,d
2
>d
1 .
variancemethod doesnotgiveriseto inversions;neitherdothelinkagemethods; butthemedian
andcentroidmethodscannotbeguaranteednotto haveinversions.
Toreturn to Fig. 5, if we are dealing with a clustering criterion which precludes inversions,
thencanddcanjustiablybeagglomerated,sincenootherpoint(forexample,bore)couldhave
beenagglomeratedtoeither ofthese.
Theprocessingrequired,followinganagglomeration,istoupdatetheNNs ofpointssuch asb
inFig.5(andonaccountofsuchpoints,thisalgorithmwasdubbedalgorithme descelibatairesin
deRham,1980). Thefollowingisasummaryofthealgorithm:
NN-chain algorithm
Step1 Selectapointarbitrarily.
Step2 GrowtheNN-chainfromthis pointuntilapairofRNNsareobtained.
Step3 Agglomerate these points (replacing with a cluster point, orupdating the dissimilarity
matrix).
Step4 Fromthepoint which preceded theRNNs (or from any otherarbitrarypointifthe rst
twopointschosenin Steps1and 2constitutedapairofRNNs),returntoStep 2until only
onepointremains.
In Murtagh (1983, 1984, 1985) and Day and Edelsbrunner (1984), one nds discussions of
O(n 2
)timeandO(n)spaceimplementationsofWard'sminimumvariance(orerrorsumofsquares)
methodandofthecentroidandmedianmethods. ThelattertwomethodsaretermedtheUPGMC
andWPGMCcriteriabySneathandSokal(1973). Now,aproblemwiththeclustercriteriausedby
theselattertwomethodsisthatthereducibilitypropertyisnotsatisedbythem. Thismeansthat
thehierarchyconstructedmaynotbeuniqueasaresultofinversionsorreversals(non-monotonic
variation)in theclustering criterionvaluedeterminedin thesequenceofagglomerations.
Murtagh (1984) describes O(n 2
) time and O(n 2
) space implementations for the single link
method, thecomplete link method and fortheweightedand unweightedgroupaveragemethods
(WPGMAandUPGMA). Thisapproachisquite generalvis avis thedissimilarityused andcan
alsobeusedforhierarchicalclusteringmethods otherthanthosementioned.
DayandEdelsbrunner(1984)provetheexactO(n 2
)timecomplexityofthecentroidandmedian
methodsusinganargumentrelatedtothecombinatorialproblemofoptimallypackinghyperspheres
into anm-dimensionalvolume. Theyalso addressthequestion ofmetrics: resultsare valid in a
wideclass ofdistancesincludingthose associatedwiththeMinkowskimetrics.
agglomerationswheneverreciprocalnearestneighborsmeet,bothoerpossibilitiesforparalleliza-
tion. ImplementationsonaSIMDmachine weredescribedbyWillett(1989).
Evidently both coordinate data and graph (e.g., dissimilarity) data can be input to these
agglomerativemethods. Gilletetal.(1998)inthecontextofclusteringchemicalstructuredatabases
refertothecommonuseoftheWardmethod,basedonthereciprocalnearestneighborsalgorithm,
ondatasets ofafewhundredthousandmolecules.
Applications of hierarchical clustering to bibliographic information retrieval are assessed in
GriÆthsetal.(1984). Ward'sminimumvariancecriterionisfavored.
FromdetailsinWhiteandMcCain(1997),theInstituteofScienticInformation(ISI)clusters
citations(science,andsocialscience)byrst clusteringhighly citeddocumentsbasedonasingle
linkage criterion, and then four more passes are made throughthe data to create a subset of a
singlelinkagehierarchicalclustering.
7 Graph Clustering
Hierarchicalclustering methods arecloselyrelated to graph-basedclustering. Forastart,aden-
drogramis arooted labeled tree. Secondly, and moreimportantly, somemethods likethe sinlge
andcompletelinkmethodscanbedisplayedasgraphs,andareverycloselyrelatedtomainstream
graphdatastructures.
Anexampleoftheincreasingprevalenceofgraphclusteringinthecontextofdataminingonthe
webispresentedinFig.7: Amazon.comprovidesinformationonwhatotherbookswerepurchased
bylike-mindedindividuals.
Thesinglelinkmethodwasreferredtointheprevioussection,asawidely-usedagglomerative,
hencehierarchical,clusteringmethod. Rohlf (1982)reviewsalgorithmsforthesinglelinkmethod
withcomplexitiesrangingfromO(nlogn)to O(n 5
). Thecriterionusedbythesinglelinkmethod
forclusterformationisweak,meaningthat noisydatainparticular giveriseto resultswhichare
notrobust.
The minimal spanning tree (MST) and the single link agglomerativeclustering method are
closely related: the MST can be transformed irreversibly into the single link hierarchy (Rohlf,
1973). The MST is dened as of minimal total weight, it spans all nodes (vertices) and is an
unrooted tree. The MST has been a method of choice for at least four decades now either in
itsownrightfor data analysis (Zahn, 1971), asa data structureto be approximated(e.g. using
shortestspanning paths, see Murtagh, 1985, p. 96),orasa basisfor clustering. Wewill look at
somefastalgorithmsfortheMSTin theremainderofthissection.
Perhaps the most basicMST algorithm, due to Prim and Dijkstra, grows asingle fragment
throughn 1 steps. Wend theclosest vertex to an arbitraryvertex, calling these afragment
oftheMST.Wedeterminetheclosestvertex,notin thefragment,toanyvertexinthefragment,
andaddthisnewvertexintothefragment. Whiletherearefewerthannverticesin thefragment,
wecontinueto growit.
Thisalgorithmleadstoauniquesolution. AdefaultO(n 3
)implementationisclear,andO(n 2
)
computationalcostispossible(Murtagh,1985,p. 98).
Sollin'salgorithmconstructsthefragmentsinparallel. Foreachfragmentinturn,atanystage
oftheconstructionoftheMST,determineitsclosestfragment. Mergethesefragments,andupdate
thelist of fragments. A tree canbe guaranteed in this algorithm (although caremust be taken
incasesofequalsimilarity)and ourotherrequirements(allverticesincluded,minimaltotaledge
weight)areverystraightforward. Giventhepotentialforroughlyhalvingthedataremainingtobe
processedateachstep,notsurprisinglythecomputationalcostreducesfromO(n 3
)toO(n 2
logn).
TherealinterestofSollin'salgorithmariseswhenweareclusteringonagraphanddonothave
alln(n 1)=2edgespresent. Sollin'salgorithm canbeshowntohavecomputationalcostmlogn
wheremisthenumberofedges. Whenmn(n 1)=2thenwehavethepotentialforappreciable
TheMSTinfeaturespacescanofcoursemakeuseofthefastnearestneighborndingmethods
studiedearlierinthisarticle. SeeMurtagh(1985,section4.4)forvariousexamples.
Other graph data structures which have been proposed for data analysis are related to the
MST.Weknow,forexample,thatthefollowingsubsetrelationshipholds:
MSTRNGGGDT
whereRNGistherelativeneighborhoodgraph,GGistheGabrielgraph,andDTistheDelaunay
triangulation. Thelatter,intheformofit'sdual,theVoronoidiagram,hasbeenusedforanalyzing
theclusteringofgalaxylocations. ReferencestotheseandrelatedmethodscanbefoundinMurtagh
(1993b).
8 Nearest Neighbor Finding on Graphs
Clusteringongraphsmayberequiredbecauseweareworkingwith(perhapscomplexnon-Euclidean)
dissimilarities. Insuchcaseswherewemusttakeintoaccountanedgebetweeneachandeverypair
ofvertices,wewill generallyhaveanO(m)computational costwhere m isthe numberofedges.
Inametricspacewehaveseenthat wecanlookforvarious possiblewaystoexpeditethenearest
neighborsearch. An approach basedonvisualization {turning ourdatainto animage {will be
lookedatbelow. Howeverthereisanotheraspectofoursimilarity(orother)graphwhichwemay
be able to turn to ouradvantage. EÆcientalgorithms forsparse graphsare available. Sparsity
canbearranged{wecanthresholdouredgesifthesparsitydoesnotsuggestitselfmorenaturally.
Aspecialtypeof sparsegraphisaplanargraph,i.e.agraphcapableof beingrepresentedin the
planewithoutanycrossoversofedges.
For sparse graphs,algorithms with O(mloglogn) computational costwere described by Yao
(1975)andCheritonandTarjan(1976). Ashortalgorithmicdescriptioncanbefoundin Murtagh
(1985,pp.107{108)andwereferinparticularto thelatter.
Thebasicideaistopreprocessthegraph,inordertoexpeditethesortingofedgeweights(why
sorting? {simplybecausewemustrepeatedlyndsmallestlinks,andmaintainingasortedlistof
edgesisagood basisfordoingthis). Ifwewere tosort alledges,thecomputationalrequirement
wouldbeO(mlogm). Insteadofdoingthat, wetaketheedgeset associatedwitheachandevery
vertex. Wedivideeachsuch edgeset intogroupsofsizek. (Thefact thatthelastsuchgroupwill
usuallybeofsize <k istakeninto accountwhenprogramming.)
Letn
v
bethenumberofincidentedgesatvertexv,suchthat P
v n
v
=2m.
Thesorting operationforeachvertexnowtakesO(klogk) operationsforeachgroup,andwe
have n
v
=k groups. For all vertices the sorting requires a number of operations which is of the
order of P
v n
v
logk = 2mlogk. This looks like a questionable { or small { improvementover
O(mlogm).
DeterminingthelightestedgeincidentonavertexrequiresO(n
v
=k)comparisonssincewehave
to check allgroups. Therefore thelightest edges incidenton allverticesare foundwith O(m=k)
operations.
Whentwovertices,andlaterfragments,aremerged,theirassociatedgroupsofedgesaresimply
collectedtogether,thereforekeepingthetotalnumberofgroupsofedgeswhichwestartedoutwith.
Wewillbypasstheissueofedgeswhich,overtime,aretobeavoidedbecausetheyconnectvertices
in the same fragment: given the fact that we are building an MST, the total number of such
edges-to-be-avoidedcannotsurpass2m.
Tond what to merge next, again O(m=k) processing is required. Using Sollin'salgorithm,
thetotalprocessingrequiredinndingwhattomergenextisO(m=k logn). Thetotalprocessing
requiredforgroupingtheedges,andsortingwithintheedge-groups,isO(mlogk),i.e.itisone-o
andaccomplishedatthestartoftheMST-buildingprocess.
dominatesandgivesoverallcomputationalcomplexityasO(mloglogn).
This result has been further improved to near linearity in m by Gabow et al. (1986), who
developanalgorithmwithcomplexityO(mlogloglog:::n)wherethenumberofiteratedlogterms
isbounded bym=n.
Motwaniand Raghavan(1995,chapter10)base astochasticO(m)algorithmfor theMSTon
randomsampling toidentifyandeliminateedgesthatareguaranteednottobelongtotheMST.
Let'sturn ourattentionnowto thecaseofaplanargraph. Foraplanargraphweknowthat
m3n 6form>1. (Forproof,seeforexampleTucker,1980,oranybookongraphtheory).
ReferringtoSollin'salgorithm,describedabove,O(n)operationsareneededtoestablishaleast
costedgefromeachvertex,sincethereareonlyO(n)edgespresent. Onthenextround,following
fragment-creation, there will be at most ceil(n=2) new vertices, implying of the order of n=2
processingtondtheleastcostedge(whereceilistheceilingfunction,orsmallestintegergreater
thantheargument). Thetotalcomputationalcostisseentobeproportionalto: n + n=2 + n=4 + :::=
O(n).
SodeterminingtheMSTofaplanargraphislinearin numbersofeither verticesoredges.
Before ending this review of veryeÆcient clustering algorithms for graphs, we note that al-
gorithmsdiscussed sofar haveassumed that the similaritygraphwasundirected. Formodeling
transport ows,or economictransfers,the graphcould wellbedirected. Componentscanbede-
ned, generalizing the clusters of the single link method, or the complete link method. Tarjan
(1983)providesanalgorithmforthelatteragglomerativecriterionwhichisofcomputationalcost
O(mlogn).
9 K-Means and Family
The non-technical person more often than not understands clustering as a partition. K-means
lookedatin thissection,orthedistribution mixtureapproachlookedatsection10,providesolu-
tions.
A mathematical denition of a partition implies no multiple assignments of observations to
clusters,i.e.nooverlappingclusters. Overlappingclustersmaybefasterto determinein practice,
andacaseinpointistheone-passalgorithm describedinSaltonandMcGill (1983). Thegeneral
principlefollowed is: make onepass through thedata, assigning each objectto the rst cluster
which is close enough, and making a new cluster for objects that are not close enough to any
existingcluster.
Broderet al.(1997)usethis algorithmforclustering theweb. Afeature vectorisdetermined
foreachHTMLdocumentconsidered,basedonsequencesofwords. Similaritybetweendocuments
is basedon an inverted list, using an approach likethose described in section 5. The similarity
graphisthresholded,andcomponentssought.
Broder(1998)solvesthe sameclusteringobjectiveusing athresholdingand overlappingclus-
teringmethodsimilartotheSaltonandMcGillone. Theapplicationdescribedisthatofclustering
theAltavista repositoryin April1996, consistingof 30millionHTML andtext documents,com-
prising150GBytesofdata. Thenumberofserviceableclustersfoundwas1.5million,containing7
milliondocuments. Processingtimewasabout10.5days. An analysisoftheclusteringalgorithm
usedbyBrodercan befound inBorodinet al.(1999),whoalso considerstheuseofapproximate
minimalspanningtrees.
The threshold-basedpass of the data, in its basicstate, is susceptibleto lack of robustness.
A bad choice of threshold leads to too many clusters ortwofew. Toremedy this, we canwork
on a well-dened data structure such as the minimal spanning tree. Or, alternatively, we can
iterativelyrenetheclustering. Partitioningmethods,suchask-means,useiterativeimprovement
ofaninitialestimationofatargetedclustering.
Averywidelyusedfamilyofmethodsforinducingapartitiononadatasetiscalled k-means,
eralnames(non-overlappingnon-hierarchicalclustering)ormorespecicnames(minimaldistance
orexchangealgorithms).
Theusualcriteriontobeoptimizedis:
1
jI j X
q2Q X
i2q k~ ~qk
2
whereIistheobjectset,j:jdenotescardinality,qissomecluster,Qisthepartition,andqdenotes
asetinthesummation,whereas~qdenotessomeassociatedvectorintheerrorterm,ormetricnorm.
Thiscriterionensuresthatclustersfoundarecompact,andthereforeassumedhomogeneous. The
optimizationcriterion,byasmallabuseofterminology,isofterreferredtoasaminimumvariance
one.
Anecessaryconditionthatthiscriterionbeoptimizedisthatvector~qbeaclustermean,which
fortheEuclideanmetriccaseis:
~ q=
1
jqj X
i2q
~
Abatch updatealgorithm, dueto Lloyd (1957),Forgy(1965),andothers, makesassignments
to a set of initially randomly-chosen vectors, ~q, as step 1. Step 2 updates the cluster vectors,
~
q. This is iterated. The distortion error, equation 1,is non-increasing, and alocal minimumis
achievedinanitenumberofiterations.
An onlineupdate algorithmis dueto MacQueen(1967). Aftereachpresentationofanobser-
vationvector,~,theclosestclustervector,~q,isupdatedtotakeaccountofit. Suchanapproachis
well-suitedforacontinuousinputdatastream(implying\online"learningofclustervectors).
Bothalgorithmsaregradientdescentones. Intheonlinecase,muchattentionhasbeendevoted
tobestlearningrateschedulesintheneuralnetwork(competitivelearning)literature: Darkenand
Moody(1991,1992),Darkenet al.(1992),Fritzke(1997).
A diÆculty, lesscontrollablein thecaseof the batch algorithm, is that clusters may become
(and stay) empty. This may be acceptable, but also may be in breach of our original problem
formulation. Analternativeto thebatchupdatealgorithm isSpath's(1985)exchangealgorithm.
Each observation is considered for possible assignment into anyof theother clusters. Updating
and\downdating"formulasaregivenbySpath. Thisexchangealgorithmisstatedtobefasterto
convergeandtoproducebetter(smaller)valuesoftheobjectivefunction. Overdecadesofuse,we
havealsoveriedthat itisasuperioralgorithmtotheminimaldistance one.
K-meansisverycloselyrelatedtoVoronoi(Dirichlet)tesselations,to Kohonenself-organizing
featuremaps,andvariousothermethods.
Thebatchlearningalgorithmabovemaybeviewedas
1. AnassignmentstepwhichwewilltermtheE(estimation)step: estimatetheposteriors,
P(observationsjclustercentres)
2. A cluster update step,theM (maximization)step, whichmaximizes acluster center likeli-
hood.
Nealand Hinton(1998)cast thek-means optimization problemin such awaythat bothE- and
M-stepsmonotonicallyincreasethemaximand'svalues. TheEMalgorithmmay,too,beenhanced
toallowforonlineaswellasbatch learning(SatoandIshii,1999).
InThiessonetal.(1999),k-meansisimplemented(i)bytraversingblocksofdata,cyclically,and
incrementallyupdatingthesuÆcientstatisticsandparameters,and(ii)insteadofcyclictraversal,
samplingfromsubsetsofthedataisused. Suchanapproachisadmirablysuitedforverylargedata
sets,wherein-memorystorageisnotfeasible. ExamplesusedbyThiessonetal.(1999)includethe
clusteringofahalf million300-dimensionalrecords.
It is traditional to note that models and (computational) speed don't mix. We review recent
progressinthis section.
10.1 Modeling of Signal and Noise
A simpleand applicablemodel is adistribution mixture, with the signalmodeled byGaussians,
inthepresenceofPoissonbackgroundnoise.
Consider data which are generated by a mixture of (G 1) bivariate Gaussian densities,
f
k
(x;) N(
k
;
k
),forclustersk=2;:::;G,andwithPoissonbackgroundnoisecorresponding
tok=1. Theoverallpopulationthushasthemixturedensity
f(x;)= G
X
k =1
k f
k (x;)
wherethemixing orpriorprobabilities,
k
,sumto1,andf
1
(x;)=A 1
,where Aistheareaof
thedataregion. Thisisthebasisformodel-basedclustering(BaneldandRaftery,1993,Dasgupta
andRaftery,1998,MurtaghandRaftery,1984,BanerjeeandRosenfeld,1993).
Theparameters, and,canbeestimatedeÆcientlybymaximizingthemixturelikelihood
L(;)= n
Y
i=1 f(x
i
;);
withrespectto and,wherex
i
isthei-th observation.
Nowletusassumethepresenceoftwoclusters,oneofwhichisPoissonnoise,theotherGaussian.
Thisyieldsthemixturelikelihood
L(;)= n
Y
i=1
"
1 A
1
+
2 1
2
p
jj exp
1
2 (x
i
) T
1
(x
i
)
#
;
where
1 +
2
=1.
Aniterativesolutionisprovidedbytheexpectation-maximization(EM)algorithmofDempster
etal.(1977). Wehavealreadynotedthis algorithmin informaltermsin thelast section,dealing
withk-means. Letthe\complete"(or\clean"or\output")databey
i
=(x
i
;z
i
)withindicatorset
z
i
=(z
i1
;z
i2
)givenby(1;0) or(0;1). Vectorz
i
hasamultinomial distribution with parameters
(1;
1
;
2
). Thisleadstothecompletedata log-likelihood:
l(y;z;;)= n
i=1
2
k =1 z
ik [log
k +logf
k (x
k
;)]
TheE-stepthen computesz^
ik
=E(z
ik jx
1
;:::;x
n
;), i.e. theposteriorprobabilitythat x
i is in
clusterk. TheM-stepinvolvesmaximizationoftheexpectedcompletedata log-likelihood:
l
(y;;)= n
i=1
2
k =1
^ z
ik [log
k +logf
k (x
i
;)]:
TheE-and M-stepsareiterateduntilconvergence.
For the2-classcase(Poisson noiseandaGaussian cluster),thecomplete-datalikelihoodis
L(y;z;;)= n
Y
i=1 h
1
A i
zi1
"
2
2
p
j j exp
1
2 (x
i
) T
1
(x
i
)
#
zi2
Thecorrespondingexpectedlog-likelihoodisthenusedin theEMalgorithm. Thisformulationof
theproblem generalizesto thecaseofGclusters,ofarbitrarydistributionsanddimensions.
software.
Inordertoassesstheevidence forthepresenceofasignal-cluster,weusetheBayes factorfor
themixturemodel, M
2
,thatincludesaGaussiandensityaswellasbackgroundnoise,againstthe
\null" model, M
1
, that containsonly background noise. The Bayes factoris the posteriorodds
forthemixturemodelagainstthepurenoisemodel,whenneitherisfavoredapriori. Itisdened
as B = p(xjM
2
)=p(xjM
1
), where p(xjM
2
) is the integrated likelihood of the mixture model M
2 ,
obtainedbyintegratingovertheparameterspace. ForageneralreviewofBayesfactors,theiruse
inappliedstatistics,andhowtoapproximateandcomputethem, seeKassandRaftery(1995).
Weapproximate theBayesfactorusing theBayesianInformation Criterion(BIC) (Schwartz,
1978). ForaGaussian clusterandPoissonnoise,this takestheform:
2logBBIC=2logL(
^
;^)+2nlogA 6logn;
where
^
and^ are themaximumlikelihood estimatorsof and , and L(
^
;)^ is themaximized
mixturelikelihood.
AreviewoftheuseoftheBICcriterionformodelselection{andmorespecicallyforchoosing
thenumberofclustersinadataset{canbefoundin FraleyandRaftery (1998).
An application of mixture modeling and the BIC criterion to gamma-ray burst data canbe
found in Mukherjee et al. (1998). So far around 800 observations have been assessed, but as
greaternumbersbecomeavailablewewillndtheinherentnumberofclustersin asimilarway,in
ordertotrytounderstandmoreaboutthecomplexphenomenon ofgamma-raybursts.
10.2 Application to Thresholding
Consideranimageoraplanaror3-dimensionalsetofobjectpositions. Forsimplicityweconsider
thecaseofsettingasinglethresholdintheimageintensities, orthepointset'sspatialdensity.
WedealwithacombinedmixturedensityoftwounivariateGaussiandistributions f
k
(x;)
N(
k
;
k
). Theoverallpopulationthushasthemixturedensity
f(x;)= 2
X
k =1
k f
k (x;)
wherethemixingorpriorprobabilities,
k
,sumto 1.
Whenthemixing proportionsareassumedequal,thelog-likelihoodtakestheform
l()= n
X
i=1 ln
"
2
X
k =1 1
2
p
j
k j
exp
1
2
k (x
i
k )
2
#
TheEMalgorithm isthenused toiterativelysolvethis (seeCeleux andGovaert,1995). This
method is used for appraisals of textile (jeans and other fabrics) fault detection in Campbell et
al.(1999). Industrialvision inspectionsystemspotentiallyproducelargedatastreams, andfault
detectioncanbeagoodapplicationforfastclusteringmethods. Wearecurrentlyusingamixture
modelof thissort onSEM(scanning electronmicroscope) imagesof cross-sectionsof concreteto
allowforsubsequentcharacterizationofphysicalproperties.
Imagesegmentation,perse,isarelativelystraightforwardapplication,buttherearenoveland
interesting aspects to the two studies mentioned. In the textile case, the faults are very often
perceptual and relative, ratherthan \absolute"orcapable ofbeinganalyzed in isolation. In the
SEMimagingcase,arstphaseof processingis appliedto de-specklethe images,using multiple
resolutionnoiseltering.
Turningfrom concreteto cosmology,theSloanDigitalSkySurvey(SDSS, 1999)isproducing
askymap ofmorethan 100millionobjects,together with 3-dimensionalinformation(redshifts)
for a million galaxies. Pelleg and Moore (1999) describe mixture modeling, using a k-D tree
preprocessingto expeditethendingoftheclass(mixture)parameters,e.g.means,covariances.
InStarcket al.(1998)andin awiderangeofpapers,wehavepursuedanapproachforthenoise
modelingofobserveddata. Amultipleresolution scalevisionmodelordatagenerationprocessis
used,toallowforthephenomenonbeingobservedondierentscales. Inaddition,awiderangeof
optionsarepermittedforthedatagenerationtransferpath,includingadditiveandmultiplicative,
stationaryand non-stationary,Gaussian (\readout"noise), Poisson (randomshotnoise), andso
on.
Given pointpatternclustering intwo-orthree-dimensionalspaces, wewilllimitouroverview
heretothePoissonnoisecase.
11.1 Poisson noise with few events using the a trous transform
If awavelet coeÆcient w
j
(x;y) is due to noise, it canbe considered as arealization of the sum
P
k 2K n
k
ofindependentrandomvariableswiththesamedistributionasthatofthewaveletfunc-
tion(n
k
beingthenumberofeventsusedforthecalculationofw
j
(x;y)). Thisallowscomparisonof
thewaveletcoeÆcientsofthedatawiththevalueswhichcanbetakenbythesumofnindependent
variables.
The distributionof oneeventin wavelet space isthen directly given by the histogramH
1 of
thewavelet . As weconsider independent events, thedistribution of acoeÆcientw
n
(note the
changedsubscriptingforw,forconvenience)relatedtoneventsisgivenbynautoconvolutionsof
H
1 :
H
n
=H
1 H
1
:::H
1
For alargenumberofevents,H
n
convergestoaGaussian.
Fig. 8 shows an exampleof where point pattern clusters { density bumps in this case {are
sought,with agreatamountof backgroundclutter. Murtagh andStarck(1998)referto thefact
that there is no computational dependence on the number of points (signal or noise) in such a
problem,whenusingawavelettransformwithnoisemodeling.
Some other alternative approaches will be brie y noted. The Haar transform presents the
advantageof its simplicityfor modeling Poisson noise. Analytic formulas forwavelet coeÆcient
distributionshavebeenderivedbyKolaczyk(1997),andJammalandBijaoui(1999). Usinganew
wavelettransform,theHaaratroustransform,Zhenget al.(1999)appraiseadenoisingapproach
fornancialdatastreams, {animportantpreliminary stepforsubsequentclustering,forecasting,
orotherprocessing.
11.2 Poisson noise with nearest neighbor clutter removal
Thewaveletapproachiscertainlyappropriatewhenthewaveletfunctionre ectsthetypeofobject
sought(e.g.isotropic),andwhensuperimposed pointpatternsareto beanalyzed. However,non-
superimposed pointpatternsofcomplexshapeareverywelltreatedbytheapproachdescribedin
ByersandRaftery(1998). UsingahomogeneousPoissonnoisemodel,theyderivethedistribution
ofthedistance ofapointtoitskthnearestneighbor.
Next,Byersand Raftery (1998)considerthe caseofaPoisson processwhichis signal,super-
imposed on aPoisson process which is clutter. The kth nearestneighbordistances are modeled
as amixture distribution: ahistogram of these, for given k, will yield a bimodal distribution if
ourassumption iscorrect. This mixture distribution problem is solvedusing the EMalgorithm.
Generalizationto higherdimensions,e.g.10,isalsodiscussed.
SimilardatawasanalyzedbynoisemodelingandaVoronoitesselationpreprocessingofthedata
inAllardandFraley(1997). Itispointedouttherehowthiscanbeaveryusefulapproachwhere
the Voronoi tiles have meaning in relation to the morphology of the point patterns. However,
it does not scale well to higher dimensions, and the statistical noise modeling is approximate.
Voronoitesselation forastronomicalX-rayobjectdetectionandcharacterization.
12 Cluster-Based User Interfaces
Informationretrievalbymeans of\semanticroadmaps"wasrstdetailed by Doyle(1961). The
spatialmetaphor isapowerfulonein humaninformationprocessing. The spatial metaphoralso
lends itself well to modern distributed computing environments such as theweb. TheKohonen
self-organizing feature map (SOM) method is an eective means towards this end of a visual
information retrieval user interface. We will also provide an illustration of web-based semantic
mapsbasedonhyperlinkclustering.
TheKohonen mapis, atheart, k-meansclustering withtheadditionalconstraintthat cluster
centersbelocatedonaregulargrid(or someothertopographicstructure)and furthermoretheir
location on the grid be monotonically related to pairwise proximity (Murtagh and Hernandez-
Pajares,1995). The nicething aboutaregular gridoutput representationspace is that it lends
itselfwellasavisualuserinterface.
Fig.9showsavisualandinteractiveuserinterfacemap,usingaKohonenself-organizingfeature
map(SOM).Colorisrelatedtodensityofdocumentclusterslocatedat regularly-spacednodesof
the map, and some of these nodes/clusters are annotated. The map is installed as a clickable
imagemap,withCGIprogramsaccessinglistsofdocumentsand{throughfurtherlinks{inmany
cases, the full documents. In the example shown, the user has queried a node and results are
seenin theright-hand panel. Such maps are maintained for (currently)12000 articlesfrom the
Astrophysical Journal, 7000 from Astronomy and Astrophysics, over2000 astronomicalcatalogs,
andotherdataholdings. Moreinformationonthedesignofthisvisualinterfaceanduserassessment
canbefoundin Poincotet al.(1998,1999).
Guillaume(GuillaumeandMurtagh,1999)developedaJava-basedvisualizationtoolforhyperlink-
baseddata,consistingofastronomers,astronomicalobjectnames,articletitles,andwiththepos-
sibilityofotherobjects(images,tables, etc.). Throughweighting,thevarioustypesoflinks could
be prioritized. An iterativerenementalgorithmwasdevelopedto map thenodes(objects) to a
regulargridofcells,which asfortheKohonenSOMmap, areclickableandprovideaccessto the
datarepresentedbythecluster. Fig.10showsanexampleforanastronomer(Prof.JeanHeyvaerts,
StrasbourgAstronomicalObservatory).
These newcluster-based visualuserinterfacesare notcomputationallydemanding. Theyare
not howeverscalable in their current implementation. Document management (see e.g. Cartia,
1999)islessthemotivationasisinsteadtheinteractiveuserinterface.
13 Images from Data
Itisquiteimpressivehow2D(or3D)imagesignalscanhandlewitheasethescalabilitylimitations
of clustering and many other data processing operations. The contiguity imposed on adjacent
pixelsbypasses theneed fornearest neighbornding. Itis veryinterestingthereforeto consider
thefeasibilityof taking problemsof clustering massivedatasets into the 2Dimage domain. We
willlookat afewrecentexamplesofwork inthisdirection.
ChurchandHelfman(1993)addresstheproblemofvisualizingpossiblymillionsoflinesofcom-
puterprogramcode,ortext. Theyconsideran approachborrowedfrom DNAsequenceanalysis.
Thedatasequenceistokenizedbysplittingitintoitsatoms(line,word,character,etc.) andthen
placingadotatposition i;j iftheithinput tokenis thesameasthejth. Theresultingdotplot,
it is argued, is not limited by the available display screen space, and can lead to discoveryof
large-scalestructurein thedata.
Whendata do nothaveasequence wehaveaninvarianceproblem which canberesolved by
ndingsome row and column permutation which pulls largearrayvalues together, and perhaps
sparsearrays. Gatheringlarger(ornonzero)arrayelementstothediagonalcanbeviewedinterms
of minimizing the envelope of nonzero values relative to the diagonal. This can be formulated
and solvedin purely symbolic termsby reorderingverticesin asuitable graphrepresentationof
thematrix. A widely-used method for symmetric sparse matrices is the Reverse Cuthill-McKee
(RCM)method.
ThecomplexityoftheRCMmethodfororderingrowsorcolumnsisproportionaltotheproduct
of the maximum degree of any vertex in the graphrepresenting the array values and the total
numberof edges(nonzeroesin thematrix). Forhypertextmatriceswithsmall maximumdegree,
themethod would be extremelyfast. Thestrength ofthemethod isits lowtime complexitybut
itdoessuer from certain drawbacks. The heuristicfor nding thestarting vertex is in uenced
bytheinitial numberingofverticesandsothequalityof thereorderingcanvaryslightlyfor the
sameproblem for dierent initial numberings. Next,the overall method does notaccommodate
denserows (e.g., a commonlink used in everydocument),and if arow has asignicantly large
numberofnonzeroesitmightbebesttoprocessitseparately;i.e.,extractthedenserows,reorder
theremainingmatrixandaugmentitbythedenserows(orcommonlinks)numberedlast. Elapsed
CPUtimesfor arangeof arraysand permutingmethods aregiveninBerry etal.(1996), andas
anindicationshowperformancesbetween0.025to3.18secondsforpermutinga4000400array.
Areview ofpublicdomain softwareforcarryingoutSVD andother linearalgebraoperationson
largesparsedatasets canbefoundin Berryet al.(1999,section8.3).
Once we have a sequence-respecting array, we can immediately apply eÆcient visualization
techniquesfromimageanalysis. Murtaghetal.(2000)investigatetheuseofnoiseltering(i.e.to
removelessusefularrayentries)usingamultiscalewavelettransformapproach.
An examplefollows. FromtheConciseColumbia Encyclopedia(1989 2nded., onlineversion)
asetofdatarelatingto12025encyclopediaentriesandto9778cross-referencesorlinkswasused.
Fig.11showsa500450subarray,basedonacorrespondenceanalysis(i.e.orderingofprojections
ontherstfactor).
Thispartoftheencyclopediadatawaslteredusingthewaveletandnoise-modelingmethodol-
ogydescribedinMurtaghetal.(2000)andtheoutcomeisshowninFig.12. Overalltherecoveryof
themoreapparentalignments,andhencevisuallystrongerclusters,isexcellent. Therstrelatively
long\horizontalbar"wasselected{itcorrespondstocolumnindex(link)1733=geological era.
Thecorrespondingrowindices(articles)are, insequence:
SILURIAN PERIOD
PLEISTOCENE EPOCH
HOLOCENE EPOCH
PRECAMBRIAN TIME
CARBONIFEROUS PERIOD
OLIGOCENE EPOCH
ORDOVICIAN PERIOD
TRIASSIC PERIOD
CENOZOIC ERA
PALEOCENE EPOCH
MIOCENE EPOCH
DEVONIAN PERIOD
PALEOZOIC ERA
JURASSIC PERIOD
MESOZOIC ERA
CAMBRIAN PERIOD
PLIOCENE EPOCH
CRETACEOUS PERIOD
Theworkdescribedhereisbasedonanumberoftechnologies: (i)datavisualizationtechniques;
wavelet transform haslinear computationalcost in termsof image rowand column dimensions,
andisindependentof thepixelvalues.
14 Conclusion
Viewedfromacommercialormanagerialperspective,onecouldjustiablyaskwherewearenowin
ourunderstandingofproblemsinthisarearelativetowherewewerebackinthe1960s? Depending
onouranswertothis,wemaywellproceedtoasecondquestion:Whyhaveallimportantproblems
notbeensolvedbynowin thisarea{aretheremajoroutstandingproblemstobesolved?
Asdescribedinthischapter,asolidbodyofexperimentalandtheoreticalresultshavebeenbuilt
upoverthelastfew decades. Clustering remainsarequirementwhichisacentralinfrastructural
elementofverymanyapplication elds.
There is continual renewal of the essential questions and problems of clustering, relating to
new data, new information, and new environments. There is no logjam in clustering research
and development simply because the riversof problems continue to broaden and deepen. Clus-
teringandclassicationremainquintessentialissuesinourcomputingandinformationtechnology
environments(Murtagh,1998).
Acknowledgements
Some of this work, in particular in sections 9 to 12, represents various collaborations with the
following: J.L.Starck,CEA;A.Raftery,UniversityofWashington;C.Fraley,UniversityofWash-
ingtonandMathSoftInc.;D.Washington,MathSoft,Inc.;Ph. PoincotandS.Lesteven,Strasbourg
Observatory;D.Guillaume, StrasbourgObservatory,Universityof IllinoisandNCSA.
References
1. Allard, D. and Fraley, C., \Non-parametric maximum likelihood estimation of features in
spatial point processesusing Voronoitessellation", Journalofthe AmericanStatisticalAs-
sociation,92,1485{1493,1997.
2. Arabie,P.andHubert,L.J.,\Anoverviewofcombinatorialdataanalysis",inArabie,P.,Hu-
bert,L.J.andDeSoete,G.,Eds.,ClusteringandClassication,WorldScientic,Singapore,
5{63,1996.
3. Arabie,P.,Hubert,L.J. andDeSoete,G.,Eds.,ClusteringandClassication,WorldScien-
tic,Singapore, 1996.
4. Banerjee, S. and Rosenfeld, A, \Model-based cluster analysis", Pattern Recognition, 26,
963{974,1993.
5. Baneld, J.D. and Raftery, A.E., \Model-based Gaussian and non-Gaussian clustering",
Biometrics,49,803{821,1993.
6. Bennett,K.P.,Fayyad,U.andGeiger,D.,\Density-basedindexing forapproximatenearest
neighborqueries",MicrosoftResearchTechnicalReportMSR-TR-98-58,1999.
7. Bentley, J.L.andFriedman,J.H., \Fastalgorithmsforconstructingminimalspanning trees
incoordinatespaces",IEEETransactionsonComputers,C-27,97{105,1978.
8. Bentley, J.L., Weide, B.W. and Yao, A.C., \Optimal expected time algorithms for closest
pointproblems",ACM TransactionsonMathematical Software,6,563{580,1980.
browsing hypertext", in Renegar, J., Shub, M. and Smale, S., Eds., Lectures in Applied
Mathematics (LAM)Vol. 32: The Mathematics of Numerical Analysis, AmericanMathe-
maticalSociety,99{123,1996.
10. Berry, M.W., Drmac, Z. and Jessup, E.R., \Matrices, vectorspaces, and information re-
trieval", SIAMReview, 41,335{362,1999.
11. Beyer,K.,Goldstein,J.,Ramakrishnan,R.andShaft,U., \Whenisnearestneighbormean-
ingful?", in Proceedings of the 7th InternationalConference on DatabaseTheory (ICDT),
Jerusalem,Israel,1999,inpress.
12. Borodin, A., Ostrovsky, R. and Rabani, Y., \Subquadratic approximation algorithms for
clusteringproblemsin highdimensionalspaces",Proc. 31stACMSymposiumonTheoryof
Computing(STOC-99),1999.
13. Broder, A.J., \StrategiesforeÆcientincrementalnearest neighborsearch",PatternRecog-
nition,23,171{178,1990.
14. Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G., \Syntactic clustering of the
web",Proc. Sixth InternationalWorldWide WebConference,391{404,1997.
15. Broder, A.Z., \On the resemblance and containment of documents", In Compressionand
ComplexityofSequences(SEQUENCES'97),21{29,IEEEComputerSociety,1998.
16. Bruynooghe,M.,\Methodesnouvellesenclassicationautomatiquedesdonneestaxinomiques
nombreuses",Statistiqueet AnalysedesDonnees,no.3,24{42,1977.
17. Burkhard,W.A.andKeller,R.M.,\Someapproachestobest-matchlesearching",Commu-
nicationsoftheACM,16,230{236,1973.
18. Byers,S.D. andRaftery,A.E., \Nearestneighborclutter removalforestimating featuresin
spatialpointprocesses",JournaloftheAmericanStatisticalAssociation,93,577{584,1998.
19. Campbell, J.G., Fraley, C., Stanford, D., Murtagh, F. and Raftery, A.E., \Model-based
methodsfortextilefaultdetection",InternationalJournalofImagingScienceandTechnology,
10,339-346,1999.
20. Cartia,Inc.,MappingtheInformationLandscape,client-serversoftwaresystem,
http://www.cartia.com,1999.
21. Celeux,G. and Govaert,G., \Gaussianparsimonious clustering models",Pattern Recogni-
tion,28,781-793,1995.
22. Cheriton,D. andTarjan,D.E.,\Findingminimumspanningtrees",SIAMJournalonCom-
puting,5,724{742,1976.
23. Church,K.W.andHelfman,J.I.,\Dotplot: aprogramforexploringself-similarityinmillions
of lines of text and code", Journalof Computationaland Graphical Statistics, 2, 153{174,
1993.
24. Croft,W.B., \Clustering largelesof documentsusing thesingle-link method", Journalof
theAmericanSocietyforInformation Science,28,341{344,1977.
25. Darken, C. and Moody, J., \Note onlearning rate schedules for stochastic optimization",
Advances in Neural Information Processing Systems 3, Lippmann, Moody and Touretzky,
Eds.,MorganKaufmann,PaloAlto,1991.
search",NeuralNetworksforSignalProcessing2,Proceedingsofthe1992IEEEWorkshop,
IEEEPress,Piscataway,1992.
27. Darken,C.andMoody,J.,\Towardsfasterstochasticgradientsearch",Advancesin Neural
InformationProcessingSystems4,Moody, HansonandLippmann,Eds.,MorganKaufman,
SanMateo,1992.
28. Dasarathy, B.V., Nearest Neighbor (NN) Norms: NN Pattern Classication Techniques,
IEEEComputerSocietyPress,NewYork,1991.
29. Dasgupta, A. andRaftery, A.E.,\Detecting features inspatial pointprocesseswithclutter
via model-basedclustering", Journal of theAmerican Statistical Association, 93, 294{302,
1998.
30. Day,W.H.E.andEdelsbrunner,H.,\EÆcientalgorithmsforagglomerativehierarchicalclus-
teringmethods",JournalofClassication,1,7{24,1984.
31. Defays, D., \An eÆcient algorithm for a complete link method", Computer Journal, 20,
364{366,1977.
32. Delannoy, C., \Unalgorithme rapide de recherchede plus proches voisins", RAIRO Infor-
matique/ComputerScience,14,275{286,1980.
33. Dempster, A.P.,Laird, N.M. andRubin, D.B., \Maximumlikelihood from incompletedata
viatheEMalgorithm",JournaloftheRoyalStatisticalSociety,Series,B39,1{22,1977).
34. Dobrzycki, A., Ebeling, H., Glotfelty, K., Freeman, P., Damiani, F, Elvis,M. and Calder-
wood, T.,ChandraDetect1.0UserGuide,ChandraX-RayCenter,SmithsonianAstrophys-
icalObservatory,Version0.9, 1999.
35. Doyle, L.B., \Semantic road maps for literature searchers", Journal of the ACM, 8, 553{
578,1961.
36. Eastman, C.M. andWeiss, S.F., \Tree structures for high dimensionality nearestneighbor
searching",InformationSystems,7,115{122,1982.
37. Ebeling,H.andWiedenmann,G.,\DetectingstructureintwodimensionscombiningVoronoi
tessellationand percolation",PhysicalReviewE,47,704{714,1993.
38. Forgy, E.,\Clusteranalysis ofmultivariatedata: eÆciency vs.interpretabilityof classica-
tions",Biometrics,21,768,1965.
39. Fraley,C., \Algorithmsformodel-basedGaussianhierarchicalclustering",SIAMJournalof
ScienticComputing,20,270-281,1999.
40. Fraley, C.andRaftery, A.E.,\Howmanyclusters? Which clustering method? Answersvia
model-basedclusteranalysis",TheComputerJournal,41,578{588,1999.
41. Friedman,J.H.,Baskett,F.andShustek,L.J.,\Analgorithmforndingnearestneighbors",
IEEETransactionsonComputers,C-24,1000-1006,1975.
42. Friedman, J.H., Bentley, J.L. and Finkel, R.A.,\An algorithm for ndingbest matches in
logarithmicexpectedtime",ACMTransactionsonMathematicalSoftware,3,209{226,1977.
43. Fritzke, B., \Some competitive learning methods", http://www.neuroinformatik.ruhr-uni-
bochum.de/ini/VDM/research/gsn/JavaPaper