Machine Learning, STOR 565
Clustering: Overview and Basic Methods
Andrew Nobel
Overview of Clustering
a. The basic problem
General Setting
Given: Objects x1, . . . , xnin feature space X
I Dissimilarity or distance d(xi, xj)between pairs of objects
Goal: Divide x1, . . . , xninto disjoint groups C1, . . . , Ck, called clusters, s.t.
I Objects in same cluster are close together
I Objects in different clusters are far apart
I Number of clusters k is small
Distinction: Clustering is complete if it partitions X and incomplete if it
Clustering: Areas of Application
Genomics, Biology Data Compression Psychology Computer Science
Feature Vectors
Objects x ∈ X typically represented by a feature vector x = (x1, . . . , xp)
where xiis a numerical/categorical measurement of interest:
I xi∈ R numerical feature
Examples
Medicine
I Object = patient
I Feature xi=outcome of a diagnostic test on patient
Microarrays (Genomics) I Object = tissue sample
I Feature xi=measured expression level of gene i in that sample
Data Mining
I Object = consumer
Dissimilarities/Distances Between Feature Vectors
Euclidean d(u, v) =pP
i(ui− vi)2
Manhattan d(u, v) =P
i|ui− vi|
Correlation d(u, v) = 1 − corr(u, v)
Hamming d(u, v) =P
iI{ui6= vi}
Basic Steps in Clustering
Acquisition of Objects x1, . . . , xn
⇓
Selection and Extraction of Features ⇓ Dissimilarity matrix D = {d(xi, xj) : 1 ≤ i, j ≤ n} ⇓ Clustering Algorithm ⇓ Partition π = {C1, . . . , Ck} of x1, . . . , xn.
Some Clustering Methods
Hierarchical: Candidate divisions of data described by a binary tree I Agglomerative (bottom-up)
I Divisive (top-down)
Iterative: Search for local minimum of simple cost function I k-means and variants
I partitioning around medioids, self organizing maps
Model-based: Fit feature vectors with a finite mixture model
Centroids
Definition: The centroid of vectors x1, . . . , xn∈ Rpis their average
c = 1 n n X i=1 xi
The centroid c is the center of mass of the point configuration x1, . . . , xn, and
is an optimal representative for the configuration in the sense that
n X i=1 ||xi− c||2 ≤ n X i=1 ||xi− v||2
Nearest Neighbor Partitions
Definition: Voronoi (nearest neighbor) partition of vectors c1, . . . , ck∈ Rpis
a collection π = {A1, . . . , Ak} where the cell
Aj= {x : ||x − cj|| ≤ ||x − cs|| all s 6= j}
contains vectors that are closer to cjthan any other cs.
Structure of Cells: Note that
Aj=
\
s6=j
{x : ||x − cj|| ≤ ||x − cs||}
is an intersection of half-spaces. Thus it is a polytope in Rpwith at most n − 1
The k-Means Algorithm
Clustering Problem: Divide x1, . . . , xn∈ Rpinto k clusters.
Optimization: Find centers c01, . . . , c0kto minimize sum of squares (SoS) cost
function Cost(c1, . . . , ck) = n X i=1 min 1≤j≤k||xi− cj|| 2
i.e., sum of squared distances from each point to its nearest center. Then the clusters are the Voronoi partition of x1, . . . , xnwith centers c01, . . . , c
0 k
Problem: Exact solution of optimization problem is not computationally
The k-Means Algorithm
Fix in advance
I Number of clusters k
I Initial centers C0= {c1, . . . , ck}
Iterate: For m = 1, 2, . . . do:
I Let πmbe the nearest neighbor (Voronoi) partition of the centers Cm−1.
I Let Cmbe the centroids (averages) of the vectors in each cell of πm
The k-Means Algorithm
Recall: Sum of Squares (SoS) cost function
Cost(c1, . . . , ck) = n X i=1 min 1≤j≤k||xi− cj|| 2
Note: Cost function decreases at each stage of the k-means algorithm. In practice
I Choose multiple initial sets of representative vectors C0= {c1, . . . , ck}
I Run the iterative k-means procedure
I Choose the partition associated with the smallest final cost Example:http://www.onmyphd.com/?p=k-means.clustering.
Features of Clusters
If clusters are present, their features can affect performance of different clustering procedures.
I Spherical or elliptical in shape
I Similar in overall variance/spread
I Similar in size (number of points)
K-Means tends to perform best when clusters are spherical, similar in
Binary Trees
1. Distinguished node called theroot with zero or two children but no parent
2. Every other node has one parent and zero or two children
I Nodes with no children are calledleaves I Nodes with two children are calledinternal
Agglomerative Clustering
Stage 0: Assign each object xito its own cluster
Stage k:
I Find the two closest clusters at stage k − 1
I Combine them into a single cluster
Stop: When all objects xibelong to a single cluster
Output: Binary tree T called a dendrogram
Note: Closeness of clusters C, C0
Distances Between Clusters
Single Linkage ds(C, C 0 ) = min xi∈C, xj∈C0 d(xi, xj) Average Linkage da(C, C 0 ) = 1 |C| |C0| X xi∈C, xj∈C0 d(xi, xj) Total Linkage dt(C, C0) = max xi∈C, xj∈C0 d(xi, xj)Dendrogram
Binary tree associated with the agglomerative clustering procedure: it is a graphical record of the clustering process
Initialize: Each singleton cluster {xi} corresponds to a node at height 0
Update: If two clusters C, C0are combined, their respective nodes are joined
to a parent node at height d(C, C0)
Each node of dendrogram corresponds to a set of objects. Objects associated with two nodes are merged when forming their parent.
I Leaves correspond to individual objects
Dendrogram, cont.
Note: Dendrogram T represents many possible clusterings, one for each
(rooted) subtree.
Methods for selecting a clustering/subtree I Ad hoc selection (by eye)
I “Cutting” dendrogram at fixed level
I Penalized pruning
Visualization of clustering structure
I Order objects in the same way as the leaves of the dendrogram
Cars Data
I Samples: 32 unique cars
I Variables: 11 descriptive variables, including gas mileage, horsepower,
number of cylinders, etc.
Ma se ra ti Bo ra F ord Pa nt era L D ust er 36 0 C ama ro Z 28 C hrysl er Imp eri al C ad ill ac F le et w oo d Li nco ln C on tin en ta l H orn et Sp ort ab ou t Po nt ia c F ire bi rd Me rc 45 0SL C Me rc 45 0SE Me rc 45 0SL Dodge Challenger AMC Ja ve lin H orn et 4 D rive Va lia nt F erra ri D in o H on da C ivi c T oyo ta C oro lla Fiat 128 Fiat X1 -9 Me rc 24 0D Ma zd a R X4 Ma zd a R X4 W ag Me rc 28 0 Me rc 28 0C Lo tu s Eu ro pa Me rc 23 0 D at su n 71 0 Vo lvo 1 42 E T oyo ta C oro na Po rsch e 91 4-2 0 20 40 60 80
Single Linkage Clustering on Cars data
hclust (*, "single")dist(mtcars)
D
ist
an
F erra ri D in o H on da C ivi c T oyo ta C oro lla Fiat 128 Fiat X1 -9 Ma zd a R X4 Ma zd a R X4 W ag Me rc 28 0 Me rc 28 0C Me rc 24 0D Lo tu s Eu ro pa Me rc 23 0 Vo lvo 1 42 E D at su n 71 0 T oyo ta C oro na Po rsch e 91 4-2 Ma se ra ti Bo ra H orn et 4 D rive Va lia nt Me rc 45 0SL C Me rc 45 0SE Me rc 45 0SL Dodge Challenger AMC Ja ve lin C hrysl er Imp eri al C ad ill ac F le et w oo d Li nco ln C on tin en ta l F ord Pa nt era L D ust er 36 0 C ama ro Z 28 H orn et Sp ort ab ou t Po nt ia c F ire bi rd 0 50 100 150 200 250
Average Linkage Clustering on Cars data
hclust (*, "average")dist(mtcars)
D
ist
an
TCGA Data
Gene expression data from The Cancer Genome Atlas (TCGA)
I Samples
I 95 Luminal A breast tumors
I 122 Basal breast tumors
TCGA Data
I Clustered samples (breast tumor subtype)
Important Questions
I What is the right number of clusters?
I What is right measure of distance?
I What is the best clustering method for the data?
I How robust is an observed clustering to small perturbations of the data?