Introduction to Clustering
Yumi Kondo
Student Seminar LSK301
Sep 25, 2010
Outline
Microarray Example
N=65
P=1756
Outline
Clustering
The data set
{
x
ij
}
, i =1,..,N, j=1,...,P consist of P features measured
on n independent observations.
Clustering
Clustering algorithm seek to assign
N
observations in
p
space, labeled
x
1
, ..,
x
N
to one of K groups, based on some similarity measure.
”Unsupervised” learning – the problem of finding groups in data
without the help of a response variable
No right or wrong partition
Outline
What is similarity measure?
x
1
and
x
2
are observation vectors in p dimention Some examples
Euclidean distance
||
x
i
−
x
i
0||
2
0 1 2 3 4 5 0 1 2 3 4 5 n[,1] n[,2]Absolute distance
||
x
i
−
x
i
0||
2
1
Correlation with
d
= 1
−
correlation
Outline
What is similarity measure?
x
1
and
x
2
are observation vectors in p dimention Some examples
Euclidean distance
||
x
i
−
x
i
0||
2
0 1 2 3 4 5 0 1 2 3 4 5 n[,1] n[,2]Absolute distance
||
x
i
−
x
i
0||
2
1
Correlation with
d
= 1
−
correlation
Outline what is K-means?
Clustering method
Hierarchical Clustering
Non-hierarchical Clustering
-K-means
Outline what is K-means?
Hierarchical Clustering
Hierarchical Clustering
It produces a dendrogram that represents a nested set of clusters:
depending on where the dendrogram is cut, between 1 and N clusters
can result.
Cool Microarray example
http
:
//
genome
−
www
.
stanford
.
edu
/
breast
c
ancer
/
molecularportraits
/
download
.
shtml
Outline what is K-means?
Hierarchical Clustering
PRO
Nice tree! (dendrogram)
Visualize the different levels of similarity between observations.
CON
computationally expensive!
Outline what is K-means?
Non-hierarchical clustering, K-means
K-mean with Euclidian distance as a similarity measure
Solution of K-mean clustering is the partition such that
min
C
1,..,
C
KWSS
= min
C
1,..,
C
KK
X
k
=1
1
n
k
X
i
,
i
0∈
C
k||
x
i
−
x
i
0||
2
-white board
Outline what is K-means?
note;
WSS
=
K
X
k
=1
1
2n
k
X
i
,
i
0∈
C
k||
x
i
−
x
i
0||
2
=
K
X
k
=1
1
2n
k
K
X
i
=1
K
X
i
0=1
||
x
i
−
x
¯
k
+ ¯
x
k
−
x
i
0||
2
=
K
X
k
=1
n
kX
i
=1
||
x
i
−
x
¯
k
||
2
Outline what is K-means?
Algorithm for K-mean
Step 1 and step 2 are iterated until convergence.
Step 1. Given cluster assignment
C
1
, ..,
C
K
, cluster centroids are
calculated as
ˆ
µ
k
=
P
i
∈
C
kx
i
N
k
= 1
, ..,
K
Step 2. Given cluster centroids, objective function is minimized
by assigning each observation to the closest cluster mean.
I
i
=
argmin
1
≤
k
≤
K
||
x
i
−
µ
ˆ
k
||
2
white board
Outline what is K-means?
Correlation as similarity measurement in K-means
It is not so easy to create an algorithm when similarity
measurement is correlation. No simple analytic form for cluster
centroid :(
Data transformation approach
1. normalize the observation vector ˜
x
i
=
q
x
i−
¯
x
||xi−¯x||2 P2.
||
x
˜
i
−
y
˜
i
||
2
∝
d
ρ
x,y ||x˜i−y˜i||2=|| xi−¯x r ||xi−¯x||2 P − yi−¯y r ||yi−¯y||2 P ||2 =||||xi−¯x|| 2 ||xi−¯x||2 P− s P ||xi−¯x||2 P ||yi−¯y||2 (xi−¯x)0(yi−¯y)− ||yi−¯y||2 ||yi−¯y||2 P||2 = 2p−2p (xi−¯x) 0(y i−¯y) ||xi−¯x||||yi−¯y|| Jump todρOutline what is K-means?
Correlation as similarity measurement in K-means
It is not so easy to create an algorithm when similarity
measurement is correlation. No simple analytic form for cluster
centroid :(
Data transformation approach
1. normalize the observation vector ˜
x
i
=
q
x
i−
¯
x
||xi−¯x||2 P2.
||
x
˜
i
−
y
˜
i
||
2
∝
d
ρ
x,y ||x˜i−y˜i||2=|| xi−¯x r ||xi−¯x||2 P − yi−¯y r ||yi−¯y||2 P ||2 =||||xi−¯x|| 2 ||xi−¯x||2 P− s P ||xi−¯x||2 P ||yi−¯y||2 (xi−¯x)0(yi−¯y)− ||yi−¯y||2 ||yi−¯y||2 P||2 = 2p−2p (xi−¯x) 0(y i−¯y) ||xi−¯x||||yi−¯y|| Jump todρOutline what is K-means?
Non-hierarchical Method K-means
Drawback of K-means
No pretty tree
The number of clusters must be pre-known!
Not robust
Outline The number of cluster must be pre-known BUT HOW?
”K” must be preknown but how?
GAP statistics, Tibshirani et al (2001)
Clest
Outline The number of cluster must be pre-known BUT HOW?
GAP statistics
idea behind GAP statistics
Find ˆ
k
such that WSS
k
shows an elbow decline
Jump to WSS
-cool example in R
Outline The number of cluster must be pre-known BUT HOW?
Definition
GAP
(k) =
E
null
(log
(WSS
ˆ
(k))
−
log
(WSS
(k))
ˆ
k
=
the smallest k such that
GAP
(k)
≥
GAP
(k
+ 1)
−
s
k
+1
Standardize the graph of log (WSS(k)) by comparing it with its
expectation under an appropriate null reference distribution of the
data
ˆ
E
null
(log(WSS
(k)) =
1
B
B
X
i
=1
log(WSS
(k)
b
)
-another cool one in R
Outline The number of cluster must be pre-known BUT HOW?
Reference distribution: K=1
Generate the reference features from a uniform distribution over a
box aligned with the principal components of the data.
Orthogonally diagonalize
S
X
=
PDP
0
.
D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0
and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.
Transform via
X
∗
=
XP. Then
S
X
∗=
P
0
S
X
P
=
P
0
PDP
0
P
=
D.
The transformed data is no longer correlated.
Draw uniform features
Z
∗
over the range of the columns of
X
∗
.
Finally we back-transform via
Z
=
Z
∗
P
0
to give reference data
set
Z
.
Outline The number of cluster must be pre-known BUT HOW?
Reference distribution: K=1
Generate the reference features from a uniform distribution over a
box aligned with the principal components of the data.
Orthogonally diagonalize
S
X
=
PDP
0
.
D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0
and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.
Transform via
X
∗
=
XP. Then
S
X
∗=
P
0
S
X
P
=
P
0
PDP
0
P
=
D.
The transformed data is no longer correlated.
Draw uniform features
Z
∗
over the range of the columns of
X
∗
.
Finally we back-transform via
Z
=
Z
∗
P
0
to give reference data
set
Z
.
Outline The number of cluster must be pre-known BUT HOW?
Reference distribution: K=1
Generate the reference features from a uniform distribution over a
box aligned with the principal components of the data.
Orthogonally diagonalize
S
X
=
PDP
0
.
D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0
and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.
Transform via
X
∗
=
XP. Then
S
X
∗=
P
0
S
X
P
=
P
0
PDP
0
P
=
D.
The transformed data is no longer correlated.
Draw uniform features
Z
∗
over the range of the columns of
X
∗
.
Finally we back-transform via
Z
=
Z
∗
P
0
to give reference data
set
Z
.
Outline The number of cluster must be pre-known BUT HOW?
Reference distribution: K=1
Generate the reference features from a uniform distribution over a
box aligned with the principal components of the data.
Orthogonally diagonalize
S
X
=
PDP
0
.
D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0
and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.
Transform via
X
∗
=
XP. Then
S
X
∗=
P
0
S
X
P
=
P
0
PDP
0
P
=
D.
The transformed data is no longer correlated.
Draw uniform features
Z
∗
over the range of the columns of
X
∗
.
Finally we back-transform via
Z
=
Z
∗
P
0
to give reference data
set
Z
.
Outline The number of cluster must be pre-known BUT HOW?
Reference distribution: K=1
Generate the reference features from a uniform distribution over a
box aligned with the principal components of the data.
Orthogonally diagonalize
S
X
=
PDP
0
.
D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0
and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.
Transform via
X
∗
=
XP. Then
S
X
∗=
P
0
S
X
P
=
P
0
PDP
0
P
=
D.
The transformed data is no longer correlated.
Draw uniform features
Z
∗
over the range of the columns of
X
∗
.
Finally we back-transform via
Z
=
Z
∗
P
0
to give reference data
set
Z
.
Outline The number of cluster must be pre-known BUT HOW?
Definition
GAP
(k) =
E
null
(log
(WSS
ˆ
(k)))
−
log(WSS
(k))
ˆ
k
=
the smallest k such that
GAP
(k)
≥
GAP
(k
+ 1)
−
s
k
+1
ˆ
E
null
(log(WSS
(k)) =
1
B
B
X
i
=1
log(WSS
(k)
b
)
Standardize the graph of log (WSS(k)) by comparing it with its
expectation under an appropriate null reference distribution of the
data.
Outline The number of cluster must be pre-known BUT HOW?
Prediction-based resampling method, Clest
Clest returns K which has the most stable predictability in clustering
procedure.
Algorithm
For each K, repeat the following process B times for data set and
reference data set
partition data set into learning set and testing set
perform K-means on learning set, return
classifiers
classify testing set by
classifiers, return
C
1
,
classifier
, ..,
C
K
,
classifiers
classify testing set by Kmeans, return
C
1
,
K
−
mean
, ...,
C
K
,
K
−
means
measure the similarity of two partitions
C
1
,
classifier
, ..,
C
K
,
classifiers
and
C
1
,
K
−
mean
, ...,
C
K
,
K
−
means
Outline The number of cluster must be pre-known BUT HOW?
Prediction-based resampling method, Clest
Clest returns K which has the most stable predictability in clustering
procedure.
Algorithm
For each K, repeat the following process B times for data set and
reference data set
partition data set into learning set and testing set
perform K-means on learning set, return
classifiers
classify testing set by
classifiers, return
C
1
,
classifier
, ..,
C
K
,
classifiers
classify testing set by Kmeans, return
C
1
,
K
−
mean
, ...,
C
K
,
K
−
means
measure the similarity of two partitions
C
1
,
classifier
, ..,
C
K
,
classifiers
and
C
1
,
K
−
mean
, ...,
C
K
,
K
−
means
Outline The number of cluster must be pre-known BUT HOW?
Prediction-based resampling method, Clest
Clest returns K which has the most stable predictability in clustering
procedure.
Algorithm
For each K, repeat the following process B times for data set and
reference data set
partition data set into learning set and testing set
perform K-means on learning set, return
classifiers
classify testing set by
classifiers, return
C
1
,
classifier
, ..,
C
K
,
classifiers
classify testing set by Kmeans, return
C
1
,
K
−
mean
, ...,
C
K
,
K
−
means
measure the similarity of two partitions
C
1
,
classifier
, ..,
C
K
,
classifiers
and
C
1
,
K
−
mean
, ...,
C
K
,
K
−
means
Outline The number of cluster must be pre-known BUT HOW?
Prediction-based resampling method, Clest
Clest returns K which has the most stable predictability in clustering
procedure.
Algorithm
For each K, repeat the following process B times for data set and
reference data set
partition data set into learning set and testing set
perform K-means on learning set, return
classifiers
classify testing set by
classifiers, return
C
1
,
classifier
, ..,
C
K
,
classifiers
classify testing set by Kmeans, return
C
1
,
K
−
mean
, ...,
C
K
,
K
−
means
measure the similarity of two partitions
C
1
,
classifier
, ..,
C
K
,
classifiers
and
C
1
,
K
−
mean
, ...,
C
K
,
K
−
means
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Compute
S(k,cluster labels 1, cluster labels 2)
Repeat this process B times for each K and obtain the average
of measure
S
k
Repeat the algorithm for reference dataset for B timesand obtain
S
k
0
Obtain standardized similarity measure
d
k
=S
k
−
S
k
0
and
ˆ
K
=argmax
k
∈{
1
,..,
K
}
d
k
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Outline The number of cluster must be pre-known BUT HOW?
Compute
S(k,cluster labels 1, cluster labels 2)
Repeat this process B times for each K and obtain the average
of measure
S
k
Repeat the algorithm for reference dataset and obtain
S
k
0
Obtain standardized similarity measure
d
k
=S
k
−
S
k
0
and
ˆ
K
=argmax
k
∈{
1
,..,
K
}
d
k
Outline The number of cluster must be pre-known BUT HOW?
Compute
S(k,cluster labels 1, cluster labels 2)
Repeat this process B times for each K and obtain the average
of measure
S
k
Repeat the algorithm for reference dataset and obtain
S
k
0
Obtain standardized similarity measure
d
k
=S
k
−
S
k
0
and
ˆ
K
=argmax
k
∈{
1
,..,
K
}
d
k
Outline The number of cluster must be pre-known BUT HOW?
Similarity measures of two partitions
Let
P
and
Q
represent two partitions
CER
=
P
i
>
i
0I
P
(
i
,
i
0)
−
I
Q
(
i
,
i
0)
n
2
I
P
(
i
,
i
0)
=
1
if i and i’ belong to the same cluster by partitioning P
0 otherwise
0
≤
CER
≤
1
CER= 0 means perfect agreement of two partitions
CER= 1 means complete disagreement of two partitions
Outline The number of cluster must be pre-known BUT HOW?
Does Clest outperform GAP statistics?
Outline The number of cluster must be pre-known BUT HOW?
Tibshirani, Robert,et al. Estimating the number of clusters in a data
set via the gap statistic