Introduction to Clustering

(1)

Introduction to Clustering

Yumi Kondo

Student Seminar LSK301

Sep 25, 2010

(2)

Outline

Microarray Example

N=65

P=1756

(3)

Outline

Clustering

The data set

{

x

ij

}

, i =1,..,N, j=1,...,P consist of P features measured

on n independent observations.

Clustering

Clustering algorithm seek to assign

N

observations in

p

space, labeled

x

1

, ..,

x

N

to one of K groups, based on some similarity measure.

”Unsupervised” learning – the problem of finding groups in data

without the help of a response variable

No right or wrong partition

(4)

Outline

What is similarity measure?

x

1

and

x

2

are observation vectors in p dimention Some examples

Euclidean distance

||

x

i

−

x

i

0

||

2

0 1 2 3 4 5 0 1 2 3 4 5 n[,1] n[,2]

Absolute distance

||

x

i

−

x

i

0

||

2

1

Correlation with

d

= 1

−

correlation

(5)

Outline

What is similarity measure?

x

1

and

x

2

are observation vectors in p dimention Some examples

Euclidean distance

||

x

i

−

x

i

0

||

2

0 1 2 3 4 5 0 1 2 3 4 5 n[,1] n[,2]

Absolute distance

||

x

i

−

x

i

0

||

2

1

Correlation with

d

= 1

−

correlation

(6)

Outline what is K-means?

Clustering method

Hierarchical Clustering

Non-hierarchical Clustering

-K-means

(7)

It produces a dendrogram that represents a nested set of clusters:

depending on where the dendrogram is cut, between 1 and N clusters

can result.

Cool Microarray example

http

:

//

genome

−

www

.

stanford

.

edu

/

breast

c

ancer

/

molecularportraits

/

download

.

shtml

(8)

PRO

Nice tree! (dendrogram)

Visualize the different levels of similarity between observations.

CON

computationally expensive!

(9)

Non-hierarchical clustering, K-means

K-mean with Euclidian distance as a similarity measure

Solution of K-mean clustering is the partition such that

min

C

1

,..,

C

K

WSS

= min

C

1

,..,

C

K

X

k

=1

1

n

k

X

i

,

i

0

∈

C

k

||

x

i

−

x

i

0

||

2

-white board

(10)

note;

WSS

=

K

X

k

=1

1

2n

k

X

i

,

i

0

∈

C

k

||

x

i

−

x

i

0

||

2

=

K

X

k

=1

1

2n

k

K

X

i

=1

K

X

i

0

=1

||

x

i

−

x

¯

k

+ ¯

x

k

−

x

i

0

||

2

=

K

X

k

=1

n

k

X

i

=1

||

x

i

−

x

¯

k

||

2

(11)

Algorithm for K-mean

Step 1 and step 2 are iterated until convergence.

Step 1. Given cluster assignment

C

1

, ..,

C

K

, cluster centroids are

calculated as

ˆ

µ

k

=

P

i

∈

C

k

x

i

N

k

= 1

, ..,

K

Step 2. Given cluster centroids, objective function is minimized

by assigning each observation to the closest cluster mean.

I

i

=

argmin

1

≤

k

≤

K

||

x

i

−

µ

ˆ

k

||

2

white board

(12)

Correlation as similarity measurement in K-means

It is not so easy to create an algorithm when similarity

measurement is correlation. No simple analytic form for cluster

centroid :(

Data transformation approach

1. normalize the observation vector ˜

x

i

=

q

x

i

−

¯

x

||_xi−¯x||2 P

2.

||

x

˜

i

−

y

˜

i

||

2

∝

d

ρ

x,y ||x˜i−y˜i||2=|| x_i−¯x r ||_xi−¯x||2 P − yi−¯y r ||_yi−¯y||2 P ||2 =||||xi−¯x|| 2 ||xi−¯x||2 P− s P ||xi−¯x||2 P ||yi−¯y||2 (xi−¯x)0(yi−¯y)− ||yi−¯y||2 ||yi−¯y||2 P||2 = 2p−2p (xi−¯x) 0₍_y i−¯y) ||xi−¯x||||yi−¯y|| Jump todρ

(13)

Correlation as similarity measurement in K-means

It is not so easy to create an algorithm when similarity

measurement is correlation. No simple analytic form for cluster

centroid :(

Data transformation approach

1. normalize the observation vector ˜

x

i

=

q

x

i

−

¯

x

||_xi−¯x||2 P

2.

||

x

˜

i

−

y

˜

i

||

2

∝

d

ρ

x,y ||x˜i−y˜i||2=|| xi−¯x r ||_xi−¯x||2 P − yi−¯y r ||_yi−¯y||2 P ||2 =||||xi−¯x|| 2 ||xi−¯x||2 P− s P ||xi−¯x||2 P ||yi−¯y||2 (xi−¯x)0(yi−¯y)− ||yi−¯y||2 ||yi−¯y||2 P||2 = 2p−2p (xi−¯x) 0₍_y i−¯y) ||xi−¯x||||yi−¯y|| Jump todρ

(14)

Non-hierarchical Method K-means

Drawback of K-means

No pretty tree

The number of clusters must be pre-known!

Not robust

(15)

Outline The number of cluster must be pre-known BUT HOW?

”K” must be preknown but how?

GAP statistics, Tibshirani et al (2001)

Clest

(16)

GAP statistics

idea behind GAP statistics

Find ˆ

k

such that WSS

k

shows an elbow decline

Jump to WSS

-cool example in R

(17)

Definition

GAP

(k) =

E

null

(log

(WSS

ˆ

(k))

−

log

(WSS

(k))

ˆ

k

=

the smallest k such that

GAP

(k)

≥

GAP

(k

+ 1)

−

s

k

+1

Standardize the graph of log (WSS(k)) by comparing it with its

expectation under an appropriate null reference distribution of the

data

ˆ

E

null

(log(WSS

(k)) =

1

B

X

i

=1

log(WSS

(k)

b

)

-another cool one in R

(18)

Reference distribution: K=1

Generate the reference features from a uniform distribution over a

box aligned with the principal components of the data.

Orthogonally diagonalize

S

X

=

PDP

0

.

D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0

and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.

Transform via

X

∗

=

XP. Then

S

X

∗

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

The transformed data is no longer correlated.

Draw uniform features

Z

∗

over the range of the columns of

X

∗

.

Finally we back-transform via

Z

=

Z

∗

P

0

to give reference data

set

Z

.

(19)

Reference distribution: K=1

S

X

=

PDP

0

.

Transform via

X

∗

=

XP. Then

S

X

∗

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

Z

∗

X

∗

.

Z

=

Z

∗

P

0

set

Z

.

(20)

Reference distribution: K=1

S

X

=

PDP

0

.

Transform via

X

∗

=

XP. Then

S

X

∗

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

Z

∗

X

∗

.

Z

=

Z

∗

P

0

set

Z

.

(21)

Reference distribution: K=1

S

X

=

PDP

0

.

Transform via

X

∗

=

XP. Then

S

X

∗

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

Z

∗

X

∗

.

Z

=

Z

∗

P

0

set

Z

.

(22)

Reference distribution: K=1

S

X

=

PDP

0

.

Transform via

X

∗

=

XP. Then

S

X

∗

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

Z

∗

X

∗

.

Z

=

Z

∗

P

0

set

Z

.

(23)

Definition

GAP

(k) =

E

null

(log

(WSS

ˆ

(k)))

−

log(WSS

(k))

ˆ

k

=

the smallest k such that

GAP

(k)

≥

GAP

(k

+ 1)

−

s

k

+1

ˆ

E

null

(log(WSS

(k)) =

1

B

X

i

=1

log(WSS

(k)

b

)

Standardize the graph of log (WSS(k)) by comparing it with its

expectation under an appropriate null reference distribution of the

data.

(24)

Prediction-based resampling method, Clest

Clest returns K which has the most stable predictability in clustering

procedure.

Algorithm

For each K, repeat the following process B times for data set and

reference data set

partition data set into learning set and testing set

perform K-means on learning set, return

classifiers

classify testing set by

classifiers, return

C

1

,

classifier

, ..,

C

K

,

classifiers

classify testing set by Kmeans, return

C

1

,

K

−

mean

, ...,

C

K

,

K

−

means

measure the similarity of two partitions

C

1

,

classifier

, ..,

C

K

,

classifiers

and

C

1

,

K

−

mean

, ...,

C

K

,

K

−

means

(25)

Prediction-based resampling method, Clest

procedure.

Algorithm

reference data set

perform K-means on learning set, return

classifiers

classifiers, return

C

1

,

classifier

, ..,

C

K

,

classifiers

C

1

,

K

−

mean

, ...,

C

K

,

K

−

means

C

1

,

classifier

, ..,

C

K

,

classifiers

and

C

1

,

K

−

mean

, ...,

C

K

,

K

−

means

(26)

Prediction-based resampling method, Clest

procedure.

Algorithm

reference data set

perform K-means on learning set, return

classifiers

classifiers, return

C

1

,

classifier

, ..,

C

K

,

classifiers

C

1

,

K

−

mean

, ...,

C

K

,

K

−

means

C

1

,

classifier

, ..,

C

K

,

classifiers

and

C

1

,

K

−

mean

, ...,

C

K

,

K

−

means

(27)

Prediction-based resampling method, Clest

procedure.

Algorithm

reference data set

perform K-means on learning set, return

classifiers

classifiers, return

C

1

,

classifier

, ..,

C

K

,

classifiers

C

1

,

K

−

mean

, ...,

C

K

,

K

−

means

C

1

,

classifier

, ..,

C

K

,

classifiers

and

C

1

,

K

−

mean

, ...,

C

K

,

K

−

means

(28)

(29)

(30)

(31)

(32)

(33)

(34)

Compute

S(k,cluster labels 1, cluster labels 2)

Repeat this process B times for each K and obtain the average

of measure

S

k

Repeat the algorithm for reference dataset for B timesand obtain

S

k

0

Obtain standardized similarity measure

d

k

=S

k

−

S

k

0

and

ˆ

K

=argmax

k

∈{

1

,..,

K

}

d

k

(35)

(36)

(37)

(38)

(39)

(40)

(41)

(42)

Compute

of measure

S

k

Repeat the algorithm for reference dataset and obtain

S

k

0

d

k

=S

k

−

S

k

0

and

ˆ

K

=argmax

k

∈{

1

,..,

K

}

d

k

(43)

Compute

of measure

S

k

Repeat the algorithm for reference dataset and obtain

S

k

0

d

k

=S

k

−

S

k

0

and

ˆ

K

=argmax

k

∈{

1

,..,

K

}

d

k

(44)

Similarity measures of two partitions

Let

P

and

Q

represent two partitions

CER

=

P

i

>

i

0

I

P

(

i

,

i

0

)

−

I

Q

(

i

,

i

0

)

n

2

I

P

(

i

,

i

0

)

=

1

if i and i’ belong to the same cluster by partitioning P

0 otherwise

0

≤

CER

≤

1

CER= 0 means perfect agreement of two partitions

CER= 1 means complete disagreement of two partitions

(45)

Does Clest outperform GAP statistics?

(46)

Tibshirani, Robert,et al. Estimating the number of clusters in a data

set via the gap statistic