• No results found

Introduction to Clustering

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to Clustering"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to Clustering

Yumi Kondo

Student Seminar LSK301

Sep 25, 2010

(2)

Outline

Microarray Example

N=65

P=1756

(3)

Outline

Clustering

The data set

{

x

ij

}

, i =1,..,N, j=1,...,P consist of P features measured

on n independent observations.

Clustering

Clustering algorithm seek to assign

N

observations in

p

space, labeled

x

1

, ..,

x

N

to one of K groups, based on some similarity measure.

”Unsupervised” learning – the problem of finding groups in data

without the help of a response variable

No right or wrong partition

(4)

Outline

What is similarity measure?

x

1

and

x

2

are observation vectors in p dimention Some examples

Euclidean distance

||

x

i

x

i

0

||

2

0 1 2 3 4 5 0 1 2 3 4 5 n[,1] n[,2]

Absolute distance

||

x

i

x

i

0

||

2

1

Correlation with

d

= 1

correlation

(5)

Outline

What is similarity measure?

x

1

and

x

2

are observation vectors in p dimention Some examples

Euclidean distance

||

x

i

x

i

0

||

2

0 1 2 3 4 5 0 1 2 3 4 5 n[,1] n[,2]

Absolute distance

||

x

i

x

i

0

||

2

1

Correlation with

d

= 1

correlation

(6)

Outline what is K-means?

Clustering method

Hierarchical Clustering

Non-hierarchical Clustering

-K-means

(7)

Outline what is K-means?

Hierarchical Clustering

Hierarchical Clustering

It produces a dendrogram that represents a nested set of clusters:

depending on where the dendrogram is cut, between 1 and N clusters

can result.

Cool Microarray example

http

:

//

genome

www

.

stanford

.

edu

/

breast

c

ancer

/

molecularportraits

/

download

.

shtml

(8)

Outline what is K-means?

Hierarchical Clustering

PRO

Nice tree! (dendrogram)

Visualize the different levels of similarity between observations.

CON

computationally expensive!

(9)

Outline what is K-means?

Non-hierarchical clustering, K-means

K-mean with Euclidian distance as a similarity measure

Solution of K-mean clustering is the partition such that

min

C

1

,..,

C

K

WSS

= min

C

1

,..,

C

K

K

X

k

=1

1

n

k

X

i

,

i

0

C

k

||

x

i

x

i

0

||

2

-white board

(10)

Outline what is K-means?

note;

WSS

=

K

X

k

=1

1

2n

k

X

i

,

i

0

C

k

||

x

i

x

i

0

||

2

=

K

X

k

=1

1

2n

k

K

X

i

=1

K

X

i

0

=1

||

x

i

x

¯

k

+ ¯

x

k

x

i

0

||

2

=

K

X

k

=1

n

k

X

i

=1

||

x

i

x

¯

k

||

2

(11)

Outline what is K-means?

Algorithm for K-mean

Step 1 and step 2 are iterated until convergence.

Step 1. Given cluster assignment

C

1

, ..,

C

K

, cluster centroids are

calculated as

ˆ

µ

k

=

P

i

C

k

x

i

N

k

= 1

, ..,

K

Step 2. Given cluster centroids, objective function is minimized

by assigning each observation to the closest cluster mean.

I

i

=

argmin

1

k

K

||

x

i

µ

ˆ

k

||

2

white board

(12)

Outline what is K-means?

Correlation as similarity measurement in K-means

It is not so easy to create an algorithm when similarity

measurement is correlation. No simple analytic form for cluster

centroid :(

Data transformation approach

1. normalize the observation vector ˜

x

i

=

q

x

i

¯

x

||xi−¯x||2 P

2.

||

x

˜

i

y

˜

i

||

2

d

ρ

x,y ||x˜i−y˜i||2=|| xi−¯x r ||xi−¯x||2 P − yi−¯y r ||yi−¯y||2 P ||2 =||||xi−¯x|| 2 ||xi−¯x||2 P− s P ||xi−¯x||2 P ||yi−¯y||2 (xi−¯x)0(yi−¯y)− ||yi−¯y||2 ||yi−¯y||2 P||2 = 2p−2p (xi−¯x) 0(y i−¯y) ||xi−¯x||||yi−¯y|| Jump todρ
(13)

Outline what is K-means?

Correlation as similarity measurement in K-means

It is not so easy to create an algorithm when similarity

measurement is correlation. No simple analytic form for cluster

centroid :(

Data transformation approach

1. normalize the observation vector ˜

x

i

=

q

x

i

¯

x

||xi−¯x||2 P

2.

||

x

˜

i

y

˜

i

||

2

d

ρ

x,y ||x˜i−y˜i||2=|| xi−¯x r ||xi−¯x||2 P − yi−¯y r ||yi−¯y||2 P ||2 =||||xi−¯x|| 2 ||xi−¯x||2 P− s P ||xi−¯x||2 P ||yi−¯y||2 (xi−¯x)0(yi−¯y)− ||yi−¯y||2 ||yi−¯y||2 P||2 = 2p−2p (xi−¯x) 0(y i−¯y) ||xi−¯x||||yi−¯y|| Jump todρ
(14)

Outline what is K-means?

Non-hierarchical Method K-means

Drawback of K-means

No pretty tree

The number of clusters must be pre-known!

Not robust

(15)

Outline The number of cluster must be pre-known BUT HOW?

”K” must be preknown but how?

GAP statistics, Tibshirani et al (2001)

Clest

(16)

Outline The number of cluster must be pre-known BUT HOW?

GAP statistics

idea behind GAP statistics

Find ˆ

k

such that WSS

k

shows an elbow decline

Jump to WSS

-cool example in R

(17)

Outline The number of cluster must be pre-known BUT HOW?

Definition

GAP

(k) =

E

null

(log

(WSS

ˆ

(k))

log

(WSS

(k))

ˆ

k

=

the smallest k such that

GAP

(k)

GAP

(k

+ 1)

s

k

+1

Standardize the graph of log (WSS(k)) by comparing it with its

expectation under an appropriate null reference distribution of the

data

ˆ

E

null

(log(WSS

(k)) =

1

B

B

X

i

=1

log(WSS

(k)

b

)

-another cool one in R

(18)

Outline The number of cluster must be pre-known BUT HOW?

Reference distribution: K=1

Generate the reference features from a uniform distribution over a

box aligned with the principal components of the data.

Orthogonally diagonalize

S

X

=

PDP

0

.

D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0

and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.

Transform via

X

=

XP. Then

S

X

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

The transformed data is no longer correlated.

Draw uniform features

Z

over the range of the columns of

X

.

Finally we back-transform via

Z

=

Z

P

0

to give reference data

set

Z

.

(19)

Outline The number of cluster must be pre-known BUT HOW?

Reference distribution: K=1

Generate the reference features from a uniform distribution over a

box aligned with the principal components of the data.

Orthogonally diagonalize

S

X

=

PDP

0

.

D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0

and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.

Transform via

X

=

XP. Then

S

X

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

The transformed data is no longer correlated.

Draw uniform features

Z

over the range of the columns of

X

.

Finally we back-transform via

Z

=

Z

P

0

to give reference data

set

Z

.

(20)

Outline The number of cluster must be pre-known BUT HOW?

Reference distribution: K=1

Generate the reference features from a uniform distribution over a

box aligned with the principal components of the data.

Orthogonally diagonalize

S

X

=

PDP

0

.

D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0

and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.

Transform via

X

=

XP. Then

S

X

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

The transformed data is no longer correlated.

Draw uniform features

Z

over the range of the columns of

X

.

Finally we back-transform via

Z

=

Z

P

0

to give reference data

set

Z

.

(21)

Outline The number of cluster must be pre-known BUT HOW?

Reference distribution: K=1

Generate the reference features from a uniform distribution over a

box aligned with the principal components of the data.

Orthogonally diagonalize

S

X

=

PDP

0

.

D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0

and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.

Transform via

X

=

XP. Then

S

X

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

The transformed data is no longer correlated.

Draw uniform features

Z

over the range of the columns of

X

.

Finally we back-transform via

Z

=

Z

P

0

to give reference data

set

Z

.

(22)

Outline The number of cluster must be pre-known BUT HOW?

Reference distribution: K=1

Generate the reference features from a uniform distribution over a

box aligned with the principal components of the data.

Orthogonally diagonalize

S

X

=

PDP

0

.

D is a diagonal matrix with the eigenvalues,λ1, .., λpof S on the diagonal, arranged so that ,λ1≥, ..,≥λp≥0

and P is an orthogonal matrix whose columns are the corresponding unit eigenvectorsu1, ..,up.

Transform via

X

=

XP. Then

S

X

=

P

0

S

X

P

=

P

0

PDP

0

P

=

D.

The transformed data is no longer correlated.

Draw uniform features

Z

over the range of the columns of

X

.

Finally we back-transform via

Z

=

Z

P

0

to give reference data

set

Z

.

(23)

Outline The number of cluster must be pre-known BUT HOW?

Definition

GAP

(k) =

E

null

(log

(WSS

ˆ

(k)))

log(WSS

(k))

ˆ

k

=

the smallest k such that

GAP

(k)

GAP

(k

+ 1)

s

k

+1

ˆ

E

null

(log(WSS

(k)) =

1

B

B

X

i

=1

log(WSS

(k)

b

)

Standardize the graph of log (WSS(k)) by comparing it with its

expectation under an appropriate null reference distribution of the

data.

(24)

Outline The number of cluster must be pre-known BUT HOW?

Prediction-based resampling method, Clest

Clest returns K which has the most stable predictability in clustering

procedure.

Algorithm

For each K, repeat the following process B times for data set and

reference data set

partition data set into learning set and testing set

perform K-means on learning set, return

classifiers

classify testing set by

classifiers, return

C

1

,

classifier

, ..,

C

K

,

classifiers

classify testing set by Kmeans, return

C

1

,

K

mean

, ...,

C

K

,

K

means

measure the similarity of two partitions

C

1

,

classifier

, ..,

C

K

,

classifiers

and

C

1

,

K

mean

, ...,

C

K

,

K

means

(25)

Outline The number of cluster must be pre-known BUT HOW?

Prediction-based resampling method, Clest

Clest returns K which has the most stable predictability in clustering

procedure.

Algorithm

For each K, repeat the following process B times for data set and

reference data set

partition data set into learning set and testing set

perform K-means on learning set, return

classifiers

classify testing set by

classifiers, return

C

1

,

classifier

, ..,

C

K

,

classifiers

classify testing set by Kmeans, return

C

1

,

K

mean

, ...,

C

K

,

K

means

measure the similarity of two partitions

C

1

,

classifier

, ..,

C

K

,

classifiers

and

C

1

,

K

mean

, ...,

C

K

,

K

means

(26)

Outline The number of cluster must be pre-known BUT HOW?

Prediction-based resampling method, Clest

Clest returns K which has the most stable predictability in clustering

procedure.

Algorithm

For each K, repeat the following process B times for data set and

reference data set

partition data set into learning set and testing set

perform K-means on learning set, return

classifiers

classify testing set by

classifiers, return

C

1

,

classifier

, ..,

C

K

,

classifiers

classify testing set by Kmeans, return

C

1

,

K

mean

, ...,

C

K

,

K

means

measure the similarity of two partitions

C

1

,

classifier

, ..,

C

K

,

classifiers

and

C

1

,

K

mean

, ...,

C

K

,

K

means

(27)

Outline The number of cluster must be pre-known BUT HOW?

Prediction-based resampling method, Clest

Clest returns K which has the most stable predictability in clustering

procedure.

Algorithm

For each K, repeat the following process B times for data set and

reference data set

partition data set into learning set and testing set

perform K-means on learning set, return

classifiers

classify testing set by

classifiers, return

C

1

,

classifier

, ..,

C

K

,

classifiers

classify testing set by Kmeans, return

C

1

,

K

mean

, ...,

C

K

,

K

means

measure the similarity of two partitions

C

1

,

classifier

, ..,

C

K

,

classifiers

and

C

1

,

K

mean

, ...,

C

K

,

K

means

(28)

Outline The number of cluster must be pre-known BUT HOW?

(29)

Outline The number of cluster must be pre-known BUT HOW?

(30)

Outline The number of cluster must be pre-known BUT HOW?

(31)

Outline The number of cluster must be pre-known BUT HOW?

(32)

Outline The number of cluster must be pre-known BUT HOW?

(33)

Outline The number of cluster must be pre-known BUT HOW?

(34)

Outline The number of cluster must be pre-known BUT HOW?

Compute

S(k,cluster labels 1, cluster labels 2)

Repeat this process B times for each K and obtain the average

of measure

S

k

Repeat the algorithm for reference dataset for B timesand obtain

S

k

0

Obtain standardized similarity measure

d

k

=S

k

S

k

0

and

ˆ

K

=argmax

k

∈{

1

,..,

K

}

d

k

(35)

Outline The number of cluster must be pre-known BUT HOW?

(36)

Outline The number of cluster must be pre-known BUT HOW?

(37)

Outline The number of cluster must be pre-known BUT HOW?

(38)

Outline The number of cluster must be pre-known BUT HOW?

(39)

Outline The number of cluster must be pre-known BUT HOW?

(40)

Outline The number of cluster must be pre-known BUT HOW?

(41)

Outline The number of cluster must be pre-known BUT HOW?

(42)

Outline The number of cluster must be pre-known BUT HOW?

Compute

S(k,cluster labels 1, cluster labels 2)

Repeat this process B times for each K and obtain the average

of measure

S

k

Repeat the algorithm for reference dataset and obtain

S

k

0

Obtain standardized similarity measure

d

k

=S

k

S

k

0

and

ˆ

K

=argmax

k

∈{

1

,..,

K

}

d

k

(43)

Outline The number of cluster must be pre-known BUT HOW?

Compute

S(k,cluster labels 1, cluster labels 2)

Repeat this process B times for each K and obtain the average

of measure

S

k

Repeat the algorithm for reference dataset and obtain

S

k

0

Obtain standardized similarity measure

d

k

=S

k

S

k

0

and

ˆ

K

=argmax

k

∈{

1

,..,

K

}

d

k

(44)

Outline The number of cluster must be pre-known BUT HOW?

Similarity measures of two partitions

Let

P

and

Q

represent two partitions

CER

=

P

i

>

i

0

I

P

(

i

,

i

0

)

I

Q

(

i

,

i

0

)

n

2

I

P

(

i

,

i

0

)

=

1

if i and i’ belong to the same cluster by partitioning P

0 otherwise

0

CER

1

CER= 0 means perfect agreement of two partitions

CER= 1 means complete disagreement of two partitions

(45)

Outline The number of cluster must be pre-known BUT HOW?

Does Clest outperform GAP statistics?

(46)

Outline The number of cluster must be pre-known BUT HOW?

Tibshirani, Robert,et al. Estimating the number of clusters in a data

set via the gap statistic

References

Related documents

optimised rate constants are then used to model two additional experimental. cases with intermediate LSR values of 0.2 and 0.3 and the model

A path model of second language (L2; English) oral language and reading comprehension variables was tested on a sample of 100 Spanish-speaking English- language learners enrolled in

Berdasarkan hasil pengujian dengan OAI-PMH Validator, BASE OAI-PMH Validator, dan berhasil didaftarkannya Portal Garuda STMIK IBBI di OpenArchive.org, OpenDOAR, dan ROAR, serta

Surface enhanced resonance Raman (SERR) spectroscopy was applied to simultaneously monitor the redox state of the hemes and the protonation state of the heme propionates..

Value of options million dollars Electronic commerce Present value of sales Time to exercise years Exercise price Volatility 1 + annual interest rate Value of option sales Net

The purpose of this dissertation is to deeply study one school district that is exemplary in the use of instructional technology to enhance teachers' planning, instruction,

The difference of EEG amplitude between during the baseline task and during auditory verbal imagery task using negative word list:In alpha frequency band, the EEG amplitude

Program Features: • Before appearing in court, homeowners must meet with a HUD approved housing counseling agency to prepare and submit a proposal to resolve the mortgage default