Clustering.ppt

(1)

(2)

What is Cluster Analysis?



Cluster: a collection of data objects



Similar to one another within the same cluster



Dissimilar to the objects in other clusters



Cluster analysis



Finding similarities between data according to the

(3)

What is Cluster Analysis?



Clustering analysis is an important human activity



Early in childhood, we learn how to distinguish

between cats and dogs



Unsupervised learning

: no predefined classes



Typical applications

(4)

Clustering: Rich Applications

and Multidisciplinary Efforts



Pattern Recognition



Spatial Data Analysis



Create thematic maps in GIS by clustering feature spaces



Detect spatial clusters or for other spatial mining tasks



Image Processing



Economic Science (especially market research)



WWW



Document classification

(5)

Quality: What Is Good

Clustering?



A good clustering method will produce high quality clusters with

 high intra-class similarity

(Similar to one another within the same cluster)

 low inter-class similarity

(Dissimilar to the objects in other clusters)



The quality of a clustering method is also measured by its ability

(6)

Similarity and Dissimilarity

Between Objects



Distances are normally used to measure the similarity or

dissimilarity between two data objects



Some popular ones include:

Minkowski distance

:

where

i

= (

x

_i1

,

x

_i2

, …,

x

_ip

) and

j

= (

x

_j1

,

x

_j2

, …,

x

_jp

) are two

p

-dimensional data objects, and

q

is a positive integer



If

q

=

1

,

d

is Manhattan distance

q q p p q q

j

x

i

x

j

x

i

x

j

x

i

x

j

i

d

(

,

)

(|

|

...

|

)

2 2

1













|

...

|

)

,

(

2 2 1

1

x

j

x

i

x

j

x

i

p

x

j

p

i

x

j

i

(7)

Similarity and Dissimilarity

Between Objects (Cont.)



If q

=

2

,

d

is Euclidean distance:



Also, one can use weighted distance,

parametric Pearson correlation, or other

disimilarity measures

)

|

...

|

(|

)

,

(

2 2

2 1

1

x

j

x

i

x

j

x

i

p

x

j

p

i

x

j

i

(8)

Major Clustering Approaches

 Partitioning approach:

 Construct various partitions and then evaluate them by some criterion, e.g.,

minimizing the sum of square errors

 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:

 Create a hierarchical decomposition of the set of data (or objects) using

some criterion

 Typical methods: Hierarchical, Diana, Agnes, BIRCH, ROCK, CAMELEON

 Density-based approach:

(9)

Some Other

Major Clustering Approaches

 Grid-based approach:

 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

 Model-based:

 A model is hypothesized for each of the clusters and tries to find the best fit of that model to each

other

 Typical methods: EM, SOM, COBWEB

 Frequent pattern-based:

 Based on the analysis of frequent patterns

 Typical methods: pCluster

 User-guided or constraint-based:

 Clustering by considering user-specified or application-specific constraints

(10)

Typical Alternatives to Calculate

the Distance between Clusters



Single link:

smallest distance between an element in one cluster and

an element in the other, i.e., dis(K_i, K_j) = min(t_ip, t_jq)



Complete link:

largest distance between an element in one cluster

and an element in the other, i.e., dis(K_i, K_j) = max(t_ip, t_jq)



Average:

avg distance between an element in one cluster and an

(11)

Typical Alternatives to Calculate

the Distance between Clusters



Centroid:

distance between the centroids of two clusters,

i.e., dis(K

_i

, K

_j

) = dis(C

_i

, C

_j

)

 Centroid: the “middle” of a cluster



Medoid:

distance between the medoids of two clusters,

i.e., dis(K

_i

, K

_j

) = dis(M

_i

, M

_j

)

 Medoid: one chosen, centrally located object in the cluster

N t N i ip

m

(12)

Clustering Approaches

1.

Partitioning Methods

2.

Hierarchical Methods

(13)

Partitioning Algorithms: Basic

Concept



Partitioning method: Construct a partition

of a database

D

of

n

objects into a set of

k

clusters, s.t., min sum of squared

distance

2

1 t Km

(

m mi

)

k

m



_mi

C



t

(14)

Partitioning Algorithms: Basic

Concept



Given a

k

, find a partition of

k clusters

that optimizes the

chosen partitioning criterion

 Global optimal: exhaustively enumerate all partitions

 Heuristic methods: k-means and k-medoids algorithms

 k-means (MacQueen’67): Each cluster is represented by the center of

the cluster

 k-medoids or PAM (Partition around medoids) (Kaufman &

(15)

The

K-Means

Clustering

Method



Given

k

, the

k-means

algorithm is

implemented in four steps:

1.

Partition objects into

k

nonempty subsets

2.

Compute seed points as the centroids of the clusters of

the current partition (the centroid is the center, i.e.,

mean point

, of the cluster)

3.

Assign each object to the cluster with the nearest seed

point

(16)

(17)

(18)

(19)

(20)

(21)

The

K-Means

Clustering

Method

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

(22)

Example



Run K-means clustering with 3 clusters

(23)

Example



Centroids:

3 – 2 3 4 7 9 new centroid: 5

16 – 10 11 12 16 18 19 new centroid: 14.33

(24)

Example



Centroids:

5 – 2 3 4 7 9 new centroid: 5

14.33 – 10 11 12 16 18 19 new centroid: 14.33

(25)

In class Practice



Run K-means clustering with 3 clusters

(26)

Comments on the

K-Means

Method



Strength:

Relatively efficient

:

O

(

tkn

), where

n

is # objects,

k

is

# clusters, and

t

is # iterations. Normally,

k

,

t

<<

n

.



Weakness

 Applicable only when mean is defined, then what about categorical

data?

 Need to specify k, the number of clusters, in advance

 Unable to handle noisy data and outliers

(27)

Fuzzy C-means Clustering



Fuzzy c-means (FCM) is a method of

clustering which allows one piece of

data to belong to two or more clusters.



This method (developed by

Dunn in 1973

and improved by

Bezdek in 1981

) is frequently used in

(28)

(29)

(30)

(31)

(32)

(33)

(34)

(35)

Fuzzy C-means Clustering

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html

(36)

Fuzzy C-means Clustering

 For example: we have initial centroid 3 & 11

(with m=2)

 For node 2 (1st element):

U11 =

The membership of first node to first cluster

U12 =

The membership of first node to second cluster

% 78 . 98 82 81 81 1 1 1 11 2 3 2 3 2 3 2 1 1 2 2 1 2

2  

                     % 22 . 1 82 1 1 81 1 11 2 11 2 3 2 11 2 1 1 2 2 1 2

2  _  

(37)

(with m=2)

 For node 3 (2nd element):

U21 = 100%

The membership of second node to first cluster

U22 = 0%

(38)

Fuzzy C-means Clustering

(with m=2)

 For node 4 (3rd element):

U31 =

The membership of first node to first cluster

U32 =

The membership of first node to second cluster

% 98 49 50 1 49 1 1 1 11 4 3 4 3 4 3 4 1 1 2 2 1 2

2  

                     % 2 50 1 1 49 1 11 4 11 4 3 4 11 4 1 1 2 2 1 2

2  _  

(39)

(with m=2)

 For node 7 (4th element):

U41 =

The membership of fourth node to first cluster

U42 =

The membership of fourth node to second cluster

% 50 2 1 1 1 1 11 7 3 7 3 7 3 7 1 1 2 2 1 2

2  _  

                   % 50 2 1 1 1 1 11 7 11 7 3 7 11 7 1 1 2 2 1 2

2  _  

(40)

Dunn in 1973

Bezdek in 1981)