• No results found

Clustering.ppt

N/A
N/A
Protected

Academic year: 2020

Share "Clustering.ppt"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

What is Cluster Analysis?

Cluster: a collection of data objects

Similar to one another within the same cluster

Dissimilar to the objects in other clusters

Cluster analysis

Finding similarities between data according to the

(3)

What is Cluster Analysis?

Clustering analysis is an important human activity

Early in childhood, we learn how to distinguish

between cats and dogs

Unsupervised learning

: no predefined classes

Typical applications

(4)

Clustering: Rich Applications

and Multidisciplinary Efforts

Pattern Recognition

Spatial Data Analysis

Create thematic maps in GIS by clustering feature spaces

Detect spatial clusters or for other spatial mining tasks

Image Processing

Economic Science (especially market research)

WWW

Document classification

(5)

Quality: What Is Good

Clustering?

A good clustering method will produce high quality clusters with

 high intra-class similarity

(Similar to one another within the same cluster)

 low inter-class similarity

(Dissimilar to the objects in other clusters)

The quality of a clustering method is also measured by its ability

(6)

Similarity and Dissimilarity

Between Objects

Distances are normally used to measure the similarity or

dissimilarity between two data objects

Some popular ones include:

Minkowski distance

:

where

i

= (

x

i1

,

x

i2

, …,

x

ip

) and

j

= (

x

j1

,

x

j2

, …,

x

jp

) are two

p

-dimensional data objects, and

q

is a positive integer

If

q

=

1

,

d

is Manhattan distance

q q p p q q

j

x

i

x

j

x

i

x

j

x

i

x

j

i

d

(

,

)

(|

|

|

|

...

|

|

)

2 2

1

1

|

|

...

|

|

|

|

)

,

(

2 2 1

1

x

j

x

i

x

j

x

i

p

x

j

p

i

x

j

i

(7)

Similarity and Dissimilarity

Between Objects (Cont.)

If q

=

2

,

d

is Euclidean distance:

Also, one can use weighted distance,

parametric Pearson correlation, or other

disimilarity measures

)

|

|

...

|

|

|

(|

)

,

(

2 2

2 2

2 1

1

x

j

x

i

x

j

x

i

p

x

j

p

i

x

j

i

(8)

Major Clustering Approaches

 Partitioning approach:

 Construct various partitions and then evaluate them by some criterion, e.g.,

minimizing the sum of square errors

 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:

 Create a hierarchical decomposition of the set of data (or objects) using

some criterion

 Typical methods: Hierarchical, Diana, Agnes, BIRCH, ROCK, CAMELEON

 Density-based approach:

(9)

Some Other

Major Clustering Approaches

 Grid-based approach:

 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

 Model-based:

 A model is hypothesized for each of the clusters and tries to find the best fit of that model to each

other

 Typical methods: EM, SOM, COBWEB

 Frequent pattern-based:

 Based on the analysis of frequent patterns

 Typical methods: pCluster

 User-guided or constraint-based:

 Clustering by considering user-specified or application-specific constraints

(10)

Typical Alternatives to Calculate

the Distance between Clusters

Single link:

smallest distance between an element in one cluster and

an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

Complete link:

largest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

Average:

avg distance between an element in one cluster and an
(11)

Typical Alternatives to Calculate

the Distance between Clusters

Centroid:

distance between the centroids of two clusters,

i.e., dis(K

i

, K

j

) = dis(C

i

, C

j

)

 Centroid: the “middle” of a cluster

Medoid:

distance between the medoids of two clusters,

i.e., dis(K

i

, K

j

) = dis(M

i

, M

j

)

 Medoid: one chosen, centrally located object in the cluster

N t N i ip

m

(12)

Clustering Approaches

1.

Partitioning Methods

2.

Hierarchical Methods

(13)

Partitioning Algorithms: Basic

Concept

Partitioning method: Construct a partition

of a database

D

of

n

objects into a set of

k

clusters, s.t., min sum of squared

distance

2

1 t Km

(

m mi

)

k

m

mi

C

t

(14)

Partitioning Algorithms: Basic

Concept

Given a

k

, find a partition of

k clusters

that optimizes the

chosen partitioning criterion

 Global optimal: exhaustively enumerate all partitions

 Heuristic methods: k-means and k-medoids algorithms

 k-means (MacQueen’67): Each cluster is represented by the center of

the cluster

 k-medoids or PAM (Partition around medoids) (Kaufman &

(15)

The

K-Means

Clustering

Method

Given

k

, the

k-means

algorithm is

implemented in four steps:

1.

Partition objects into

k

nonempty subsets

2.

Compute seed points as the centroids of the clusters of

the current partition (the centroid is the center, i.e.,

mean point

, of the cluster)

3.

Assign each object to the cluster with the nearest seed

point

(16)
(17)
(18)
(19)
(20)
(21)

The

K-Means

Clustering

Method

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

(22)

Example

Run K-means clustering with 3 clusters

(23)

Example

Centroids:

3 – 2 3 4 7 9 new centroid: 5

16 – 10 11 12 16 18 19 new centroid: 14.33

(24)

Example

Centroids:

5 – 2 3 4 7 9 new centroid: 5

14.33 – 10 11 12 16 18 19 new centroid: 14.33

(25)

In class Practice

Run K-means clustering with 3 clusters

(26)

Comments on the

K-Means

Method

Strength:

Relatively efficient

:

O

(

tkn

), where

n

is # objects,

k

is

# clusters, and

t

is # iterations. Normally,

k

,

t

<<

n

.

Weakness

 Applicable only when mean is defined, then what about categorical

data?

 Need to specify k, the number of clusters, in advance

 Unable to handle noisy data and outliers

(27)

Fuzzy C-means Clustering

Fuzzy c-means (FCM) is a method of

clustering which allows one piece of

data to belong to two or more clusters.

This method (developed by

Dunn in 1973

and improved by

Bezdek in 1981

) is frequently used in

(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)

Fuzzy C-means Clustering

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html

(36)

Fuzzy C-means Clustering

 For example: we have initial centroid 3 & 11

(with m=2)

 For node 2 (1st element):

U11 =

The membership of first node to first cluster

U12 =

The membership of first node to second cluster

% 78 . 98 82 81 81 1 1 1 11 2 3 2 3 2 3 2 1 1 2 2 1 2

2  

                     % 22 . 1 82 1 1 81 1 11 2 11 2 3 2 11 2 1 1 2 2 1 2

2   

(37)

Fuzzy C-means Clustering

 For example: we have initial centroid 3 & 11

(with m=2)

 For node 3 (2nd element):

U21 = 100%

The membership of second node to first cluster

U22 = 0%

(38)

Fuzzy C-means Clustering

 For example: we have initial centroid 3 & 11

(with m=2)

 For node 4 (3rd element):

U31 =

The membership of first node to first cluster

U32 =

The membership of first node to second cluster

% 98 49 50 1 49 1 1 1 11 4 3 4 3 4 3 4 1 1 2 2 1 2

2  

                     % 2 50 1 1 49 1 11 4 11 4 3 4 11 4 1 1 2 2 1 2

2   

(39)

Fuzzy C-means Clustering

 For example: we have initial centroid 3 & 11

(with m=2)

 For node 7 (4th element):

U41 =

The membership of fourth node to first cluster

U42 =

The membership of fourth node to second cluster

% 50 2 1 1 1 1 11 7 3 7 3 7 3 7 1 1 2 2 1 2

2   

                   % 50 2 1 1 1 1 11 7 11 7 3 7 11 7 1 1 2 2 1 2

2   

(40)
Dunn in 1973 Bezdek in 1981)

References

Related documents