• No results found

Clustering UE 141 Spring 2013

N/A
N/A
Protected

Academic year: 2021

Share "Clustering UE 141 Spring 2013"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Clustering

UE 141 Spring 2013

Jing Gao

SUNY Buffalo

(2)

Definition of Clustering

• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster similarity are minimized Intra-cluster similarity are maximized

(3)

K-means

Partition {x

1

,…,x

n

} into K clusters

K is predefined

Initialization

Specify the initial cluster centers (centroids)

Iteration until no change

For each object x

i

Calculate the distances between xi and the K centroids (Re)assign xi to the cluster whose centroid is the

closest to xi

Update the cluster centroids based on current

assignment

(4)

0 1 2 3 4 5 0 1 2 3 4 5

K-means: Initialization

Initialization: Determine the three cluster centers

m1

m2

(5)

0 1 2 3 4 5 0 1 2 3 4 5

K-means Clustering: Cluster Assignment

Assign each object to the cluster which has the closet distance from the centroid to the object

m1

m2

(6)

0 1 2 3 4 5 0 1 2 3 4 5

K-means Clustering: Update Cluster Centroid

Compute cluster centroid as the center of the points in the cluster

m1

m2

(7)

K-means Clustering: Update Cluster Centroid

Compute cluster centroid as the center of the points in the cluster

0 1 2 3 4 5 0 1 2 3 4 5 m1 m2 m3

(8)

0 1 2 3 4 5 0 1 2 3 4 5 m1 m2 m3

K-means Clustering: Cluster Assignment

Assign each object to the cluster which has the closet distance from the centroid to the object

(9)

0 1 2 3 4 5 0 1 2 3 4 5 m1 m2 m3

K-means Clustering: Update Cluster Centroid

(10)

0 1 2 3 4 5 0 1 2 3 4 5 m1 m2 m3

K-means Clustering: Update Cluster Centroid

(11)

Sum of Squared Error (SSE)

Suppose the centroid of cluster Cj is mj

For each object x in Cj, compute the squared error between x and the centroid mj

• Sum up the error of all the objects

å å

Î -= j x C j j m x SSE ( )2 1

´

m1= 2 4

´

5 1.5 m2= 4.5 1 ) 5 . 4 5 ( ) 5 . 4 4 ( ) 5 . 1 2 ( ) 5 . 1 1 ( - 2 + - 2 + - 2 + - 2 = = SSE

(12)

How to Minimize SSE

Two sets of variables to minimize

Each object x belongs to which cluster? What’s the cluster centroid? mj

Minimize the error wrt each set of variable

alternatively

– Fix the cluster centroid—find cluster assignment that minimizes the current error

– Fix the cluster assignment—compute the cluster centroids that minimize the current error

å å

Î -j x C j j m x )2 ( min j C xÎ

(13)

Cluster Assignment Step

Cluster centroids (m

j

) are known

For each object

Choose C

j

among all the clusters for x such that

the distance between x and m

j

is the minimum

Choose another cluster will incur a bigger error

Minimize error on each object will minimize

the SSE

å å

Î -j x C j j m x )2 ( min

(14)

0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1

m

1

m

2

x

1

x

2

x

3

x

4

x

5

Example—Cluster Assignment

1 3 2 1, x , x C x Î 2 5 4 , x C x Î 2 1 3 2 1 2 2 1 1 ) ( ) ( ) (x m x m x m SSE= - + - + -Given m1, m2, which cluster each of the five points belongs to?

Assign points to the closet centroid— minimize SSE 2 2 5 2 2 4 ) ( ) (x -m + x -m +

(15)

Cluster Centroid Computation Step

For each cluster

Choose cluster centroid m

j

as the center of the

points

Minimize error on each cluster will

minimize the SSE

å å

Î -j x C j j m x )2 ( min

|

|

j C x j

C

x

m

j

å

Î

=

(16)

10 9 8 7 6 5 4 3 2 1

m

1

m

2

x

1

x

2

x

3

x

4

x

5

Example—Cluster Centroid Computation

Given the cluster

assignment, compute the centers of the two clusters

0 1 2 3 4 5 6 7 8 9 10

(17)

Hierarchical Clustering

Agglomerative approach

b d c e a a b d e c d e a b c d e

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

Initialization:

Each object is a cluster Iteration:

Merge two clusters which are most similar to each other; Until all objects are merged into a single cluster

(18)

Hierarchical Clustering

Divisive Approaches

b d c e a a b d e c d e a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

Initialization:

All objects stay in one cluster Iteration:

Select a cluster and split it into two sub clusters

Until each leaf cluster contains only one object

(19)

Dendrogram

A tree that shows how clusters are merged/split

hierarchically

Each node on the tree is a cluster; each leaf node is a

singleton cluster

(20)

Dendrogram

A clustering of the data objects is obtained by cutting

the dendrogram at the desired level, then each

(21)

Agglomerative Clustering Algorithm

• More popular hierarchical clustering technique • Basic algorithm is straightforward

1. Compute the distance matrix 2. Let each data point be a cluster

3. Repeat

4. Merge the two closest clusters 5. Update the distance matrix

6. Until only a single cluster remains

• Key operation is the computation of the distance between two clusters

– Different approaches to defining the distance between clusters distinguish the different algorithms

(22)

Starting Situation

• Start with clusters of individual points and a distance matrix

p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Distance Matrix

...

(23)

Intermediate Situation

• After some merging steps, we have some clusters

• Choose two clusters that has the smallest

distance (largest similarity) to merge

C1 C4 C2 C5 C3 C2 C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix

...

(24)

Intermediate Situation

• We want to merge the two closest clusters (C2 and C5) and update

the distance matrix.

C1 C4 C2 C5 C3 C2 C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix

...

(25)

After Merging

• The question is “How do we update the distance matrix?”

C1 C4 C2 U C5 C3 ? ? ? ? ? ? ? C2 U C5 C1 C1 C3 C4 C2 U C5 C3 C4 Distance Matrix

...

(26)

How to Define Inter-Cluster Distance

p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Distance? l MIN l MAX l Group Average

l Distance Between Centroids

l ……

(27)

Question

Talk about big data

– What’s the definition of big data?

– What applications generate big data? – What are the challenges?

References

Related documents

In vertex-weighted clustering, to partition the given data set into clusters the weighted Laplacian matrix and the cosine similarity matrix of the weighted Fiedler vector are used..

Our method generates a data matrix from the behavioural data of malware, standardise this data matrix and find the number of clusters in a dataset by using intelligent K-Means [20]..

Our study concludes that our predictive data mining model can improve the prediction ability by using all attributes in the different clusters with the nearest distance as

If the approximate nearest clusters returned by the data structure is at distance at most v, then the two clusters are merged and the following operations are performed: the

1985 1990 1995 2000 2005 2010 2015 Year 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Posterior Probability clustering graph matrix points cluster means algorithms distance kernel

Distance Matrix is created from VSM. It is needed to compare the similarity or dissimilarity between documents. Distance matrix is generated in many ways, like by using

Different from existing spectral clustering algorithms for HINs, SClump uses meta-paths in the con- struction of an effective similarity matrix.. Through an iter- ative

In addition, many existing document clustering algorithms require the user to specify the number of clusters as an input parameter and are not robust enough to handle different