• No results found

Data Mining:

N/A
N/A
Protected

Academic year: 2020

Share "Data Mining:"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Mining:

Concepts and Techniques

(3

rd

ed.)

— Chapter 10

Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &

Simon Fraser University

(2)

2

Chapter 10. Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Density Methods

Grid Based Methods

Evaluation of Clustering

Summary

2

(3)

What is Cluster Analysis?

Cluster: A collection of data objects

similar (or related) to one another within the same group

dissimilar (or unrelated) to the objects in other groups

Cluster analysis

Finding similarities between data according to the

characteristics found in the data and grouping similar data objects into clusters

Unsupervised learning: no predefined classes

Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms

(4)

4

(5)

What is Clustering

Clustering is the classification of objects into different groups, or more precisely, the

partitioning of a data set into subsets (clusters),

so that the data in each subset (ideally) share

some common trait - often according to some

defined distance measure.

(6)

Considerations for Cluster Analysis

Partitioning criteria

Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)

Separation of clusters

Exclusive (e.g., one customer belongs to only one region) vs. non- exclusive (e.g., one document may belong to more than one class)

Similarity measure

Distance-based (e.g., Euclidian) vs. connectivity-based (e.g., density)

Clustering space

Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

6

(7)

Quality: What Is Good Clustering?

A good clustering method will produce high quality clusters

high intra-class similarity: cohesive within clusters

low inter-class similarity: distinctive between clusters

The quality of a clustering method depends on

the similarity measure used by the method

its implementation, and

Its ability to discover some or all of the hidden

(8)

Types of Clustering

1.

Partitioning approach:

Construct various partitions and then evaluate them by some criterion

2.

Hierarchical approach:

Create a hierarchical decomposition of the set of data (or objects) using some criterion

3.

Density-based approach:

Based on connectivity and density functions

4.

Grid-based approach:

based on a multiple-level granularity structure

8

(9)

Chapter 10. Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Evaluation of Clustering

Summary

(10)

1) Partitioning Method

December 16, 2

10

021 Data Mining: Concepts

and Techniques

(11)

Algorithm for Partitioning methods

K-Mean Algorithm

K-Mediods Algorithm

CLARANS (Clustering Based Algorithm for

Randomize Search)

(12)

2-Hierarichal Methods

December 16, 2

12

021 Data Mining: Concepts

and Techniques

(13)

Algorithm for Hierarchal methods

AGNES (AGglomerative NESting Clustering)

DIANA (DIisive ANalysis Clustering )

BIRCH (Balance Iterative Reducing and Clustering)

CAMELEON (CLUSTERING USING DYNAMIC MODELING)

DenClue

(14)

3-Density Based Method

14

(15)

Algorithm for Density Based methods

DBSCAN ( Density-Based Clustering Based on Connected

Regions with High Density)

OPTICS ( Ordering Points to Identify the Clustering

Structure)

(16)

4-Grid Based Clustering

16

(17)

Algorithm for Grid Based methods

STING (STatistical Information Grid)

CLIQUE: An Apriori-like Subspace Clustering Method

WaveCluste

(18)

18

(19)

Common Distance measures:

Distance measure will determine how the similarity of two elements is calculated and it will influence the shape of the clusters.

They include:

1. The Euclidean distance (also called 2-norm distance) is given by:

2. The Manhattan distance (also called taxicab norm or 1-

norm) is given by:

(20)

Partitioning Algorithms: Basic Concept

Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where c

i

is the centroid or medoid of cluster C

i

)

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

20

2

1 p C ( ( , i ))

k

i d p c

E      i

(21)

Distance formula (2-D)

(22)

K-MEANS CLUSTERING

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k

< n .

It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data.

It assumes that the object attributes form a vector

space.

(23)

An algorithm for partitioning (or clustering) N data points into K disjoint subsets S

j

containing data

points so as to minimize the sum-of-squares criterion

where x

n

is a vector representing the the n

th

data

point and u

j

is the geometric centroid of the data

points in S

j

.

(24)

Simply speaking k-means clustering is an

algorithm to classify or to group the objects based on attributes/features into K number of group.

K is positive integer number.

The grouping is done by minimizing the sum of squares of distances between data and the

corresponding cluster centroid.

(25)

An Example of K-Means Clustering

K=2

Arbitrarily partition objects into k groups

Update the cluster centroids

Update the cluster centroids

Reassign objects Loop if

needed The initial data set

Partition objects into k nonempty subsets

Repeat

Compute centroid (i.e., mean point) for each partition

Assign each object to the

(26)

How the K-Mean Clustering algorithm

works?

(27)

Step 1: Begin with a decision on the value of k = number of clusters .

Step 2: Put any initial partition that classifies the data into k clusters. You may assign the training samples randomly,or systematically as the following:

1.Take the first k training sample as single- element clusters

2. Assign each of the remaining (N-k) training sample to the cluster with the nearest

centroid. After each assignment, recompute the

centroid of the gaining cluster.

(28)

Step 3: Take each sample in sequence and compute its distance from the centroid of each of the clusters. If a sample is not currently in the cluster with the closest

centroid, switch this sample to that cluster and update the centroid of the

cluster gaining the new sample and the cluster losing the sample.

Step 4 . Repeat step 3 until convergence is

achieved, that is until a pass through the

training sample causes no new assignments.

(29)

A Simple example showing the

implementation of k-means algorithm

(using K=2)

(30)

Step 1:

Initialization: Randomly we choose following two centroids (k=2) for two clusters.

In this case the 2 centroid are: m1=(1.0,1.0) and

m2=(5.0,7.0).

(31)

Step 2:

Thus, we obtain two clusters containing:

{1,2,3} and {4,5,6,7}.

Their new centroids are:

(32)

Step 3:

Now using these centroids we compute the Euclidean distance of each object, as shown in table.

Therefore, the new clusters are:

{1,2} and {3,4,5,6,7}

Next centroids are:

m1=(1.25,1.5) and m2 =

(3.9,5.1)

(33)

Step 4 :

The clusters obtained are:

{1,2} and {3,4,5,6,7}

Therefore, there is no change in the cluster.

Thus, the algorithm comes to a halt here and final

result consist of 2 clusters

{1,2} and {3,4,5,6,7}.

(34)

PLOT

(35)

(with K=3)

(36)

PLOT

(37)

Exercise

Consider the 1D data set as {1,2,3,4,7,9}

Where K=2

Identify the clusters and their centroid.

Tip use

(38)

Home work

Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters:

A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).

Map the resultant values in Scattered Plot

38

(39)

What Is the Problem of the K-Means Method?

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially distort the distribution of the data

K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the most

centrally located object in a cluster

(40)

Determine the Number of Clusters

 Empirical method

# of clusters: k ≈√n/2 for a dataset of n points,

e.g., n = 200, k = 10

How many for the n=900???

40

(41)

Measuring Clustering Quality

3 kinds of measures: External, internal and relative

External: supervised, employ criteria not inherent to the dataset

Compare a clustering against prior or expert-specified knowledge using certain clustering quality measure

Internal: unsupervised, criteria derived from data itself

Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the

clusters are, e.g., Silhouette coefficient

Relative: directly compare different clusterings, usually those

obtained via different parameter settings for the same algorithm

(42)

42

Chapter 10. Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Evaluation of Clustering

Summary

42

(43)

Visualization of Clustering

(44)

44

(45)
(46)

46

(47)
(48)

Summary

Cluster analysis groups objects based on their similarity and has wide applications

Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

K-means and K-medoids algorithms are popular partitioning-based clustering algorithms

Birch and Chameleon are interesting hierarchical clustering

algorithms, and there are also probabilistic hierarchical clustering algorithms

DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms

STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithm

Quality of clustering results can be evaluated in various ways

48

References

Related documents

The first research question posed in the study was concerned with the effects of three types of grammar presentation, namely, Inter-AO, Intra-AO and TE groups on

Once the support desk confirms that your serial number is released you may install the product on its replacement computer and register it again via the processes outlined in

In Chapter 2, an improved distance metric method of deriving soft reliability information over flat Rayleigh fading channels for combined demodulation with symbol level RS soft

Pallavi Institute of Diploma in Education, Nancherla Gate, Pargi Road, Rangareddy district-509337.

Conclusion: Internalizing symptoms were associated with lower self-reported quality of life and self-esteem in children in the at-risk groups reporting depressive or depressive

Vertical Sound Source Localization Influenced by Visual Stimuli. Original

[r]

The five core books cover each stage of the service lifecycle (Figure 5.1-1 - ITIL Service Lifecycle) from the initial definition and analysis of business requirements in