• No results found

Data Mining:

N/A
N/A
Protected

Academic year: 2020

Share "Data Mining:"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Mining:

Concepts and Techniques

(3

rd

ed.)

— Chapter 10

Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &

Simon Fraser University

(2)

2

Chapter 10. Cluster Analysis: Basic

Concepts and Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Density Methods

Grid Based Methods

Evaluation of Clustering

Summary

(3)

What is Cluster Analysis?

 Cluster: A collection of data objects

 similar (or related) to one another within the same group  dissimilar (or unrelated) to the objects in other groups

 Cluster analysis

 Finding similarities between data according to the

characteristics found in the data and grouping similar data objects into clusters

 Unsupervised learning: no predefined classes  Typical applications

 As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms

(4)
(5)

What is Clustering

Clustering is the

classification

of objects into

different groups, or more precisely, the

partitioning

of a

data set

into

subsets

(clusters),

so that the data in each subset (ideally) share

some common trait - often according to some

(6)

Considerations for Cluster Analysis

 Partitioning criteria

 Single level vs. hierarchical partitioning (often, multi-level

hierarchical partitioning is desirable)

 Separation of clusters

 Exclusive (e.g., one customer belongs to only one region) vs.

non-exclusive (e.g., one document may belong to more than one class)

 Similarity measure

 Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,

density)

 Clustering space

 Full space (often when low dimensional) vs. subspaces (often in

high-dimensional clustering)

(7)

Quality: What Is Good Clustering?

 A good clustering method will produce high quality

clusters

 high intra-class similarity: cohesive within clusters  low inter-class similarity: distinctive between clusters

 The quality of a clustering method depends on

 the similarity measure used by the method  its implementation, and

 Its ability to discover some or all of the hidden patterns

(8)

Types of Clustering

1. Partitioning approach:

 Construct various partitions and then evaluate them by some criterion

2. Hierarchical approach:

 Create a hierarchical decomposition of the set of data (or objects) using some criterion

3. Density-based approach:

 Based on connectivity and density functions

4. Grid-based approach:

 based on a multiple-level granularity structure

(9)

Chapter 10. Cluster Analysis: Basic

Concepts and Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Evaluation of Clustering

(10)

1) Partitioning Method

10

May 29, 2019 Data Mining: Concepts and Techniques

(11)

Algorithm for Partitioning methods

K-Mean Algorithm

K-Mediods Algorithm

CLARANS (Clustering Based Algorithm for

(12)

2-Hierarichal Methods

12

May 29, 2019 Data Mining: Concepts and Techniques

(13)

Algorithm for Hierarchal methods

AGNES (AGglomerative NESting Clustering)

DIANA (DIisive ANalysis Clustering )

BIRCH (Balance Iterative Reducing and

Clustering)

CAMELEON (CLUSTERING USING DYNAMIC

MODELING)

(14)

3-Density Based Method

(15)

Algorithm for Density Based methods

DBSCAN (

Density-Based Clustering Based on Connected

 Regions with High Density)

OPTICS (

Ordering Points to Identify the Clustering Structure)

(16)

4-Grid Based Clustering

(17)

Algorithm for Grid Based methods

STING (STatistical Information Grid)

CLIQUE: An Apriori-like Subspace Clustering

Method

(18)
(19)

Common Distance measures:

Distance measure

will determine how the

similarity

of two

elements is calculated and it will influence the shape of the

clusters.

They include:

1. The

Euclidean distance

(also called 2-norm distance) is

given by:

2. The

Manhattan distance

(also called taxicab norm or

1-norm) is given by:

(20)

Distance formula (2-D)

(21)

How the K-Mean Clustering algorithm

works?

(22)

An Example of

K-Means

Clustering

K=2 Arbitrarily partition objects into k groups Update the cluster centroids Update the cluster centroids Reassign objects Loop if needed 22

The initial data set

Partition objects into k nonempty

subsets

 Repeat

 Compute centroid (i.e., mean

point) for each partition

 Assign each object to the

cluster of its nearest centroid

(23)

Step 1: Begin with a decision on the value of k =

number of clusters .

Step 2: Put any initial partition that classifies the

data into k clusters. You may assign the

training samples randomly,or systematically

as the following:

1.Take the first k training sample as

single-element clusters

2. Assign each of the remaining (N-k) training

sample to the cluster with the nearest

centroid. After each assignment, recompute

the centroid of the gaining cluster.

(24)

Step 3: Take each sample in sequence and

compute its

distance

from the centroid of

each of the clusters. If a sample is not

currently in the cluster with the closest

centroid, switch this sample to that cluster

and update the centroid of the cluster

gaining the new sample and the cluster

losing the sample.

Step 4 . Repeat step 3 until convergence is

achieved, that is until a pass through the

(25)

A Simple example showing the

implementation of k-means algorithm

(using K=2)

(26)

Step 1:

Initialization: Randomly we choose following two centroids (k=2) for two clusters.

In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).

(27)

Step 2:

Thus, we obtain two clusters

containing:

{1,2,3} and {4,5,6,7}.

(28)

Step 3:

Now using these centroids

we compute the Euclidean

distance of each object, as

shown in table.

Therefore, the new

clusters are:

{1,2} and {3,4,5,6,7}

Next centroids are:

m1=(1.25,1.5) and m2 =

(3.9,5.1)

(29)

Step 4 :

The clusters obtained are:

{1,2} and {3,4,5,6,7}

Therefore, there is no

change in the cluster.

Thus, the algorithm comes

to a halt here and final

result consist of 2 clusters

{1,2} and {3,4,5,6,7}.

(30)
(31)
(32)
(33)

Class work

Use the k-means algorithm and Euclidean

distance to cluster the following 8 examples into 3

clusters:

A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8),

A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).

(34)

Measuring Clustering Quality

 3 kinds of measures: External, internal and relative

External: supervised, employ criteria not inherent to the dataset

 Compare a clustering against prior or expert-specified

knowledge using certain clustering quality measure  Internal: unsupervised, criteria derived from data itself

 Evaluate the goodness of a clustering by considering how

well the clusters are separated, and how compact the clusters are, e.g., Silhouette coefficient

Relative: directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm

(35)

Chapter 10. Cluster Analysis: Basic

Concepts and Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Evaluation of Clustering

(36)

Visualization of Clustering

(37)
(38)
(39)

Summary

 Cluster analysis groups objects based on their similarity and has

wide applications

 Clustering algorithms can be categorized into partitioning methods,

hierarchical methods, density-based methods, grid-based methods, and model-based methods

 K-means and K-medoids algorithms are popular partitioning-based

clustering algorithms

 Birch and Chameleon are interesting hierarchical clustering algorithms,

and there are also probabilistic hierarchical clustering algorithms

 DBSCAN, OPTICS, and DENCLU are interesting density-based

algorithms

 STING and CLIQUE are grid-based methods, where CLIQUE is also a

subspace clustering algorithm

References

Related documents

Factors affecting college attendance that were important or very important for more than 50 percent of automotive students were academic reputation, teaching emphasis,

With the information entered correctly (as shown above), click on the folder icon to establish the backup destination. When you do so the first time you will be asked a question

Once the support desk confirms that your serial number is released you may install the product on its replacement computer and register it again via the processes outlined in

Measured insertion loss as a function of moisture content for coherent and incoherent illuminations.. Thickness of the tested sample is

May cause damage to organs following a single exposure in contact with skin.. May cause an allergic

We developed a device - intraoperative limb-length measurement and osteotomy device (ILMOD), and applied it to patients who were treated with hemiarthroplasty for femoral neck