Chapter 3: Cluster Analysis

(1)

Chapter 3: Cluster Analysis

` 3.1 Basic Concepts of Clustering

` 3.2 Partitioning Methods ` 3.3 Hierarchical Methods ` 3.4 Density-Based Methods ` 3.5 Model-Based Methods

` 3.6 Clustering High-Dimensional Data ` 3.7 Outlier Analysis

3.7.1 Definition

3.7.2 Statistical-Based Methods 3.7.3 Distance-Based Methods

3.7.4 Density-Based Local Methods 3.7.5 Deviation-Based Methods

(2)

3.7.1 Definition

` Outliers: data objects that do not comply with the general behavior or model of the data

` Outlier detection or analysis is referred to as Outlier Mining ` Outlier mining has different applications

Fraud detection

Detecting unusual usage of telecommunication services

Identifying the spending behavior of costumers with extremely low or extremely high incomes

Finding unusual responses to various medical treatments Etc.

(3)

Outlier Mining

` Given a set if n data objects and k expected number of outliers ` Find the top k objects that are considerably

Dissimilar Exceptional

Inconsistent with respect to the remaining data

` The outlier mining problem can be seen as two sub-problems

1) Define what data can be considered as inconsistent in a given data set

2) Find an efficient method to mine the outliers so defined

` Data visualization methods are weak in detecting data with many categorical attributes or data of high dimensionality ` Investigate computer-based techniques to detect outliers

(4)

3.7.2 Statistical Distribution-Based Methods

` Assume a distribution model for the given data set(e.g., Normal) ` Identify outliers w. r. t the model using a discordancy test

` How does it work?

` Examine two hypothesis working hypothesis alternative hypothesis

` A working hypothesis H is a statement that the entire data set of n objects comes from an initial distribution model F that is:

H: o_i∈F, where i=1,2,…,n

` The hypothesis H is retained if there is no statistically significant

(5)

Discordancy Test

` Verifies whether an object o_i is significantly large(or small) in relation to the distribution F

` Principle

Choose a some statistic T for discordancy testing Consider the value v_i of an object o_i

If significance probability SP(vi) is sufficiently small o_i is discordant

The working hypothesis is rejected

An alternative hypothesis ¬H which says that o_i comes from a

another distribution model G is adopted

` The result depends on the model F is chosen because o_i may be an outlier under one model and perfectly valid value under

(6)

Discordancy Test: Example

` Let o₁,…,o_n represent the data objects

` Compute the sample mean µ and the standard deviation σ

` If the an object o_i is suspected to be an outlier Compute the test statistic T

If T exceeds some critical value, then o_i is an outlier

σ

µ

| | o_i T = −

(7)

Discordancy Test: Example

` Consider the following ordered data: 3.84, 4.26, 4.53, 4.60, 5.28, 5.29, 5.74, 5.86

` Consider an additional sample P: 10 (it is suspected that this point might be an outlier)

` Compute µ and σ without the suspected outlier

` With n=9 and level of significance α=0.05, the critical value is 2.110

` T>2.110, then there is an evidence that P is an outlier

82 . 1 , 48 . 5 = =

σ

µ

48 . 2 82 . 1 | 10 48 . 5 | = − = T

(8)

Alternative Distributions

Inherent Alternative Distribution

` The working hypothesis that all objects come from distribution F is rejected

` Alternative hypothesis assume that all objects come from another distribution G

¬H: o_i∈G, where i=1,2,…,n ` F and G: different distributions

` F and G : the same distribution but with different parameters ` Distribution G must have the potential to produce outliers (a

(9)

Alternative Distributions

Mixture Alternative Distribution

` The discordant values are not outliers in F population but contaminants from some other population G

` The alternative hypothesis is

¬H: o_i∈ (1-λ) F+ λG, where i=1,2,…,n

Slippage Alternative Distribution

` All objects (except a small number) are from initial model F, with its given parameters

` The remaining objects are from a modified version of F in which the parameters have been shifted

(10)

Characteristics of Statistical-Based Methods

` Tests are for single attributes

` Need to find outliers in multidimensional space

` Statistical approaches require knowledge about parameters of the data set

` Statistical methods do not guarantee that all outliers will be found

No specific test was developed

The distribution cannot be adequately modeled with any standard distribution

(11)

3.7.2 Distance-Based Methods

` Generalize the test-based techniques

` Distance-based outliers are those objects that do not have “enough” neighbors

` Formally

Define DB(pct, dmin)-outlier: a distance based outlier with parameters pct and dmin

An object o is DB(pct, dmin)-outlier if at least a fraction pct of the objects lie at a distance greater than dmin from o

` Avoids excessive computation related to fitting the observed data into some standard distribution and selecting discordancy tests

(12)

Distance-Based Algorithms

Index-based algorithms

` Use multidimensional indexing structures such as R-trees or k-d trees to search for neighbors of each object o

(13)

Distance-Based Algorithms

` Find neighbors of object o within a

radius dmin

` M is the maximum number of objects within the dmin-neighborhood of an outlier

` Once M+1 objects of object o are found, then o is not an outlier

` Complexity of O(n2k) N: number of objects K: dimensionality

` Complexity is in search time. Building the index can be computationally very expensive

(14)

Distance-Based Algorithms

Cell-based algorithms

` The data space is partitioned into cells with a side length equal to

` dmin: radius around objects ` K: dimensionality

` Each cell has two layers surrounding it First layer is 1-cell thick

Second layer is thick, rounded up to the closest integer k 2 dmin 1 − k 2

(15)

Distance-Based Algorithms

` Count outliers on a cell-by-cell rather than object-by-object basis

` For a given cell, the algorithm accumulates three counts

The number of objects on the cell C

The number of objects in the cell and the first layer C+1

The number of objects In the cell and the second layer C+2

(16)

Distance-Based Algorithms

` Assume M to be a threshold used to detect outliers

` An object o is considered as an outlier if C+1 <M, else all the objects in the cell are considered as non outliers

` If C+2 <M, all the objects in the cell are considered outliers

` If C+2 >M, it is possible that some objects in the cell are outliers do object-by-object processing to detect outliers

only objects that have less than M objects in their dmin-neighborhood are outliers

the dmin-neighborhood consist of the object’s cell, all of its first layer and some of its second layer

(17)

Characteristics of Distance-Based Methods

` Avoid O(n2) computational complexity

` Its complexity is O(ck+n)

c is a constant depending on the number of cells k the dimensionality

n number of objects

` Developed for memory-resident data sets

` Requires the user to set both dmin and pct

` Finding suitable settings for these parameters can involve much trial and error

(18)

3.7.3 Density-Based Methods

` Statistical and distance-based methods depend on the overall “global” distribution of data

` Data are usually not uniformly distributed

` Data can have different density distributions

C₁

C₂

o₁ o₂

(19)

Density-Based Methods

` Define Local Outliers

An object is a local outlier if it is outlying relative to its local neighborhood (w. r. t the density of the neighborhood )

` Does not consider being an outlier as a binary property

Asses the degree to which an object is an outlier

The degree of the “outlierness” is computed as the Local Outlier Factor(LOF) of an object

The degree depends on how isolated the object is with respect to the surrounding neighborhood

(20)

Density-Based Methods

` To define the local outlier factor of an object, the following concepts should be introduced

K-distance

K-distance neighborhood Reachability distance

(21)

K-distance & K-distance neighborhood

` The k-distance of an object p is the maximal distance that p gets from its k-nearest neighbors

Denoted k-distance(p)

How k is determined?

LOF method sets k to the parameter MinPts used in the density-based clustering (e.g., Minpts=4) [MinPts-distance]

` K-distance neighborhood of an object p contains the MinPts-nearest neighbors of p

Denoted N_k-distance(P) or N_k(P), also N_MinPts

p

(22)

Reachability distance

` The reachability-distance of an object q with respect to object o (where o is within the MinPts-nearest neighbors of P) is denoted

reach_distMinPts(p,o)

` Reach_distMinPts (p,o)=max{MinPts_distance(o), d(p,o)}

` If p is far away from o, the reachability distance between the two is simply their actual distance

` If they are close, then the actual distance is replaced by the MinPts_distance of o

(23)

Local Outlier Factor (LOF)

` The local reachability density of p is the inverse of the average reachability density based on the MinPts-nearest neighbors of p

` The local outlier factor (LOF) of p captures the degree to which we call p an outlier

∑

∈

=

NMinPts(p) o MinPts MinPts MinPts

o)

(p,

reach_dist

|

(P)

N

|

(p)

lrd

|

)

(

|

)

(

)

(

P

N

P

Ird

o

Ird

(p)

LOF

MinPts NMinPts(P) o MinPts MinPts MinPts

∑

∈

=

(24)

3.7.4 Deviation-Based Methods

` Identify outliers by examining the main characteristics of objects on a group

` Objects that deviate from this description are outliers

` The term deviation is used to refer to outliers ` Two main methods

Sequential Exception Technique

(25)

Summary of Chapter 3

` A cluster is a collection of data objects that are similar within the same cluster and dissimilar to the objects on other clusters

` Clustering can be used as

a main task to gain insights about the data

a preprocessing step for other data mining algorithms

` Several applications

Market segmentation Pattern recognition Biological studies

Spatial data analysis

(26)

Summary of Chapter 3

` The quality of clustering can be assessed based on dissimilarity of objects

` Many techniques have been developed

Partitioning Methods Hierarchical methods Density-based methods Grid-based methods Model-based methods

Clustering high dimensional data Constrained-based methods

(27)

Applications and Tools in Data Mining

(28)

1. Financial Data Analysis

` Banks and Institutions offer a wise variety of banking services

Checking and savings accounts for business or individual customers

Credit business, mortgage, and automobile loans Investment services (mutual funds)

Insurance services and stock investment services

` Financial data is relatively complete, reliable, and of high

quality

(29)

1. Financial Data Analysis

` Design of data warehouses for multidimensional data analysis

and data mining

Construct data warehouses (data come from different sources)

Multidimensional Analysis: e.g., view the revenue changes by month. By region, by sector, etc. along with some statistical

information such as the mean, the average, the maximum and the minimum values, etc.

Characterization and class comparison

(30)

1. Financial Data Analysis

` Loan Payment Prediction and costumer credit policy analysis Attribute selection and attribute relevance ranking may help

indentifying important factors and eliminate irrelevant ones Example of factors related to the risk of loan payment

Term of the loan Debt ratio

Payment to income ratio Customer level income Education level

Residence region

The bank can adjust its decisions

(31)

2. Retail Industry

` Collect huge amount of data on sales, customer shopping history, goods transportation,

consumption and service, etc.

` Many stores have web sites where you can buy online. Some of them exist only online (e.g.,

Amazon)

` Data mining helps to

Identify costumer buying behaviors

Discover customers shopping patterns and trends Improve the quality of costumer service

Achieve better costumer satisfaction

Design more effective good transportation Reduce the cost of business

(32)

2. Retail Industry

` Design data warehouses

` Multidimensional analysis

` Analysis of the effectiveness of sales campaigns

Advertisements, coupons, discounts, bonuses, etc Comparing transactions that contain sales items

during and after the campaign ` Costumer retention

Analyze the change in costumers behaviors

` Product Recommendation Mining association rules

(33)

3. Telecommunication Industry

` Many different ways of communicating

Fax, cellular phone, Internet messenger, images, e-mail, computer and Web data transmission, etc. ` Great demand of data mining to help

Understanding the business involved

Indentifying telecommunication patterns Catching fraudulent activities

Making better use of resources Improve the quality of service

(34)

3. Telecommunication Industry

` Multidimensional analysis (several attributes)

Several features: Calling time, Duration, Location of caller, Location of callee, Type of call, etc.

Compare data traffic, system workload, resource usage, user group behavior, and profit

` Fraudulent Pattern Analysis

Identify potential fraudulent users

Detect attempts to gain fraudulent entry to costumer accounts

(35)

4. Many Other Applications

` Biological Data Analysis

E.g., identification and analysis of human genomes and other species

` Web Mining

E.g., explore linkage between web pages to compute authority scores (Page Rank Algorithm)

` Intrusion detection

Detect any action that threaten file integrity,

(36)

How to Choose a Data Mining System (Tool)?

` Do data mining system share the same well defined operations

and a standard query language? ` No

` Many commercial data mining system have a little in common

Different functionalities Different methodology Different data sets

` You need to carefully choose the data mining system that is appropriate for your task

(37)

How to Choose a Data Mining System (Tool)?

` Data Types

Available systems handle formatted record-based, relational-like data with numerical, and nominal attributes

That data could be on the form of ASCII text, relational databases, or data warehouse data

It is important to check which kind of data the system you are choosing can handle

` Operating System

A data mining system may run only on one operating system The most popular operating systems that host data mining tools

are UNIX/LINUX and Microsoft Windows

Large industry data mining systems adopt client-server architecture

(38)

How to Choose a Data Mining System (Tool)?

` Data Sources

Data formats

Some systems work only with ASCII test files, whereas many other work with databases

It is important that the data mining system supports ODBC connections (Open Database Connectivity)

` Data Mining functions and Methodologies

Some systems provide only one data mining function(e.g., classification). Other system support many functions

For a given data mining function (e.g., classification), some systems support only one method. Other systems may support many methods (k-nearest neighbor, naive Bayesian, etc.)

(39)

How to Choose a Data Mining System (Tool)?

` Coupling data mining with databases(data warehouse) systems

No Coupling

A DM system will not use any function of a DB/DW system Fetch data from particular resource (file)

Process data and then store results in a file Loose coupling

A DM system use some facilities of a DB/DW system

Fetch data from data repositories managed by a DB/DW Store results in a file or in the DB/DW

Semi-tight coupling

Efficient implementation of few essential data mining primitives

(sorting, indexing, histogram analysis) is provided by the DB/DW

Tight coupling

A DM system is smoothly integrated into the DB/DW Data mining queries are optimized

` Tight coupling is highly desirable because it facilitates

(40)

How to Choose a Data Mining System (Tool)?

` Scalability

Query execution time should increase linearly with the number of dimensions

` Visualization

“A picture is worth a thousand words”

The quality and the flexibility of visualization tools may strongly

influence usability, interpretability and attractiveness of the system ` Data Mining Query Language and Graphical user Interface

High quality user interface

(41)

Examples of Commercial Data Mining Tools

Database system and graphics vendors

` Intelligent Miner (IBM)

` Microsoft SQL Server 2005

` MineSet (Purple Insight)

(42)

Examples of Commercial Data Mining Tools

Vendors of statistical analysis or data mining software

` Clementine (SPSS)

` Enterprise Miner (SAS Institute)

(43)

Examples of Commercial Data Mining Tools

Machine learning community

` CART (Salford Systems)

` See5 and C5.0 (RuleQuest)

(44)

Chapter 3: Cluster Analysis