• No results found

Data Mining: An Overview. David Madigan

N/A
N/A
Protected

Academic year: 2021

Share "Data Mining: An Overview. David Madigan"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Mining: An Overview

David Madigan

http://www.stat.columbia.edu/~madigan

(2)

Overview

• Brief Introduction to Data Mining

• Data Mining Algorithms

• Specific Examples

– Algorithms: Disease Clusters

– Algorithms: Model-Based Clustering

– Algorithms: Frequent Items and Association Rules

• Future Directions, etc.

(3)

Of “Laws”, Monsters, and Giants…

• Moore’s law: processing “capacity” doubles every 18 months : CPU, cache, memory

• It’s more aggressive cousin:

– Disk storage “capacity” doubles every 9 months

1E+3 1E+4 1E+5 1E+6 1E+7

1988 1991 1994 1997 2000

disk TB growth:

112%/y

Moore's Law:

58.7%/y

ExaByte

Disk TB Shipped per Year

1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf.

What do the two

“laws” combined produce?

A rapidly growing gap

between our ability to

generate data, and our

ability to make use of it.

(4)

What is Data Mining?

Finding interesting structure in data

• Structure: refers to statistical patterns, predictive models, hidden relationships

• Examples of tasks addressed by Data Mining

– Predictive Modeling (classification, regression) – Segmentation (Data Clustering )

– Summarization

– Visualization

(5)
(6)
(7)

Ronny Kohavi, ICML 1998

(8)

Data Mining Algorithms

“A data mining algorithm is a well-defined

procedure that takes data as input and produces output in the form of models or patterns”

“well-defined”: can be encoded in software

“algorithm”: must terminate after some finite number of steps

Hand, Mannila, and Smyth

(9)

Algorithm Components

1. The task the algorithm is used to address (e.g.

classification, clustering, etc.)

2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model)

3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.)

4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.)

5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory)

(10)
(11)

Backpropagation data mining algorithm

x1 x2 x3 x4

h1 h2

y

•vector of p input values multiplied by p × d1 weight matrix

•resulting d1 values individually transformed by non-linear function

•resulting d1 values multiplied by d1 × d2 weight matrix

4 2

1

i i i

i ixi s x

s1 =

!

4=1# ; 2 =

!

4=1"

( )

si 1 (1 e si )

h = + !

i wihi

y =

!

2=1

(12)

Backpropagation (cont.)

Parameters: "1,…,"4, !1,…, !4, w1, w2

Score:

!

=

"

= n

i

SSE y i y i

S

1

))2

ˆ( ) ( (

Search: steepest descent; search for structure?

(13)

Models and Patterns

Models

Prediction Probability Distributions

Structured Data

•Linear regression

•Piecewise linear

(14)

Models

Prediction Probability Distributions

Structured Data

•Linear regression

•Piecewise linear

•Nonparamatric regression

(15)
(16)

Models

Prediction Probability Distributions

Structured Data

•Linear regression

•Piecewise linear

•Nonparametric regression

•Classification

logistic regression

naïve bayes/TAN/bayesian networks NN

support vector machines Trees

etc.

(17)

Models

Prediction Probability Distributions

Structured Data

•Linear regression

•Piecewise linear

•Nonparametric regression

•Classification

•Parametric models

•Mixtures of

parametric models

•Graphical Markov models (categorical, continuous, mixed)

(18)

Models

Prediction Probability Distributions

Structured Data

•Linear regression

•Piecewise linear

•Nonparametric regression

•Classification

•Parametric models

•Mixtures of

parametric models

•Graphical Markov models (categorical, continuous, mixed)

•Time series

•Markov models

•Mixture Transition Distribution models

•Hidden Markov models

•Spatial models

(19)

Markov Models

First-order:

!

= "

= T

t

t t t

T p y p y y

y y

p

2

1 1

1

1, , ) ( ) ( | )

(

e.g.: 1 ( 1) 2

2 exp 1 2

) 1

|

( !

"

$ #

%

& ' '

= '

' )( (

t t

t t

y g y y

y p

g linear ⇒ standard first-order auto-regressive model

e y

yt ="0 +"1 t!1 + e ~ N(0,!2)

y

1

y

2

y

3

y

T

(20)

First-Order HMM/Kalman Filter

x

1

x

2

x

3

x

T

y

1

y

2

y

3

y

T

!

= "

=

T

t

t t

t t

T

T

x x p x p y x p y x p x x

y y

p

2

1 1

1 1 1

1 1

1

, , , , , ) ( ) ( | ) ( | ) ( | )

( … …

Note: to compute p(y1,…,yT) need to sum/integrate over all possible state sequences...

(21)

Bias-Variance Tradeoff

High Bias - Low Variance Low Bias - High Variance

“overfitting” - modeling the random component

Score function should embody the compromise

(22)

The Curse of Dimensionality

X ~ MVNp (0 , I)

•Gaussian kernel density estimation

•Bandwidth chosen to minimize MSE at the mean

•Suppose want:

0 1

. ) 0

(

)) ( )

ˆ( [(

2

2

< =

! x x p

x p x

p E

Dimension # data points

1 4 2 19 3 67 6 2,790 10 842,000

(23)

Patterns

Global Local

•Clustering via partitioning

•Hierarchical Clustering

•Mixture Models

•Outlier detection

•Changepoint detection

•Bump hunting

•Scan statistics

•Association rules

(24)

x x

x

x xx x x x

xx x xx x

x xx

x xx xx

x

x x x x x

x xx x xxx

The curve represents a road Each “x” marks an accident

Red “x” denotes an injury accident Black “x” means no injury

Is there a stretch of road where there is an unusually large fraction of injury accidents?

Scan Statistics via Permutation Tests

(25)

Scan with Fixed Window

• If we know the length of the “stretch of road” that we seek, e.g., we could slide this window long the road and find the most “unusual” window location

x x

x

x xx x x x

xx x xx x

x xx

x xx xx

x

x x x x x

x xx x xxx

(26)

How Unusual is a Window?

• Let p

W

and p

¬W

denote the true probability of being red inside and outside the window respectively. Let (x

W

,n

W

) and (x

¬W

,n

¬W

) denote the corresponding counts

• Use the GLRT for comparing H

0

: p

W

= p

¬W

versus H

1

: p

W

≠ p

¬W

W W

W W

W W

W W

W W

W W

x n

W W

x W W

x n W W

x W W

x x n

n W

W W

W x

x W W

W W

n x

n x

n x

n x

n n

x x

n n

x x

¬

¬

¬

¬

¬

¬

¬ !

¬

¬

! ¬

!

!

¬ + + ¬

¬

¬ ! !

+ +

! +

= +

)]

/ (

1 [ )

/ (

)]

/ (

1 [ ) /

(

))]

/(

) ((

1 [ )]

/(

)

" [(

−2 log λ here has an asymptotic chi-square distribution with 1df

• lambda measures how unusual a window is

(27)

Permutation Test

• Since we look at the smallest λ over all window

locations, need to find the distribution of smallest- λ under the null hypothesis that there are no clusters

• Look at the distribution of smallest- λ over say 999 random relabellings of the colors of the x’s

xx x xxx x xx x xx x 0.376 xx x xxx x xx x xx x 0.233 xx x xxx x xx x xx x 0.412 xx x xxx x xx x xx x 0.222

smallest-λ

• Look at the position of observed smallest-λ in this distribution to get the scan statistic p-value (e.g., if observed smallest-λ is 5th smallest, p-value is 0.005)

(28)

Variable Length Window

• No need to use fixed-length window. Examine all possible windows up to say half the length of the entire road

O = fatal accident

O = non-fatal accident

(29)

Spatial Scan Statistics

• Spatial scan statistic uses, e.g., circles instead of line

segments

(30)
(31)

Spatial-Temporal Scan Statistics

• Spatial-temporal scan statistic use cylinders where the

height of the cylinder represents a time window

(32)

Other Issues

• Poisson model also common (instead of the bernoulli model)

• Covariate adjustment

• Andrew Moore’s group at CMU: efficient

algorithms for scan statistics

(33)

Software: SaTScan + others

http://www.satscan.org http://www.phrl.org

http://www.terraseer.com

(34)

Association Rules: Support and Confidence

• Find all the rules Y Z with

minimum confidence and support

– support, s, probability that a transaction contains {Y & Z}

– confidence, c, conditional probability that a transaction having Y also

contains Z Transaction ID Items Bought

2000 A,B,C

1000 A,C

4000 A,D

5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have

– A C (50%, 66.6%) – C A (50%, 100%)

Customer buys diaper Customer

buys both

Customer buys beer

(35)

Mining Association Rules—An Example

For rule A ⇒ C:

support = support({A &C}) = 50%

confidence = support({A &C})/support({A}) = 66.6%

The Apriori principle:

Any subset of a frequent itemset must be frequent Transaction ID Items Bought

2000 A,B,C

1000 A,C

4000 A,D

5000 B,E,F

Frequent Itemset Support

{A} 75%

{B} 50%

{C} 50%

{A,C} 50%

Min. support 50%

Min. confidence 50%

(36)

Mining Frequent Itemsets: the Key Step

• Find the frequent itemsets: the sets of items that have minimum support

– A subset of a frequent itemset must also be a frequent itemset

• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

– Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

• Use the frequent itemsets to generate association

rules.

(37)

The Apriori Algorithm

• Join Step:

Ck is generated by joining Lk-1with itself

• Prune Step:

Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

• Pseudo-code:

Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items};

for (k = 1; Lk !=∅; k++) do begin

Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support end

return ∪k Lk;

(38)

The Apriori Algorithm — Example

TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5

Database D itemset sup.

{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

Scan D

C1 L1

itemset {1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2

itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2

L2

C2 C2

Scan D

C3 itemset L3

{2 3 5} Scan D itemset sup {2 3 5} 2

(39)

Association Rule Mining: A Road Map

Boolean vs. quantitative associations (Based on the types of values handled)

buys(x, “SQLServer”) ^ buys(x, “DMBook”) → buys(x, “DBMiner”) [0.2%, 60%]

age(x, “30..39”) ^ income(x, “42..48K”) → buys(x, “PC”) [1%, 75%]

Single dimension vs. multiple dimensional associations (see ex.

Above)

Single level vs. multiple-level analysis

What brands of beers are associated with what brands of diapers?

Various extensions (thousands!)

(40)

References

Related documents

GAAP and IFRS have general requirements for hedge accounting as well as requirements for specifi c types of hedging relationships (i.e., a fair value hedge or a cash fl ow

A.1 - Necessary risk an and audit techniques are developed and adopted by major tax offices (Large Taxpayers Office (LTO), Beirut, Mount Lebanon) two years after project

As Panel B of Table 6 shows, differences in before- and after-expense performance between SRI funds in generalist companies and matched conventional funds from the whole sample

La CEI collabore étroitement avec l'Organisation Internationale de Normalisation (ISO), selon des conditions fixées par accord entre les deux organisations. 2) Les décisions

Chen in sod. [8] v ˇ clanku navajajo najprepoznavnejˇse napredke na tem podroˇ cju, od odkrivanja rakavih celic do analize kromosomov in ostalih zno- trajceliˇ cnih struktur.

Unlike ulcerative colitis patients, Crohn’s disease pa- tients generally do not undergo the variation of the procedure that eliminates the need to wear an external ostomy

Cultural Tourism in Portugal Cultural Tourists World Heritage by UNESCO World Heritage Historical Centers (WHHC) Intangible Heritage Tourism Experience and Storytelling

While there has been some literature on key substantive aspects of the Directive (see Shaping Private Antitrust Enforcement in Europe edited by Mel Marquis and Giorgio