Data Mining. Toon Calders

(1)

Data Mining

Toon Calders

[email protected]

(2)

What is Data Mining?

(3)

What is Data Mining?

•

Analyzing all data “manually” becomes impossible

•

Data mining emerged from this need

Data mining is …

“the use of sophisticated data analysis tools to

discover

previously unknown

,

valid patterns

and

relationships

in

large

data sets.”

(4)

2II15: Course Organization

•

Lectures: Thursday 13:30-15:15 in Auditorium 12

•

Lecturer: Toon Calders (

[email protected]

HG 7.82a )

•

Course website:

•

Course website:

http://www.win.tue.nl/~tcalders/teaching/datamining/

•

Boek:

Tan, Steinbach, Kumar:

Introduction to datamining

(5)

•

Evaluation:

• Written exam 50%

• Group project 50%

•

Without project, no grade

•

Without project, no grade

•

Without exam, no grade

•

Project/exam scores can be transferred to August if

at least 6

(6)

•

Group project

• Groups of 3-4 students

• Pick a dataset to analyze (suggestions online)

• Analyze the dataset; report results

W8: Groups formed, assignment proposal

W14: half-time report (presentation in W16)

W22: end presentation (report in W23)

(7)

Outline

•

Three Main Categories:

• Classification

• Clustering

• Pattern Mining

•

Potential dangers of Data Mining

• Overfitting

• Bad experimental design

• Spurious discoveries

(8)

Technique 1: Classification

•

Learn a

model

based on

labeled

data.

•

The model can be used for

prediction

.

Example: age <30 ≥30 gender M F Car type High Example: Medium sports family High _Low

(9)

Early Intermediate Class: • Phase Attributes: • image features, • wavelengths Late Dataset size:

• 72 million stars, 20 million galaxies

• Object catalog: 9 GB

• Image database: 150 GB

(10)

•

Other examples

• Spam filters

Strijd tegen fiscale fraude bracht vorig jaar 590 miljoen op

Bron: De Standaard 6/6/08 […]

• Classifying solar systems

• Fraud detection

Content analysis details: (5.7 points, 5.0 required) pts rule name description

---- ---0.6 NO_REAL_NAME From: does not include a real name

0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL 2.0 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP

address [122.164.179.102 listed in dnsbl.sorbs.net]

3.1 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL [122.164.179.102 listed in zen.spamhaus.org] 0.0 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL

[122.164.179.102 listed in zen.spamhaus.org]

... ...

Datamining

De techniek van de datamining blijft wel evenveel geld opleveren: 204,37 miljoen euro vorig jaar, tegenover 218,3 miljoen in 2006. […]

(11)

Technique 1: Classification

•

The course will cover:

• Different algorithms:

− Decision tree construction

− Nearest neighbor

− Naïve Bayes

• How to combine classifiers

(12)

Technique 2: Clustering

•

Automatically

dividing

data into

homogeneous

(13)

(14)

(15)

•

• Agglomerative clustering

− Distance based

− Density based

• Hierarchical clustering

(16)

Technique 3: Pattern Mining

•

Find regularities, trends, patterns that frequently

occur in the data

⇒

(17)

Technique 3: Pattern Mining

(18)

Technique 3: Pattern Mining

•

• Algorithms − Apriori − FPGrowth − FPGrowth • Output reduction − Condensed representations

(19)

Techniques: Summary

•

Current state-of-the-art in Data Mining:

• Toolbox

• Many different techniques;

− Also deviation/outlier detection, regression, webmining, …

webmining, …

•

Typically Data Mining involves many different steps

• Not one “optimal” algorithm

(20)

Outline

•

• Clustering

•

Case Study: Heating and Cooling

•

• Meaningless Discoveries

• Overfitting

(21)

Case Study

•

Optimizing energy usage for heating and cooling

• complex system

• dynamics only

• dynamics only partially known

• lots of data being generated

(22)

Case Study

•

Performance of individual components in idealized

conditions well-known

• Reality turns out not to be so nice …

•

Different parameters constantly being monitored

•

Different parameters constantly being monitored

• Room temperature

• Temperature in boiler

• Flow of water

(23)

Case Study

•

Data mining helps:

• Model « normal » behavior of the system

− Learned from observations Classification/regression

− Difficult to model statistically

• Monitor when systems no longer follows model

− alarm-function: something changed

(24)

Case Study: Conclusion

•

Real applications need …

• Physics

• Statistics estimate situation-dependent parameters

• Data mining for finding unexpected patterns, modelling complex systems

(25)

Outline

•

• Clustering

•

Case study

•

• Meaningless discoveries

• Overfitting

(26)

Meaningless Discoveries

•

Implication

≠

causality

•

Simpson’s paradox

•

Data dredging

•

Redundancy

•

No new information

•

No new information

(27)

Implication

≠

Causality

•

Diet Coke

Obesity

•

Intensive Care

Death

•

Beach:

•

Beach:

Ice cream sales go up # drowned goes up # drowned goes up Ice cream sales go up

(28)

Simpson’s Paradox

•

Two hospitals: Academic hospital, local hospital.

•

Success rate of simple and complex operations is

measured:

Academic

Local

Simple

95%

92%

Complex

75%

60%

(29)

Simpson’s Paradox

•

Two hospitals: Academic hospital, local hospital.

•

Success rate of simple and complex operations is

measured:

Academic

Local

Simple

190/200

920/1000

Complex

750/1000

60/100

(30)

Data Dredging

•

“Torturing the data until they confess”

(31)

Redundancy

•

Often the

number

of frequent sets is

extremely large

.

Data

(32)

No New Information

•

Most frequent patterns = most well-known patterns

•

Many interesting patterns are infrequent; otherwise

we would already know them

(33)

Outline

•

• Clustering

•

Case study

•

• Overfitting

(34)

Overfitting

•

Setting:

• Training data

• Separate set for testing the data

•

We keep updating the model

•

We keep updating the model

• Make it more and more specific

• Make it better and better on the training data

(35)

(36)

Overfitting Underfitting

Overfitting

Underfitting: Model did not see enough data

(37)

Overfitting Due to Noise

•

Twodimensional data, class + or

-B + + + + + + + -A + + + + + -- -- -- -- -+

(38)

•

Good model

B + + + + + + + -A + + + + + -- -- -- -- -+

(39)

•

Bad model with better training performance

B + + + + + + + -A + + + + + -- -- -- -- -+

(40)

Outline

•

• Clustering

•

Case study

•

• Overfitting

(41)

Bad Experimental Design

•

Keep in mind:

• Never, ever test performance of your solutions on data that is used in the training process

• Always keep the scenario in mind in which you will deploy your method

(42)

Bad Experimental Design

•

Example: Nearest Neighbor Classification

• Training set has been given

A B C D Class

0.5 0.3 0.1 7.5 + 0.3 0.1 0.7 8.9

-• Classifying a new example p:

− Find “closest” example q in training set

− Assign label of q to p

0.3 0.1 0.7 8.9 -0.4 0.2 0.8 4.2 +

(43)

(44)

•

How do we measure the distance?

• Weighted Eucledian distance

between new point (p1, …, pk) and (q1, …, qk)

∑

−

=

n

q

p

w

dist

(

)

2

• We try some different settings for the weights

− Equal weights Accuracy of 56%

− Standardized weights Accuracy of 65%

− Giving more weight to C Accuracy of 75%

− …

∑

=

−

=

k k k k

p

q

w

dist

1 2

)

(

(45)

Bad Experimental Design

•

We draw the following conclusions:

• Standardized weights with a small correction to increase the weight of C gives the best results

(46)

•

We draw the following conclusions:

• Standardized weights with a small correction to increase the weight of C gives the best results

• We can get an accuracy as high as 75%

WHAT IS WRONG?

(47)

Conclusions

•

Three main techniques:

• Classification • Pattern Mining • Clustering

•

Many dangers

• Under/overfitting • Meaningless discoveries

(48)