• No results found

Data Mining. Toon Calders

N/A
N/A
Protected

Academic year: 2021

Share "Data Mining. Toon Calders"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Mining

Toon Calders

[email protected]

[email protected]

(2)

What is Data Mining?

(3)

What is Data Mining?

Analyzing all data “manually” becomes impossible

Data mining emerged from this need

Data mining is …

Data mining is …

“the use of sophisticated data analysis tools to

discover

previously unknown

,

valid patterns

and

relationships

in

large

data sets.”

(4)

2II15: Course Organization

Lectures: Thursday 13:30-15:15 in Auditorium 12

Lecturer: Toon Calders (

[email protected]

HG 7.82a )

Course website:

Course website:

http://www.win.tue.nl/~tcalders/teaching/datamining/

Boek:

Tan, Steinbach, Kumar:

Introduction to datamining

(5)

2II15: Course Organization

Evaluation:

Written exam 50%

Group project 50%

Without project, no grade

Without project, no grade

Without exam, no grade

Project/exam scores can be transferred to August if

at least 6

(6)

2II15: Course Organization

Group project

Groups of 3-4 students

Pick a dataset to analyze (suggestions online)

Analyze the dataset; report results

W8: Groups formed, assignment proposal

W14: half-time report (presentation in W16)

W22: end presentation (report in W23)

(7)

Outline

Three Main Categories:

Classification

Clustering

Pattern Mining

Potential dangers of Data Mining

Overfitting

Bad experimental design

Spurious discoveries

(8)

Technique 1: Classification

Learn a

model

based on

labeled

data.

The model can be used for

prediction

.

Example: age <30 ≥30 gender M F Car type High Example: Medium sports family High Low

(9)

Technique 1: Classification

Early Intermediate Class: • Phase Attributes: • image features, • wavelengths Late Dataset size:

• 72 million stars, 20 million galaxies

• Object catalog: 9 GB

• Image database: 150 GB

(10)

Technique 1: Classification

Other examples

Spam filters

Strijd tegen fiscale fraude bracht vorig jaar 590 miljoen op

Bron: De Standaard 6/6/08 […]

Classifying solar systems

Fraud detection

Content analysis details: (5.7 points, 5.0 required) pts rule name description

---- ---0.6 NO_REAL_NAME From: does not include a real name

0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL 2.0 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP

address [122.164.179.102 listed in dnsbl.sorbs.net]

3.1 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL [122.164.179.102 listed in zen.spamhaus.org] 0.0 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL

[122.164.179.102 listed in zen.spamhaus.org]

... ...

Datamining

De techniek van de datamining blijft wel evenveel geld opleveren: 204,37 miljoen euro vorig jaar, tegenover 218,3 miljoen in 2006. […]

(11)

Technique 1: Classification

The course will cover:

Different algorithms:

Decision tree construction

Nearest neighbor

Naïve Bayes

Naïve Bayes

How to combine classifiers

(12)

Technique 2: Clustering

Automatically

dividing

data into

homogeneous

(13)

Technique 2: Clustering

(14)

Technique 2: Clustering

(15)

Technique 2: Clustering

The course will cover:

Agglomerative clustering

Distance based

Density based

Density based

Hierarchical clustering

(16)

Technique 3: Pattern Mining

Find regularities, trends, patterns that frequently

occur in the data

(17)

Technique 3: Pattern Mining

(18)

Technique 3: Pattern Mining

The course will cover:

AlgorithmsAprioriFPGrowthFPGrowthOutput reductionCondensed representations

(19)

Techniques: Summary

Current state-of-the-art in Data Mining:

Toolbox

Many different techniques;

Also deviation/outlier detection, regression, webmining, …

webmining, …

Typically Data Mining involves many different steps

Not one “optimal” algorithm
(20)

Outline

Three Main Categories:

Classification

Clustering

Pattern Mining

Case Study: Heating and Cooling

Potential dangers of Data Mining

Meaningless Discoveries

Overfitting

(21)

Case Study

Optimizing energy usage for heating and cooling

complex system

dynamics only

dynamics only partially known

lots of data being generated

(22)

Case Study

Performance of individual components in idealized

conditions well-known

Reality turns out not to be so nice …

Different parameters constantly being monitored

Different parameters constantly being monitored

Room temperature

Temperature in boiler

Flow of water

(23)

Case Study

Data mining helps:

Model « normal » behavior of the system

Learned from observations Classification/regression

Difficult to model statistically

Difficult to model statistically

Monitor when systems no longer follows model

alarm-function: something changed

(24)

Case Study: Conclusion

Real applications need …

Physics

Statistics estimate situation-dependent parameters

Data mining for finding unexpected patterns, modelling complex systems

(25)

Outline

Three Main Categories:

Classification

Clustering

Pattern Mining

Case study

Potential dangers of Data Mining

Meaningless discoveries

Overfitting

(26)

Meaningless Discoveries

Implication

causality

Simpson’s paradox

Data dredging

Redundancy

No new information

No new information

(27)

Implication

Causality

Diet Coke

Obesity

Intensive Care

Death

Beach:

Beach:

Ice cream sales go up # drowned goes up # drowned goes up Ice cream sales go up

(28)

Simpson’s Paradox

Two hospitals: Academic hospital, local hospital.

Success rate of simple and complex operations is

measured:

Academic

Local

Simple

95%

92%

Complex

75%

60%

(29)

Simpson’s Paradox

Two hospitals: Academic hospital, local hospital.

Success rate of simple and complex operations is

measured:

Academic

Local

Simple

190/200

920/1000

Complex

750/1000

60/100

(30)

Data Dredging

“Torturing the data until they confess”

(31)

Redundancy

Often the

number

of frequent sets is

extremely large

.

Data

(32)

No New Information

Most frequent patterns = most well-known patterns

Many interesting patterns are infrequent; otherwise

we would already know them

(33)

Outline

Three Main Categories:

Classification

Clustering

Pattern Mining

Case study

Potential dangers of Data Mining

Meaningless discoveries

Overfitting

(34)

Overfitting

Setting:

Training data

Separate set for testing the data

We keep updating the model

We keep updating the model

Make it more and more specific

Make it better and better on the training data

(35)
(36)

Overfitting Underfitting

Overfitting

Underfitting: Model did not see enough data

(37)

Overfitting Due to Noise

Twodimensional data, class + or

-B + + + + + + + -A + + + + + -- -- -- -- -+

(38)

Overfitting Due to Noise

Good model

B + + + + + + + -A + + + + + -- -- -- -- -+
(39)

Overfitting Due to Noise

Bad model with better training performance

B + + + + + + + -A + + + + + -- -- -- -- -+

(40)

Outline

Three Main Categories:

Classification

Clustering

Pattern Mining

Case study

Potential dangers of Data Mining

Meaningless discoveries

Overfitting

(41)

Bad Experimental Design

Keep in mind:

Never, ever test performance of your solutions on data that is used in the training process

Always keep the scenario in mind in which you will deploy your method

(42)

Bad Experimental Design

Example: Nearest Neighbor Classification

Training set has been given

A B C D Class

0.5 0.3 0.1 7.5 + 0.3 0.1 0.7 8.9

-• Classifying a new example p:

Find “closest” example q in training set

Assign label of q to p

0.3 0.1 0.7 8.9 -0.4 0.2 0.8 4.2 +

(43)
(44)

Bad Experimental Design

How do we measure the distance?

Weighted Eucledian distance

between new point (p1, …, pk) and (q1, …, qk)

=

n

q

p

w

dist

(

)

2

We try some different settings for the weights

Equal weights Accuracy of 56%

Standardized weights Accuracy of 65%

Giving more weight to C Accuracy of 75%

=

=

k k k k

p

q

w

dist

1 2

)

(

(45)

Bad Experimental Design

We draw the following conclusions:

Standardized weights with a small correction to increase the weight of C gives the best results

(46)

Bad Experimental Design

We draw the following conclusions:

Standardized weights with a small correction to increase the weight of C gives the best results

We can get an accuracy as high as 75%

WHAT IS WRONG?

(47)

Conclusions

Three main techniques:

ClassificationPattern MiningClustering

Many dangers

Under/overfittingMeaningless discoveries
(48)

References

Related documents

Issues that will be addressed include: how to gain experience prior to pharmacy school and the variety of ways that experience can be obtained, explanation of the difference

The FSC logo, the initials ‘FSC’ and the name ‘Forest Stewardship Council’ are registered trademarks, and therefore a trademark symbol must accompany the.. trademarks in

Lord God, whose blessed Son, our Savior, gave his body to be whipped and his face spit upon: Give us the grace to accept joyfully the suffering of the present time, confident of

Among the methodological models that the research group is now working on to seek solutions to the main problems that normally characterize small schools and to improve the quality

In the second year, students will be given latitude to choose their electives within two policy tracks, namely social policies (including fiscal policies, health

Social media network Facebook is relied upon by millions of people around the globe for commu- nication.. But under billionaire CEO Mark Zucker- berg and his fellow owners,

This paper presents the performance and emission characteristics of a CRDI diesel engine fuelled with UOME biodiesel at different injection timings and injection pressures..