What is Data Mining?
What is Data Mining?
•
Analyzing all data “manually” becomes impossible
•
Data mining emerged from this need
Data mining is …
Data mining is …
“the use of sophisticated data analysis tools to
discover
previously unknown
,
valid patterns
and
relationships
in
large
data sets.”
2II15: Course Organization
•
Lectures: Thursday 13:30-15:15 in Auditorium 12
•
Lecturer: Toon Calders (
HG 7.82a )
•
Course website:
•
Course website:
http://www.win.tue.nl/~tcalders/teaching/datamining/
•
Boek:
Tan, Steinbach, Kumar:
Introduction to datamining
2II15: Course Organization
•
Evaluation:
• Written exam 50%
• Group project 50%
•
Without project, no grade
•
Without project, no grade
•
Without exam, no grade
•
Project/exam scores can be transferred to August if
at least 6
2II15: Course Organization
•
Group project
• Groups of 3-4 students
• Pick a dataset to analyze (suggestions online)
• Analyze the dataset; report results
W8: Groups formed, assignment proposal
W14: half-time report (presentation in W16)
W22: end presentation (report in W23)
Outline
•
Three Main Categories:
• Classification
• Clustering
• Pattern Mining
•
Potential dangers of Data Mining
• Overfitting• Bad experimental design
• Spurious discoveries
Technique 1: Classification
•
Learn a
model
based on
labeled
data.
•
The model can be used for
prediction
.
Example: age <30 ≥30 gender M F Car type High Example: Medium sports family High Low
Technique 1: Classification
Early Intermediate Class: • Phase Attributes: • image features, • wavelengths Late Dataset size:• 72 million stars, 20 million galaxies
• Object catalog: 9 GB
• Image database: 150 GB
Technique 1: Classification
•
Other examples
• Spam filters
Strijd tegen fiscale fraude bracht vorig jaar 590 miljoen op
Bron: De Standaard 6/6/08 […]
• Classifying solar systems
• Fraud detection
Content analysis details: (5.7 points, 5.0 required) pts rule name description
---- ---0.6 NO_REAL_NAME From: does not include a real name
0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL 2.0 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP
address [122.164.179.102 listed in dnsbl.sorbs.net]
3.1 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL [122.164.179.102 listed in zen.spamhaus.org] 0.0 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL
[122.164.179.102 listed in zen.spamhaus.org]
... ...
Datamining
De techniek van de datamining blijft wel evenveel geld opleveren: 204,37 miljoen euro vorig jaar, tegenover 218,3 miljoen in 2006. […]
Technique 1: Classification
•
The course will cover:
• Different algorithms:− Decision tree construction
− Nearest neighbor
− Naïve Bayes
− Naïve Bayes
• How to combine classifiers
Technique 2: Clustering
•
Automatically
dividing
data into
homogeneous
Technique 2: Clustering
Technique 2: Clustering
Technique 2: Clustering
•
The course will cover:
• Agglomerative clustering
− Distance based
− Density based
− Density based
• Hierarchical clustering
Technique 3: Pattern Mining
•
Find regularities, trends, patterns that frequently
occur in the data
⇒
⇒
⇒
⇒
Technique 3: Pattern Mining
Technique 3: Pattern Mining
•
The course will cover:
• Algorithms − Apriori − FPGrowth − FPGrowth • Output reduction − Condensed representations
Techniques: Summary
•
Current state-of-the-art in Data Mining:
• Toolbox• Many different techniques;
− Also deviation/outlier detection, regression, webmining, …
webmining, …
•
Typically Data Mining involves many different steps
• Not one “optimal” algorithmOutline
•
Three Main Categories:
• Classification• Clustering
• Pattern Mining
•
Case Study: Heating and Cooling
•
Potential dangers of Data Mining
• Meaningless Discoveries
• Overfitting
Case Study
•
Optimizing energy usage for heating and cooling
• complex system
• dynamics only
• dynamics only partially known
• lots of data being generated
Case Study
•
Performance of individual components in idealized
conditions well-known
• Reality turns out not to be so nice …
•
Different parameters constantly being monitored
•
Different parameters constantly being monitored
• Room temperature• Temperature in boiler
• Flow of water
Case Study
•
Data mining helps:
• Model « normal » behavior of the system
− Learned from observations Classification/regression
− Difficult to model statistically
− Difficult to model statistically
• Monitor when systems no longer follows model
− alarm-function: something changed
Case Study: Conclusion
•
Real applications need …
• Physics
• Statistics estimate situation-dependent parameters
• Data mining for finding unexpected patterns, modelling complex systems
Outline
•
Three Main Categories:
• Classification• Clustering
• Pattern Mining
•
Case study
•
Potential dangers of Data Mining
• Meaningless discoveries
• Overfitting
Meaningless Discoveries
•
Implication
≠
causality
•
Simpson’s paradox
•
Data dredging
•
Redundancy
•
No new information
•
No new information
Implication
≠
Causality
•
Diet Coke
Obesity
•
Intensive Care
Death
•
Beach:
•
Beach:
Ice cream sales go up # drowned goes up # drowned goes up Ice cream sales go up
Simpson’s Paradox
•
Two hospitals: Academic hospital, local hospital.
•
Success rate of simple and complex operations is
measured:
Academic
Local
Simple
95%
92%
Complex
75%
60%
Simpson’s Paradox
•
Two hospitals: Academic hospital, local hospital.
•
Success rate of simple and complex operations is
measured:
Academic
Local
Simple
190/200
920/1000
Complex
750/1000
60/100
Data Dredging
•
“Torturing the data until they confess”
Redundancy
•
Often the
number
of frequent sets is
extremely large
.
Data
No New Information
•
Most frequent patterns = most well-known patterns
•
Many interesting patterns are infrequent; otherwise
we would already know them
Outline
•
Three Main Categories:
• Classification• Clustering
• Pattern Mining
•
Case study
•
Potential dangers of Data Mining
• Meaningless discoveries
• Overfitting
Overfitting
•
Setting:
• Training data
• Separate set for testing the data
•
We keep updating the model
•
We keep updating the model
• Make it more and more specific• Make it better and better on the training data
Overfitting Underfitting
Overfitting
Underfitting: Model did not see enough data
Overfitting Due to Noise
•
Twodimensional data, class + or
-B + + + + + + + -A + + + + + -- -- -- -- -+
Overfitting Due to Noise
•
Good model
B + + + + + + + -A + + + + + -- -- -- -- -+Overfitting Due to Noise
•
Bad model with better training performance
B + + + + + + + -A + + + + + -- -- -- -- -+
Outline
•
Three Main Categories:
• Classification• Clustering
• Pattern Mining
•
Case study
•
Potential dangers of Data Mining
• Meaningless discoveries
• Overfitting
Bad Experimental Design
•
Keep in mind:
• Never, ever test performance of your solutions on data that is used in the training process
• Always keep the scenario in mind in which you will deploy your method
Bad Experimental Design
•
Example: Nearest Neighbor Classification
• Training set has been givenA B C D Class
0.5 0.3 0.1 7.5 + 0.3 0.1 0.7 8.9
-• Classifying a new example p:
− Find “closest” example q in training set
− Assign label of q to p
0.3 0.1 0.7 8.9 -0.4 0.2 0.8 4.2 +
Bad Experimental Design
•
How do we measure the distance?
• Weighted Eucledian distancebetween new point (p1, …, pk) and (q1, …, qk)
∑
−
=
nq
p
w
dist
(
)
2• We try some different settings for the weights
− Equal weights Accuracy of 56%
− Standardized weights Accuracy of 65%
− Giving more weight to C Accuracy of 75%
− …
∑
=−
=
k k k kp
q
w
dist
1 2)
(
Bad Experimental Design
•
We draw the following conclusions:
• Standardized weights with a small correction to increase the weight of C gives the best results
Bad Experimental Design
•
We draw the following conclusions:
• Standardized weights with a small correction to increase the weight of C gives the best results
• We can get an accuracy as high as 75%
WHAT IS WRONG?
Conclusions
•
Three main techniques:
• Classification • Pattern Mining • Clustering•
Many dangers
• Under/overfitting • Meaningless discoveries