Part I
Part I
Data Mining: A First View
Data Mining: A First View
1 1 Data Mining: A Definition
1.1 Data Mining: A Definition
Data Mining
Data Mining
The process of employing one or more The process of employing one or more
computer learning techniques to
automatically analyze and extract
automatically analyze and extract
Induction-based Learning
g
The process of forming general
The process of forming general
concept definitions by observing
specific examples of concepts to be specific examples of concepts to be learned.
Knowledge Discovery in
Databases (KDD)
The application of the scientific
method to data mining Data mining is method to data mining. Data mining is one step of the KDD process.
Four Levels of Learning
• Facts • ConceptsConcepts • Procedures i i l • PrinciplesFacts
Concepts
A concept is a set of objects, symbols, or events grouped together because they events grouped together because they share certain characteristics.
Procedures
A procedure is a step-by-step course of action to achieve a goal.
Principles
A principles are general truths or laws that are basic to other truths.
Computers & Learning
Computers are good at learning concepts. Concepts are the output of a data mining session.
Three Concept Views
• Classical View
• Probabilistic ViewProbabilistic View • Exemplar View
Classical View
All concepts have definite defining properties. g p p
Probabilistic View
People store and recall concepts as generalizations created by g y
Exemplar View
People store and recall likely concept exemplars that are used p p to classify unknown instances.
Supervised Learning
Supervised Learning
• Build a learner model using datag instances of known origin.
• Use the model to determine theUse the model to determine the outcome for new instances of unknown origin.
Supervised Learning:
A Decision Tree Example
Decision Tree
A tree structure where non-terminal
d t t t
nodes represent tests on one or more attributes and terminal nodes reflect
d i i t
T bl 1 1 H h i l T i i D f Di Di i Table 1.1 • Hypothetical Training Data for Disease Diagnosis
Patient Sore Swollen
ID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
4 Yes No Yes No No Strep throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
7 No No Yes No No Strep throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
Swollen Glands
No Yes
Fever
Diagnosis = Strep Throat
Yes
Diagnosis = Allergy Diagnosis = Cold
No
Figure 1.1 A decision tree for the data in Table 1.1
Table 1.2 • Data Instances with an Unknown Classification
Patient Sore Swollen
ID# Throat Fever Glands Congestion Headache Diagnosis
11 No No Yes Yes Yes ?
12 Y Y N N Y ?
12 Yes Yes No No Yes ?
Production Rules
Production Rules
IF S ll Gl d Y
IF Swollen Glands = Yes
THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever = Yes
IF Swollen Glands No & Fever Yes
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
Unsupervised Clustering
A data mining method that builds
d l f d t ith t d fi d
models from data without predefined classes.
h A
The Acme Investors Dataset
Table 1.3 • Acme Investors Incorporated
Customer Account Margin Transaction Trades/ Favorite Annual ID T A t M th d M th S A R ti I
ID Type Account Method Month Sex Age Recreation Income
1005 Joint No Online 12.5 F 30–39 Tennis 40–59K 1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K 1245 Joint No Online 3.6 M 20–29 Golf 20–39K 2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K 1001 Individual Yes Online 5.0 M 40–49 Golf 60–79K
The Acme Investors Dataset &
S
i d L
i
Supervised Learning
1. Can I develop a general profile of an online investor?
2. Can I determine if a new customer is likely to open a y p
margin account?
3. Can I build a model predict the average number of trades
per month for a new investor? p
4. What characteristics differentiate female and male
The Acme Investors Dataset &
Unsupervised Clustering
1. What attribute similarities group customers of Acme Investors together?
of Acme Investors together?
2. What differences in attribute values
t th t d t b ?
1.3 Is Data Mining Appropriate
f
bl
Data Mining or Data Query?
Data Mining or Data Query?
• Shallow Knowledge
• Multidimensional Knowledgeg • Hidden Knowledge
• Deep Knowledge • Deep Knowledge
Shallow Knowledge
Shallow Knowledge
Shallow knowledge is factual. It can be easily stored and manipulated in a database.
Multidimensional Knowledge
Multidimensional Knowledge
Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge.
Hidden Knowledge
Hidden Knowledge
Hidden knowledge represents patterns or regularities in data that cannot be
easily found using database query. However, data mining algorithms can find such patterns with ease.
Deep Knowledge
Deep Knowledge
Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for.
Data Mining vs Data Query: An
Data Mining vs. Data Query: An
Example
p
• Use data query if you alreadyUse data query if you already almost know what you are looking for.
looking for.
• Use data mining to find regularities in data that are not obvious
1.4 Expert Systems or Data
i i
Expert System
Expert System
A computer program that emulates the problem-solving skills of one orp g more human experts.
Knowledge Engineer
Knowledge Engineer
A person trained to interact with an expert in order to capture theirp p knowledge.
Data Mining Tool
Data
If Swollen Glands = Yes Then Diagnosis = Strep Throat
Expert System Building Tool
Human Expert Knowledge Engineer
If Swollen Glands = Yes
Figure 1.2 Data mining vs. expert systems
1 5 A Simple Data Mining
1.5 A Simple Data Mining
SQL Queries Operational
Database
Data ApplicationResult
Interpretation &
E l ti Data Mining
Warehouse Evaluation pp
Figure 1.3 A simple data mining process model
bli
h
Assembling the Data
• The Data Warehouse
h
h
The Data Warehouse
The data warehouse is a historical
d t b d i d f d i i
database designed for decision support.
Mining the Data
Mining the Data
Interpreting the Results
Interpreting the Results
Result Application
Result Application
1 6 Why Not Simple Search?
1.6 Why Not Simple Search?
• Nearest Neighbor Classifier • K-nearest Neighbor ClassifierK nearest Neighbor Classifier
Nearest Neighbor Classifier
Nearest Neighbor Classifier
Classification is performed by searching Classification is performed by searching the training data for the instance closest in distance to the unknown instance
_ _ _ _ _ _ X X X X _ _ _ _ Intrinsic (Predicted) Value X X X X X X _ X X Actual Value
Figure 1.4 Intrinsic vs. actual customer value