Data Mining Fundamentals

(1)

Part I

(2)

Data Mining: A First View

(3)

1 1 Data Mining: A Definition

1.1 Data Mining: A Definition

(4)

Data Mining

The process of employing one or more The process of employing one or more

computer learning techniques to

automatically analyze and extract

(5)

Induction-based Learning

g

The process of forming general

concept definitions by observing

specific examples of concepts to be specific examples of concepts to be learned.

(6)

Knowledge Discovery in

Databases (KDD)

The application of the scientific

method to data mining Data mining is method to data mining. Data mining is one step of the KDD process.

(7)

(8)

Four Levels of Learning

• Facts • ConceptsConcepts • Procedures i i l • Principles

(9)

Facts

(10)

Concepts

A concept is a set of objects, symbols, or events grouped together because they events grouped together because they share certain characteristics.

(11)

Procedures

A procedure is a step-by-step course of action to achieve a goal.

(12)

Principles

A principles are general truths or laws that are basic to other truths.

(13)

Computers & Learning

Computers are good at learning concepts. Concepts are the output of a data mining session.

(14)

Three Concept Views

• Classical View

• Probabilistic ViewProbabilistic View • Exemplar View

(15)

Classical View

All concepts have definite defining properties. g p p

(16)

Probabilistic View

People store and recall concepts as generalizations created by g y

(17)

Exemplar View

People store and recall likely concept exemplars that are used p p to classify unknown instances.

(18)

Supervised Learning

• Build a learner model using datag instances of known origin.

• Use the model to determine theUse the model to determine the outcome for new instances of unknown origin.

(19)

Supervised Learning:

A Decision Tree Example

(20)

Decision Tree

A tree structure where non-terminal

d t t t

nodes represent tests on one or more attributes and terminal nodes reflect

d i i t

(21)

T bl 1 1 H h i l T i i D f Di Di i Table 1.1 • Hypothetical Training Data for Disease Diagnosis

Patient Sore Swollen

ID# Throat Fever Glands Congestion Headache Diagnosis

1 Yes Yes Yes Yes Yes Strep throat

2 No No No Yes Yes Allergy

3 Yes Yes No Yes No Cold

4 Yes No Yes No No Strep throat

5 No Yes No Yes No Cold

6 No No No Yes No Allergy

7 No No Yes No No Strep throat

8 Yes No No Yes Yes Allergy

9 No Yes No Yes Yes Cold

(22)

Swollen Glands

No Yes

Fever

Diagnosis = Strep Throat

Yes

Diagnosis = Allergy Diagnosis = Cold

No

Figure 1.1 A decision tree for the data in Table 1.1

(23)

Table 1.2 • Data Instances with an Unknown Classification

Patient Sore Swollen

ID# Throat Fever Glands Congestion Headache Diagnosis

11 No No Yes Yes Yes ?

12 Y Y N N Y ?

12 Yes Yes No No Yes ?

(24)

Production Rules

IF S ll Gl d Y

IF Swollen Glands = Yes

THEN Diagnosis = Strep Throat

IF Swollen Glands = No & Fever = Yes

IF Swollen Glands No & Fever Yes

THEN Diagnosis = Cold

IF Swollen Glands = No & Fever = No

(25)

Unsupervised Clustering

A data mining method that builds

d l f d t ith t d fi d

models from data without predefined classes.

(26)

h A

The Acme Investors Dataset

Table 1.3 • Acme Investors Incorporated

Customer Account Margin Transaction Trades/ Favorite Annual ID T A t M th d M th S A R ti I

ID Type Account Method Month Sex Age Recreation Income

1005 Joint No Online 12.5 F 30–39 Tennis 40–59K 1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K 1245 Joint No Online 3.6 M 20–29 Golf 20–39K 2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K 1001 Individual Yes Online 5.0 M 40–49 Golf 60–79K

(27)

The Acme Investors Dataset &

S

i d L

i

Supervised Learning

1. Can I develop a general profile of an online investor?

2. Can I determine if a new customer is likely to open a y p

margin account?

3. Can I build a model predict the average number of trades

per month for a new investor? p

4. What characteristics differentiate female and male

(28)

The Acme Investors Dataset &

Unsupervised Clustering

1. What attribute similarities group customers of Acme Investors together?

of Acme Investors together?

2. What differences in attribute values

t th t d t b ?

(29)

1.3 Is Data Mining Appropriate

f

bl

(30)

Data Mining or Data Query?

• Shallow Knowledge

• Multidimensional Knowledgeg • Hidden Knowledge

• Deep Knowledge • Deep Knowledge

(31)

Shallow Knowledge

Shallow knowledge is factual. It can be easily stored and manipulated in a database.

(32)

Multidimensional Knowledge

Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge.

(33)

Hidden Knowledge

Hidden knowledge represents patterns or regularities in data that cannot be

easily found using database query. However, data mining algorithms can find such patterns with ease.

(34)

Deep Knowledge

Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for.

(35)

Data Mining vs Data Query: An

Data Mining vs. Data Query: An

Example

p

• Use data query if you alreadyUse data query if you already almost know what you are looking for.

looking for.

• Use data mining to find regularities in data that are not obvious

(36)

1.4 Expert Systems or Data

i i

(37)

Expert System

A computer program that emulates the problem-solving skills of one orp g more human experts.

(38)

Knowledge Engineer

A person trained to interact with an expert in order to capture theirp p knowledge.

(39)

Data Mining Tool

Data

If Swollen Glands = Yes Then Diagnosis = Strep Throat

Expert System Building Tool

Human Expert _{Knowledge Engineer}

If Swollen Glands = Yes

Figure 1.2 Data mining vs. expert systems

(40)

1 5 A Simple Data Mining

1.5 A Simple Data Mining

(41)

SQL Queries Operational

Database

Data _ApplicationResult

Interpretation &

E l ti Data Mining

Warehouse Evaluation pp

Figure 1.3 A simple data mining process model

(42)

bli

h

Assembling the Data

• The Data Warehouse

(43)

h

The Data Warehouse

The data warehouse is a historical

d t b d i d f d i i

database designed for decision support.

(44)

Mining the Data

(45)

Interpreting the Results

(46)

Result Application

(47)

1 6 Why Not Simple Search?

1.6 Why Not Simple Search?

• Nearest Neighbor Classifier • K-nearest Neighbor ClassifierK nearest Neighbor Classifier

(48)

Nearest Neighbor Classifier

Classification is performed by searching Classification is performed by searching the training data for the instance closest in distance to the unknown instance

(49)

(50)

(51)

_ _ _ _ _ _ X X X X _ _ _ _ Intrinsic (Predicted) Value X X X X X X _ X X Actual Value

Figure 1.4 Intrinsic vs. actual customer value