• No results found

Data Mining Fundamentals

N/A
N/A
Protected

Academic year: 2021

Share "Data Mining Fundamentals"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)

Part I

Part I

(2)

Data Mining: A First View

Data Mining: A First View

(3)

1 1 Data Mining: A Definition

1.1 Data Mining: A Definition

(4)

Data Mining

Data Mining

The process of employing one or more The process of employing one or more

computer learning techniques to

automatically analyze and extract

automatically analyze and extract

(5)

Induction-based Learning

g

The process of forming general

The process of forming general

concept definitions by observing

specific examples of concepts to be specific examples of concepts to be learned.

(6)

Knowledge Discovery in

Databases (KDD)

The application of the scientific

method to data mining Data mining is method to data mining. Data mining is one step of the KDD process.

(7)
(8)

Four Levels of Learning

• Facts • ConceptsConcepts • Procedures i i l • Principles

(9)

Facts

(10)

Concepts

A concept is a set of objects, symbols, or events grouped together because they events grouped together because they share certain characteristics.

(11)

Procedures

A procedure is a step-by-step course of action to achieve a goal.

(12)

Principles

A principles are general truths or laws that are basic to other truths.

(13)

Computers & Learning

Computers are good at learning concepts. Concepts are the output of a data mining session.

(14)

Three Concept Views

• Classical View

• Probabilistic ViewProbabilistic View • Exemplar View

(15)

Classical View

All concepts have definite defining properties. g p p

(16)

Probabilistic View

People store and recall concepts as generalizations created by g y

(17)

Exemplar View

People store and recall likely concept exemplars that are used p p to classify unknown instances.

(18)

Supervised Learning

Supervised Learning

• Build a learner model using datag instances of known origin.

• Use the model to determine theUse the model to determine the outcome for new instances of unknown origin.

(19)

Supervised Learning:

A Decision Tree Example

(20)

Decision Tree

A tree structure where non-terminal

d t t t

nodes represent tests on one or more attributes and terminal nodes reflect

d i i t

(21)

T bl 1 1 H h i l T i i D f Di Di i Table 1.1 • Hypothetical Training Data for Disease Diagnosis

Patient Sore Swollen

ID# Throat Fever Glands Congestion Headache Diagnosis

1 Yes Yes Yes Yes Yes Strep throat

2 No No No Yes Yes Allergy

3 Yes Yes No Yes No Cold

4 Yes No Yes No No Strep throat

5 No Yes No Yes No Cold

6 No No No Yes No Allergy

7 No No Yes No No Strep throat

8 Yes No No Yes Yes Allergy

9 No Yes No Yes Yes Cold

(22)

Swollen Glands

No Yes

Fever

Diagnosis = Strep Throat

Yes

Diagnosis = Allergy Diagnosis = Cold

No

Figure 1.1 A decision tree for the data in Table 1.1

(23)

Table 1.2 • Data Instances with an Unknown Classification

Patient Sore Swollen

ID# Throat Fever Glands Congestion Headache Diagnosis

11 No No Yes Yes Yes ?

12 Y Y N N Y ?

12 Yes Yes No No Yes ?

(24)

Production Rules

Production Rules

IF S ll Gl d Y

IF Swollen Glands = Yes

THEN Diagnosis = Strep Throat

IF Swollen Glands = No & Fever = Yes

IF Swollen Glands No & Fever Yes

THEN Diagnosis = Cold

IF Swollen Glands = No & Fever = No

(25)

Unsupervised Clustering

A data mining method that builds

d l f d t ith t d fi d

models from data without predefined classes.

(26)

h A

The Acme Investors Dataset

Table 1.3 • Acme Investors Incorporated

Customer Account Margin Transaction Trades/ Favorite Annual ID T A t M th d M th S A R ti I

ID Type Account Method Month Sex Age Recreation Income

1005 Joint No Online 12.5 F 30–39 Tennis 40–59K 1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K 1245 Joint No Online 3.6 M 20–29 Golf 20–39K 2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K 1001 Individual Yes Online 5.0 M 40–49 Golf 60–79K

(27)

The Acme Investors Dataset &

S

i d L

i

Supervised Learning

1. Can I develop a general profile of an online investor?

2. Can I determine if a new customer is likely to open a y p

margin account?

3. Can I build a model predict the average number of trades

per month for a new investor? p

4. What characteristics differentiate female and male

(28)

The Acme Investors Dataset &

Unsupervised Clustering

1. What attribute similarities group customers of Acme Investors together?

of Acme Investors together?

2. What differences in attribute values

t th t d t b ?

(29)

1.3 Is Data Mining Appropriate

f

bl

(30)

Data Mining or Data Query?

Data Mining or Data Query?

• Shallow Knowledge

• Multidimensional Knowledgeg • Hidden Knowledge

• Deep Knowledge • Deep Knowledge

(31)

Shallow Knowledge

Shallow Knowledge

Shallow knowledge is factual. It can be easily stored and manipulated in a database.

(32)

Multidimensional Knowledge

Multidimensional Knowledge

Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge.

(33)

Hidden Knowledge

Hidden Knowledge

Hidden knowledge represents patterns or regularities in data that cannot be

easily found using database query. However, data mining algorithms can find such patterns with ease.

(34)

Deep Knowledge

Deep Knowledge

Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for.

(35)

Data Mining vs Data Query: An

Data Mining vs. Data Query: An

Example

p

• Use data query if you alreadyUse data query if you already almost know what you are looking for.

looking for.

• Use data mining to find regularities in data that are not obvious

(36)

1.4 Expert Systems or Data

i i

(37)

Expert System

Expert System

A computer program that emulates the problem-solving skills of one orp g more human experts.

(38)

Knowledge Engineer

Knowledge Engineer

A person trained to interact with an expert in order to capture theirp p knowledge.

(39)

Data Mining Tool

Data

If Swollen Glands = Yes Then Diagnosis = Strep Throat

Expert System Building Tool

Human Expert Knowledge Engineer

If Swollen Glands = Yes

Figure 1.2 Data mining vs. expert systems

(40)

1 5 A Simple Data Mining

1.5 A Simple Data Mining

(41)

SQL Queries Operational

Database

Data ApplicationResult

Interpretation &

E l ti Data Mining

Warehouse Evaluation pp

Figure 1.3 A simple data mining process model

(42)

bli

h

Assembling the Data

• The Data Warehouse

(43)

h

h

The Data Warehouse

The data warehouse is a historical

d t b d i d f d i i

database designed for decision support.

(44)

Mining the Data

Mining the Data

(45)

Interpreting the Results

Interpreting the Results

(46)

Result Application

Result Application

(47)

1 6 Why Not Simple Search?

1.6 Why Not Simple Search?

• Nearest Neighbor Classifier • K-nearest Neighbor ClassifierK nearest Neighbor Classifier

(48)

Nearest Neighbor Classifier

Nearest Neighbor Classifier

Classification is performed by searching Classification is performed by searching the training data for the instance closest in distance to the unknown instance

(49)
(50)
(51)

_ _ _ _ _ _ X X X X _ _ _ _ Intrinsic (Predicted) Value X X X X X X _ X X Actual Value

Figure 1.4 Intrinsic vs. actual customer value

References

Related documents

We consider the usual (spring balance) weighing design set-up with the design matrix having a string property meaning thereby that in every row of it, there is exactly £n~ ~ of l's

All patients were assessed for the clinical benefit of presurgi- cal TMT (Table 2).. The median reduction of tumor thrombus height was − 0.53 cm.. Dove press Presurgical

Immunoprecipi- tation and Western blot for FGFR3 proteins confirmed the presence of both FGFR3 proteins in the cell lysate, suggesting that this decrease in phosphorylation did

In examining the ways in which nurses access information as a response to these uncertainties (Thompson et al. 2001a) and their perceptions of the information’s usefulness in

As a formal method it allows the user to test their applications reliably based on the SXM method of testing, whilst using a notation which is closer to a programming language.

For the cells sharing a given channel, the antenna pointing angles are first calculated and the azimuth and elevation angles subtended by each cell may be used to derive

This continuous reactive power exchange results in more reactive current flow in the voltage source inverter (VSI) as well as feeder. Consequently, losses in the

glands dissected from larvae treated with the compound, or larvae expressing smo- RNAi in the lymph gland MZ, showed that Smo inhibition results in loss of blood cell precursors