• No results found

Learning Supervised Learning Classification Prediction

N/A
N/A
Protected

Academic year: 2020

Share "Learning Supervised Learning Classification Prediction"

Copied!
60
0
0

Loading.... (view fulltext now)

Full text

(1)

Lecture 07

(2)

Outline

Learning

Supervised Learning

Classification

Prediction

Hi Humans !! Hi Machine

(3)

Machine Learning

Machine Learning

Supervised (Task Driven) Supervised (Task Driven) Classification Classification

Decision Tree, NN, Naïve Bayes, KNN, SVM, Discriminant Analysis, Ensemble Methods, Random

Forest Decision Tree, NN, Naïve Bayes, KNN, SVM, Discriminant Analysis, Ensemble Methods, Random Forest Regression Regression Ordinary LSR, Linear Regression, Logistic Regression,

MARS, LOESS

Ordinary LSR, Linear Regression, Logistic Regression,

MARS, LOESS Unsupervised (Data Driven) Unsupervised (Data Driven) Clustering Clustering kMean, Kmedoids, Fuzzy C-means, Hierarchical, SOM, Hidden Markov Model,

Gaussian Mixture kMean, Kmedoids,

Fuzzy C-means, Hierarchical, SOM, Hidden Markov Model,

Gaussian Mixture Dimension Reduction Dimension Reduction PCA, LDA PCA, LDA Reinforcement (Algorithms learn to react

an environment)

Reinforcement (Algorithms learn to react

an environment) Decision Process Decision Process Reward System Reward System Recommendation Systems Recommendation Systems

Develop predictive model based on

both input and

output data Develop predictive

model based on both input and

output data

Discover an internal representation

(4)

Comparison with Traditional

Programming

Computer

Output

Computer

Data

Program

Output

Data

Program

Traditional Programming

(5)

Webster's definition of “

to learn

“To gain

knowledge

or

understanding

of, or

skill

in

by

study

,

instruction

or

experience''

Learning a set of new facts

Learning HOW to do something

Improving ability of something already learned

(6)

There is no need to “learn” to calculate payroll

Learning is used when:

Human expertise does not exist

Humans are unable to explain their expertise

(speech recognition)

Solution changes in time (routing on a computer

network)

Solution needs to be adapted to particular cases

(user biometrics)

(7)

Examples

Walking (motor skills)

Riding a bike (motor skills)

Telephone number (memorizing)

Playing backgammon (strategy)

Develop scientific theory (abstraction)

Language

Recognize fraudulent credit card transactions

Etc.

(8)
(9)

Supervised learning is the machine learning

task of inferring a function from

supervised

training data

.

The training data consist of a set of

training

examples

.

In supervised learning, each example is a

pair consisting of an

input

object

(typically

a vector) and a desired

output

value

(also

called the supervisory signal).

(10)

(Inferred function) classifier

If the output is

discrete

.

OR

Regression function

If the output is

continuous

A supervised learning algorithm analyzes the

training data and produces

(11)

Given a set of training examples of the form:

{ (x

1

,y

1

), . . ., (x

N

, y

N

)}

a learning algorithm seeks a function

g : X

Y

where X is the input space and Y is the output space and the

function g is an element of some space of possible functions G,

usually called the hypothesis space.

(12)
(13)

Classification: Definition

Given a collection of records (

training set

)

Each record contains a set of

attributes

, one of the

attributes is the

class

.

Find a

model

for class attribute as a function of the values

of other attributes.

Goal

: previously unseen records should be assigned a class

as accurately as possible.

A

test set

is used to determine the accuracy of the model.

(14)

Training

data

Class

label

Attribute

vector

Classification

Algorithm

Classification

Algorithm

Classification rules

(15)

Test

data

(Fahim Anwar, youth, medium)

Loan decision?

(Fahim Anwar, youth, medium)

Loan decision?

Risky

Risky

Estimate classifier accuracy

(to avoid overfitting)

Estimate classifier accuracy

(to avoid overfitting)

Predict classification

of new data

Predict classification

of new data

Classification rules

Classification rules

% test set tuples correctly

classified

% test set tuples correctly

classified

(16)

Examples of Classification Task

Applications

Predicting

tumor cells as benign or malignant

Classifying

credit card transactions as legitimate or

fraudulent

Categorizing

news stories as finance, weather,

(17)

Back propagation

Bayesian Classifiers

Decision Trees

Density estimation methods

Fuzzy set theory

Linear discriminant analysis (LDA)

Logistic regression

Naive bayes classifier

Nearest Neighborhood Classification

Neural networks

Quadratic discriminant analysis (QDA)

Support Vector Machine

many more…

(18)
(19)

Decision tree is a classifier in the form of a tree structure

Decision node

: specifies a test on a single attribute

Leaf node

: indicates the value of the target attribute

Arc/edge

: split of one attribute

Path

: a disjunction of test to make the final decision

Definition

(20)

Attribute-value description

: object or case must be

expressible in terms of a fixed collection of properties

or attributes (e.g., hot, mild, cold).

Predefined classes (target values)

: the target

function has discrete output values (single or

multiclass)

Sufficient data

: enough training cases should be

provided to learn the model.

(21)

Consider an example of playing tennis

Attributes (features)

Outlook, temp, humidity, wind

Values

Description of features

Eg: Outlook values - sunny, cloudy,

Overcast

Target

Play

Represents the output of the model

Instances

Examples D1 to D14 of the dataset

Concept

Learn to decide whether to play

tennis i.e. find

h

from given data set

Day Outlook Temp Humidity Wind Play

(22)

Shall we play tennis today?

(23)

Fast

to implement

Simple

to implement because it perform classification without

much computation

Can

convert result to

a set of easily interpretable

rules

that

can be used in knowledge system such as database, where

rules are built from the label of the nodes and the labels of

the arcs.

Can handle

continuous

and

categorical

variables

Can handle

noisy

data

provide a clear indication of which fields are

most important

for prediction or classification

(24)

"Univariate"

splits/partitioning using only one attribute

at a time so limits types of possible trees

Large

decision trees may be

hard

to understand

Perform

poorly with many class

and small data.

(25)

ID3 Algorithm

Other Decision Tree Learning Algorithms

• C4.5

• CART (Classification and Regression Tree)

• CHAID (CHi-squared Automatic Interaction Detection) • MARS

• QUEST (Quick, Unbiased, Efficient, Statistical Tree) • SLIQ

• SPRINT

(26)

Top-down tree construction schema:

Examine training database and find best splitting

predicate for the root node

Partition training database

Recourse on each child node

(27)

Attribute selection measure

a heuristic for

selecting the splitting criterion that “best” splits a

given data partition into

smaller

mutually exclusive

classes

Attributes are

ranked

according to a

measure

attribute having the

best score

is chosen as the

splitting attribute

split-point

for

continuous attributes

splitting subset

for

discrete attributes with

binary trees

Measures:

Information Gain

, Gain Ratio, Gini Index

(28)

Information Gain

Based on

Shannon’s information theory

Goal is to

minimize the expected number of tests needed to

classify a tuple

guarantee that a simple tree is found

Attribute with the

highest information gain

is chosen as the

splitting attribute

minimizes information needed to classify tuples in

resulting partitions

reflects least “impurity” in resulting partitions

(29)
(30)
(31)
(32)

RID age income student Credit_rating Class: buys_computer

1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no

C

1

= Yes = 9

C

2

= No = 5

(33)

RID age income student Credit_rating Class: buys_computer

1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no

C

1

= Yes = 9

C

2

= No = 5

(34)

RID age income student Credit_rating Class: buys_computer

1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no

C

1

= Yes = 9

C

2

= No = 5

(35)

RID age income student Credit_rating Class: buys_computer

1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no

C

1

= Yes = 9

C

2

= No = 5

3- compute

Gain(age) = 0.94 - 0.694 = 0.246 bits

(36)

RID age income student Credit_rating Class: buys_computer

1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no

Gain(age) = 0.246 bits

Gain(income) = 0.029 bits

Gain(student) = 0.151 bits

Gain(credit_rating) = 0.048 bits

Gain(age) has highest information gain

(37)
(38)
(39)

Decision Tree Induction - Attribute Selection

Measures

The same process of splitting has to happen for the two remaining

branches.

For branch age<=30 we still have attributes income, student and

credit_rating.

The mutual information is I(2 Yes, 3 No)= I(2,3)= -2/5 log2(2/5) – 3/5

log2(3/5)=0.97

For Income we have three values income high (0 yes and 2 no),

income medium (1 yes and 1 no) and income low (1 yes and 0 no)

Entropy(income) = 2/5(0) + 2/5 (-1/2log(1/2)-1/2log(1/2)) + 1/5 (0)

= 2/5 (1) = 0.4

(40)

Decision Tree Induction - Attribute Selection

Measures

For Student we have two values student yes (2 yes

and 0 no) and student no (0 yes 3 no)

Entropy(student) = 2/5(0) + 3/5(0) = 0

Gain (student) = 0.97 – 0 = 0.97

We can then safely split on attribute student without

(41)

Decision Tree Induction - Attribute Selection

Measures

(42)
(43)

Decision Tree Induction - Attribute Selection

Measures

Again the same process is needed for the other

branch of age.

The mutual information is I(S Yes, S No)= I(3,2)=

-3/5 log2(3/5) – 2/5 log2(2/5)=0.97

For Income we have two values income medium

(2 yes and 1 no) and income low (1 yes and 1

no)

Entropy(income) = 3/5(-2/3log(2/3)-1/3log(1/3))

+ 2/5 (-1/2log(1/2)-1/2log(1/2))

(44)

Decision Tree Induction - Attribute Selection

Measures

For Student we have two values student yes (2

yes and 1 no) and student no (1 yes and 1 no)

Entropy(student) = 3/5(-2/3log(2/3)-1/3log(1/3))

+ 2/5 (-1/2log(1/2)-1/2log(1/2)) = 0.95

Gain (student) = 0.97 – 0.95 = 0.02

For Credit_Rating we have two values

credit_rating fair (3 yes and 0 no) and

credit_rating excellent (0 yes and 2 no)

Entropy(credit_rating) = 0

(45)
(46)

Decision Tree Induction - Attribute Selection

Measures

We then split based on credit_rating.

These splits give partitions each with records from the same class

make these into leaf nodes with their class label attached

New example: age<=30, income=medium, student=yes,

credit-rating=fair

(47)
(48)

Data may be

overfitted

to dataset anomalies and outliers

Pruning

removes the least reliable branches

DT becomes less complex

Prepruning

statistically assess the

goodness of a split

before it takes place

hard to choose

thresholds

for statistical significance

Postpruning

remove sub-trees from already constructed

trees

(49)
(50)

(51)

RID age income student Credit_rating Class: buys_computer

1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no

R

1: (

age =youth

) ^ (

student =yes

)

(

buys computer

=

yes

)

R

1: (

age =youth

) ^ (

student =yes

)

(

buys computer

=

yes

)

𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒

(

𝑅

1

)

=

2

14

=

14.28%

𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒

(

𝑅

1

)

=

2

14

=

14.28%

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦

(

𝑅

1

)

=

2

2

=

100 %

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦

(

𝑅

1

)

=

2

2

=

100 %

X: (

age =

youth, income = medium,

student =

yes, credit_rating=fair

)

X: (

age =

youth, income = medium,

student =

yes, credit_rating=fair

)

(52)

Rules are easier to understand than large

trees.

One rule is created for each path form

the root to leaf

Each attribute-value pair along a path

forms a conjunction(ANDed): the leaf

holds the class prediction (THEN)

Rules are mutually exclusive and

exhaustive

Rule Extraction from a Decision Tree

Example: Rule extraction from our buy_computer decision-tree

R

1: IF

age

=

youth

AND

student

=

no

THEN

buys computer

=

no

R

2: IF

age

=

youth

AND

student

=

yes

THEN

buys computer

=

yes

R

3: IF

age

=

middle aged

THEN

buys computer

=

yes

(53)
(54)

What Is Prediction?

Prediction is similar to classification

First, construct a model

Second, use model to predict unknown value

Major method for prediction is regression

Linear and multiple regression

Non-linear regression

Prediction is different from classification

Classification refers to predict categorical class label

(55)

Linear and Multiple Regression Analysis

Linear regression:

Y =

+

X

Two parameters ,

and

specify the line and are to be estimated by

using the data at hand.

using the least squares criterion to the known values of Y

1

, Y

2

, …, X

1

, X

2

, ….

Multiple regression:

Y = b

0

+ b

1

X

1

+ b

2

X

2

.

Many nonlinear functions can be transformed into the above.

(56)

Given data

X

(Years

experience)

Y

(Salary)

3

8

9

13

3

6

11

21

1

16

30K

57K

64K

75K

36K

43K

59K

90K

20K

83K

Example: Linear Regression Analysis

y =

23.6

+

3.5

x

(57)

X

(Years

experience)

Y

(Salary)

Y (Predicted)

[Y=23.6 +3.5X]

Mean Absolute

Error

Root Mean Squared

Error

3

8

9

13

3

6

11

21

1

16

30K

57K

64K

75K

36K

43K

59K

90K

20K

83K

34.1K

51.6K

55.1K

69.1K

34.1K

44.6K

62.1K

97.1K

27.1K

79.6K

abs(Y-Yp)=4.1

5.4

8.9

5.9

1.9

1.6

3.1

7.1

7.1

3.4

Sqr(Y-Yp)=16.81

29.16

79.21

34.81

3.61

2.56

9.61

50.41

50.41

11.56

4.85

5.37

(58)

Classification vs Regression

Classification means to

group

the output into a

class.

classification to

predict

the type of tumor i.e.

harmful or not harmful

using training data

if it is

discrete/categorical

variable, then it is

classification problem

Regression means to

predict

the output value

using training data.

regression to

predict

the

house price from

training data

if it is a real

(59)

1. Compare traditional programming with machine learning.

2. What is learning? Why to learn?

3. What is supervised learning? What are supervised learning

algorithms?

4. What is classifier? Explain with an example.

5. What is decision tree? What are pros and cons of decision tree?

6. What are attribute selection measures?

7. What are the criteria to prune a tree?

8. How rules are extracted from decision tree? Given an example.

9. How the goodness of rule is measure?

10. What is prediction? How it is different from classification?

(60)

References

Related documents

International marketing practice ional marketing practices, embedded in a s, embedded in a strong ethical doctrine strong ethical doctrine, can , can play a vital role play a

Decatur Memorial Hospital is a major contributor to both our local and state economies and to keeping families healthy and secure by providing needed health care services.. the

Networked citizen politics, characterized by decentralization, swarm-like action and an intensive use of Information and Com- munication Technologies, have been having a starring

Accounting Historians, Academy of (2000) &#34;Academy of Accounting Historians: The rhetoric of accounting history: Conversations across time and space, Toronto, November 18-20,

1. Cost of material is high: about double that of rivets. The tensile strength of the bolt is reduced because of area reduction at the root of the thread and also

Using these linked network assets, we can offer secure managed connectivity between any business premises and data centres around the world. Tailored

In this range the small–spotted genet feeds not only on terrestrial prey but also on aquatic prey (Ruiz– Olmo &amp; López–Martín, 1993), potentially competing with the

MM-SH is entitled to exclude from future trade fair any exhibitor whose stand is staffed by insufficiently qualified personnel during the trade fair’s opening hours, who exhibits