Learning Supervised Learning Classification Prediction

(1)

Lecture 07

(2)

Outline

• _Learning

• _{Supervised Learning}

–

_{Classification}

–

_Prediction

Hi Humans !! Hi Machine

(3)

Machine Learning

Supervised (Task Driven) Supervised (Task Driven) Classification Classification

Decision Tree, NN, Naïve Bayes, KNN, SVM, Discriminant Analysis, Ensemble Methods, Random

Forest Decision Tree, NN, Naïve Bayes, KNN, SVM, Discriminant Analysis, Ensemble Methods, Random Forest Regression Regression Ordinary LSR, Linear Regression, Logistic Regression,

MARS, LOESS

Ordinary LSR, Linear Regression, Logistic Regression,

MARS, LOESS Unsupervised (Data Driven) Unsupervised (Data Driven) Clustering Clustering kMean, Kmedoids, Fuzzy C-means, Hierarchical, SOM, Hidden Markov Model,

Gaussian Mixture kMean, Kmedoids,

Fuzzy C-means, Hierarchical, SOM, Hidden Markov Model,

Gaussian Mixture Dimension Reduction Dimension Reduction PCA, LDA PCA, LDA Reinforcement (Algorithms learn to react

an environment)

Reinforcement (Algorithms learn to react

an environment) Decision Process Decision Process Reward System Reward System Recommendation Systems Recommendation Systems

Develop predictive model based on

both input and

output data Develop predictive

model based on both input and

output data

Discover an internal representation

(4)

Comparison with Traditional

Programming

Computer

Output

Computer

Data

Program

Output

Data

Program

Traditional Programming

(5)

• _{Webster's definition of “}

_{to learn}

_”

“To gain

knowledge

or

understanding

of, or

skill

in

by

study

,

instruction

or

experience''

• _{Learning a set of new facts}

• _{Learning HOW to do something}

• _{Improving ability of something already learned}

(6)

• _{There is no need to “learn” to calculate payroll}

• _{Learning is used when:}

–

_{Human expertise does not exist}

–

_{Humans are unable to explain their expertise}

(speech recognition)

–

_{Solution changes in time (routing on a computer}

network)

–

_{Solution needs to be adapted to particular cases}

(user biometrics)

(7)

• _Examples

–

_{Walking (motor skills)}

–

_{Riding a bike (motor skills)}

–

_{Telephone number (memorizing)}

–

_{Playing backgammon (strategy)}

–

_{Develop scientific theory (abstraction)}

–

_Language

–

_{Recognize fraudulent credit card transactions}

–

_Etc.

(8)

(9)

• _{Supervised learning is the machine learning}

task of inferring a function from

supervised

training data

.

• _{The training data consist of a set of}

_training

examples

.

• _{In supervised learning, each example is a}

pair consisting of an

input

object

(typically

a vector) and a desired

output

value

(also

called the supervisory signal).

(10)

• _{(Inferred function) classifier}

–

_{If the output is}

_discrete

_.

OR

• _{Regression function}

–

_{If the output is}

_continuous

A supervised learning algorithm analyzes the

training data and produces

(11)

Given a set of training examples of the form:

{ (x

1 ,y

1 ), . . ., (x

N

, y

N

)}

a learning algorithm seeks a function

g : X



Y

where X is the input space and Y is the output space and the

function g is an element of some space of possible functions G,

usually called the hypothesis space.

(12)

(13)

Classification: Definition

• _{Given a collection of records (}

_{training set}

₎

• _{Each record contains a set of}

_attributes

_{, one of the}

attributes is the

class

.

• _{Find a}

_model

_{for class attribute as a function of the values}

of other attributes.

• _Goal

_{: previously unseen records should be assigned a class}

as accurately as possible.

• _A

_{test set}

_{is used to determine the accuracy of the model.}

(14)

Training

data

Class

label

Attribute

vector

Classification

Algorithm

Classification

Algorithm

Classification rules

(15)

Test

data

(Fahim Anwar, youth, medium)

Loan decision?

(Fahim Anwar, youth, medium)

Loan decision?

Risky

Estimate classifier accuracy

(to avoid overfitting)

Estimate classifier accuracy

(to avoid overfitting)

Predict classification

_{of new data}

Predict classification

of new data

Classification rules

% test set tuples correctly

classified

% test set tuples correctly

classified

(16)

Examples of Classification Task

Applications

• _Predicting

_{tumor cells as benign or malignant}

• _Classifying

_{credit card transactions as legitimate or}

fraudulent

• _Categorizing

_{news stories as finance, weather,}

(17)

• _{Back propagation}

• _{Bayesian Classifiers}

• _{Decision Trees}

• _{Density estimation methods}

• _{Fuzzy set theory}

• _{Linear discriminant analysis (LDA)}

• _{Logistic regression}

• _{Naive bayes classifier}

• _{Nearest Neighborhood Classification}

• _{Neural networks}

• _{Quadratic discriminant analysis (QDA)}

• _{Support Vector Machine}

• _{many more…}

(18)

(19)

• _{Decision tree is a classifier in the form of a tree structure}

–

_{Decision node}

_{: specifies a test on a single attribute}

–

_{Leaf node}

_{: indicates the value of the target attribute}

–

_Arc/edge

_{: split of one attribute}

–

_Path

_{: a disjunction of test to make the final decision}

Definition

(20)

• _{Attribute-value description}

_{: object or case must be}

expressible in terms of a fixed collection of properties

or attributes (e.g., hot, mild, cold).

• _{Predefined classes (target values)}

_{: the target}

function has discrete output values (single or

multiclass)

• _{Sufficient data}

_{: enough training cases should be}

provided to learn the model.

(21)

• _{Consider an example of playing tennis}

• _{Attributes (features)}

–

_{Outlook, temp, humidity, wind}

• _Values

–

Description of features

–

Eg: Outlook values - sunny, cloudy,

Overcast

• _Target

–

_Play

–

_{Represents the output of the model}

• _Instances

–

_{Examples D1 to D14 of the dataset}

• _Concept

–

Learn to decide whether to play

tennis i.e. find

h

from given data set

Day Outlook Temp Humidity Wind Play

(22)

Shall we play tennis today?

(23)

• _Fast

_{to implement}

• _Simple

_{to implement because it perform classification without}

much computation

• _Can

_{convert result to}

_{a set of easily interpretable}

_rules

_that

can be used in knowledge system such as database, where

rules are built from the label of the nodes and the labels of

the arcs.

• _{Can handle}

_continuous

_and

_categorical

_variables

• _{Can handle}

_noisy

_data

• _{provide a clear indication of which fields are}

_{most important}

for prediction or classification

(24)

• _"Univariate"

_{splits/partitioning using only one attribute}

at a time so limits types of possible trees

• _Large

_{decision trees may be}

_hard

_{to understand}

• _Perform

_{poorly with many class}

_{and small data.}

(25)

ID3 Algorithm

Other Decision Tree Learning Algorithms

• C4.5

• CART (Classification and Regression Tree)

• CHAID (CHi-squared Automatic Interaction Detection) • MARS

• QUEST (Quick, Unbiased, Efficient, Statistical Tree) • SLIQ

• SPRINT

(26)

• Top-down tree construction schema:

• Examine training database and find best splitting

predicate for the root node

• Partition training database

• Recourse on each child node

(27)

• _{Attribute selection measure}



a heuristic for

selecting the splitting criterion that “best” splits a

given data partition into

smaller

mutually exclusive

classes

• _{Attributes are}

_ranked

_{according to a}

_measure

–

_{attribute having the}

_{best score}

_{is chosen as the}

splitting attribute

–

_split-point

_for

_{continuous attributes}

–

_{splitting subset}

_for

_{discrete attributes with}

binary trees

• _Measures:

_{Information Gain}

_{, Gain Ratio, Gini Index}

(28)

Information Gain

• _{Based on}

_{Shannon’s information theory}

• _{Goal is to}

_{minimize the expected number of tests needed to}

classify a tuple

–

_{guarantee that a simple tree is found}

• _{Attribute with the}

_{highest information gain}

_{is chosen as the}

splitting attribute

–

_{minimizes information needed to classify tuples in}

resulting partitions

–

_{reflects least “impurity” in resulting partitions}

(29)

(30)

(31)

(32)

RID age income student Credit_rating Class: buys_computer

1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no

C

₁

= Yes = 9

C

₂

= No = 5

(33)

C

₁

= Yes = 9

C

₂

= No = 5

(34)

C

₁

= Yes = 9

C

₂

= No = 5

(35)

C

₁

= Yes = 9

C

₂

= No = 5

3- compute

Gain(age) = 0.94 - 0.694 = 0.246 bits

(36)

Gain(age) = 0.246 bits

Gain(income) = 0.029 bits

Gain(student) = 0.151 bits

Gain(credit_rating) = 0.048 bits

Gain(age) has highest information gain

(37)

(38)

(39)

Decision Tree Induction - Attribute Selection

Measures

• _{The same process of splitting has to happen for the two remaining}

branches.

• For branch age<=30 we still have attributes income, student and

credit_rating.

• _{The mutual information is I(2 Yes, 3 No)= I(2,3)= -2/5 log2(2/5) – 3/5}

log2(3/5)=0.97

• _{For Income we have three values income high (0 yes and 2 no),}

income medium (1 yes and 1 no) and income low (1 yes and 0 no)

• _{Entropy(income) = 2/5(0) + 2/5 (-1/2log(1/2)-1/2log(1/2)) + 1/5 (0)}

• = 2/5 (1) = 0.4

(40)

Decision Tree Induction - Attribute Selection

Measures

• _{For Student we have two values student yes (2 yes}

and 0 no) and student no (0 yes 3 no)

• _{Entropy(student) = 2/5(0) + 3/5(0) = 0}

• _{Gain (student) = 0.97 – 0 = 0.97}

• _{We can then safely split on attribute student without}

(41)

Decision Tree Induction - Attribute Selection

Measures

(42)

(43)

Decision Tree Induction - Attribute Selection

Measures

• _{Again the same process is needed for the other}

branch of age.

• _{The mutual information is I(S Yes, S No)= I(3,2)=}

-3/5 log2(3/5) – 2/5 log2(2/5)=0.97

• _{For Income we have two values income medium}

(2 yes and 1 no) and income low (1 yes and 1

no)

• _{Entropy(income) = 3/5(-2/3log(2/3)-1/3log(1/3))}

+ 2/5 (-1/2log(1/2)-1/2log(1/2))

(44)

Decision Tree Induction - Attribute Selection

Measures

• _{For Student we have two values student yes (2}

yes and 1 no) and student no (1 yes and 1 no)

• _{Entropy(student) = 3/5(-2/3log(2/3)-1/3log(1/3))}

+ 2/5 (-1/2log(1/2)-1/2log(1/2)) = 0.95

• _{Gain (student) = 0.97 – 0.95 = 0.02}

• _{For Credit_Rating we have two values}

credit_rating fair (3 yes and 0 no) and

credit_rating excellent (0 yes and 2 no)

• _{Entropy(credit_rating) = 0}

(45)

(46)

Decision Tree Induction - Attribute Selection

Measures

• _{We then split based on credit_rating.}

• _{These splits give partitions each with records from the same class}

• _{make these into leaf nodes with their class label attached}

New example: age<=30, income=medium, student=yes,

credit-rating=fair

(47)

(48)

• Data may be

overfitted

to dataset anomalies and outliers

• Pruning

removes the least reliable branches

–

_{DT becomes less complex}

• _Prepruning



statistically assess the

goodness of a split

before it takes place

–

_{hard to choose}

_thresholds

_{for statistical significance}

• _Postpruning



remove sub-trees from already constructed

trees

(49)

(50)

•

(51)

R

1: (

age =youth

) ^ (

student =yes

)

(

buys computer

=

yes

)

R

1: (

age =youth

) ^ (

student =yes

)

(

buys computer

=

yes

)

𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒

(

𝑅

1 )

=

2

14 =

14.28%

𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒

(

𝑅

1 )

=

2

14 =

14.28%

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦

(

𝑅

1 )

=

2

2 =

100 %

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦

(

𝑅

1 )

=

2

2 =

100 %

X: (

age =

youth, income = medium,

student =

yes, credit_rating=fair

)

X: (

age =

youth, income = medium,

student =

yes, credit_rating=fair

)

(52)

• _{Rules are easier to understand than large}

trees.

• _{One rule is created for each path form}

the root to leaf

• Each attribute-value pair along a path

forms a conjunction(ANDed): the leaf

holds the class prediction (THEN)

• _{Rules are mutually exclusive and}

exhaustive

Rule Extraction from a Decision Tree

Example: Rule extraction from our buy_computer decision-tree

R

1: IF

age

=

youth

AND

student

=

no

THEN

buys computer

=

no

R

2: IF

age

=

youth

AND

student

=

yes

THEN

buys computer

=

yes

R

3: IF

age

=

middle aged

THEN

buys computer

=

yes

(53)

(54)

What Is Prediction?

• _{Prediction is similar to classification}

–

_{First, construct a model}

–

_{Second, use model to predict unknown value}

• _{Major method for prediction is regression}

–

_{Linear and multiple regression}

–

_{Non-linear regression}

• _{Prediction is different from classification}

–

_{Classification refers to predict categorical class label}

(55)

Linear and Multiple Regression Analysis

• _{Linear regression:}

_{Y =}



+



X

–

Two parameters ,



and



specify the line and are to be estimated by

using the data at hand.

–

using the least squares criterion to the known values of Y

₁

, Y

₂

, …, X

₁

, X

₂

, ….

• Multiple regression:

Y = b

₀

+ b

₁

X

₁

+ b

₂

X

₂

.

–

Many nonlinear functions can be transformed into the above.

(56)

• Given data

X

(Years

experience)

Y

(Salary)

3

8

9

13

3

6

11

21

1

16 30K

57K

64K

75K

36K

43K

59K

90K

20K

83K

Example: Linear Regression Analysis

y =

23.6 +

3.5 x

(57)

X

(Years

experience)

Y

(Salary)

Y (Predicted)

[Y=23.6 +3.5X]

Mean Absolute

Error

Root Mean Squared

Error

3

8

9

13

3

6

11

21

1

16 30K

57K

64K

75K

36K

43K

59K

90K

20K

83K

34.1K

51.6K

55.1K

69.1K

34.1K

44.6K

62.1K

97.1K

27.1K

79.6K

abs(Y-Yp)=4.1

5.4

8.9

5.9

1.9

1.6

3.1

7.1

3.4 Sqr(Y-Yp)=16.81

29.16

79.21

34.81

3.61

2.56

9.61

50.41

11.56

4.85

5.37

(58)

Classification vs Regression

• _{Classification means to}

group

the output into a

class.

• _{classification to}

_predict

the type of tumor i.e.

harmful or not harmful

using training data

• _{if it is}

_{discrete/categorical}

variable, then it is

classification problem

• _{Regression means to}

predict

the output value

using training data.

• _{regression to}

_predict

_the

house price from

training data

• _{if it is a real}

(59)

1. Compare traditional programming with machine learning.

2. What is learning? Why to learn?

3. What is supervised learning? What are supervised learning

algorithms?

4. What is classifier? Explain with an example.

5. What is decision tree? What are pros and cons of decision tree?

6. What are attribute selection measures?

7. What are the criteria to prune a tree?

8. How rules are extracted from decision tree? Given an example.

9. How the goodness of rule is measure?

10. What is prediction? How it is different from classification?

(60)