Lecture 07
Outline
•
Learning
•
Supervised Learning
–
Classification
–
Prediction
Hi Humans !! Hi Machine
Machine Learning
Machine Learning
Supervised (Task Driven) Supervised (Task Driven) Classification ClassificationDecision Tree, NN, Naïve Bayes, KNN, SVM, Discriminant Analysis, Ensemble Methods, Random
Forest Decision Tree, NN, Naïve Bayes, KNN, SVM, Discriminant Analysis, Ensemble Methods, Random Forest Regression Regression Ordinary LSR, Linear Regression, Logistic Regression,
MARS, LOESS
Ordinary LSR, Linear Regression, Logistic Regression,
MARS, LOESS Unsupervised (Data Driven) Unsupervised (Data Driven) Clustering Clustering kMean, Kmedoids, Fuzzy C-means, Hierarchical, SOM, Hidden Markov Model,
Gaussian Mixture kMean, Kmedoids,
Fuzzy C-means, Hierarchical, SOM, Hidden Markov Model,
Gaussian Mixture Dimension Reduction Dimension Reduction PCA, LDA PCA, LDA Reinforcement (Algorithms learn to react
an environment)
Reinforcement (Algorithms learn to react
an environment) Decision Process Decision Process Reward System Reward System Recommendation Systems Recommendation Systems
Develop predictive model based on
both input and
output data Develop predictive
model based on both input and
output data
Discover an internal representation
Comparison with Traditional
Programming
Computer
Output
Computer
Data
Program
Output
Data
Program
Traditional Programming
•
Webster's definition of “
to learn
”
“To gain
knowledge
or
understanding
of, or
skill
in
by
study
,
instruction
or
experience''
•
Learning a set of new facts
•
Learning HOW to do something
•
Improving ability of something already learned
•
There is no need to “learn” to calculate payroll
•
Learning is used when:
–
Human expertise does not exist
–
Humans are unable to explain their expertise
(speech recognition)
–
Solution changes in time (routing on a computer
network)
–
Solution needs to be adapted to particular cases
(user biometrics)
•
Examples
–
Walking (motor skills)
–
Riding a bike (motor skills)
–
Telephone number (memorizing)
–
Playing backgammon (strategy)
–
Develop scientific theory (abstraction)
–
Language
–
Recognize fraudulent credit card transactions
–
Etc.
•
Supervised learning is the machine learning
task of inferring a function from
supervised
training data
.
•
The training data consist of a set of
training
examples
.
•
In supervised learning, each example is a
pair consisting of an
input
object
(typically
a vector) and a desired
output
value
(also
called the supervisory signal).
•
(Inferred function) classifier
–
If the output is
discrete
.
OR
•
Regression function
–
If the output is
continuous
A supervised learning algorithm analyzes the
training data and produces
Given a set of training examples of the form:
{ (x
1
,y
1
), . . ., (x
N
, y
N
)}
a learning algorithm seeks a function
g : X
Y
where X is the input space and Y is the output space and the
function g is an element of some space of possible functions G,
usually called the hypothesis space.
Classification: Definition
•
Given a collection of records (
training set
)
•
Each record contains a set of
attributes
, one of the
attributes is the
class
.
•
Find a
model
for class attribute as a function of the values
of other attributes.
•
Goal
: previously unseen records should be assigned a class
as accurately as possible.
•
A
test set
is used to determine the accuracy of the model.
Training
data
Class
label
Attribute
vector
Classification
Algorithm
Classification
Algorithm
Classification rules
Test
data
(Fahim Anwar, youth, medium)
Loan decision?
(Fahim Anwar, youth, medium)
Loan decision?
Risky
Risky
Estimate classifier accuracy
(to avoid overfitting)
Estimate classifier accuracy
(to avoid overfitting)
Predict classification
of new data
Predict classification
of new data
Classification rules
Classification rules
% test set tuples correctly
classified
% test set tuples correctly
classified
Examples of Classification Task
Applications
•
Predicting
tumor cells as benign or malignant
•
Classifying
credit card transactions as legitimate or
fraudulent
•
Categorizing
news stories as finance, weather,
•
Back propagation
•
Bayesian Classifiers
•
Decision Trees
•
Density estimation methods
•
Fuzzy set theory
•
Linear discriminant analysis (LDA)
•
Logistic regression
•
Naive bayes classifier
•
Nearest Neighborhood Classification
•
Neural networks
•
Quadratic discriminant analysis (QDA)
•
Support Vector Machine
•
many more…
•
Decision tree is a classifier in the form of a tree structure
–
Decision node
: specifies a test on a single attribute
–
Leaf node
: indicates the value of the target attribute
–
Arc/edge
: split of one attribute
–
Path
: a disjunction of test to make the final decision
Definition
•
Attribute-value description
: object or case must be
expressible in terms of a fixed collection of properties
or attributes (e.g., hot, mild, cold).
•
Predefined classes (target values)
: the target
function has discrete output values (single or
multiclass)
•
Sufficient data
: enough training cases should be
provided to learn the model.
•
Consider an example of playing tennis
•
Attributes (features)
–
Outlook, temp, humidity, wind
•
Values
–
Description of features
–
Eg: Outlook values - sunny, cloudy,
Overcast
•
Target
–
Play
–
Represents the output of the model
•
Instances
–
Examples D1 to D14 of the dataset
•
Concept
–
Learn to decide whether to play
tennis i.e. find
h
from given data set
Day Outlook Temp Humidity Wind Play
Shall we play tennis today?
•
Fast
to implement
•
Simple
to implement because it perform classification without
much computation
•
Can
convert result to
a set of easily interpretable
rules
that
can be used in knowledge system such as database, where
rules are built from the label of the nodes and the labels of
the arcs.
•
Can handle
continuous
and
categorical
variables
•
Can handle
noisy
data
•
provide a clear indication of which fields are
most important
for prediction or classification
•
"Univariate"
splits/partitioning using only one attribute
at a time so limits types of possible trees
•
Large
decision trees may be
hard
to understand
•
Perform
poorly with many class
and small data.
ID3 Algorithm
Other Decision Tree Learning Algorithms
• C4.5
• CART (Classification and Regression Tree)
• CHAID (CHi-squared Automatic Interaction Detection) • MARS
• QUEST (Quick, Unbiased, Efficient, Statistical Tree) • SLIQ
• SPRINT
•
Top-down tree construction schema:
•
Examine training database and find best splitting
predicate for the root node
•
Partition training database
•
Recourse on each child node
•
Attribute selection measure
a heuristic for
selecting the splitting criterion that “best” splits a
given data partition into
smaller
mutually exclusive
classes
•
Attributes are
ranked
according to a
measure
–
attribute having the
best score
is chosen as the
splitting attribute
–
split-point
for
continuous attributes
–
splitting subset
for
discrete attributes with
binary trees
•
Measures:
Information Gain
, Gain Ratio, Gini Index
Information Gain
•
Based on
Shannon’s information theory
•
Goal is to
minimize the expected number of tests needed to
classify a tuple
–
guarantee that a simple tree is found
•
Attribute with the
highest information gain
is chosen as the
splitting attribute
–
minimizes information needed to classify tuples in
resulting partitions
–
reflects least “impurity” in resulting partitions
RID age income student Credit_rating Class: buys_computer
1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no
C
1= Yes = 9
C
2= No = 5
RID age income student Credit_rating Class: buys_computer
1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no
C
1= Yes = 9
C
2= No = 5
RID age income student Credit_rating Class: buys_computer
1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no
C
1= Yes = 9
C
2= No = 5
RID age income student Credit_rating Class: buys_computer
1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no
C
1= Yes = 9
C
2= No = 5
3- compute
Gain(age) = 0.94 - 0.694 = 0.246 bits
RID age income student Credit_rating Class: buys_computer
1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no
Gain(age) = 0.246 bits
Gain(income) = 0.029 bits
Gain(student) = 0.151 bits
Gain(credit_rating) = 0.048 bits
Gain(age) has highest information gain
Decision Tree Induction - Attribute Selection
Measures
•
The same process of splitting has to happen for the two remaining
branches.
•
For branch age<=30 we still have attributes income, student and
credit_rating.
•
The mutual information is I(2 Yes, 3 No)= I(2,3)= -2/5 log2(2/5) – 3/5
log2(3/5)=0.97
•
For Income we have three values income high (0 yes and 2 no),
income medium (1 yes and 1 no) and income low (1 yes and 0 no)
•
Entropy(income) = 2/5(0) + 2/5 (-1/2log(1/2)-1/2log(1/2)) + 1/5 (0)
•
= 2/5 (1) = 0.4
Decision Tree Induction - Attribute Selection
Measures
•
For Student we have two values student yes (2 yes
and 0 no) and student no (0 yes 3 no)
•
Entropy(student) = 2/5(0) + 3/5(0) = 0
•
Gain (student) = 0.97 – 0 = 0.97
•
We can then safely split on attribute student without
Decision Tree Induction - Attribute Selection
Measures
Decision Tree Induction - Attribute Selection
Measures
•
Again the same process is needed for the other
branch of age.
•
The mutual information is I(S Yes, S No)= I(3,2)=
-3/5 log2(3/5) – 2/5 log2(2/5)=0.97
•
For Income we have two values income medium
(2 yes and 1 no) and income low (1 yes and 1
no)
•
Entropy(income) = 3/5(-2/3log(2/3)-1/3log(1/3))
+ 2/5 (-1/2log(1/2)-1/2log(1/2))
Decision Tree Induction - Attribute Selection
Measures
•
For Student we have two values student yes (2
yes and 1 no) and student no (1 yes and 1 no)
•
Entropy(student) = 3/5(-2/3log(2/3)-1/3log(1/3))
+ 2/5 (-1/2log(1/2)-1/2log(1/2)) = 0.95
•
Gain (student) = 0.97 – 0.95 = 0.02
•
For Credit_Rating we have two values
credit_rating fair (3 yes and 0 no) and
credit_rating excellent (0 yes and 2 no)
•
Entropy(credit_rating) = 0
Decision Tree Induction - Attribute Selection
Measures
•
We then split based on credit_rating.
•
These splits give partitions each with records from the same class
•
make these into leaf nodes with their class label attached
New example: age<=30, income=medium, student=yes,
credit-rating=fair
•
Data may be
overfitted
to dataset anomalies and outliers
•
Pruning
removes the least reliable branches
–
DT becomes less complex
•
Prepruning
statistically assess the
goodness of a split
before it takes place
–
hard to choose
thresholds
for statistical significance
•
Postpruning
remove sub-trees from already constructed
trees
•
RID age income student Credit_rating Class: buys_computer
1 youth high no fair no 2 youth high no excellent no 3 middle aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle aged medium no excellent yes 13 middle aged high yes fair yes 14 senior medium no excellent no