Introduction to Data Mining
Ming Li
Department of Computer Science and Technology Nanjing University
Spring 2015
Prediction
• Predictive modeling can be thought of as learning
a mapping from an input instance x to a label
y
Prediction
Classification
Regression
Ranking
Predicts the categorical labels
Predicts numerical labels
Two step process of prediction (I)
• Step 1: Construct a model based on a training set– the set of tuples used for model construction is called training set
– the set of tuples can be called as a sample (a tuple can also be called as a sample) – a tuple is usually called an example (usually with the label) or an instance (usually
without the label)
– The attribute to be predicted is called label
Name Rank Years Tenured
Mike Assistant Prof 3 No Mary Assistant Prof 7 Yes Bill Professor 2 Yes Jim Associate Prof 7 Yes Dave Assistant Prof 6 No Anne Associate Prof 3 no
Training
Data label
Learning algorithm
Prediction model
e.g., IF rank = professor OR years > 6 THEN tenured = yes
• Step 2: Use the model to predict unseen instances
before use the model, we can estimate the accuracy of the model by a test set
– Test set is different from training set
– The desired output of a test instance is compared with the actual output from the model
– for classification, the accuracy is usually measured by the percentage of test instances that are correctly classified by the model
– for regression, the accuracy is usually measured by mean squared error
Two step process of prediction (II)
Name Rank Years Tenured
Tom Assistant Prof 2 No Merlisa Associate Prof 7 No George Professor 5 Yes
accuracy Tenured? Unseen Data
Yes
Test Data Prediction modelTuring
Award 2011
PAC (Probably Approximately Correct)
:
There exists a sample size m,
L. G. Valiant. A theory of the learnable. Communications of
the ACM, 1984, 27(11): 1134-1142
Leslie Valiant
(1949 - )(Harvard Univ.)
Supervised vs. Unsupervised learning
•
Supervised learning
- the training data are accompanied by labels indicating the desired outputs of the observations
- the concerned property of unseen data is predicted - usually: classification, regression
•
Unsupervised learning
- the labels of training data are unknown
- given a set of observations, to discover the inherent properties, such as the existence of classes or clusters, in the data
How to evaluate prediction algorithms?
•
Generalization
– the ability of the model to correctly predict unseen instances.
•
Speed
– the computational cost involved in generating and using the model – training time cost vs. test time cost
usually, larger training time cost but smaller test time cost
•
Robustness
– the ability of model to deal with noise or missing values
•
Scalability
– the ability of the model to deal with huge volume of data
•
Comprehensibility
How to evaluate the
generalization?
• Classification
–
Accuracy
–
Overall cost
–
Precision, Recall, F1
–
AUC (area under the
ROC curve)
• Regression
–
MSE (mean squared error)
• Ranking
–
Ranking loss
–
MAP (mean average
precesion)
–
NDCG
How to measure?
• Two widely used method
–
Hold out:
• Randomly partition the data into two disjoint set, one for training and
one for test. (usually 2/3 vs. 1/3, or 75% vs. 25%)
• Repeat the process for many times to achieve a good estimate
–
Cross validation
• Randomly partition data into k disjoint sets with equal-size.
• Sequentially choose 1 set for test and the others for training. So
the training / testing will be conducted k times.
In most cases, we don’t have a test set at all! So we have to
leverage our provided data to measure
How good can a prediction model be?
• Perfect prediction is what we expected. However, for
some problems, no perfect prediction can be achieved if
no other knowledge is used besides the data
Not separable!
Mistake is inevitable in this case
How good can a predictive model be?
• Bayes decision and Bayes error
best current
Bayes Error:
No other classifier can achieve a lower expected error rate on unseen new data. It is a lower-‐‑bound on the best classifier for this problem
No Free Lunch
Generally, there is no algorithm which is consistently better
than other algorithm
The No Free Lunch Theorem states that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems
• D.H. Wolpert and W.G. Macready. No free lunch theorems for search.
IEEE TEC, 1997, 1(1):67-82
• D.H. Wolpert and W.G. Macready. No free lunch theorems for
optimization. Tech. Rep. SFI-TR-95-02-010, Santa Fe Institute, 1995.
Different Algorithms usually have different pros and cons,
therefore, it is important to know the strength/weakness of
an algorithm and when should it be used
Different types of classifier
• Discriminative
– Models the decision boundary directly
– Models the posterior class probabilities p(c
k|x) directly
• Generative
Aims to model how the data can be separated
e.g., perceptron, Nerual Networks, decision trees, SVM, … e.g., Logistic regression, …
Aims to find the model that generates the data
e.g., Naïve Bayes, Bayes Network, …
Model assumption is required. Mismatch might lead to poor prediction
What is decision tree?
Decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf represents a class or class distribution
an example age?
student? yes credit_rating?
no yes yes no
<30
30-40 >40
no yes excellent fair
the topmost node in a tree is the root node
in order to classify an unseen instance, the attribute values of the instance are tested against the decision tree. A path is traced from the root to a leaf which holds the class prediction for the instance
Brief history of decision tree (I)
• The first decision tree algorithm is CLS (Concept Learning System)
[E. B. Hunt, J. Marin, and P. T. Stone’s book “Experiments in Induction” published by Academic Press in 1966]
• The algorithm raised the interests in decision tree is ID3
[J. R. Quinlan’s paper in a book “Expert System in the Micro Electronic Age” edited by D. Michie, published by Edinburgh University Press in 1979]
• The most popular decision tree algorithm
is C4.5
[J. R. Quinlan’s book “C4.5: Programs for Machine Learning” published by Morgan Kaufmann in 1993]
J. Ross Quinlan
SIGKDD Innovation Award Winner (2011)
Brief history of decision tree (II)
• The most popular decision tree algorithm that can be used in
regression is CART (Classification and Regression Tree)
[L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone’s book
“Classification and Regression Trees” published by Wadsworth in 1984]
• The strongest decision tree-based learning algorithm
is RandomForests, a tree ensemble algorithm
[L. Breiman’s MLJ’01 paper “Random Forests”]
Leo Breiman
SIGKDD Innovation Award Winner (2005)
How to construct a decision tree? (I)
Basic strategy:
• A tree is constructed in a top-down recursive divide-and-conquer manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes
a selected attribute is also called a split or a test
• Splits are selected based on a heuristic or statistical measure (e.g., information gain) • The partitioning terminates if any of the constraints is met:
- all examples falling into a node belong to the same class
this node becomes a leaf whose label is the class
- no attribute can be used to further partition the data
this node becomes a leaf whose label is the majority class of the examples falling into the node
- no instance falling into a node
this node becomes a leaf whose label is the majority class of the examples falling into the parent of the node
How to construct a decision tree? (II)
Algorithm of the basic strategy (ID3):generate_decision_tree (samples, attribute_list)
1) create a node N;
2) if samples are all of the same class, C, then
return N as a leaf node labeled with class C;
3) if attribute_list is empty then
return N as a leaf node labeled with the most common class in samples
4) select test_attribute, the attribute among attribute_list with the highest information gain; 5) label node N with test_attribute;
6) for each known value ai of test_attribute
1) grow a branch from node N for the condition test_attribute = ai;
2) Let si be the set of samples in samples for which test_attribute = ai;
3) if si is empty then
attach a leaf labeled with the most common class in samples;
4) else attach the node returned by
generate_decision_tree (si, attribute_list-test_attribute);
SpliZing
criteria:
Information gain
S: training set
Si: training instances of class Ci (i = 1,…,m)
aj: values of attribute A (j = 1,…,v)
the information needed to correctly classify the training set is
suppose attribute A is selected to partition the training set into the subsets {SA 1,
SA
2,…,SAv}, then the entropy of the subsets, i.e., the information needed to classify
all the instances in those subsets is
where SA
ij is the instances of class Cj contained in Sai
then the information gain of selecting A is
SpliZing
criteria:
Example of
information gain (I)
Target class: Graduate students (Σ=120)
gender major birth_country age_range gpa count M Science Canada 20-25 Very_good 16
F Science Foreign 25-30 Excellent 22 M Engineering Foreign 25-30 Excellent 18 F Science Foreign 25-30 Excellent 25 M Science Canada 20-25 Excellent 21 F Engineering Canada 20-25 Excellent 18
Contrasting class: Undergraduate students (Σ=130)
gender major birth_country age_range gpa count M Science Foreign <20 Very_good 18
F Business Canada <20 Fair 20 M Business Canada <20 Fair 22 F Science Canada 20-25 Fair 24 M Engineering Foreign 20-25 Very_good 22 F Engineering Canada <20 Excellent 24
SpliZing
criteria:
Example of
information gain (II)
the information needed to correctly classify the training set is
suppose attribute major is selected to partition the training set
then the entropy of major is
for major = “Science”: S11 = 84, S12 = 42
for major = “Engineering”: S21 = 36, S22 = 46
for major = “Business”: S31 = 0, S32 = 42
SpliZing
criteria:
Example of
information gain (III)
then the information gain of major is
Gain(major) = I(S1,S2) – E(major) = 0.2115
we can also get the information gain of other attributes:
Gain(gender) = 0.0003
Gain(birth_country) = 0.0407 Gain(gpa) = 0.4490
Gain(age_range) = 0.5971
SpliZing
criteria:
Other kinds of split
selection criteria (I)
• Gain ratio
a shortcoming of information gain is its bias to attributes with lots of values.
In order to reduce the influence of this bias, J. R. Quinlan used gain ratio in C4.5 instead of information gain
where attribute A is selected to partition the instance set into the subsets {S’1, S’2,…, S’v}
if IV(X) ≠ 0, J. R. Quinlan recommended to select the attribute X, which maximizes the Gain_ratio(X), from the attributes with an average-or-better Gain(X)
SpliZing
criteria:
Other kinds of split
selection criteria (II)
• Gini index
(Gini index is used in CART ) S: training setSi: training instances of class Ci (i=1,…,m)
aj: values of attribute A (j=1,…,v) the gini of S is
suppose attribute A is selected to partition the instance set into the subsets {S’1, S’2,…, S’v}, then the gini of A is
Why pruning?
overfitting(过拟合): the trained model fits the training set too much such that it deviates the real distribution of the instance space
main reason: finite training set; noise
when a decision tree is built, many branches may reflect anomalies in the training set due to noise or outliers
How to prune decision tree?
Two popular methods:
• Prepruning
Terminate tree construction early: do not split a node if it would result in the goodness measure falling below a threshold
hard to choose an appropriate threshold
• Postpruning
Remove branches from a “fully grown” tree: progressively prune the tree if the goodness measure can be improved
The
“goodness”
is usually measured with
the help of a validation
set, which is a set of data different from the training data
In general, postpruning is more accurate than prepruning, yet
requires more computational cost
Mapping decision tree to rules
A rule is created for each path from the root to a leaf: Each attribute-value pair along a given path forms a conjunction in the rule antecedent. The class label held by the leaf forms the rule consequent
age?
student? yes credit_rating?
no yes yes no
<30
30-40 >40
no yes excellent fair
IF age = “<30” AND student = no THEN buys_computer = no IF age = “<30” AND student = yes THEN buys_computer = yes IF age = “30-40” THEN buys_computer = yes IF age = “>40” AND credit_rating = execllent THEN buys_computer = yes IF age = “>40” AND credit_rating = fair THEN buys_computer = no
Enhance basic decision tree algorithm (I)
• Allow for continuous-valued attributes– a test on a continuous-valued attribute A results in two branches
corresponding to A ≤ V and A > V
– given v values of A, (v-1)possible splits may be considered
• Handle missing attribute values
– assign the most common value of the attribute – assign probability of each of the possible values • Incremental induction
– it it not good to generate the tree from scratch at every time that new instances arriving
– dynamically adjust the splits in the tree • Attribute construction
Enhance basic decision tree algorithm (II)
Scalable decision tree algorithm:
most studies focus on improving the data structures
• SLIQ [Mehta et al., EDBT96]
– build an index for each attribute. Only class list and the current attribute list reside in memory
• SPRINT [J. Shafer et al., VLDB96]
– construct an attribute list. When a node is partitioned, the attribute list is also partitioned
• RainForest [J. Gehrke et al., VLDB98]
– build an AVC-list (attribute, value, class label)
– separate the scalability aspects from the criterion that determine the quality of the tree
What is neural network?
also called artificial neural network
neural networks are massively parallel interconnected networks of
simple (usually adaptive) elements and their hierarchical organizations
which are intended to interact with the objects of the real world in the
same way as biological nervous systems do
[T. Kohonen, NN88]
The basic component of a neural network: neuron and weight
M-P model neurons are connected by weights
neuron is also called unit bias
is also called threshold
The knowledge learned by a neural network is encoded in the weights and biases
Perceptron
It can be learned by
The model of the Perceptron
w ← w −η
(
wTxi − yi)
xiFor each training example (xi, yi) do
Repeat Until convergence
Aims to find a hyperplain such that it separate different class
Gradient Descent
w ← w + Δw
Perceptron
• Limitations of perceptron
✔
✗
What is multilayer feedforward NN?
Feedforward neural network:
a kind of neural network where a
unit is only connected with the units in its next neighboring layer
Hidden units and output units are called
functional units, which are usually equipped
with a non-linear function (sigmoid
function)
There is no rule indicating how to get the
best network, therefore the network design
is a trial-by-error process
Backpropagation (I)
Abbreviated as BPmost popular neural network algorithm, can be used in both classification and regression
at first, proposed by P. Werbos in his Ph.D dissertation:
P. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral science. Ph.D dissertation, Harvard University, 1974
The BP algorithm Sketch:
Step 1: feedforward input from input layer to hidden layer to output layer Step 2: compute the error of the output layer
Step 3: backpropagate the error from output layer to hidden layer Step 4: adjust weights and biases
Backpropagation (II)
Backpropagation Procedure:
For each given training example (x, y), do
1. Input the instance x to the NN and compute the output value ou of every output
unit u of the network
2. For each network output unit k, calculate its error term δk
3. For each hidden unit h, calculate its error term δh
4. Update each network weight wji which is the weight associated with the i-th
input value to the unit j
Propagate input forward though the network
Propagate the errors
backward though the
network
(1 )
h oh oh k output whk k
Let’s derive the BP.
Backpropagation
(III)
Reform the score function as
Since we adopt a stochastic gradient descent, we calculate the gradient when receiving the training example (xp, yp) as
, where
Then ,we derive the corresponding for different type of units in the network , such that the update rule can be
• For the weights associated to output unit
Backpropagation
(IV)
By using chain rule, we get
Plugging oj yields
, where
• For the weights associated to hidden unit
Backpropagation
(V)
By using chain rule, we get
where and Nextlayer(j) is a set of unit whose immediate input is j
• MSE of output layer
Example
• Hidden Unit encoding for input 01000000
Example
• Weights from inputs to one hidden unit
Example
Backpropagation
(VI)
• Remarks
–
Solution of BP:
• Guaranteed local minima
• Random initiation and stochastic gradient descent make it less likely to get stuck in the local minima.
–
Representation power of Neural Networks:
• Boolean functions: can be exactly represented by 2 layers of units
• (Bounded) Continuous functions: can be approximated with arbitrarily
small error with 2 layer of units, with 1 layer of sigmoid units (hidden)
and 1 layer of linear units
• Arbitrary functions: can be approximated to arbitrary accuracy by 3 layers of units with 2 hidden layers of sigmoid units.
local minima:
What is support vector machine?
Support vector machines
are learning systems that use a hypothesisspace of linear functions in a high-‐‑dimensional feature space, trained
with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory
SVM has close relationship with neural networks, e.g., SVM with Gaussian kernel is actually a RBF neural network
Although SVM becomes hot since the middle of 1990s, in fact the idea
of support vector was proposed by V. Vapnik in 1963, and some keys
gradients of Statistical Learning Theory were obtained in 1960s and
1970s (mainly by V. Vapnik and A. Chervonenkis): VC dimension
(1968), structural risk minimization inductive principle (1974)
Linear hyperplane
Binary classification can be viewed as the task of separating
classes in feature space
Margin
Margin:
The width between two classes
Support Vectors:
The examples that are closest to the hyperplane
Assuming all data points are at least distance 1 from the hyperplane, the following two constraints hold for a training set {(xi, yi)}
r(x) =
w
T
x + b
• Model: linear classifier in high dimensional feature space
• Score function & optimization:
General Model of SVM
min w
s.t
is a mapping from input space to high-dimensional feature space. If ϕ (x) = x, SVM is a linear SVM
Aims to find a hyperplane in the high dimensional feature space to separate the data, where the “margin” between the two classes are maximized
Structural risk
Empirical risk
From linear to non-‐‑linear (I)
• Why the mapping matters
By mapping the input space to higher dimensional feature space, we can achieve nonlinearity by linear means.
Maps the input to higher dimensional space
Project the hyperplane back to the input space yields the non-linear boundary
From linear to non-‐‑linear (II)
• High dimension cause problems on learning (curse of
dimensionality)
• Solution:
Kernel trick
– A kernel function corresponds to inner products in some high dimensional feature space.
– Thus, we can implicitly compute the inner products in some high
dimensional feature space using kernel function without actually conduct the mapping
Kernel trick allows to work in the original space while benefiting from the mappings to high dimensional feature space.
What is Bayesian classification?
Bayesian classification is based on Bayes rule
Bayesian classifiers have exhibited high accuracy and fast speed
when applied to large databases
Bayes rule
where P(H|X) is the posterior probability of the hypothesis H conditioned on the data sample X, P(H) is the prior probability of H, P(X|H) is the
posterior probability of X conditioned on H, P(X) is
Naïve Bayes classifier (I)
also called simple Bayes classifier
class conditional independence
: assume that the effect of an attribute value
on a given class is independent of the values of other attributes
class Ci (i = 1,…,m) attribute Ak (k = 1,…,n)
feature vector X = (x1,x2,…,xn), where xk is the value of X on Ak
Naïve Bayes classifier wants to get the maximum a posteriori hypothesis Ci
according to Bayes rule,
because P(X) is a constant for all classes, only P(X|Ci)P(Ci) needs to be maximized
P(C
i| X) > P(C
j| X) for 1 ≤ j ≤ m, j
≠
i
INTRODUCTIONTO DATA MINING: PART 5
Naïve Bayes classifier (II)
to maximize P(X | Ci)P(Ci):
P(Ci) can be estimated by
where Si is the number of training instances of class Ci, and S is the total number of training instances
since naïve Bayes classifier assumes class conditional independence, P(X|Ci) can be estimated by
• if Ak is an categorical attribute, then we can take
where Sik is the number of training instances of class Ci having the value xk for Ak, and Si is the number of training instances of class Ci
• if Ak is a continuous attribute, then usually we can take
where g(xk ,µci , σci) is the Gaussian density function for Ak; and are the mean and variance, respectively, given the values for Ak for training instances of
Example of naïve Bayes classifier (I)
training set:C1: buys_computer = “yes”, C2: buys_computer = “no”
rid age income student credit_rating Class: buys_computer
1 <30 high no fair no
2 <30 high no execllent no
3 30-40 high no fair yes
4 >40 medium no fair yes
5 >40 low yes fair yes
6 >40 low yes excellent no
7 30-40 low yes excellent yes
8 <30 medium no fair no
9 <30 low yes fair yes
10 >40 medium yes fair yes
11 <30 medium yes excellent yes
12 30-40 medium no excellent yes
INTRODUCTIONTO DATA MINING: PART 5
Example of naïve Bayes classifier (II)
Given an instance to be classified:
X = (age = “< 30”, income = “medium”, student = “yes”, credit_rating = “fair”)
P(Ci): P(C1) = P(buys_computer = “yes”) = 9 / 14 = 0.643 P(C2) = P(buys_computer = “no”) = 5 / 14 = 0.357
P(X|Ci): since
P(age = “<30” | buys_computer = “yes”) = 2 / 9 = 0.222 P(age = “<30” | buys_computer = “no”) = 3 / 5 = 0.600
P(income = “medium” | buys_computer = “yes”) = 4 / 9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2 / 5 = 0.400 P(student = “yes” | buys_computer = “yes”) = 6 / 9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1 / 5 = 0.200
P(credit_rating = “fair” | buys_computer = “yes”) = 6 / 9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2 / 5 = 0.400
then
P(X|C1) = P(X|buys_computer = “yes”) = 0.222*0.444*0.667*0.667 = 0.044
P(X|C2) = P(X|buys_computer = “no”) = 0.600*0.400*0.200*0.400 = 0.019
P(X|Ci)P(Ci): P(X|C1)P(C1) = 0.044*0.643 = 0.028 P(X|C2)P(C2) = 0.019*0.357 = 0.007 therefore C , i.e., buys_computer = “yes”, is returned
A problem with naïve Bayes classifier
rid age income student credit_rating Class: buys_computer
1 <30 high no fair no
2 <30 high no execllent no
3 30-40 high no fair yes
4 >40 medium no fair yes
5 >40 low yes fair yes
6 >40 low yes excellent no
7 30-40 low yes excellent yes
8 <30 medium no fair no
9 <30 low yes fair yes
10 >40 medium yes fair yes
11 <30 medium yes excellent yes
12 30-40 medium no excellent yes
13 30-40 high yes fair yes
14 >40 medium no execllent no
What happens if we remove this example?
will always be classified as “buys-computer = yes”, no matter what values appear in other
Laplacian correction to naïve Bayes
(拉普拉斯修正)
If some attribute values are not observed in training data, how to perform naïve Bayes learning?
student credit buys
yes no yes
no yes yes
… … yes
yes yes no
no yes no
e.g.
all entries are “yes”
for buys = “no”
Then, how about:
(student = “yes”, credit = “no”)??
the number of training instances with property X is denoted as #X
Laplacian
Bayesian network
also called belief network, Bayesian belief network, or probabilistic network
a graphical model which allows the representation of
dependencies among subsets of attributes
a standard Bayesian network is defined by two components:
•
a
directed acyclic graph
where each node represents a random
variable, and each arc represents a probabilistic dependence
•
a
conditional probability table
(CPT) for each variable
Example of Bayesian network
Family History Lung Cancer Pisitive XRay Smoker Emphysema Dyspneaconditional probability table for
the variable LungCancer
0.8 0.5 0.7 0.1 0.2 0.5 0.3 0.9
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC ~LC
P(LungCancer = “yes” | FamilyHistory = “yes”, Smoker = “yes”) = 0.8
How to construct a Bayesian network?
• if the network structure and all the variables are known
-
it is easy to calculate the CPT entries as calculating the
probabilities in naïve Bayes classifiers
• if the network structure is known, but some variables are unknown
-
gradient descent methods are often used to generate the values
of the CPT entries
• if the network structure is unknown
-
discrete optimization techniques are often used to generate the
network structure from known variables
2011年度图灵奖
Judea Pearl
(1936 - )
(UCLA)
Pioneer of
Probabilistic and Causal Reasoning
(Bayesian networks, graphical model)
J. Pearl. Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference, Morgan Kaufmann, 1988.
What is ensemble learning?
Ensemble learning is a machine learning paradigm where
multiple (homogenous
/
heterogeneous) individual learners
are trained for the same problem
Problem
Learner
Learner Learner
… …
Learner… …
Why ensemble learning?
The generalization ability ofan ensemble is usually
significantly better than the corresponding single learner
Ensemble learning was regarded as one of the four current directions in machine learning research [T.G. Dietterich, AIMag97]
[A. Krogh & J. Vedelsby, NIPS94]
The more accurate and the more diverse, the better
How to build an ensemble?
An ensemble is built in two steps:
1) obtain the base learners
2) combine the individual predictions
Learner Learner
… …
Learner… …
Many ensemble methods
According to how the base learners can be generated:
• Parallel methods
• Bagging
[L. Breiman, MLJ96]
• Random Subspace
[T. K . Ho, TPAMI98]
• Random Forests
[L. Breiman, MLJ01]
• … …
• Sequential methods
• AdaBoost
[Y. Freund & R. Schapire, JCSS97]
• Arc-x4
[L. Breiman, AnnStat98]
• LPBoost
[A. Demiriz et al., MLJ06]
Bagging
Data set
Data set 1
Data set 2
Data set n
Learner 1
Learner 2
Learner n
… …
… …
Voting for classification:
Output the label with the most votes
Averaging for regression: Output the averaging of the individual learners Bootstrap a set of learners: Generate a set of training set by bootstrap sampling from original data set and then train a learner for each generated data set
INTRODUCTIONTO DATA MINING: PART 5
Boosting
Original training set
Data set
1Learner
1Data set
2Learner
2Data set
TLearner
T… …
… …
… …
training instances that are wrongly predicted by Learner1 will play more important roles in the training of Learner2
weighted combination Gödel Prize (2003)
Freund & Schapire, A decision theoretic generalization of on-line learning and an application to Boosting. Journal of Computer and System Sciences, 1997, 55: 119-139.
Simple yet effective
can be applied to almost all tasks where one wants to apply
machine learning techniques
For example, in computer vision, the Viola-Jones detector
AdaBoost using harr-like features in a cascade structure
in average, only 8 features
needed to be evaluated per
image
The Viola-Jones detector
“the first real-time face detector”
Comparable accuracy, but
15 times faster than
state-of-the-art of face
detectors (at that time)
Longuet-Higgins Prize (2011)
Viola & Jones, Rapid object detection using a Boosted cascade of simple features. CVPR, 2001.
Selective ensemble
Many Could be Better Than All: Given a set of trained
learners, … …, ensembling many of the trained learners may be
better than ensembling all of them
[Z.-H. Zhou
et al
, AIJ02]
The basic idea of selective ensemble: Use multiple solutions
and perform some kind of selection
individual solution
… …
… …
Problem individual solution individual solutionPrinciples for selection:
• the effectiveness of the
individuals
• the complementarity of
the individuals
More about ensemble methods:
Z.-H. Zhou.
Ensemble Methods:
Foundations and
Algorithms
, Boca Raton,
FL: Chapman & Hall/
CRC, Jun. 2012.
k-‐‑nearest neighbor classifier
• store all the training examples
• each training instance represents a point in n-D instance space
• the nearest neighbors are identified based on some distance
measures
• for classification, returns the most common value among the k
training instances nearest to the unseen instance
• for regression estimation, returns the average value among the k
Case-‐‑based reasoning
• “cases” are complex symbolic descriptions
• store all cases
• compare the cases (or components of cases) with unseen cases
(or components of cases)
•
combine the solutions of similar previous cases (or components
of cases) to the solution of unseen cases
previous cases:
case 1: a man stole $100 was published for 1 year imprisonment case 2: a man spitted in public area was published for 10 whips
unseen case:
a man stole $200 and spitted in public area
published for 1.5 year imprisonment plus 10 whips similar identical
Linear regression
• linear regression
data are modeled to fit a straight line
where α and β can be estimated from a training set:
• multiple regression
a response variable
Y
is modeled as a linear function of
multiple predictor variables
Nonlinear regression
• polynomial regression
data are modeled to fit a polynomial function
it can be converted to linear regression problem, e.g. let
there are also nonlinear regression models that cannot be converted to
a linear model. For such cases, it may be possible to obtain
least-square estimates through extensive calculations on more complex
formulae
Let’s move to
Part 6