• No results found

Introduction to Data Mining

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to Data Mining"

Copied!
76
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction  to  Data  Mining

Ming  Li

Department of Computer Science and Technology Nanjing University

Spring 2015

(2)

Prediction

•  Predictive modeling can be thought of as learning

a mapping from an input instance x to a label

y

Prediction

Classification

Regression

Ranking

Predicts  the  categorical  labels

Predicts  numerical  labels

(3)

Two  step  process  of  prediction  (I)

•  Step 1: Construct a model based on a training set

–  the set of tuples used for model construction is called training set

–  the set of tuples can be called as a sample (a tuple can also be called as a sample) –  a tuple is usually called an example (usually with the label) or an instance (usually

without the label)

–  The attribute to be predicted is called label

Name Rank Years Tenured

Mike Assistant Prof 3 No Mary Assistant Prof 7 Yes Bill Professor 2 Yes Jim Associate Prof 7 Yes Dave Assistant Prof 6 No Anne Associate Prof 3 no

Training

Data label

Learning algorithm

Prediction model

e.g., IF rank = professor OR years > 6 THEN tenured = yes

(4)

•  Step 2: Use the model to predict unseen instances

before use the model, we can estimate the accuracy of the model by a test set

–  Test set is different from training set

–  The desired output of a test instance is compared with the actual output from the model

–  for classification, the accuracy is usually measured by the percentage of test instances that are correctly classified by the model

–  for regression, the accuracy is usually measured by mean squared error

Two  step  process  of  prediction  (II)

Name Rank Years Tenured

Tom Assistant Prof 2 No Merlisa Associate Prof 7 No George Professor 5 Yes

accuracy Tenured? Unseen Data

Yes

Test Data Prediction model
(5)

Turing

 

Award  2011

PAC (Probably Approximately Correct)

:

There exists a sample size m,

L. G. Valiant. A theory of the learnable. Communications of

the ACM, 1984, 27(11): 1134-1142

Leslie Valiant

(1949 - )

(Harvard Univ.)

(6)

Supervised  vs.  Unsupervised  learning

Supervised learning

-  the training data are accompanied by labels indicating the desired outputs of the observations

-  the concerned property of unseen data is predicted -  usually: classification, regression

Unsupervised learning

-  the labels of training data are unknown

-  given a set of observations, to discover the inherent properties, such as the existence of classes or clusters, in the data

(7)

How  to  evaluate  prediction  algorithms?

Generalization

–  the ability of the model to correctly predict unseen instances.

Speed

–  the computational cost involved in generating and using the model –  training time cost vs. test time cost

usually, larger training time cost but smaller test time cost

Robustness

–  the ability of model to deal with noise or missing values

Scalability

–  the ability of the model to deal with huge volume of data

Comprehensibility

(8)

How  to  evaluate  the

 

generalization?

•  Classification

Accuracy

Overall cost

Precision, Recall, F1

AUC (area under the

ROC curve)

•  Regression

MSE (mean squared error)

•  Ranking

Ranking loss

MAP (mean average

precesion)

NDCG

(9)

How  to  measure?

•  Two widely used method

  Hold out:

•  Randomly partition the data into two disjoint set, one for training and

one for test. (usually 2/3 vs. 1/3, or 75% vs. 25%)

•  Repeat the process for many times to achieve a good estimate

  Cross validation

•  Randomly partition data into k disjoint sets with equal-size.

•  Sequentially choose 1 set for test and the others for training. So

the training / testing will be conducted k times.

In  most  cases,  we  don’t  have  a  test  set  at  all!  So  we  have  to  

leverage  our  provided  data  to  measure

(10)

How  good  can  a  prediction  model  be?

•  Perfect prediction is what we expected. However, for

some problems, no perfect prediction can be achieved if

no other knowledge is used besides the data

Not  separable!

Mistake  is  inevitable  in   this  case

(11)

How  good  can  a  predictive  model  be?

•  Bayes decision and Bayes error

best current

Bayes Error:

No  other  classifier  can  achieve  a  lower  expected  error  rate  on  unseen  new   data.  It  is  a  lower-­‐‑bound  on  the  best  classifier  for  this  problem

(12)

No  Free  Lunch

Generally, there is no algorithm which is consistently better

than other algorithm

The No Free Lunch Theorem states that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems

•  D.H. Wolpert and W.G. Macready. No free lunch theorems for search.

IEEE TEC, 1997, 1(1):67-82

•  D.H. Wolpert and W.G. Macready. No free lunch theorems for

optimization. Tech. Rep. SFI-TR-95-02-010, Santa Fe Institute, 1995.

Different Algorithms usually have different pros and cons,

therefore, it is important to know the strength/weakness of

an algorithm and when should it be used

(13)

Different  types  of  classifier

•  Discriminative

–  Models the decision boundary directly

–  Models the posterior class probabilities p(c

k

|x) directly

•  Generative

Aims  to  model  how  the  data  can  be  separated

e.g.,  perceptron,  Nerual  Networks,  decision  trees,  SVM,  … e.g.,  Logistic  regression,  …

Aims  to  find  the  model  that  generates  the  data

e.g.,  Naïve  Bayes,  Bayes  Network,  …

Model  assumption  is  required.  Mismatch  might  lead  to   poor  prediction

(14)

What  is  decision  tree?

Decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf represents a class or class distribution

an example age?

student? yes credit_rating?

no yes yes no

<30

30-40 >40

no yes excellent fair

the topmost node in a tree is the root node

in order to classify an unseen instance, the attribute values of the instance are tested against the decision tree. A path is traced from the root to a leaf which holds the class prediction for the instance

(15)

Brief  history  of  decision  tree  (I)

•  The first decision tree algorithm is CLS (Concept Learning System)

[E. B. Hunt, J. Marin, and P. T. Stone’s book “Experiments in Induction” published by Academic Press in 1966]

•  The algorithm raised the interests in decision tree is ID3

[J. R. Quinlan’s paper in a book “Expert System in the Micro Electronic Age” edited by D. Michie, published by Edinburgh University Press in 1979]

•  The most popular decision tree algorithm

is C4.5

[J. R. Quinlan’s book “C4.5: Programs for Machine Learning” published by Morgan Kaufmann in 1993]

J. Ross Quinlan

SIGKDD Innovation Award Winner (2011)

(16)

Brief  history  of  decision  tree  (II)

•  The most popular decision tree algorithm that can be used in

regression is CART (Classification and Regression Tree)

[L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone’s book

“Classification and Regression Trees” published by Wadsworth in 1984]

•  The strongest decision tree-based learning algorithm

is RandomForests, a tree ensemble algorithm

[L. Breiman’s MLJ’01 paper “Random Forests”]

Leo Breiman

SIGKDD Innovation Award Winner (2005)

(17)

How  to  construct  a  decision  tree?  (I)

Basic strategy:

•  A tree is constructed in a top-down recursive divide-and-conquer manner

•  At start, all the training examples are at the root

•  Attributes are categorical (if continuous-valued, they are discretized in advance) •  Examples are partitioned recursively based on selected attributes

a selected attribute is also called a split or a test

•  Splits are selected based on a heuristic or statistical measure (e.g., information gain) •  The partitioning terminates if any of the constraints is met:

-  all examples falling into a node belong to the same class

this node becomes a leaf whose label is the class

-  no attribute can be used to further partition the data

this node becomes a leaf whose label is the majority class of the examples falling into the node

-  no instance falling into a node

this node becomes a leaf whose label is the majority class of the examples falling into the parent of the node

(18)

How  to  construct  a  decision  tree?  (II)

Algorithm of the basic strategy (ID3):

generate_decision_tree (samples, attribute_list)

1)  create a node N;

2)  if samples are all of the same class, C, then

return N as a leaf node labeled with class C;

3)  if attribute_list is empty then

return N as a leaf node labeled with the most common class in samples

4)  select test_attribute, the attribute among attribute_list with the highest information gain; 5)  label node N with test_attribute;

6)  for each known value ai of test_attribute

1)  grow a branch from node N for the condition test_attribute = ai;

2)  Let si be the set of samples in samples for which test_attribute = ai;

3)  if si is empty then

attach a leaf labeled with the most common class in samples;

4)  else attach the node returned by

generate_decision_tree (si, attribute_list-test_attribute);

(19)

SpliZing

 

criteria:

 

Information  gain

S: training set

Si: training instances of class Ci (i = 1,…,m)

aj: values of attribute A (j = 1,…,v)

the information needed to correctly classify the training set is

suppose attribute A is selected to partition the training set into the subsets {SA 1,

SA

2,…,SAv}, then the entropy of the subsets, i.e., the information needed to classify

all the instances in those subsets is

where SA

ij is the instances of class Cj contained in Sai

then the information gain of selecting A is

(20)

SpliZing

 

criteria:

 

Example  of  

information  gain  (I)

Target class: Graduate students (Σ=120)

gender major birth_country age_range gpa count M Science Canada 20-25 Very_good 16

F Science Foreign 25-30 Excellent 22 M Engineering Foreign 25-30 Excellent 18 F Science Foreign 25-30 Excellent 25 M Science Canada 20-25 Excellent 21 F Engineering Canada 20-25 Excellent 18

Contrasting class: Undergraduate students (Σ=130)

gender major birth_country age_range gpa count M Science Foreign <20 Very_good 18

F Business Canada <20 Fair 20 M Business Canada <20 Fair 22 F Science Canada 20-25 Fair 24 M Engineering Foreign 20-25 Very_good 22 F Engineering Canada <20 Excellent 24

(21)

SpliZing

 

criteria:

 

Example  of  

information  gain  (II)

the information needed to correctly classify the training set is

suppose attribute major is selected to partition the training set

then the entropy of major is

for major = “Science”: S11 = 84, S12 = 42

for major = “Engineering”: S21 = 36, S22 = 46

for major = “Business”: S31 = 0, S32 = 42

(22)

SpliZing

 

criteria:

 

Example  of  

information  gain  (III)

then the information gain of major is

Gain(major) = I(S1,S2) – E(major) = 0.2115

we can also get the information gain of other attributes:

Gain(gender) = 0.0003

Gain(birth_country) = 0.0407 Gain(gpa) = 0.4490

Gain(age_range) = 0.5971

(23)

SpliZing

 

criteria:

 

Other  kinds  of  split  

selection  criteria  (I)

•  Gain ratio

a shortcoming of information gain is its bias to attributes with lots of values.

In order to reduce the influence of this bias, J. R. Quinlan used gain ratio in C4.5 instead of information gain

where attribute A is selected to partition the instance set into the subsets {S’1, S’2,…, S’v}

if IV(X) ≠ 0, J. R. Quinlan recommended to select the attribute X, which maximizes the Gain_ratio(X), from the attributes with an average-or-better Gain(X)

(24)

SpliZing

 

criteria:

 

Other  kinds  of  split  

selection  criteria  (II)

•  Gini index

(Gini index is used in CART ) S: training set

Si: training instances of class Ci (i=1,…,m)

aj: values of attribute A (j=1,…,v) the gini of S is

suppose attribute A is selected to partition the instance set into the subsets {S’1, S’2,…, S’v}, then the gini of A is

(25)

Why  pruning?

overfitting(过拟合): the trained model fits the training set too much such that it deviates the real distribution of the instance space

main reason: finite training set; noise

when a decision tree is built, many branches may reflect anomalies in the training set due to noise or outliers

(26)

How  to  prune  decision  tree?

Two popular methods:

•  Prepruning

Terminate tree construction early: do not split a node if it would result in the goodness measure falling below a threshold

hard to choose an appropriate threshold

•  Postpruning

Remove branches from a “fully grown” tree: progressively prune the tree if the goodness measure can be improved

The  

“goodness”  

is  usually  measured  with  

the  help  of  a  validation  

set,  which  is  a  set  of  data  different  from  the  training  data

In general, postpruning is more accurate than prepruning, yet

requires more computational cost

(27)

Mapping  decision  tree  to  rules

A rule is created for each path from the root to a leaf: Each attribute-value pair along a given path forms a conjunction in the rule antecedent. The class label held by the leaf forms the rule consequent

age?

student? yes credit_rating?

no yes yes no

<30

30-40 >40

no yes excellent fair

IF age = “<30” AND student = no THEN buys_computer = no IF age = “<30” AND student = yes THEN buys_computer = yes IF age = “30-40” THEN buys_computer = yes IF age = “>40” AND credit_rating = execllent THEN buys_computer = yes IF age = “>40” AND credit_rating = fair THEN buys_computer = no

(28)

Enhance  basic  decision  tree  algorithm  (I)

•  Allow for continuous-valued attributes

–  a test on a continuous-valued attribute A results in two branches

corresponding to AV and AV

–  given v values of A, (v-1)possible splits may be considered

•  Handle missing attribute values

–  assign the most common value of the attribute –  assign probability of each of the possible values •  Incremental induction

–  it it not good to generate the tree from scratch at every time that new instances arriving

–  dynamically adjust the splits in the tree •  Attribute construction

(29)

Enhance  basic  decision  tree  algorithm  (II)

Scalable decision tree algorithm:

most studies focus on improving the data structures

•  SLIQ [Mehta et al., EDBT96]

–  build an index for each attribute. Only class list and the current attribute list reside in memory

•  SPRINT [J. Shafer et al., VLDB96]

–  construct an attribute list. When a node is partitioned, the attribute list is also partitioned

•  RainForest [J. Gehrke et al., VLDB98]

–  build an AVC-list (attribute, value, class label)

–  separate the scalability aspects from the criterion that determine the quality of the tree

(30)

What  is  neural  network?

also called artificial neural network

neural networks are massively parallel interconnected networks of

simple (usually adaptive) elements and their hierarchical organizations

which are intended to interact with the objects of the real world in the

same way as biological nervous systems do

[T. Kohonen, NN88]

The  basic  component  of  a  neural  network:  neuron  and  weight

M-P model neurons are connected by weights

neuron is also called unit bias

is also called threshold

The  knowledge  learned  by  a   neural  network  is  encoded  in   the  weights  and  biases

(31)

Perceptron

It  can  be  learned  by  

The model of the Perceptron

w ← w −η

(

wTxi − yi

)

xi

For each training example (xi, yi) do

Repeat Until convergence

Aims  to  find  a   hyperplain  such   that  it  separate   different  class

(32)

Gradient  Descent

w ← w + Δw

(33)

Perceptron

•  Limitations of perceptron

(34)

What  is  multilayer  feedforward  NN?

Feedforward  neural  network:

a  kind  of  neural  network  where  a  

unit  is  only  connected  with  the  units  in  its  next  neighboring  layer

Hidden units and output units are called

functional units, which are usually equipped

with a non-linear function (sigmoid

function)

There is no rule indicating how to get the

best network, therefore the network design

is a trial-by-error process

(35)

Backpropagation  (I)

Abbreviated as BP

most popular neural network algorithm, can be used in both classification and regression

at first, proposed by P. Werbos in his Ph.D dissertation:

P. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral science. Ph.D dissertation, Harvard University, 1974

The BP algorithm Sketch:

Step 1: feedforward input from input layer to hidden layer to output layer Step 2: compute the error of the output layer

Step 3: backpropagate the error from output layer to hidden layer Step 4: adjust weights and biases

(36)

Backpropagation  (II)

Backpropagation Procedure:

For each given training example (x, y), do

1. Input the instance x to the NN and compute the output value ou of every output

unit u of the network

2. For each network output unit k, calculate its error term δk

3. For each hidden unit h, calculate its error term δh

4. Update each network weight wji which is the weight associated with the i-th

input value to the unit j

Propagate input forward though the network

Propagate the errors

backward though the

network

(1 )

h oh oh k output whk k

(37)

Let’s derive the BP.

Backpropagation  

(III)

Reform the score function as

Since we adopt a stochastic gradient descent, we calculate the gradient when receiving the training example (xp, yp) as

, where

Then ,we derive the corresponding for different type of units in the network , such that the update rule can be

(38)

•  For the weights associated to output unit

Backpropagation  

(IV)

By using chain rule, we get

Plugging oj yields

, where

(39)

•  For the weights associated to hidden unit

Backpropagation  

(V)

By using chain rule, we get

where and Nextlayer(j) is a set of unit whose immediate input is j

(40)
(41)

•  MSE of output layer

Example

(42)

•  Hidden Unit encoding for input 01000000

Example

(43)

•  Weights from inputs to one hidden unit

Example

(44)

Backpropagation  

(VI)

•  Remarks

Solution of BP:

•  Guaranteed local minima

•  Random initiation and stochastic gradient descent make it less likely to get stuck in the local minima.

Representation power of Neural Networks:

  Boolean functions: can be exactly represented by 2 layers of units

  (Bounded) Continuous functions: can be approximated with arbitrarily

small error with 2 layer of units, with 1 layer of sigmoid units (hidden)

and 1 layer of linear units

  Arbitrary functions: can be approximated to arbitrary accuracy by 3 layers of units with 2 hidden layers of sigmoid units.

local minima:

(45)

What  is  support  vector  machine?

Support  vector  machines  

are  learning  systems  that  use  a  hypothesis  

space  of  linear  functions  in  a  high-­‐‑dimensional  feature  space,  trained  

with  a  learning  algorithm  from  optimization  theory  that  implements  a   learning  bias  derived  from  statistical  learning  theory

SVM has close relationship with neural networks, e.g., SVM with Gaussian kernel is actually a RBF neural network

Although SVM becomes hot since the middle of 1990s, in fact the idea

of support vector was proposed by V. Vapnik in 1963, and some keys

gradients of Statistical Learning Theory were obtained in 1960s and

1970s (mainly by V. Vapnik and A. Chervonenkis): VC dimension

(1968), structural risk minimization inductive principle (1974)

(46)

Linear  hyperplane

Binary classification can be viewed as the task of separating

classes in feature space

(47)

Margin

Margin:

The width between two classes

Support Vectors:

The examples that are closest to the hyperplane

Assuming all data points are at least distance 1 from the hyperplane, the following two constraints hold for a training set {(xi, yi)}

r(x) =

w

T

x + b

(48)

•  Model: linear classifier in high dimensional feature space

•  Score function & optimization:

General  Model  of  SVM

min w

s.t

is a mapping from input space to high-dimensional feature space. If ϕ (x) = x, SVM is a linear SVM

Aims  to  find  a  hyperplane  in  the  high  dimensional  feature  space   to  separate  the  data,  where  the  “margin”  between  the  two  classes   are  maximized

Structural risk

Empirical risk

(49)

From  linear  to  non-­‐‑linear  (I)

•  Why the mapping matters

By  mapping  the  input  space  to  higher  dimensional  feature  space,   we  can  achieve  nonlinearity  by  linear  means.

Maps the input to higher dimensional space

Project the hyperplane back to the input space yields the non-linear boundary

(50)

From  linear  to  non-­‐‑linear  (II)

•  High dimension cause problems on learning (curse of

dimensionality)

•  Solution:

Kernel trick

–  A kernel function corresponds to inner products in some high dimensional feature space.

–  Thus, we can implicitly compute the inner products in some high

dimensional feature space using kernel function without actually conduct the mapping

Kernel  trick  allows  to  work  in  the  original  space  while  benefiting  from   the  mappings  to  high  dimensional  feature  space.

(51)

What  is  Bayesian  classification?

Bayesian classification is based on Bayes rule

Bayesian classifiers have exhibited high accuracy and fast speed

when applied to large databases

Bayes rule

where P(H|X) is the posterior probability of the hypothesis H conditioned on the data sample X, P(H) is the prior probability of H, P(X|H) is the

posterior probability of X conditioned on H, P(X) is

(52)

Naïve  Bayes  classifier  (I)

also called simple Bayes classifier

class conditional independence

: assume that the effect of an attribute value

on a given class is independent of the values of other attributes

class Ci (i = 1,…,m) attribute Ak (k = 1,…,n)

feature vector X = (x1,x2,…,xn), where xk is the value of X on Ak

Naïve Bayes classifier wants to get the maximum a posteriori hypothesis Ci

according to Bayes rule,

because P(X) is a constant for all classes, only P(X|Ci)P(Ci) needs to be maximized

P(C

i

| X) > P(C

j

| X) for 1 ≤ j ≤ m, j

i

(53)

INTRODUCTIONTO DATA MINING: PART 5

Naïve  Bayes  classifier  (II)

to maximize P(X | Ci)P(Ci):

P(Ci) can be estimated by

where Si is the number of training instances of class Ci, and S is the total number of training instances

since naïve Bayes classifier assumes class conditional independence, P(X|Ci) can be estimated by

•  if Ak is an categorical attribute, then we can take

where Sik is the number of training instances of class Ci having the value xk for Ak, and Si is the number of training instances of class Ci

•  if Ak is a continuous attribute, then usually we can take

where g(xk ci , σci) is the Gaussian density function for Ak; and are the mean and variance, respectively, given the values for Ak for training instances of

(54)

Example  of  naïve  Bayes  classifier  (I)

training set:

C1: buys_computer = “yes”, C2: buys_computer = “no”

rid age income student credit_rating Class: buys_computer

1 <30 high no fair no

2 <30 high no execllent no

3 30-40 high no fair yes

4 >40 medium no fair yes

5 >40 low yes fair yes

6 >40 low yes excellent no

7 30-40 low yes excellent yes

8 <30 medium no fair no

9 <30 low yes fair yes

10 >40 medium yes fair yes

11 <30 medium yes excellent yes

12 30-40 medium no excellent yes

(55)

INTRODUCTIONTO DATA MINING: PART 5

Example  of  naïve  Bayes  classifier  (II)

Given an instance to be classified:

X = (age = “< 30”, income = “medium”, student = “yes”, credit_rating = “fair”)

P(Ci): P(C1) = P(buys_computer = “yes”) = 9 / 14 = 0.643 P(C2) = P(buys_computer = “no”) = 5 / 14 = 0.357

P(X|Ci): since

P(age = “<30” | buys_computer = “yes”) = 2 / 9 = 0.222 P(age = “<30” | buys_computer = “no”) = 3 / 5 = 0.600

P(income = “medium” | buys_computer = “yes”) = 4 / 9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2 / 5 = 0.400 P(student = “yes” | buys_computer = “yes”) = 6 / 9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1 / 5 = 0.200

P(credit_rating = “fair” | buys_computer = “yes”) = 6 / 9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2 / 5 = 0.400

then

P(X|C1) = P(X|buys_computer = “yes”) = 0.222*0.444*0.667*0.667 = 0.044

P(X|C2) = P(X|buys_computer = “no”) = 0.600*0.400*0.200*0.400 = 0.019

P(X|Ci)P(Ci): P(X|C1)P(C1) = 0.044*0.643 = 0.028 P(X|C2)P(C2) = 0.019*0.357 = 0.007 therefore C , i.e., buys_computer = “yes”, is returned

(56)

A  problem  with  naïve  Bayes  classifier  

rid age income student credit_rating Class: buys_computer

1 <30 high no fair no

2 <30 high no execllent no

3 30-40 high no fair yes

4 >40 medium no fair yes

5 >40 low yes fair yes

6 >40 low yes excellent no

7 30-40 low yes excellent yes

8 <30 medium no fair no

9 <30 low yes fair yes

10 >40 medium yes fair yes

11 <30 medium yes excellent yes

12 30-40 medium no excellent yes

13 30-40 high yes fair yes

14 >40 medium no execllent no

What happens if we remove this example?

will always be classified as “buys-computer = yes”, no matter what values appear in other

(57)

Laplacian  correction  to  naïve  Bayes  

(拉普拉斯修正)  

If some attribute values are not observed in training data, how to perform naïve Bayes learning?

student credit buys

yes no yes

no yes yes

… … yes

yes yes no

no yes no

e.g.

all entries are “yes”

for buys = “no”

Then, how about:

(student = “yes”, credit = “no”)??

the number of training instances with property X is denoted as #X

Laplacian  

(58)

Bayesian  network  

also called belief network, Bayesian belief network, or probabilistic network

a graphical model which allows the representation of

dependencies among subsets of attributes

a standard Bayesian network is defined by two components:

• 

a

directed acyclic graph

where each node represents a random

variable, and each arc represents a probabilistic dependence

• 

a

conditional probability table

(CPT) for each variable

(59)

Example  of  Bayesian  network  

Family History Lung Cancer Pisitive XRay Smoker Emphysema Dyspnea

conditional probability table for

the variable LungCancer

0.8 0.5 0.7 0.1 0.2 0.5 0.3 0.9

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC ~LC

P(LungCancer = “yes” | FamilyHistory = “yes”, Smoker = “yes”) = 0.8

(60)

How  to  construct  a  Bayesian  network?

•  if the network structure and all the variables are known

-

it is easy to calculate the CPT entries as calculating the

probabilities in naïve Bayes classifiers

•  if the network structure is known, but some variables are unknown

-

gradient descent methods are often used to generate the values

of the CPT entries

•  if the network structure is unknown

-

discrete optimization techniques are often used to generate the

network structure from known variables

(61)

2011年度图灵奖

Judea Pearl

(1936 - )

(UCLA)

Pioneer of

Probabilistic and Causal Reasoning

(Bayesian networks, graphical model)

J. Pearl. Probabilistic Reasoning in Intelligent Systems:

Networks of Plausible Inference, Morgan Kaufmann, 1988.

(62)

What  is  ensemble  learning?

Ensemble  learning  is  a  machine  learning  paradigm  where  

multiple  (homogenous

 

/

 

heterogeneous)  individual  learners  

are  trained  for  the  same  problem

Problem

Learner

Learner Learner

… …

Learner

… …

(63)

Why  ensemble  learning?

The generalization ability of

an ensemble is usually

significantly better than the corresponding single learner

Ensemble learning was regarded as one of the four current directions in machine learning research [T.G. Dietterich, AIMag97]

[A. Krogh & J. Vedelsby, NIPS94]

The more accurate and the more diverse, the better

(64)

How  to  build  an  ensemble?

An  ensemble  is  built  in  two  steps:

1)  obtain  the  base  learners

2)  combine  the  individual  predictions

Learner Learner

… …

Learner

… …

(65)

Many  ensemble  methods

According to how the base learners can be generated:

•  Parallel methods

•  Bagging

[L. Breiman, MLJ96]

•  Random Subspace

[T. K . Ho, TPAMI98]

•  Random Forests

[L. Breiman, MLJ01]

•  … …

•  Sequential methods

•  AdaBoost

[Y. Freund & R. Schapire, JCSS97]

•  Arc-x4

[L. Breiman, AnnStat98]

•  LPBoost

[A. Demiriz et al., MLJ06]

(66)

Bagging

Data set

Data set 1

Data set 2

Data set n

Learner 1

Learner 2

Learner n

… …

… …

Voting for classification:

Output the label with the most votes

Averaging for regression: Output the averaging of the individual learners Bootstrap a set of learners: Generate a set of training set by bootstrap sampling from original data set and then train a learner for each generated data set

(67)

INTRODUCTIONTO DATA MINING: PART 5

Boosting

Original training set

Data set

1

Learner

1

Data set

2

Learner

2

Data set

T

Learner

T

… …

… …

… …

training instances that are wrongly predicted by Learner1 will play more important roles in the training of Learner2

weighted combination Gödel Prize (2003)

Freund & Schapire, A decision theoretic generalization of on-line learning and an application to Boosting. Journal of Computer and System Sciences, 1997, 55: 119-139.

(68)

Simple yet effective

can be applied to almost all tasks where one wants to apply

machine learning techniques

For example, in computer vision, the Viola-Jones detector

AdaBoost using harr-like features in a cascade structure

in average, only 8 features

needed to be evaluated per

image

(69)

The Viola-Jones detector

“the first real-time face detector”

Comparable accuracy, but

15 times faster than

state-of-the-art of face

detectors (at that time)

Longuet-Higgins Prize (2011)

Viola & Jones, Rapid object detection using a Boosted cascade of simple features. CVPR, 2001.

(70)

Selective  ensemble

Many Could be Better Than All: Given a set of trained

learners, … …, ensembling many of the trained learners may be

better than ensembling all of them

[Z.-H. Zhou

et al

, AIJ02]

The  basic  idea  of  selective  ensemble:  Use  multiple  solutions  

and  perform  some  kind  of  selection

individual solution

… …

… …

Problem individual solution individual solution

Principles for selection:

•  the effectiveness of the

individuals

•  the complementarity of

the individuals

(71)

More  about  ensemble  methods:

Z.-H. Zhou.

Ensemble Methods:

Foundations and

Algorithms

, Boca Raton,

FL: Chapman & Hall/

CRC, Jun. 2012.

(72)

k-­‐‑nearest  neighbor  classifier

•  store all the training examples

•  each training instance represents a point in n-D instance space

•  the nearest neighbors are identified based on some distance

measures

•  for classification, returns the most common value among the k

training instances nearest to the unseen instance

•  for regression estimation, returns the average value among the k

(73)

Case-­‐‑based  reasoning

•  “cases” are complex symbolic descriptions

•  store all cases

•  compare the cases (or components of cases) with unseen cases

(or components of cases)

combine the solutions of similar previous cases (or components

of cases) to the solution of unseen cases

previous cases:

case 1: a man stole $100 was published for 1 year imprisonment case 2: a man spitted in public area was published for 10 whips

unseen case:

a man stole $200 and spitted in public area

published for 1.5 year imprisonment plus 10 whips similar identical

(74)

Linear  regression

•  linear regression

data are modeled to fit a straight line

where α and β can be estimated from a training set:

•  multiple regression

a response variable

Y

is modeled as a linear function of

multiple predictor variables

(75)

Nonlinear  regression

•  polynomial regression

data are modeled to fit a polynomial function

it can be converted to linear regression problem, e.g. let

there are also nonlinear regression models that cannot be converted to

a linear model. For such cases, it may be possible to obtain

least-square estimates through extensive calculations on more complex

formulae

(76)

Let’s move to

Part 6

References

Related documents