• No results found

Computational Learning Theory: Agnostic Learning

N/A
N/A
Protected

Academic year: 2021

Share "Computational Learning Theory: Agnostic Learning"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Computational Learning Theory: Agnostic Learning

Machine Learning Fall 2017

Supervised Learning: The Setup

Machine Learning Spring 2018

(2)

This lecture: Computational Learning Theory

The Theory of Generalization

Probably Approximately Correct (PAC) learning

Positive and negative learnability results

Agnostic Learning

Shattering and the VC dimension

(3)

This lecture: Computational Learning Theory

The Theory of Generalization

Probably Approximately Correct (PAC) learning

Positive and negative learnability results

Agnostic Learning

Shattering and the VC dimension

(4)

So far we have seen…

The general setting for batch learning

PAC learning and Occam’s Razor

How good will a classifier that is consistent on a training set be?

Two assumptions so far:

1. Training and test examples come from the same distribution

2. For any concept, there is some function in the hypothesis space that is consistent with the training set

(5)

So far we have seen…

The general setting for batch learning

PAC learning and Occam’s Razor

How good will a classifier that is consistent on a training set be?

Two assumptions so far:

1. Training and test examples come from the same distribution

2. For any concept, there is some function in the hypothesis space that is consistent with the training set

(6)

So far we have seen…

The general setting for batch learning

PAC learning and Occam’s Razor

How good will a classifier that is consistent on a training set be?

Two assumptions so far:

1. Training and test examples come from the same distribution

2. For any concept, there is some function in the hypothesis space that is consistent with the training set

(7)

What is agnostic learning?

So far, we have assumed that the learning algorithm could find consistent hypothesis

What if: We are trying to learn a concept f using hypotheses in H, but f Ï H

That is C is not a subset of H

This setting is called agnostic learning

Can we say something about sample complexity?

More realistic setting than before

(8)

What is agnostic learning?

So far, we have assumed that the learning algorithm could find consistent hypothesis

What if: We are trying to learn a concept f using hypotheses in H, but f Ï H

That is C is not a subset of H

This setting is called agnostic learning

Can we say something about sample complexity?

More realistic setting than before

C H

(9)

What is agnostic learning?

So far, we have assumed that the learning algorithm could find consistent hypothesis

What if: We are trying to learn a concept f using hypotheses in H, but f Ï H

That is C is not a subset of H

This setting is called agnostic learning

Can we say something about sample complexity?

More realistic setting than before

C H

(10)

Agnostic Learning

Are we guaranteed that training error will be zero?

No. There may be no consistent hypothesis in the hypothesis space!

Our goal should be to find a classifier h ∈H that has low training error

This is the fraction of training examples that are misclassified

Learn a concept f using hypotheses in H, but f Ï H

(11)

Agnostic Learning

Our goal should be to find a classifier h∈H that has low training error

What we want: A guarantee that a hypothesis with small training error will have a good accuracy on

unseen examples

Learn a concept f using hypotheses in H, but f Ï H

(12)

We will use Tail bounds for analysis

How far can a random variable get from its mean?

(13)

We will use Tail bounds for analysis

How far can a random variable get from its mean?

(14)

Bounding probabilities

Markov’s inequality: Bounds the probability that a non- negative random variable exceeds a fixed value

Chebyshev’s inequality: Bounds the probability that a random variable differs from its expected value by more than a fixed number of standard deviations

What we want: To bound average of random variables

Why? Because the training error depends on the number of errors on the training set

(15)

Hoeffding’s inequality

Upper bounds on how much the average of a set of random variables differs from its expected value

(16)

Hoeffding’s inequality

Upper bounds on how much the average of a set of random variables differs from its expected value

True mean

(Eg. For a coin toss, the probability of seeing heads)

(17)

Hoeffding’s inequality

Upper bounds on how much the average of a set of random variables differs from its expected value

True mean

(Eg. For a coin toss, the probability of seeing heads)

Empirical mean, computed over m independent trials

(18)

Hoeffding’s inequality

Upper bounds on how much the average of a set of random variables differs from its expected value

True mean

(Eg. For a coin toss, the probability of seeing heads)

Empirical mean, computed over m independent trials

What this tells us: The empirical mean will not be too far from the true mean if there are many samples.

(19)

Back to agnostic learning

Suppose we consider the true error (a.k.a generalization error) errD(h) to be the true mean

The training error over m examples errS(h) is the empirical estimate of this true error

Let’s apply Hoeffding’s inequality

(20)

Back to agnostic learning

Suppose we consider the true error (a.k.a generalization error) errD(h) to be the true mean

The training error over m examples errS(h) is the empirical estimate of this true error, why?

Let’s apply Hoeffding’s inequality

(21)

Back to agnostic learning

Suppose we consider the true error (a.k.a generalization error) errD(h) to be the true mean

The training error over m examples errS(h) is the empirical estimate of this true error

We can ask: What is the probability that the true error is more than ! away from the empirical error?

(22)

Back to agnostic learning

Suppose we consider the true error (a.k.a generalization error) errD(h) to be the true mean

The training error over m examples errS(h) is the empirical estimate of this true error

Let’s apply Hoeffding’s inequality

(23)

Back to agnostic learning

Suppose we consider the true error (a.k.a generalization error) errD(h) to be the true mean

The training error over m examples errS(h) is the empirical estimate of this true error

Let’s apply Hoeffding’s inequality

(24)

Back to agnostic learning

Suppose we consider the true error (a.k.a generalization error) errD(h) to be the true mean

The training error over m examples errS(h) is the empirical estimate of this true error

Let’s apply Hoeffding’s inequality

(25)

Agnostic learning

The probability that a single hypothesis h has a true error that is more than ! away from the training error is bounded above

The learning algorithm looks for the best one of the |H| possible hypotheses

The probability that there exists a hypothesis in |H| whose training error is ² away from the true error is bounded above

(26)

Agnostic learning

The probability that a single hypothesis h has a true error that is more than ! away from the training error is bounded above

The learning algorithm looks for the best one of the |H| possible hypotheses

The probability that there exists a hypothesis in |H| whose training error is ² away from the true error is bounded above

(27)

Agnostic learning

The probability that a single hypothesis h has a true error that is more than ! away from the training error is bounded above

The learning algorithm looks for the best one of the |H| possible hypotheses (let us assume we do not know what is the best)

The probability that there exists a hypothesis in H whose true error is ! away from the training error is bounded above

(28)

Agnostic learning

The probability that there exists a hypothesis in H whose true error is !

away from the training error is bounded above

Same game as before: We want this probability to be smaller than ±

Rearranging this gives us

(29)

Agnostic learning

The probability that there exists a hypothesis in H whose true error is !

away from the training error is bounded above

Same game as before: We want this probability to be smaller than "

Rearranging this gives us

(30)

Agnostic learning

The probability that there exists a hypothesis in H whose true error is ! away from the training error is bounded above

Same game as before: We want this probability to be smaller than "

Rearranging this gives us

(31)

Agnostic learning

The probability that there exists a hypothesis in H whose true error is !

away from the training error is bounded above

Same game as before: We want this probability to be smaller than "

Rearranging this gives us

(32)

Agnostic learning: Interpretations

1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples.

It can guarantee with probability 1 - ! that the true error is not off by more than

" from the training error if

(33)

Agnostic learning: Interpretations

1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples.

It can guarantee with probability 1 - ! that the training error is not off by more than " from the training error if

Difference between generalization and training errors: How much worse will the classifier be in the future than it is at training time?

(34)

Agnostic learning: Interpretations

1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples.

It can guarantee with probability 1 - ! that the training error is not off by more than " from the training error if

Difference between generalization and training errors: How much worse will the classifier be in the

future than it is at training time? Size of the hypothesis class:

Again an Occam’s razor

argument – prefer smaller sets

(35)

Agnostic learning: Interpretations

1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples.

It can guarantee with probability 1 - ! that the training error is not off by more than " from the training error if

2. We have a generalization bound: A bound on how much the true error will

deviate from the training error. If we have more than m examples, then with high probability

Generalization error Training error

(36)

What we have seen so far

Occam’s razor: When the hypothesis space contains the true concept

Agnostic learning: When the hypothesis space may not contain the true concept

Learnability depends on the log of the size of the hypothesis space

(37)

What we have seen so far

Occam’s razor: When the hypothesis space contains the true concept

Agnostic learning: When the hypothesis space may not contain the true concept

Learnability depends on the log of the size of the hypothesis space

(38)

What we have seen so far

Occam’s razor: When the hypothesis space contains the true concept

Agnostic learning: When the hypothesis space may not contain the true concept

Learnability depends on the log of the size of the hypothesis space

References

Related documents

Servo on Home return Target position Movement parameter number Pause Alarm reset Gateway head address Axis number. Controller ready Servo status Home return status

The fourth recommendation covers the need for empirical support before the introduction of new data exclusivity protection or an increase in such protection. Thus far, policymakers

These profiles presented similar configurations of motivation (i.e., highest on intrinsic and identified regulation, followed by introjected, and lowest on external regulation)

This report is the result of my bachelor assignment about employee motivation within PT. Sarandi Karya Nugraha in Sukabumi, Indonesia. Sarandi Karya Nugraha is

After this introduction part rest of the paper is organized as follows: In Section 2 investigation of problem with DDoS attacks is given; in Section 3 DDoS Attack

The current study examined the relationship between diagnosed mental illness of the caregiver, a presence of domestic violence, caregiver age, caregiver sex, marital status, number

Significant, positive coefficients denote that the respective sector (commercial banking, investment banking, life insurance or property-casualty insurance) experienced

We take advantage of an unprecedented change in policy in a number of Swiss occupational pension plans: The 20 percent reduction in the rate at which retirement capital is