Agnostic active learning

(1)

Contents lists available atScienceDirect

Journal of Computer and System Sciences

www.elsevier.com/locate/jcss

✩

Maria-Florina Balcan

a

, Alina Beygelzimer

b

,

∗

, John Langford

c a_{Carnegie Mellon University, Pittsburgh, PA 15213, USA}

b_{IBM T. J. Watson Research Center, Hawthorne, NY 10532, USA} c_{Yahoo! Research, New York, NY 10011, USA}

a r t i c l e i n f o a b s t r a c t

Article history: Received 1 June 2007

Received in revised form 18 June 2008 Available online 18 July 2008 Keywords:

Active learning Agnostic setting Sample complexity Linear separators

We state and analyze the ﬁrst active learning algorithm that ﬁnds an

-optimal hypothesis in any hypothesis class, when the underlying distribution has arbitrary forms of noise. The algorithm, A2 _(for _A_gnostic _A_{ctive), relies only upon the assumption that it has}

access to a stream of unlabeled examples drawni.i.d. from a ﬁxed distribution. We show that A2achieves an exponential improvement (i.e., requires onlyO(ln1)samples to ﬁnd an

-optimal classiﬁer) over the usual sample complexity of supervised learning, for several settings considered before in the realizable case. These include learning threshold classiﬁers and learning homogeneous linear separators with respect to an input distribution which is uniform over the unit sphere.

1. Introduction

Traditionally, machine learning has focused on the problem of learning a task from labeled examples only. In many applications, however, labeling is expensive while unlabeled data is usually ample. This observation motivated substantial work on properly using unlabeled data to beneﬁt learning, and there are many examples showing that unlabeled data can signiﬁcantly help [4,9,10,27,29,31–33].

There are two main frameworks for incorporating unlabeled data into the learning process. The ﬁrst framework is semi-supervised learning [17], where in addition to a set of labeled examples, the learning algorithm can also use a (usually larger) set of unlabeled examples drawn at random from the same underlying data distribution. In this setting, unlabeled data becomes useful under additional assumptions and beliefs about the learning problem. For example, transductive SVM learning [27] assumes that the target function cuts through low density regions of the space, while co-training [10] assumes that the target should be self-consistent in some way. Unlabeled data is potentially useful in this setting because it allows one to reduce the search space to a set which is a-priori reasonable with respect to the underlying distribution.

The second setting, which is the main focus of this paper, isactive learning[18,21]. Here the learning algorithm is allowed to draw random unlabeled examples from the underlying distribution and ask for the labels of any of these examples. The hope is that a good classiﬁer can be learned with signiﬁcantly fewer labels by actively directing the queries to informative examples.

As in passive supervised learning, but unlike in semi-supervised learning, the only prior belief about the learning problem here is that the target function (or a good approximation of it) belongs to a given concept class. For some concept classes such as thresholds on the line, one can achieve an exponential improvement over the usual sample complexity of supervised learning, under no additional assumptions about the learning problem [18,21]. In general, the speedups achievable in active

✩ _{A preliminary version of this paper appeared in the 23rd International Conference on Machine Learning, ICML 2006.}

*

Corresponding author.

(2)

learning depend on the match between the data distribution and the hypothesis class, and therefore on the target hypothesis in the class. The most noteworthy non-trivial example of improvement is the case of homogeneous (i.e., through the origin) linear separators, when the data is linearly separable and distributed uniformly over the unit sphere [21,23,24]. There are also simple examples where active learning does not help at all, even in the realizable case [21].

Most of the previous work on active learning has focused on the realizable case. In fact, many of the existing active learn-ing strategies arenoise seekingon natural learning problems, because the process of actively ﬁnding an optimal separation between one class and another often involves label queries for examples close to the decision boundary, and such examples often used a large conditional noise rate (e.g., due to a mismatch between the hypothesis class and the data distribution). Thus the most informative examples are also the ones that are typically the most noise-prone.

Consider an active learning algorithm which searches for the optimal threshold on an interval using binary search. This example is often used to demonstrate the potential of active learning in the noise-free case when there is a perfect threshold separating the classes [18]. Binary search needs O

(

ln1

)

labeled examples to learn a threshold with error less than

, while learning passively requires O

(

1

)

labels. A fundamental drawback of this algorithm is that a small amount of adversarial noise can force the algorithm to behave badly. Is this extreme brittleness to small amounts of noise essential? Can an exponential decrease in sample complexity be achieved? Can assumptions about the mechanism producing noise be avoided? These are the questions addressed here.

1.1. Previous work on active learning

There has been substantial work on active learning under additional assumptions. For example, the Query by Committee analysis [24] assumes realizability (i.e., existence of a perfect classifier in a known set), and a correct Bayesian prior on the set of hypotheses. Dasgupta [21] has identified sufficient conditions (which are also necessary against an adversarially chosen distribution) for active learning given only the additional realizability assumption. There are several other papers that assume only realizability [20,23]. If there exists a perfect separator amongst hypotheses, any informative querying strategy can direct the learning process without the need to worry about the distribution it induces—any inconsistent hypothesis can be eliminated based on a single query, regardless of which distribution this query comes from. In the agnostic case, however, a hypothesis that performs badly on the query distribution may well be the optimal hypothesis with respect to the input distribution. This is the main challenge in agnostic active learning that is not present in the non-agnostic case. Burnashev and Zigangirov [14] allow noise, but require a correct Bayesian prior on threshold functions. Some papers require specific noise models such as a constant noise rate everywhere [16] or Tsybakov noise conditions [5,15].

The membership-query setting [1,2,13,26] is similar to active learning considered here, except that no unlabeled data is given. Instead, the learning algorithm is allowed to query examples of its own choice. This is problematic in several applications because natural oracles, such as hired humans, have difficulty labeling synthetic examples [8]. Ulam’s Problem (quoted in [19]), where the goal is find a distinguished element in a set by asking subset membership queries, is also related. The quantity of interest is the smallest number of such queries required to find the element, given a bound on the number of queries that can be answered incorrectly. But both types of results do not apply here since an active learning strategy can only buy labels of the examples it observes. For example, a membership query algorithm can be used to quickly find a separating hyperplane in a high-dimensional space. An active learning algorithm cannot do so when the data distribution does not support queries close to the decision boundary.

1.2. Our contributions

This paper presents the firstagnosticactive learning algorithm, A2. The only necessary assumption is that the algorithm has access to a stream of examples drawni.i.d.from some fixed distribution. No additional assumptions are made about the mechanism producing noise (e.g., class/target misfit, fundamental randomization, adversarial situations). The main contribu-tion of this paper is to prove the feasibility of agnostic active learning.

Two comments are in order:

(1) We define the noise rate of a hypothesis class H with respect to a fixed distribution P on labeled examples as the minimum error rate of any hypothesis in H on P (see section 2 for a formal definition). Note that for the special case of so-calledlabel noise(where for each example, the probability that it is mislabeled with respect to the best hypothesis is a fixed constant) these definitions coincide.

(2) We regard unlabeled data as being free so as to focus exclusively on the question of whether or not agnostic active learning is possible at all. Substantial follow-up work to this paper has successfully optimized unlabeled data usage to be on the same order as passive learning [22].

A2 is provably correct (for any 0

<

1

/

2 and 0

< δ <

1

/

2, it outputs an

-optimal hypothesis with probability at least 1

−

δ

) and it is never harmful (it never requires signiﬁcantly more labeled examples than batch learning). A2_provides exponential sample complexity reductions in several settings previously analyzed without noise or with known noise con-ditions. This includes learning threshold functions with small noise with respect to

and hypothesis classes consisting of

(3)

homogeneous (through the origin) linear separators with the data distributed uniformly over the unit sphere in

R

d. The last example has been the most encouraging theoretical result so far in the realizable case [23].

The A2 _{analysis achieves an almost contradictory property: for some sets of classiﬁers, an}

_{-optimal classiﬁer can be} output with fewer labeled examples than are needed to estimate the error rate of the chosen classiﬁer with precision

from random examples only.

1.3. Lower bounds

It is important to keep in mind that the speedups achievable with active learning depend on the match between the distribution over example-label pairs and the hypothesis class, and therefore on the target hypothesis in the class. Thus one should expect the results to be distribution-dependent. There are simple examples where active learning does not help at all in the model analyzed in this paper, even if there is no noise [21]. These lower bounds essentially result from an “aliasing” effect and they are unavoidable in the setting we analyze in this paper (where we bound the number of queries an algorithm makes before itcan proveit has found a good function).1

In the noisy situation, the target function itself can be very simple (e.g., a threshold function), but if the error rate is very close to 1

/

2 in a sizeable interval near the threshold, then no active learning procedure can signiﬁcantly outperform passive learning. In particular, in the pure agnostic setting onecannothope to achieve speedups when the noise rate

η

is large, due to a lower bound of

Ω(

η2

2

)

on the sample complexity of any active learner [28]. However, under specific noise models (such as a constant noise rate everywhere [16] or Tsybakov noise conditions [5,15]) and for specific classes, one can still show significant improvement over supervised learning.

1.4. Structure of this paper

Preliminaries and notation are covered in Section 2, then A2 _{is presented in Section 3. Section 3.1 proves that} _A2 _is correct and Section 3.2 proves it is never harmful (i.e., it never requires signiﬁcantly more samples than batch learning). Threshold functions such as ft

(

x

)

=

I

(

x

>

t

)

and homogeneous linear separators under the uniform distribution over the unit sphere are analyzed in Section 4. Here and in the rest of the paper, I

(

·

)

is the indicator function which is 1 if its argument is true, and 0 otherwise. Conclusions, a discussion of subsequent work, and open questions are covered in Section 5. 2. Preliminaries

We consider a binary agnostic learning problem speciﬁed as follows. Let X be an instance space and Y

= {−

1

,

1

}

be the set of possible labels. Let H be the hypothesis class, a set of functions mapping from X to Y. We assume there is a distribution D over instances in X, and that the instances are labeled by a possibly randomized oracle O. The ora-cle O can be thought of as taking an unlabeled example x in, choosing a biased coin based on x, then ﬂipping it ﬁnd the label

−

1 or 1. The error rate of a hypothesis h with respect to a distribution P over X

×

Y is deﬁned as errP

(

h

)

=

Prx,y∼P

[

h

(

x

)

=

y

]

. The error rate errP

(

h

)

is not generally known since P is unknown, however the empirical version

errP

(

h

)

=

Prx,y∼S

[

h

(

x

)

=

y

] =

1_S

x,y∈SI

(

h

(

x

)

=

y

)

is computable based upon an observed sample set S. Let

η

=

min

h∈H

(

errD,O

(

h

))

denote the minimum error rate of any hypothesis in H with respect to the distribution

(

D

,

O

)

induced by D and the labeling oracle O. The goal is to ﬁnd an

-optimal hypothesis, i.e. a hypothesish

∈

H with errD,O

(

h

)

within

of

η

, where

is some target error.

The algorithm A2 relies on a subroutine, which computes a lower bound LB

(

S

,

h

, δ)

and an upper bound UB

(

S

,

h

, δ)

on the true error rate errP

(

h

)

ofhby using a sample S of examples drawni.i.d.from P. Each of these bounds must hold for allhsimultaneously with probability at least 1

−

δ

. The subroutine is formally deﬁned below.

Deﬁnition 1.A subroutine for computing LB

(

S

,

h

, δ)

and UB

(

S

,

h

, δ)

is said to belegalif for all distributions P over X

×

Y, for all 0

< δ <

1

/

2 andm

∈

N

,

LB

(

S

,

h

, δ)

errP

(

h

)

UB

(

S

,

h

, δ)

holds for allh

∈

Hsimultaneously, with probability 1

−

δ

over the draw of Saccording to Pm_.

Classic examples of such subroutines are the (distribution independent) VC bound [34] and the Occam Razor bound [11], or the newer data dependent generalization bounds such as those based on Rademacher Complexities [12]. For concreteness, a VC bound subroutine is stated in Appendix A.

1 _{In recent work, Balcan et al. [6,7] have shown that in an asymptotic model for Active Learning where one bounds the number of queries the algorithm} makes before it ﬁnds a good function (i.e. one of arbitrarily small error rate), but not the number of queries before it can prove or it knows it has found a good function, one can obtain signiﬁcantly better bounds on the number of label queries required to learn.

(4)

seti_←1,Di←D,Hi←H,Hi−1←H,Si−1← ∅, andk←1. (1)while Disagree_D₍H_i₋₁₎₍ min

h∈Hi−1 UB(Si−1,h, δk)− min h∈Hi−1 LB(Si−1,h, δk)) >

set Si← ∅,Hi←Hi,k←k+1 (2)while Disagree_D₍_H i) 1 2DisagreeD(Hi) ifDisagree_D₍_H_i_)(min h∈H_iUB(Si,h, δk)−hmin∈H_iLB(Si,h, δk))

(∗)returnh=argminh∈H_iUB(Si,h, δk).

else S_i=rejection sample 2|Si| +1 samplesxfromDsatisfying ∃h1,h2∈Hi:h1(x)=h2(x). Si←Si∪ {(x,O(x)):x∈Si},k←k+1 (∗∗)H_i= {h∈Hi: LB(Si,h, δk, ) min h∈Hi UB(Si,h, δk)},k←k+1 end if end while Hi+1←Hi,Di+1←Direstricted to{x:∃h1,h2∈H_i:h1(x)=h2(x)} i_←i₊1 end while

returnh=argminh∈Hi−1UB(Si−1,h, δk).

Algorithm 1.A2 _{(allowed error rate}_{, sampling oracle for}_D_{, labeling oracle}_O_{, hypothesis class} H).

3. TheA2agnostic active learner

At a high level, A2 _{can be viewed as a robust version of the selective sampling algorithm of [18]. Selective sampling is} a sequential process that keeps track of two spaces—the current version space Hi, deﬁned as the set of hypotheses in H consistent with all labels revealed so far, and the currentregion of uncertainty Ri, deﬁned as the set of all x

∈

X, for which there exists a pair of hypotheses in Hi that disagrees onx. In roundi, the algorithm picks a random unlabeled example fromRiand queries it, eliminating all hypotheses in Hiinconsistent with the received label. The algorithm then eliminates thosex

∈

Rion which all surviving hypotheses agree, and recurses. This process fundamentally relies on the assumption that there exists a consistent hypothesis in H. In the agnostic case, a hypothesis cannot be eliminated based on its disagreement with a single example. Any algorithm must be more conservative without risking eliminating the best hypotheses in the class.

A formal speciﬁcation of A2 _{is given in Algorithm 1. Let} _H

i be the set of hypotheses still under consideration by A2 in roundi. If all hypotheses inHi agree on some region of the instance space, this region can be safely eliminated. To help us keep track of progress in decreasing the region of uncertainty, deﬁneDisagree_D

(

H_i

)

as the probability that there exists a

pair of hypotheses inHithat disagrees on a random example drawn from D:

Disagree_D

(

_H_i

)

=

_Pr_x_∼_D

∃

_h₁

,

_h₂

∈

_H_i_: _h₁

(

_x

)

=

_h₂

(

_x

)

.

HenceDisagree_D

(

H_i

)

is the volume of the current region of uncertainty with respect toD.

It is important to understand that the ability to sample from the unlabeled data distribution D implies that ability to compute Disagree_D

(

H_i

)

. To see this, note that: Disagree_D

(

H_i

)

=

E_x_∼_DI

(

∃

h₁

,

h₂

∈

H_i: h₁

(

x

)

=

h₂

(

x

))

is an expectation over unlabeled points drawn from D. Consequently, Chernoff bounds on the empirical expectation of a

{

0

,

1

}

random variable imply thatDisagree_D

(

H_i

)

can be estimated to any desired precision with probability 1 using an unlabeled dataset with size

limiting to inﬁnity.

Let Di be the distribution D restricted to the current region of uncertainty. Formally, Di

=

D

(

x

| ∃

h1

,

h2

∈

Hi: h1

(

x

)

=

h2

(

x

))

. In round i, A2 samples a fresh set of examples S from Di

,

O, and uses it to compute upper and lower bounds for all hypotheses in Hi. It then eliminates all hypotheses whose lower bound is greater than the minimum upper bound. Fig. 3.1 shows the algorithm in action for the case when the data lie in the

[

0

,

1

]

interval on the real line, and His the set of thresholding functions. The horizontal axis denotes both the instance space and the hypothesis space, superim-posed. The vertical axis shows the error rates. Round i completes when S is large enough to eliminate at least half of the current region of uncertainty.

(5)

Fig. 3.1.A2_{in action: sampling, bounding, eliminating.}

Since A2 does not label examples on which the surviving hypotheses agree, an optimal hypothesis in Hi with respect to Di remains an optimal hypothesis in Hi+1 with respect to Di+1. Since each roundi cutsDisagree_D

(

H_i

)

down by half,

the number of rounds is bounded by log1. Sections 4 gives examples of distributions and hypothesis classes for which A2 requires only a small number of labeled examples to transition between rounds, yielding an exponential improvement in sample complexity.

When evaluating bounds during the course of Algorithm 1, A2 _{uses a schedule of}

δ

_{according to the following rule:} the kth bound evaluation has conﬁdence

δ

k

=

k(kδ+1), for k

1. In Algorithm 1, k keeps track of the number of bound computations andi of the number of rounds.

Note.It is important to note that A2_{does not need to know}

η

_{in advance. Similarly, it does not need to know}_D_{in advance.} 3.1. Correctness

Theorem 3.1(Correctness). For all H , for all

(

D

,

O

)

, for all valid subroutines for computing UB and LB, for all0

<

1

/

2and 0

< δ <

1

/

2, with probability1

−

δ

, A2_{returns an}

_{-optimal hypothesis or does not terminate.}

Note.For most “reasonable” subroutines for computing UB and LB, A2 _{terminates with probability at least 1}

−

δ

_{. For more} discussion and a proof of this fact see Section 3.2.

Proof. The ﬁrst claim is that all bound evaluations are valid simultaneously with probability at least 1

−

δ

, and the second is that the procedure produces an

-optimal hypothesis upon termination.

To prove the ﬁrst claim, notice that the samples on which each bound is evaluated are drawni.i.d.from some distribution over X

×

Y. This can be veriﬁed by noting that the distributionDi used in roundi is precisely that given by drawingxfrom

(6)

the underlying distribution D conditioned on the disagreement

∃

h1

,

h2

∈

Hi: h1

(

x

)

=

h2

(

x

)

, and then labeling according to the oracleO.

Thekth bound evaluation fails with probability at most _k₍_kδ₊₁₎. By the union bound, the probability that any bound fails is less then the sum of the probabilities of individual bound failures. This sum is bounded by

∞_k₌₁_k₍_kδ₊₁₎

=

δ

.

To prove the second claim, notice ﬁrst that since every bound evaluation is correct, step

(

∗∗

)

never eliminates a hypoth-esis that has minimum error rate with respect

(

D

,

O

)

.

Let us now introduce the following notation. For a hypothesish

∈

H andG

⊆

Hdeﬁne: eD,G,O

(

h

)

=

Prx,y∼D,O|∃h1,h2∈G:h1(x)=h2(x)

h

(

x

)

=

y

,

fD,G,O

(

h

)

=

Prx,y∼D,O|∀h1,h2∈G:h1(x)=h2(x)

h

(

x

)

=

y

.

Notice that eD,G,O

(

h

)

is in fact errDG,O

(

h

)

, where DG is D conditioned on the disagreement

∃

h1

,

h2

∈

G: h1

(

x

)

=

h2

(

x

)

. Moreover, given anyG

⊆

H, the error rate of every hypothesishdecomposes into two parts as follows:

errD,O

(

h

)

=

eD,G,O

(

h

)

·

Disagree_D

(

G

)

+

f_D_,_G_,_O

(

h

)

·

1

−

Disagree_D

(

G

)

=

errDG,O

(

h

)

·

DisagreeD

(

G

)

+

fD,G,O

(

h

)

·

1

−

Disagree_D

(

G

)

.

Notice that the only term that varies with h

∈

G in the above decomposition, is eD,G,O

(

h

)

. Consequently, ﬁnding an

-optimal hypothesis requires only bounding errDG,O

(

h

)

·

Disagree_D

(

G

)

to precision

. But this is exactly what the negation

of the main while-loop guard does, and this is also the condition used in the ﬁrst step of the second while loop of the algorithm. In other words, upon termination A2 _satisﬁes

Disagree_D

(

_H_i

)

min h∈Hi UB

(

Si

,

h

, δ

k

)

−

min h∈Hi LB

(

Si

,

h

, δ

k

)

,

which proves the desired result.

2

3.2. Fall-back analysis

This section shows thatA2_{is never much worse than a standard batch, bound-based algorithm in terms of the number of} samples required in order to learn. (A standard example of a bound-based learning algorithm is Empirical Risk Minimization (ERM) [35].)

The sample complexitym

(

, δ,

H

)

required by a batch algorithm that uses a subroutine for computing LB

(

S

,

h

, δ)

and UB

(

S

,

h

, δ)

is deﬁned as the minimum number of samplesm such that for all S

∈

Xm_,

|

_UB

(

_S

,

_h

, δ)

−

_LB

(

_S

,

_h

, δ)

|

_for allh

∈

H. For concreteness, this section uses the following bound onm

(

, δ,

H

)

stated as Theorem A.1 in Appendix A:

m

(

, δ,

H

)

=

64

2

2VHln

12

+

ln

4

δ

.

Here VH is the VC-dimension of H. Assume that m

(

2

, δ,

H

)

m(,δ,2 H), and also that the functionm is monotonically increasing in 1

/δ

. These conditions are satisﬁed by many subroutines for computing UB and LB, including those based on the VC-bound [34] and the Occam’s Razor bound [11].

Theorem 3.2.For all H , for all

(

D

,

O

)

, for all UB and LB satisfying the assumption above, for all0

<

1

/

2and0

< δ <

1

/

2, the algorithm A2_{makes at most}₂_m

(

, δ

,

_H

)

_{calls to the oracle O , where}

δ

=

δ

N(,δ,H)(N(,δ,H)+1)and N

(

, δ,

H

)

satisﬁes N

(

, δ,

H

)

ln1lnm

(

,

_N₍_,δ,_H₎₍_Nδ₍_,δ,_H₎₊₁₎

,

H

)

. Here m

(

, δ,

H

)

is the sample complexity of UB and LB.

Proof. Let

δ

k

=

_k₍_kδ₊₁₎ be the conﬁdence parameter used in the kth application of the subroutine for computing UB and LB. The proof works by ﬁnding an upper bound N

(

, δ,

H

)

on the number of bound evaluations throughout the life of the algorithm. This implies that the conﬁdence parameter

δ

kis always greater than

δ

=

N(,δ,H)(Nδ(,δ,H)+1).

Recall that Diis the distribution overxused on theith iteration of the ﬁrst while loop.

Consider i

=

1. If condition 2 of Algorithm A2 _{is repeatedly satisﬁed then after labeling}_m

(

, δ

,

_H

)

_{examples from} _D 1 for all hypothesesh

∈

H1,

UB

(

S1

,

h

, δ

)

−

LB

(

S1

,

h

, δ

)

simultaneously. Note that in these conditions A2safely halts. Notice also that the number of bound evaluations during this process is at most log2m

(

, δ

,

H

)

.

On the other hand, if loop

(

2

)

ever completes and i increases, then it is enough, if you ﬁnish when i

=

2, to have uniformly for allh

∈

H2,

(7)

(This follows from the exit conditions in the outer while-loop and the ‘if’ in Step 2 of A2.) Uniformly bounding the gap between upper and lower bounds over all hypothesesh

∈

H2to within 2

, requiresm

(

2

, δ

,

H

)

m(,δ

_,_H₎

2 labeled examples fromD2and the number of bound evaluations in round i

=

2 is at most log2m

(

, δ

,

H

)

.

In general, in roundiit is enough to have uniformly for allh

∈

Hi,

UB

(

Si

,

h

, δ

)

−

LB

(

Si

,

h

, δ

)

2i−1

,

and which requiresm

(

2i−1

, δ

,

H

)

m(₂,δi−1,H) labeled examples from Di. Also the number of bound evaluations in round i is at most log₂m

(

, δ

,

H

)

.

Since the number of rounds is bounded by log21, it follows that the maximum number of bound evaluations throughout the life of the algorithm is at most log₂1log₂m

(

, δ

,

H

)

. This implies that in order to determine an upper boundN

(

, δ,

H

)

only a solution to the inequality:

N

(

, δ,

H

)

log2 1

log2m

,

δ

N

(

, δ,

H

)(

N

(

, δ,

H

)

+

1

)

,

H

is required.

Finally, adding up the number of calls to the label oracle O in all rounds yields at most 2m

(

, δ

,

H

)

over the life of the algorithm.

2

Let VH denote the VC-dimension of H, and letm

(

, δ,

H

)

be the number of examples required by the ERM algorithm. As stated in Theorem A.1 in Appendix A, a classic bound onm

(

, δ,

H

)

ism

(

, δ,

H

)

=

642

(

2VHln

(

12

)

+

ln

(

4_δ

))

. Using Theo-rem 3.2, the following corollary holds.

Corollary 3.3.For all hypothesis classes H of VC-dimension VH, for all distributions

(

D

,

O

)

over X

×

Y , for all0

<

1

/

2and 0

< δ <

1

/

2, the algorithm A2requires at mostO

˜

(

12

(

VHln1

+

ln1_δ

))

labeled examples the oracle O .2

Proof. The form ofm

(

, δ,

H

)

and Theorem 3.2 implies an upper bound on N

=

N

(

, δ,

H

)

. It is enough to ﬁnd the small-est N satisfying N

ln

1

ln

64

2

2VHln

12

+

ln

4N2

δ

.

Using the inequality lna

ab

−

lnb

−

1 for alla

,

b

>

0 and some simple algebraic manipulations, the desired upper bound on N

(

, δ,

H

)

holds. The result then follows from Theorem 3.2.

2

4. Active learning speedups

This section gives examples of exponential sample complexity improvements achieved by A2. 4.1. Learning threshold functions

Linear threshold functions are the simplest and easiest to analyze class. It turns out that even for this class, exponential reductions in sample complexity are not achievable when the noise rate

η

is large [28]. We prove the following three results:

(1) An exponential improvement in sample complexity when the noise rate is small (Theorem 4.1).

(2) A slower improvement when the noise rate is large (Theorem 4.2). In the extreme case where the noise rate is 1

/

2, there is no improvement.

(3) An exponential improvement when the noise rate is large but due to constant label noise (Theorem 4.3). This shows that for some forms of high noise exponential improvement remains possible.

All results in this section assume that subroutines LB and UB in A2 are based on the VC bound.

Theorem 4.1.Let H be the set of thresholds on an interval. For all distributions

(

D

,

O

)

where D is a continuous probability distribution function, for any

<

1₂and₁₆

η

, the algorithm A2_makes

O

ln

1

ln

_ln

(

1_δ

)

δ

calls to the oracle O on examples drawni.i.d.from D, with probability1

−

δ

.

(8)

Proof. Consider roundi

1 of the algorithm. Forh1

,

h2

∈

Hi, letdi

(

h1

,

h2

)

be the probability thath1 andh2 predict differ-ently on a random example drawn according to the distribution Di, i.e.,di

(

h1

,

h2

)

=

Prx∼Di

[

h1

(

x

)

=

h2

(

x

)

]

.

Let h∗ be any minimum error rate hypothesis in H. Note that for any hypothesis h

∈

Hi, we have errDi,O

(

h

)

di

(

h

,

h∗

)

−

errDi,O

(

h∗

)

and errDi,O

(

h∗

)

η

/

Zi, where Zi

=

Prx∼D

[

x

∈ [

loweri

,

upperi]]is a shorthand forDisagreeD

(

Hi

)

and

[

loweri

,

upperi]denotes the support of Di. Thus errDi,O

(

h∗

)

di

(

h

,

h∗

)

−

η

/

Zi.

We will show that at least a 1₂-fraction (measured with respect to Di) of thresholds in Hi satisfy di

(

h

,

h∗

)

1₄, and these thresholds are located at the ends of the interval

[

loweri

,

upperi]. Assume ﬁrst that both di

(

h∗

,

loweri

)

1₄ anddi

(

h∗

,

upperi

)

14, then let li and ui be the hypotheses to the left and to the right of h∗, respectively, that satisfy di

(

h∗

,

li

)

=

1₄ anddi

(

h∗

,

ui

)

=

1₄. Allh

∈ [

loweri

,

li] ∪ [ui

,

upperi]satisfydi

(

h∗

,

h

)

1₄ and moreover

Prx∼Di

x

∈ [

loweri

,

li] ∪ [ui

,

upperi]

1 2

.

Now suppose that di

(

h∗

,

loweri

)

1₄. Let ui be the hypothesis to the right of h∗ with di

(

h∗

,

upperi

)

=

12. Then all h

∈

[

ui

,

upperi]satisfydi

(

h∗

,

h

)

1₄ and moreoverPrx∼Di

[

x

∈ [

ui

,

upperi]]

1

2. A similar argument holds fordi

(

h∗

,

upperi

)

14. Using the VC bound, with probability 1

−

δ

, if

|

Si| =O

_ln1 δ

(

1₈

−

_Zη i

)

2

,

then for all hypothesesh

∈

Hi simultaneously,

|

UB

(

Si

,

h

, δ)

−

LB

(

Si

,

h

, δ)

|

1₈

−

_Zη_i holds. Note that

η

/

Zi is always upper bounded by ₁₆1.

Consider a hypothesis h

∈

Hi with di

(

h

,

h∗

)

14. For any such h, errDi,O

(

h

)

di

(

h

,

h∗

)

−

η

/

Zi

1 4

−

η

Zi, and so

LB

(

Si

,

h

, δ)

₄1

−

_Zη_i

−

(

1₈

−

_Zη_i

)

=

1₈. On the other hand, errDi,O

(

h∗

)

η Zi, and so UB

(

Si

,

h ∗

, δ)

η Zi

+

1 8

−

η Zi

=

1 8. Thus A 2 eliminates allh

∈

Hiwithdi

(

h

,

h∗

)

1₄. But that means Disagree_D

(

H

i

)

1

2DisagreeD

(

Hi

)

, thus terminating roundi. 3 Each exit fromwhileloop

(

2

)

decreasesDisagree_D

(

_H_i

)

_{by at least a factor of 2, implying that the number of executions}

is bounded by log1. The algorithm makes O

(

ln

(

_δ1

)

ln

(

1

))

calls to the oracle, where

δ

=

N(,δ,H)(Nδ(,δ,H)+1) andN

(

, δ,

H

)

is an upper bound on the number of bound evaluations throughout the life of the algorithm.

The number of bound evaluations required in round i is O

(

ln_δ1

)

, which implies that N

(

, δ,

H

)

should satisfy

cln

(

N(,δ,H)(N_δ(,δ,H)+1)

)

ln

(

1

)

N

(

, δ,

H

)

, for some constantc. Solving this inequality completes the proof.

2

Theorem 4.2 below asymptotically matches a lower bound of Kääriäinen [28]. Recall that A2 _{does not need to know}

η

in advance.

Theorem 4.2.Let H be the set of thresholds on an interval. Suppose that

<

1₂ and

η

>

16

. For all D, with probability1

−

δ

, the algorithm A2_{requires at most}_O

˜

(

η2ln1δ

2

)

labeled samples.

Proof. The proof is similar to the previous proof. Theorem 4.1 implies that loop (2) completes

Θ(

log_η1

)

times. At this point, the minimum error rate of the remaining hypotheses conditioned on disagreement becomes suﬃcient so that the algorithm may only halt via the return step

(

∗

)

. In this case,Disagree_D

(

H

)

=

Θ(

η

)

implying that the number of samples required is

˜

O

(

η

2_ln1 δ 2

)

.

2

The ﬁnal theorem is for theconstant label noisecase wherePry∼O|x[h∗

(

x

)

=

y

] =

η

for allx

∈

X. The theorem is similar to earlier work [14], except that we achieve these improvements with a general purpose active learning algorithm that does not use any prior over the hypothesis space or knowledge of the noise rate, and is applicable to arbitrary hypothesis spaces. Theorem 4.3.Let H be the set of thresholds on an interval. For all unlabeled data distributions D, for all labeled data distributions O , for any constant label noise

η

<

1

/

2and

<

1₂, the algorithm A2makes O

(

₍₁₋1₂

η)2ln

(

1

)

ln

(

ln(1 δ)

δ

))

calls to the oracle O on examples drawni.i.d.from D, with probability1

−

δ

.

The proof is essentially the same as for Theorem 4.1, except that the constant label noise condition implies that the amount of noise in the remaining actively labeled subset stays bounded through the recursions.

Proof. Consider roundi

1. For h1

,

h2

∈

Hi, letdi

(

h1

,

h2

)

=

Prx∼Di

[

h1

(

x

)

=

h2

(

x

)

]

. Note that for any hypothesish

∈

Hi, we

have errDi,O

(

h

)

=

di

(

h

,

h∗

)(

1

−

2

η

)

+

η

and errDi,O

(

h∗

)

=

η

, whereh∗ is a minimum error rate threshold.

3 _{The assumption in the theorem statement can be weakened to}_η_<

(9)

As in the proof of Theorem 4.1, at least a 1₂-fraction (measured with respect to Di) of thresholds in Hi satisfy di

(

h

,

h∗

)

1₄, and these thresholds are located at the ends of the support

[

loweri

,

upperi]ofDi.

The VC bound implies that for any

δ

>

0 with probability 1

−

δ

, if

|

Si| =O

(

₍₁ln(1/δ₋₂_η₎)2

)

, then for all hypotheses h

∈

Hi simultaneously,

|

UB

(

Si

,

h

, δ)

−

LB

(

Si

,

h

, δ)

|

<

1−82η.

Consider a hypothesis h

∈

Hi withdi

(

h

,

h∗

)

1₄. For any suchh, errDi,O

(

h

)

1−2η 4

+

η

=

1 4

+

η 2, and so LB

(

Si

,

h

, δ) >

1 4

+

η 2

−

1 8

(

1

−

2

η

)

=

1 8

+

3η

4. On the other hand, errDi,O

(

h∗

)

=

η

, and so UB

(

Si

,

h∗

, δ) <

η

+

(

1 8

−

η 4

)

=

1 8

+

3η 4. Thus A2 eliminates allh

∈

Hi withdi

(

h

,

h∗

)

1₄. But this means that Disagree_D

(

_H

i

)

12DisagreeD

(

Hi

)

, thus terminating roundi. Finally notice that A2 _makes _O

(

_ln

(

1

δ

)

ln

(

1

))

δ

=

δ

N(,δ,H)(N(,δ,H)+1) and N

(

, δ,

H

)

is an upper bound on the number of bound evaluations throughout the life of the algorithm. The number of bound evaluations required in roundiis O

(

ln

(

1

/δ

))

, which implies that the number of bound evaluations throughout the life of the algorithm N

(

, δ,

H

)

should satisfy cln

(

N(,δ,H)(N_δ(,δ,H)+1)

)

ln

(

1

)

N

(

, δ,

H

)

, for some constantc. Solving this inequality, completes the proof.

2

4.2. Linear separators under the uniform distribution

A commonly analyzed case for which active learning is known to give exponential savings in the number of labeled examples is when the data is drawn uniformly from the unit sphere in

R

d_{, and the labels are consistent with a linear} separator going through the origin. Note that even in this seemingly simple scenario, there exists an

Ω(

1

(

d

+

log1_δ

))

lower bound on the PAC passive supervised learning sample complexity [30]. We will show that A2 _{provides exponential savings} in this case even in the presence of arbitrary forms of noise.

Let X

= {

x

∈

R

d_:

_x

=

₁

}

_{, the unit sphere in}

R

d_{. Assume that} _D _{is uniform over} _X_{, and let} _H _{be the class of linear} separators through the origin. Anyh

∈

His a homogeneous hyperplane represented by a unit vector w

∈

X with the clas-siﬁcation ruleh

(

x

)

=

sign

(

w

·

x

)

; for concreteness, leth

(

0

)

=

1. The distance between two hypothesesu and v in H with respect to a distribution D (i.e., the probability that they predict differently on a random example drawn from D) is given bydD

(

u

,

v

)

=

arccos(_πu·v). Finally, let

θ (

u

,

v

)

=

arccos

(

u

·

v

)

. ThusdD

(

u

,

v

)

=

θ (u_π,v).

Theorem 4.4.Let X , H , and D be as deﬁned above, and let LB and UB be the VC bound. Then for any0

<

1₂,0

<

η

<

16√d, and

δ >

0, with probability1

−

δ

, A2_requires O

d

dlnd

+

ln1

δ

ln1

calls to the labeling oracle, where

δ

=

_N₍_,δ,_H₎₍_Nδ₍_,δ,_H₎₊₁₎and N

(

, δ,

H

)

=

O

ln1

d2lnd

+

dlndln 1

δ

.

Proof. Let w∗

∈

H be a hypothesis with the minimum error rate

η

. Denote the region of uncertainty in round i by Ri. Thus Prx∼D

[

x

∈

Ri] =Disagree_D

(

H_i

)

. Fig. 4.1 schematically shows the region of uncertainty after the ﬁrst iteration of the

algorithm.

Consider round i of A2_{. We prove that the round completes with high probability if a certain threshold on the number} of labeled examples is reached. The round may complete with a smaller number of examples, but this is ﬁne because the metric of progress Disagree_D

(

H_i

)

must halve in order for the round to complete.

Theorem A.1 says that it suﬃces to query the oracle on a set S of O

(

d2_ln_d

+

_d_ln1

δ

)

examples fromith distribution Di to guarantee, with probability 1

−

δ

, that for allw

∈

Hi,

errDi,O

(

w

)

−

err

Di,O

(

w

)

<

1 2

1 8

√

d

−

η

ri

,

whereriis a shorthand forDisagree_D

(

H_i

)

. (By assumption,

η

<

16√d and the loop guard guarantees thatDisagreeD

(

Hi

)

. Thus the precision above is at least 1

32√d.)

4 _{This implies that UB}

(

_S

,

_w

, δ

)

−

_err_D

i,O

(

w

) <

1 8√d

−

η ri

,

and errDi,O

(

w

)

−

LB

(

S

,

w

, δ

) <

1 8√d

−

η

ri

.

Consider any w

∈

Hi with dDi

(

w

,

w

∗

)

1

4√d. For any such w, errDi,O

(

w

)

dDi

(

w

,

w∗

)

−

errDi,O

(

w∗

)

1 4√d

−

η ri, since errDi,O

(

w ∗

)

η ri. Therefore LB

(

S

,

w

, δ

) >

1 4

√

d

−

η

ri

−

1 8

√

d

+

η

ri

=

1 8

√

d

.

4 _{The assumption in the theorem statement can be weakened to}_η_<

(10)

Fig. 4.1.The region of uncertainty after the ﬁrst iteration (schematic).

On the other hand, we have UB

(

S

,

w∗

, δ

) <

η_r

i

+

1 8√d

−

η ri

=

1 8√d, so A 2_eliminates _w _{in step}

(

∗∗

)

_. Thus roundieliminates all hypothesesw

∈

HiwithdDi

(

w

,

w∗

)

1

4√d. Since all hypotheses inHiagree on everyx

∈

/

Ri, dDi

(

w

,

w∗

)

=

1 ri dD

(

w

,

w∗

)

=

θ (

w

,

w∗

)

π

ri

.

Thus round i eliminates all hypotheses w

∈

Hi with

θ (

w

,

w∗

)

₄πr√i_d. But since 2

θ/

π

sin

θ

, for

θ

∈

(

0

,

π₂

]

, it certainly eliminates all wwith sin

θ (

w

,

w∗

)

ri

2√d.

Consider anyx

∈

Ri+1 and the value

|

w∗

·

x

| =

cos

θ (

w∗

,

x

)

. There must exist a hypothesis w

∈

Hi+1 that disagrees with w∗ onx; otherwisexwould not be inRi+1. But then cos

θ (

w∗

,

x

)

cos

(

π₂

−

θ (

w

,

w∗

))

=

sin

θ (

w

,

w∗

) <

ri

2√d, where the last inequality is due to the fact thatA2 _{eliminates all}_w_{with sin}

θ (

_w

,

_w∗

)

ri

2√d. Thus anyx

∈

Ri+1must satisfy

|

w

∗

·

_x

|

<

ri

2√d. Using the fact thatPr

[

A

|

B

] =

Pr_Pr[_[A B_B_]]

Pr_Pr[_[A_B]_] for any AandB,

Prx∼Di

[

x

∈

Ri+1

]

Prx∼Di

|

w

·

x

|

ri 2

√

d

Prx∼D

[|

w

·

x

|

ri 2√d

]

Prx∼D

[

x

∈

Ri]

ri 2ri

=

1 2

,

where the third inequality follows from Lemma A.2. ThusDisagree_D

(

H_i₊₁

)

1

2DisagreeD

(

Hi

)

, as desired.

In order to ﬁnish the argument, it suﬃces to notice that since every round cuts Disagree_D

(

H_i

)

at least in half, the

total number of rounds is bounded by log1. Notice also that A2 makes O

(

d2lnd

+

dln_δ1

)

ln

(

1

)

δ

=

_N₍_,δ,_H₎₍_Nδ₍_,δ,_H₎₊₁₎ andN

(

, δ,

H

)

is an upper bound on the number of bound evaluations throughout the life of the algorithm. The number of bound evaluations required in round i is O

(

d2_ln_d

+

_d_ln1

δ

)

. This implies that the number of bound evaluations throughout the life of the algorithmN

(

, δ,

H

)

should satisfyc

(

d2_ln_d

+

_d_ln

(

N(,δ,H)(N(,δ,H)+1)

δ

))

ln

(

1

)

N

(

, δ,

H

)

for some constantc. Solving this inequality, completes the proof.

2

Note.For comparison, the query complexity of the Perceptron-based active learner of [23] is O

(

dln1_δ

(

lnd_δ

+

ln ln1

))

, for the sameH,X, andD, but only for the realizable case when

η

=

0. Similar bounds are obtained in [5] both in the realizable case and for a specific form of noise related to the Tsybakov small noise condition. The cleanest and simplest argument that exponential improvement is in principle possible in the realizable case for the sameH, X, andD appears in [21]. Our work provides the first justification of why one can hope to achieve similarly strong guarantees in the much harder agnostic case, when the noise rate is sufficiently small with respect to the desired error.

5. Conclusions, discussion and open questions

This paper presents A2_{, the ﬁrst active learning algorithm that ﬁnds an}

_{-optimal hypothesis in any hypothesis class,} when the distribution has arbitrary forms of noise. The algorithm relies only upon the assumption that the samples are drawni.i.d.from a ﬁxed (unknown) distribution, and it does not need to know the error rate of the best classiﬁer in the class in advance. We analyze A2 _{for several settings considered before in the realizable case, showing that} _A2 _achieves an exponential improvement over the usual sample complexity of supervised learning in these settings. We also provide a guarantee that A2_{never requires substantially more labeled examples than passive learning.}

(11)

5.1. Subsequent work

Following the initial publication of A2_{, Hanneke has further analyzed the} _A2 _{algorithm [25], deriving a general upper} bound on the number of label requests made by A2. This bound is expressed in terms of particular quantity called the disagreement coeﬃcient, which roughly quantiﬁes how quickly the region of disagreement can grow as a function of the radius of the version space. For concreteness the bound is included in Appendix B.

In addition, Dasgupta, Hsu, and Monteleoni [22] introduce and analyze a new agnostic active learning algorithm. While similar to A2_{, this algorithm simpliﬁes the maintenance of the region of uncertainty with a reduction to supervised learning,} keeping track of the version space implicitly via label constraints.

5.2. Open questions

A common feature of the selective sampling algorithm [18], A2, and others [22] is that they are all non-aggressive in their choice of query points. Even points on which there is a small amount of uncertainty are queried, rather than pursuing the maximally uncertain point. In recent work Balcan, Broder and Zhang [5] have shown that more aggressive strategies can generally lead to better bounds. However the analysis in [5] was speciﬁc to the realizable case, or done for a very speciﬁc type of noise. It is an open question to design aggressive agnostic active learning algorithms.

A more general open question is what conditions are suﬃcient and necessary for active learning to succeed in the agnos-tic case. What is the right quantity that can characterize the sample complexity of agnosagnos-tic active learning? As mentioned already, some progress in this direction has been recently made in [25] and [22]; however, those results characterize non-aggressive agnostic active learning. Deriving and analyzing the optimal agnostic active learning strategy is still an open question.

Much of the existing literature on active learning has been focused on binary classiﬁcation; it would be interesting to analyze active learning for other loss functions. The key ingredient allowing recursion in the proof of correctness is a loss that is unvarying with respect to substantial variation over the hypothesis space. Many losses such as squared error loss do not have this property, so achieving substantial speedups, if that is possible, requires new insights. For other losses with this property (such as hinge loss or clipped squared loss), generalizations of A2 _{appear straightforward.}

Acknowledgments

Maria-Florina has been supported in part by the NSF grant CCF-0514922, by an IBM Graduate Fellowship, and by a Google Research Grant. Part of this work was done while she was visiting IBM T.J. Watson Research Center in Hawthorne, NY, and Toyota Technological Institute at Chicago.

Appendix A. Standard results

The following standard Sample Complexity bound is in [3].

Theorem A.1.Suppose that H is a set of functions from X to

{−

1

,

1

}

with ﬁnite VC-dimension VH

1. Let D be an arbitrary, but ﬁxed probability distribution over X

× {−

1

,

1

}

. For any

,

δ >

0, if a sample is drawn from D of size

m

(

, δ,

VH

)

=

64

2

2VHln

12

+

ln

4

δ

,

then with probability at least1

−

δ

,

|

err

(

h

)

−

err

(

h

)

|

for all h

∈

H .

Section 4.2 uses the following a classic lemma about the uniform distribution. For a proof see, for example, [5,23]. Lemma A.2. For any ﬁxed unit vector w and any0

<

γ

1,

γ

4

Prx

|

w

·

x

|

√

γ

d

γ

,

where x is drawn uniformly from the unit sphere. Appendix B. Subsequent guarantees forA2

This section describes the disagreement coeﬃcient [25] and the guarantees it provides for the A2 _{algorithm. We begin} with a few additional deﬁnitions, in the notation of Section 2.

Deﬁnition 2.Thedisagreement rate

Δ(

V

)

of a set V

⊆

His deﬁned as

Δ(

V

)

=

Prx∼D

(12)

Deﬁnition 3.Forh

∈

H,r

>

0, letB

(

h

,

r

)

= {

h

∈

H: d

(

h

,

h

)

r

}

and deﬁne thedisagreement rate at radius r as

Δ

r

=

sup

h∈H

Δ

B

(

h

,

r

)

.

Thedisagreement coeﬃcientis the inﬁmum value of

θ >

0 such that

∀

r

>

η

+

,

Δ

r

θ

r

.

Theorem B.1.(See [25].) If

θ

is the disagreement coeﬃcient for H , then with probability at least1

−

δ

, given the inputs

and

δ

, A2 outputs an

-optimal hypothesis h. Moreover, the number of label requests made by A2is at most:

˜

O

θ

2

η

2

+

1

VHln 1

+

ln 1

δ

ln1

,

where VH

1is the VC-dimension of H . References

[1] D. Angluin, Queries and concept learning, Mach. Learn. 2 (1998) 319–342. [2] D. Angluin, Queries revisited, Theoret. Comput. Sci. 313 (2) (2004) 175–194.

[3] M. Anthony, P. Bartlett, Neural Network Learning: Theoretical Foundations, Cambridge University Press, 1999.

[4] M.-F. Balcan, A. Blum, A PAC-style model for learning from labeled and unlabeled data, in: Proceedings of the 18th Annual Conference on Computational Learning Theory (COLT), 2005.

[5] M.-F. Balcan, A. Broder, T. Zhang, Margin based active learning, in: Proceedings of the 20th Annual Conference on Computational Learning Theory (COLT), 2007.

[6] M.-F. Balcan, E. Even-Dar, S. Hanneke, M. Kearns, Y. Mansour, J. Wortman, Asymptotic active learning, in: Workshop on Principles of Learning Design Problem, in Conjunction with the 21st Annual Conference on Neural Information Processing Systems (NIPS), 2007.

[7] M.-F. Balcan, S. Hanneke, J. Wortman, The true sample complexity of active learning, in: Proceedings of the 21st Annual Conference on Computational Learning Theory (COLT), 2008.

[8] E. Baum, K. Lang, Query learning can work poorly when a human oracle is used, in: Proceedings of the IEEE International Joint Conference on Neural Networks, 1993.

[9] A. Blum, S. Chawla, Learning from labeled and unlabeled data using graph mincuts, in: Proceedings of the 18th International Conference on Machine Learning (ICML), 2001.

[10] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: The 11th Annual Conference on Computational Learning Theory (COLT), 1998.

[11] A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth, Occam’s razor, Inform. Process. Lett. 24 (1987) 377–380.

[12] O. Bousquet, S. Boucheron, G. Lugosi, Theory of classiﬁcation: A survey of recent advances, ESAIM Probab. Stat. 9 (2005) 323–375.

[13] N.H. Bshouty, N. Eiron, Learning monotone dnf from a teacher that almost does not answer membership queries, J. Mach. Learn. Res. 3 (2002) 49–57. [14] M. Burnashev, K. Zigangirov, An interval estimation problem for controlled observations, Probl. Inf. Transm. 10 (1974) 223–231.

[15] R. Castro, R. Nowak, Minimax bounds for active learning, in: Proceedings of the 20th Annual Conference on Computational Learning Theory (COLT), 2007.

[16] R. Castro, R. Willett, R. Nowak, Faster rates in regression via active learning, Adv. Neural Inform. Process. Systems 18 (2006). [17] O. Chapelle, B. Schölkopf, A. Zien, Semi-Supervised Learning, MIT Press, 2006.

[18] D. Cohn, L. Atlas, R. Ladner, Improving generalization with active learning, in: Proceedings of the 15th International Conference on Machine Learning (ICML), 1994, pp. 201–221.

[19] J. Czyzowicz, D. Mundici, A. Pelc, Ulam’s searching game with lies, J. Combin. Theory Ser. A 52 (1989) 62–76. [20] S. Dasgupta, Analysis of a greedy active learning strategy, Adv. Neural Inform. Process. Systems 17 (2004). [21] S. Dasgupta, Coarse sample complexity bounds for active learning, Adv. Neural Inform. Process. Systems 18 (2005).

[22] S. Dasgupta, D.J. Hsu, C. Monteleoni, A general agnostic active learning algorithm, Adv. Neural Inform. Process. Systems 20 (2007).

[23] S. Dasgupta, A. Kalai, C. Monteleoni, Analysis of perceptron-based active learning, in: Proceedings of the 18th Annual Conference on Computational Learning Theory (COLT), 2005.

[24] Y. Freund, H.S. Seung, E. Shamir, N. Tishby, Selective sampling using the query by committee algorithm, Mach. Learn. 28 (2–3) (1997) 133–168. [25] S. Hanneke, A bound on the label complexity of agnostic active learning, in: Proceedings of the 24th Annual International Conference on Machine

Learning (ICML), 2007.

[26] J. Jackson, An eﬃcient membership-query algorithm for learning dnf with respect to the uniform distribution, J. Comput. System Sci. 57 (3) (1995) 414–440.

[27] T. Joachims, Transductive inference for text classiﬁcation using support vector machines, in: Proceedings of the 16th International Conference on Machine Learning (ICML), 1999, pp. 200–209.

[28] M. Kääriäinen, On active learning in the non-realizable case, in: Proceedings of 17th International Conference on Algorithmic Learning Theory (ALT), 2006.

[29] A. Levin, P. Viola, Y. Freund, Unsupervised improvement of visual detectors using co-training, in: Proceedings of the 9th International Conference on Computer Vision (ICCV), 2003.

[30] P.M. Long, On the sample complexity of PAC learning halfspaces against the uniform distribution, IEEE Trans. Neural Networks 6 (6) (1995) 1556–1559. [31] T. Mitchell, The discipline of machine learning, Technical Report CMU ML-06 108, 2006.

[32] K. Nigam, A. McCallum, S. Thrun, T.M. Mitchell, Text classiﬁcation from labeled and unlabeled documents using EM, Mach. Learn. 39 (2/3) (2000) 103–134.

[33] S.B. Park, B.T. Zhang, Co-trained support vector machines for large scale unstructured document classiﬁcation using unlabeled data and syntactic information, Inform. Process. Manag. 40 (3) (2004) 421–439.

[34] V. Vapnik, A. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Theory Probab. Appl. 16 (2) (1971) 264–280.

ScienceDirect

www.elsevier.com/locate/jcss

doi:10.1016/j.jcss.2008.07.003