For any label complexity Λ a achieved by any active learning algorithm A a , there exists

aPXY ∈Dsuch thatΛa(ν+ε,PXY)6=o(1/ε).

Proof The idea here is to reduce from the task of estimating the mean of iid Bernoulli trials, corresponding to the Yivalues. Specifically, consider any active learning algorithmAa; we useAato construct an estimator for the mean of iid Bernoulli trials as follows. Suppose we have B1,B2, . . . ,Bn i.i.d. Bernoulli(p), for some p_∈(1/8,3/8)and n_∈N_{. We take the sequence of X}₁_,_X₂_{, . . .}_random

variables i.i.d. with distributionP defined above (independent from the Bj variables). For each

i, we additionally have a random variable Ci with conditional distribution Bernoulli(Xi/2) given

Xi, where the Ciare conditionally independent given the Xi sequence, and independent from the Bi sequence as well.

We run _Aa with this sequence of Xi values. For the tth label request made by the algorithm, say for the Yi value corresponding to some Xi, if it has previously requested this Yi already, then we simply repeat the same answer for Yi again, and otherwise we return to the algorithm the value 2 max_{Bt,Ci} −1 for Yi. Note that in the latter case, the conditional distribution of max{Bt,Ci}is Bernoulli(p+ (1₋p)Xi/2), given the XithatAarequests the label of; thus, the Yiresponse has the same conditional distribution given Xi as it would have for the PXY ∈Dwithη(0;PXY) =p (i.e.,

η(Xi;PXY) =p+ (1−p)Xi/2). Since this Yivalue is conditionally (given Xi) independent from the previously returned labels and Xj sequence, this is distributionally equivalent to runningAa under thePXY ∈Dwithη(0;PXY) =p.

Let ˆhn be the classifier returned by Aa(n) in the above context, and let ˆzn denote the value of z_∈[2/5,6/7]with minimum _P(x : hz(x)6= ˆhn(x)). Then define ˆpn = 1−ˆ

z_n

2−zˆ_n ∈[1/8,3/8]and

z∗= 1₁−₋2p_p _∈(2/5,6/7). By a triangle inequality, we have_|zˆn−z∗|=2P(x : hzˆ_n(x)6=hz∗(x))≤

4_P(x : ˆhn(x)6=hz∗(x)). Combining this with (80) and (78) implies that

er(ˆhn)−er(hz∗)≥1 8P x : ˆhn(x)6=hz∗(x) 2 ≥ ₁₂₈1 (zˆn−z∗)2≥ 1 128(pˆn−p) 2 . (84)

In particular, by Lemma 55, we can choose p∈(1/8,3/8)so thatEh₍_p_ˆ_n₋_p₎2i₆₌_o₍₁_/n_{), which, by}

(84), impliesE

er(ˆhn)

−ν6=o(1/n). This means there is an increasing infinite sequence of values

nk∈N, and a constant c∈(0,∞)such that∀k∈N,E

er(ˆhnk)

−ν≥c/nk. SupposingAaachieves label complexityΛa, and taking the valuesεk=c/(2nk), we haveΛa(ν+εk,PXY)>nk=c/(2εk). Sinceεk>0 and approaches 0 as k→ ∞, we haveΛa(ν+ε,PXY)6=o(1/ε).

Proof [of Theorem 22] The result follows from Lemmas 54 and 56.

E.2 Proof of Lemma 26: Label Complexity of Algorithm 5

The proof of Lemma 26 essentially runs parallel to that of Theorem 16, with variants of each lemma from that proof adapted to the noise-robust Algorithm 5.

As before, in this section we will fix a particular joint distributionPXY onX × {−1,+1}with marginalPonX, and then analyze the label complexity achieved by Algorithm 5 for that particular distribution. For our purposes, we will suppose_PXY satisfies Condition 1 for some finite parameters

µ andκ. We also fix any f ∈ T

ε>0

cl(C₍ε_{)). Furthermore, we will continue using the notation of}

Appendix B, such as _Sk(_H), etc., and in particular we continue to denote V⋆

m={h∈C:∀ℓ≤

m,h(Xℓ) = f(Xℓ)}(though note that in this case, we may sometimes have f(Xℓ)6=Yℓ, so that Vm⋆6=

C_[_Z_m_{]). As in the above proofs, we will prove a slightly more general result in which the “1}_/_2”

threshold in Step 5 can be replaced by an arbitrary constantγ_∈(0,1).

For the estimators ˆP4mused in the algorithm, we take the same definitions as in Appendix B.1. To be clear, we assume the sequences W1and W2mentioned there are independent from the entire

(X1,Y1),(X2,Y2), . . . sequence of data points; this is consistent with the earlier discussion of how these W1and W2sequences can be constructed in a preprocessing step.

We will consider running Algorithm 5 with label budget n∈N_{and confidence parameter}δ _∈

(0,e−3_{), and analyze properties of the internal sets V}

i. We will denote by ˆVi, ˆLi, and ˆik, the final values of Vi, Li, and ik, respectively, for each i and k in Algorithm 5. We also denote by ˆm(k) and ˆV(k) the final values of m and Vik+1, respectively, obtained while k has the specified value in

Algorithm 5; ˆV(k) may be smaller than ˆV_ˆi

k when ˆm

(k) _{is not a power of 2. Additionally, define}

L⋆

i ={(Xm,Ym)}2

m=2i₋1₊₁. After establishing a few results concerning these, we will show that for

n satisfying the condition in Lemma 26, the conclusion of the lemma holds. First, we have a few

auxiliary definitions. For_{H ⊆}C_{, and any i}_∈N_{, define}

φi(H) =E sup h1,h2∈H er(h₁)−erL⋆ i(h1) − er(h2)−erL⋆_i(h2) and U˜i(H,δ) =min ( ˜ K φi(H) + r diam(_H)ln(32i 2_/δ₎ 2i−1 + ln(32i2_/_δ₎ 2i−1 ! ,1 ) ,

where for our purposes we can take ˜K=8272. It is known (see, e.g., Massart and Nédélec, 2006; Giné and Koltchinskii, 2006) that for some universal constant c′∈[2,∞),

φi+1(H)≤c′max (s diam(H)2−i_{d log} 2 2 diam(_H),2 −i_di ) . (85)

We also generally have φi(H)≤2 for every i∈N. The next lemma is taken from the work of Koltchinskii (2006) on data-dependent Rademacher complexity bounds on the excess risk.

Lemma 57 For anyδ _∈(0,e−3₎_{, any}_{H ⊆}_C_{with f} _∈_cl(_H₎_{, and any i}_∈_N_{, on an event K} i with P₍_K_i₎_≥₁₋δ_/_4i2_,_∀_h_{∈ H}_, erL⋆ i(h)−_hmin_′ ∈HerL ⋆ i(h ′₎_≤_er(_h₎₋_er(_f_{) +}_Uˆ_i₍_H_,_δ₎ er(h)₋er(f)_≤erL⋆

i(h)−erL⋆i(f) +Uˆi(H,δ)

minUˆi(H,δ),1 ≤U˜i(H,δ).

Lemma 57 essentially follows from a version of Talagrand’s inequality. The details of the proof may be extracted from the proofs of Koltchinskii (2006), and related derivations have previously been presented by Hanneke (2011) and Koltchinskii (2010). The only minor twist here is that f need only be in cl(_H), rather than in_H itself, which easily follows from Koltchinskii’s original results, since the Borel-Cantelli lemma implies that with probability one, every ε >0 has some

g_{∈ H}(ε)(very close to f ) with erL⋆

i(g) =erL⋆i(f).

For our purposes, the important implications of Lemma 57 are summarized by the following

In document Activized Learning: Transforming Passive to Active with Improved Label Complexity (Page 104-106)