aPXY ∈Dsuch thatΛa(ν+ε,PXY)6=o(1/ε).
Proof The idea here is to reduce from the task of estimating the mean of iid Bernoulli trials, corresponding to the Yivalues. Specifically, consider any active learning algorithmAa; we useAato construct an estimator for the mean of iid Bernoulli trials as follows. Suppose we have B1,B2, . . . ,Bn i.i.d. Bernoulli(p), for some p∈(1/8,3/8)and n∈N. We take the sequence of X1,X2, . . .random
variables i.i.d. with distributionP defined above (independent from the Bj variables). For each
i, we additionally have a random variable Ci with conditional distribution Bernoulli(Xi/2) given
Xi, where the Ciare conditionally independent given the Xi sequence, and independent from the Bi sequence as well.
We run Aa with this sequence of Xi values. For the tth label request made by the algorithm, say for the Yi value corresponding to some Xi, if it has previously requested this Yi already, then we simply repeat the same answer for Yi again, and otherwise we return to the algorithm the value 2 max{Bt,Ci} −1 for Yi. Note that in the latter case, the conditional distribution of max{Bt,Ci}is Bernoulli(p+ (1−p)Xi/2), given the XithatAarequests the label of; thus, the Yiresponse has the same conditional distribution given Xi as it would have for the PXY ∈Dwithη(0;PXY) =p (i.e.,
η(Xi;PXY) =p+ (1−p)Xi/2). Since this Yivalue is conditionally (given Xi) independent from the previously returned labels and Xj sequence, this is distributionally equivalent to runningAa under thePXY ∈Dwithη(0;PXY) =p.
Let ˆhn be the classifier returned by Aa(n) in the above context, and let ˆzn denote the value of z∈[2/5,6/7]with minimum P(x : hz(x)6= ˆhn(x)). Then define ˆpn = 1−ˆ
zn
2−zˆn ∈[1/8,3/8]and
z∗= 11−−2pp ∈(2/5,6/7). By a triangle inequality, we have|zˆn−z∗|=2P(x : hzˆn(x)6=hz∗(x))≤
4P(x : ˆhn(x)6=hz∗(x)). Combining this with (80) and (78) implies that
er(ˆhn)−er(hz∗)≥1 8P x : ˆhn(x)6=hz∗(x) 2 ≥ 1281 (zˆn−z∗)2≥ 1 128(pˆn−p) 2 . (84)
In particular, by Lemma 55, we can choose p∈(1/8,3/8)so thatEh(pˆn−p)2i6=o(1/n), which, by
(84), impliesE
er(ˆhn)
−ν6=o(1/n). This means there is an increasing infinite sequence of values
nk∈N, and a constant c∈(0,∞)such that∀k∈N,E
er(ˆhnk)
−ν≥c/nk. SupposingAaachieves label complexityΛa, and taking the valuesεk=c/(2nk), we haveΛa(ν+εk,PXY)>nk=c/(2εk). Sinceεk>0 and approaches 0 as k→ ∞, we haveΛa(ν+ε,PXY)6=o(1/ε).
Proof [of Theorem 22] The result follows from Lemmas 54 and 56.
E.2 Proof of Lemma 26: Label Complexity of Algorithm 5
The proof of Lemma 26 essentially runs parallel to that of Theorem 16, with variants of each lemma from that proof adapted to the noise-robust Algorithm 5.
As before, in this section we will fix a particular joint distributionPXY onX × {−1,+1}with marginalPonX, and then analyze the label complexity achieved by Algorithm 5 for that particular distribution. For our purposes, we will supposePXY satisfies Condition 1 for some finite parameters
µ andκ. We also fix any f ∈ T
ε>0
cl(C(ε)). Furthermore, we will continue using the notation of
Appendix B, such as Sk(H), etc., and in particular we continue to denote V⋆
m={h∈C:∀ℓ≤
m,h(Xℓ) = f(Xℓ)}(though note that in this case, we may sometimes have f(Xℓ)6=Yℓ, so that Vm⋆6=
C[Zm]). As in the above proofs, we will prove a slightly more general result in which the “1/2”
threshold in Step 5 can be replaced by an arbitrary constantγ∈(0,1).
For the estimators ˆP4mused in the algorithm, we take the same definitions as in Appendix B.1. To be clear, we assume the sequences W1and W2mentioned there are independent from the entire
(X1,Y1),(X2,Y2), . . . sequence of data points; this is consistent with the earlier discussion of how these W1and W2sequences can be constructed in a preprocessing step.
We will consider running Algorithm 5 with label budget n∈Nand confidence parameterδ ∈
(0,e−3), and analyze properties of the internal sets V
i. We will denote by ˆVi, ˆLi, and ˆik, the final values of Vi, Li, and ik, respectively, for each i and k in Algorithm 5. We also denote by ˆm(k) and ˆV(k) the final values of m and Vik+1, respectively, obtained while k has the specified value in
Algorithm 5; ˆV(k) may be smaller than ˆVˆi
k when ˆm
(k) is not a power of 2. Additionally, define
L⋆
i ={(Xm,Ym)}2
i
m=2i−1+1. After establishing a few results concerning these, we will show that for
n satisfying the condition in Lemma 26, the conclusion of the lemma holds. First, we have a few
auxiliary definitions. ForH ⊆C, and any i∈N, define
φi(H) =E sup h1,h2∈H er(h1)−erL⋆ i(h1) − er(h2)−erL⋆i(h2) and U˜i(H,δ) =min ( ˜ K φi(H) + r diam(H)ln(32i 2/δ) 2i−1 + ln(32i2/δ) 2i−1 ! ,1 ) ,
where for our purposes we can take ˜K=8272. It is known (see, e.g., Massart and N´ed´elec, 2006; Gin´e and Koltchinskii, 2006) that for some universal constant c′∈[2,∞),
φi+1(H)≤c′max (s diam(H)2−id log 2 2 diam(H),2 −idi ) . (85)
We also generally have φi(H)≤2 for every i∈N. The next lemma is taken from the work of Koltchinskii (2006) on data-dependent Rademacher complexity bounds on the excess risk.
Lemma 57 For anyδ ∈(0,e−3), anyH ⊆Cwith f ∈cl(H), and any i∈N, on an event K i with P(Ki)≥1−δ/4i2,∀h∈ H, erL⋆ i(h)−hmin′ ∈HerL ⋆ i(h ′)≤er(h)−er(f) +Uˆi(H,δ) er(h)−er(f)≤erL⋆
i(h)−erL⋆i(f) +Uˆi(H,δ)
minUˆi(H,δ),1 ≤U˜i(H,δ).
Lemma 57 essentially follows from a version of Talagrand’s inequality. The details of the proof may be extracted from the proofs of Koltchinskii (2006), and related derivations have previously been presented by Hanneke (2011) and Koltchinskii (2010). The only minor twist here is that f need only be in cl(H), rather than inH itself, which easily follows from Koltchinskii’s original results, since the Borel-Cantelli lemma implies that with probability one, every ε >0 has some
g∈ H(ε)(very close to f ) with erL⋆
i(g) =erL⋆i(f).
For our purposes, the important implications of Lemma 57 are summarized by the following