Improving the Sample Complexity - Upper Bounds for Sample Complexity

Upper Bounds for Sample Complexity

3.2 Improving the Sample Complexity

In this section, we give some special cases of agnostic leaming which allow improved bounds on the sample complexity. We also give improved bounds for closure convex function classes.

3.2.1 Function Learning

For function leaming. Theorem 3.2 can be used to obtain a better bound for the sample complexity. Setting // = e and a = 1/2, we get E ( Q ; ) < 3 E ( Q / ) + e. For function leaming, it is possible to set the empirical loss to zero by choosing an appropriate function / , hence giving E(Qy) < e.

With these values of a and u together with Theorem 3.2 and Lemma 3.4, a sample size of

m >

128T2 / / / e 4 In max

N

Kxex^"" \96T ' J J 0

suffices for agnostically learning T .

3.2.2 Learning with Noise

For learning with noise, Barron (1990) and McCaffrey & Gallant (1994) have shown that the sample complexity for functions with finite Lqo covering number is O Q (in iV(e, JF, Loo) + In j ) ) . The Loo covering number is always at least at large as the l\ covering number but may be considerably larger. For example, the class of sigmoid functions without a bound on the input weight size has a finite l\ covering number (see Section 3.3) but cannot have a finite Loo cover. (It is easy to see that for any finite set of functions, we can always find a sigmoid function, with distance close to 1 / 2 from all the functions in the set by considering linear threshold functions which can be approximated arbitrarily closely by sigmoid functions.) We extend the result for learning with noise (Barron 1990, McCaffrey & Gallant 1994) to function classes with finite h covering numbers by using the following theorem.

Theorem 3.6 Let T be a permissible class of functions mapping from X toy C [ - T , T]. Let P be an arbitrary probability distribution on Z = X x y. Let C = max{T, 1}. Assume u,uc> 0,0 < a < 1/2. Letf*eJ'wheref*{x)=E[Y\X = x] andgf{x,y) = {y - f{x)f - {y - f*{x))^. Then form > 1,

I + i/c + E(5/) J

< jnaxJN expi-a'um/iSlSC')). (3.8)

The proof is included in Appendix A. The main idea (in addition to the ideas of the proof of Theorem 3.2) is to bound the variance of the random variable Y) in terms of its expectation and to use Bernstein's inequality to take advantage of the variance bound.

To obtain a bound on the sample complexity, we first rescale the function class and target random variable by dividing by T to give C = 1 and consider the new learning problem. (This rescaling trick allows us to obtain a sample complexity which has a T^ term instead of a T'^ term.) The e covering number of the scaled fiinction class is the same as the Te covering number of the

unsealed function elass. To get the correct accuracy when the function class is scaled back to the original scale, we need to learn to accuracy e/T^. Assume the scaled function class is T . Setting u = Uc = a = 1/2 and the right hand side of (3.8) to 6, we get with probability 1 - 5,

< lEziOf) + e/T^ for all f e T. From the definition of gj, notice that it is possible to choose / such that E z i O f ) < 0 (since it is possible to choose the function giving the best empirical loss which is no more than the empirical loss for / * ) giving < e/T^. Setting the right hand side of (3.8) to S and solving for m shows that

lOOOT^ m >

observations suffices for agnostically learning the function class.

3.2.3 Agnostic Learning of Closure-Convex Function Classes

Given that it is possible to obtain better sample complexity (with respect to e) for the special cases of function learning and learning with noise, we would also like to investigate the possibilities for the more general agnostic case. However, better sample complexity is not possible without some conditions on the function class if we are restricted to hypotheses from the same class. For example, consider the class of functions which consists only of f\ (x) = 0 and /a (a;) = 1. Let the target be a { 0 , 1 } random variable which is 1 with probability p and 0 with probability 1 - p. The sample complexity for properly learning this function class with this type of target is Q.

(see Lemma 4.5).

However, with closure-convex function classes, it is possible to obtain the same sample complexity bound as the case for leaming with noise. This is done by using the following theorem. The proof of the theorem is given in the Appendix A. The convexity of the function class allows us to bound the variance of the random variable g f { X , Y ) in terms of its expectation hence giving the better bound. The theorem is given in a more general form which is useful for leaming the convex hull of function classes in Chapter 5.

T h e o r e m s . ? Let T = U^i-^fc be a closure-convex class of functions mapping from X to y C [ - T , T] such that each Tk is permissible. Let P be an arbitrary probability distribution on 2 - X x y . Let T be the closure of J^ in the space with inner product { f , g ) = j }{x)g{x)dPx{x). LetC = max{T, 1}. Assumeu,uc > 0,0 < a < 1/2. Let f*{x) = E[ y | X = x] andgf{x,y) = {y - - (y - where fa e T and fa 6 argmin^^^/(/(x) - f'{x))^dPxix). Then

for m > 1 and each k,

[ u + uc + E{gf) J

< max 6N (3.9)

Let J'fc = ^ for A; = 1 , . . . , oo where ^ is a closure-convex class of functions (hence 7 = Q). As in the learning with noise case, we first rescale the function class and target random variable by dividing by T to give C = 1 and consider the new learning problem. The e covering number of the scaled function class is the same as the Te covering number of the unsealed function class. To get the correct accuracy when the function class is scaled back to the original scale, we need to learn to accuracy e/T^. Assume the scaled function class is T. Set v — Vc = e/lT'^ and a = 1/2, and set the right hand side of (3.9) to 5 to get with probability 1 - 5, ^ { g f ) < 2 E z ( ^ / ) + e/T^ for all / G From the definition of g j , again it is possible to choose / such that ' ^ z i g f ) < 0 giving < e/T^. Setting the right hand side of (3.8) to 5 and solving for m shows that

^ 7000r2 / / ^ / e ^ , \ , , 6 \ \

m > In max N + In j

e V xxex^"' V512T ' / 6 J J

observations suffices for agnostically learning the function class.

In document Agnostic learning and single hidden layer neural networks (Page 36-39)