Networks In Chapter 4, we showed that if a function class is not closure-convex, then the sample complexity
Lemma 5. 6 Let x = {x\, , Xm) G Then
Proof. Let C/ be an e/iiT-cover for Q^x with \U\ = N{elK,G\xM). Pick an arbitrary function
Kf gG}^ where f eg. Pick veU such that di^ {f\x,v) < e/K. Then 1 J^. K ^
i=\
Obviously, {Kv, -Kv : v e U}[j{{K,..., K), { - K , - K ) } is a an e-cover . •
Lemma 5.7 Let x = {xi,..., Xm) e Then
< < + l ) ' - (5.1)
Proof. Let U be an e-cover for Q]^ with \U\ = N{e, Gk^x^h)- Let / = ^ ZLi h Ui G Qk^
i= 1 , . . . , A;) be afunction in For each/j, pick a member of such t h a t < e- Let/i = Wi- Then
1 m 1 m I A; J=1 J=l 1=1 1 1 m k j=\i=\ J fc J m 1=1 j=i 1=1
So for any f\x ^ ^^ ^ vector in the set Yli=i Ui'.Ui G U} with distance less than e from it. Since YlLi ^i'-'^i ^ U}\ < \U\'' the first inequality in (5.1) follows. The second inequality in (5.1) follows from Lemma 5.6. •
We are now give the proof of Theorem 5.2.
Proof. (Theorem 5.2) First note that M ^ is convex and hence closure-convex. We also have
-^ik ^ -^K permissible for each k. Scale the function class and target random variable by dividing by C. The covering number of the scaled function class is the same as the Ce covering number of the unsealed class. By learning to accuracy e/C^ and rescaling back, we obtain the desired bound. Assume the scaled function class is T . In Theorem 3.7, set a = X/l and use
Theorem 3.7 and Lemma 5.7 with = f c = e / 4 C ^ to get
P - { z G : 3 / e E [(y - f{x)f - {y - / , ( x ) )
> lEz [(y - f { x ) f - (y - fa{x)Y + e /{IC^)] < 6 max N . , ^^, „,
~ V 1 0 2 4 C < 6 x 2 * = max N
\Q2ACK' + e x p ( - e m / 1 4 0 0 0 C 2 ) . ( 5 2 )
Suppose f'{x) = Ez[y\X = x]. Let fk be the estimated function and let fa be the function in the convex closure which minimizes the empirical error. Then E z \{y - fk{x)f - { y - fa{x)Y] = Ez \{f'{x) - - ( f i x ) - fa{x)) Note that Bz
Ez
( f i x ) - A(x))2 - ( f i x ) - fa{x)f
ihix) - fa{x))^
< Ez {f'{x)-Mx))^-{f'{x)-fa{x)y
In Lemma 5.5, set c = 1. To get approximation within e / 4 C ^ (with respect to the empirical mean squared error), we require k > AC^/e. Setting the right hand side of (5.2) to be 5 and k — AC^ Ie, we see that
m -- 14000^2 e In max N
e \ e xex^rn \ \ 1 0 2 4 C K
\ \ AC^ 6 + l n 2 + l n -
will suffice for agnostic learning. •
5.3 Discussion
Corollary 5.3 shows that for function classes with finite pseudo-dimension which are not closure convex, the sample complexity for agnostically learning the convex hull of the function class is at worst within a logarithmic factor of the sample complexity for properly agnostically learning the function class itself. The convex hull can be learned by increasing the number of hidden units as a function of the required accuracy. Learning the convex hull gives better approximation capabilities and hence may be preferable to properly agnostically learning the function class in view of the sample complexity bounds.
The function class Fi is in the closure of the convex hull of single hidden layer neural networks with linear threshold hidden units. Since the pseudo-dimension of linear threshold units is n -h 1, the class Fi is agnostically leamable with sample complexity O ^^ (j^ln j + log Barron
(1992) has also shown that the sample complexity for learning Fi using an arbitrary estimator cannot be better than Hence the sample complexity bound is close to optimal for learning Ti. The bound is also close to optimal for learning the class of single hidden layer neural networks with linear threshold hidden units since functions in Ti can be approximated arbitrarily closely single hidden layer neural networks.
There are also function classes for which using the convex hull as the hypothesis class (instead of doing proper agnostic learning) results in a much better sample complexity. For example if Q has a finite number of functions, then the pseudo-dimension of the convex hull of Q is bounded by Q\ (the convex hull is a subset of a |^|-dimensional vector space of functions, hence as mentioned in Section 3.3.1, the pseudo-dimension is bounded by \Q\ (Dudley 1978)). Since the pseudo- dimension is finite. Corollary 3.10 shows that the sample complexity is O ^^ ^In ^ -I- log In contrast, since Q is not closure-convex, the sample complexity for properly agnostically learning Q is 0 ( l n ( l / 5 ) / e 2 ) . This shows that for such classes, by learning the convex hull of the function class, not only do we get better approximation, we also get a better sample complexity. Since the upper bound is smaller than n(ln(l/(5)/e2) (for small enough e) this also shows that the lower bound for the sample complexity for learning function classes which are not closure convex only holds for proper agnostic learning and not agnostic learning in general.
eters I am not worthy to calculate and yet I will design it for you. A computer which can calculate the Question to the Ultimate Answer. '
— Deep Thought, in The Hitchhiker's Guide to the Galaxy.