From Finite to Infinite Classes: First Attempt

Just as we did in Eq. (8.4), we pass to the tree process onFitself, rather than the loss class. This step is a version of the contraction principle as stated in Lemma 12.9

for the i.i.d. case.

Lemma 13.1. ForY={0, 1},F⊆YX, and`(f, (x,y))=I©f(x)6=yª,

sup x,y E sup g∈`(F) Tg =sup x E sup f∈F Tf (13.2)

With the definition of sequential Rademacher complexity, the statement can be written more succinctly as

Rseq_(`(_F

))=Rseq(F).

A proof of this lemma requires a few intermediate steps, and postponing it until the end of the lecture seems like a good idea.

13.1 From Finite to Infinite Classes: First Attempt

Lemma9.5gives us a handle on the supremum of the tree process for a finite class

F. Mimicking the development for the i.i.d. case, we would like to now encompass infinite classes as well. The threshold example is not interesting since the tree process does not converge to zero, according to Theorem8.1. Consider a different example instead. The example is rather trivial, but will serve us for the purposes of demonstration. LetX=[0, 1] and

F={fa: a∈[0, 1], fa(x)=0 ∀x6=a, fa(a)=1} (13.3) the class of functions that are zero everywhere except on one designated point. This is an infinite class and the question is whether the expected supremum of the tree process decays to zero with increasingn for anyX-valued treex. Now, recall our development for the i.i.d. case with the class of thresholds. We argued that if we condition on the data, the effective number of possible values that the

functions take is finite even if the class is infinite. This led us to the idea of the growth function.

Following the analogy, let us defineF|xas the set of all {0, 1}-valued trees F|x={f ◦x:f ∈F}

where f ◦xis defined as a tree (f ◦x1,f ◦x2, . . . ,f ◦xn). Sincext is a function from {±1}t−1toX, thet-th level f ◦xt of thef ◦xtree is a function from {±1}t−1to {0, 1}. Thinking ofF|xas the projection (or, an “imprint”) ofFonx, we can write

Esup f∈F 1 n n X t=1 ²tf(xt(²1:t−1))=Emax v∈F|x 1 n n X t=1 ²tvt(²1:t−1) (13.4)

because the setF|xis clearly finite.

Givenx, how can we describeF|x? For any fa ∈F, fa◦xis a tree that is zero everywhere except for those (if any) nodes in the tree which are equala. Observe that for two functions fa,fb∈F, fa◦x=fb◦xif and only ifa,b∈[0, 1] both do not appear in the treex:

fa◦x=fb◦x ⇔ a,b∉Img(x) or a=b.

Supposexis such thatxn: {±1}n−17→[0, 1] takes on 2n−1distinct values for each path. Then

card (F|x)=2n−1.

Since this cardinality is exponential inn, Lemma9.5gives a vacuous bound. It appears that there is no hope for passing from a finite to an infinite class by studying the size of the projection ofFonx, as it will be exponential inn for any nontrivial class and a “diverse enough”x. But maybe this is for a good reason, and the problem is not learnable, just like in the case of thresholds? This turns out to be not the case:

P Exercise (??): Provide a strategy for the learner that will sufferO(p(logn)/n) regret for the binary prediction problem with the class defined in (13.3). Can you get aO(1/pn) algorithm? (This should be doable.)

Of course, one may hypothesize that the problem might be learnable but the bound of Theorem7.9is loose. This is again not the case, and one can prove that the expected supremum of the tree process indeed converges to zero for anyx, but

13.2 From Finite to Infinite Classes: Second Attempt

As we have already noticed, the cardinality of the setF|xis too large. However, this

does not reflect the fact that every fa◦x, in the case of the function class defined in (13.3), is quite “simple”: the values are zero, except whenxt=a.

First, for illustration purposes, consider the case when the treexcontains unique elements along any path (left-most tree in Figure13.1). That is, for any (²1, . . . ,²n),

ifxt(²1:t−1)=xs(²1:s−1) thens=t. In this case, f ◦xis a tree with at most a single

value 1 along any path. Consider the set V ={v(0),v(1), . . . ,v(n)} of n+1 binary-

a c d e b e b v(1) _v(2) _v(3)

Figure 13.1: Left: anX-valued treexwith unique elements within any path. Right: first three of then+1 covering trees. Circled nodes are defined to be 1 while the rest are 0.

valued trees defined as follows. The treev(0)is identically 0-valued. For j≥1, the treev(j)has zeros everywhere except for thej-th level. That is, for anyj∈{1, . . . ,n},

v(_tj)(²1:t−1)=1 whenevert=j and zero otherwise. These trees are depicted in Fig-

ure13.1. Given the uniqueness of values ofxalong any path, we have the following important property:

∀f ∈F, ∀(²1, . . . ,²n)∈{±1}n, ∃v∈V s.t. f(xt(²1:t−1))=vt(²1:t−1) ∀t∈{1, . . . ,n}

(13.5) That is, for any f ∈Fand any path, there exists a “covering tree”vinV such that

f on xagrees with von the given path. For instance, consider a function fd ∈

F which takes values zero everywhere except on x =d. Consider the left-most path ²=(−1,−1,−1, . . .). Then, for thexgiven in Figure 13.1, the function takes on a value 1 at the third node, sincex3(−1,−1)=d, and zero everywhere else. But

then the covering treev(3)in Figure13.1provides exactly these values on the path (−1,−1,−1, . . .). It is not hard to convince oneself that the property (13.5) holds

true. What is crucial is that for such a treex, Rseq₍_F_,_x₎ =Esup f∈F 1 n n X t=1 ²tf(xt(²1:t−1))=Emax v∈V 1 n n X t=1 ²tvt(²1:t−1)≤ s 2 log(n+1) n (13.6) by Lemma 9.5since card (V)=n+1. While this achieves our goal, the argument crucially relied on the fact thatxcontains unique elements along any path. What ifxhas several identical elements a along some path? In this case, the function

fa will take on the value 1 on multiple nodes on this path, and the setV defined above no longer does the job. A straightforward attempt to create a setV with all possible subsets of rows (taking on the value 1) will fail, as there is an exponential number of such possibilities.

It is indeed remarkable that there exists a set of {0, 1}-valued treesV of cardinality at mostn+1 that satisfies property (13.5) without the assumption of uniqueness of elements along the paths. Here is how such a set can be constructed inductively. Suppose we have at our disposal two setsV`andVr of covering trees of depthn−1

. . .

V` _Vr _a c d b a a a a v0 x c

Figure 13.2:V is constructed inductively by pairing up trees inV`andVr, plus an additional treev0(bottom right) which takes on value 1 only whenx(top right) at the corresponding node takes on the valuex1.

for the left and right subtrees ofxat the root. Suppose that, inductively, on the two subtrees the setsV`andVr satisfy property (13.5). For av`∈V` andvr ∈Vr de- fine a joined tree vas having 0 at the root andv` andvr as the two subtrees at the root (see Figure13.2). In this fashion, take pairs from both sets such that each element ofV`andVr occurs in at least one pair. This construction gives a setV

of size max©

card¡ V`¢

, card (Vr)ª

everywhere except on those nodes wherexhas the same value as the rootx1(in

Figure13.2,x1=aandv0is constructed accordingly). That is,

v0_t(²1:t−1)=1 if xt(²1:t−1)=x1 and v0t(²1:t−1)=0 otherwise

We claim that the setV satisfies (13.5) and its size is larger than that ofV`andVr

by at most 1. Indeed, for fx1 ∈Fthe treev

0_{matches the values along any path.}

For otherf ∈F, the value at the root is zero, and (13.5) holds by induction on both subtrees.

The size ofV increases by at most one when passing from depthn−1 ton. For the base of the induction, we require 2 trees forn=1: one with the value 0 and one with the value 1 at the single vertex. We conclude that the size of the set satisfying (13.5) is at mostn+1. As a consequence, the bound of (13.6) holds for anyx. We conclude that for the example under the consideration with the function classF defined in (13.3), Vseq (F,n)≤2 sup x E sup f∈F Tf ≤2 s 2 log(n+1) n (13.7)

In document Statistical Learning and Sequential Prediction - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 125-129)