The Growth Function and the VC Dimension - Statistical Learning and Sequential Prediction

The size of the projection F|Xn played an important role in obtaining an upper bound on Rademacher averages for the case of classification with thresholds. The same ideas carry over to a general classification setting.

Definition 11.1. For a binary valued function classF⊂{0, 1}X, thegrowth function

is defined as ΠF(n)=max n card¡F|x1,...,xn ¢ :x1, . . . ,xn∈X o (11.5) We have the following proposition that follows immediately from our previous arguments.

Proposition 11.2. For classification with some domainX, label setY={0, 1}, class of functionF⊆YXand loss function`(f, (x,y))=I©f(x)6=yª

, it holds that for any distribution PX×Y, E sup g∈`(F) Gg≤2Esup f Sf ≤2 s 2 logΠ_F(n) n

The growth function measures expressiveness ofF. In particular, ifFcan produce all possible signs (that is,Π_F(n)=2n), the bound becomes useless.

Definition 11.3. We say thatFshatterssome setx1, . . . ,xn(a term due to J. Michael Steele) ifF|xn={0, 1}n. That is,

It is possible to show that if a setx1, . . . ,xn is shattered for somen, then not only is the bound of Proposition11.2vacuous, but the learning problem itself (for the givenn) is impossible.

P Exercise (? ? ?): Prove the above statement.

We see that the growth function is quite important as a measure of complexity ofFwhen it comes to distribution-free learning. But what is the behavior of the growth function? The situation turns out to be quite interesting. Define the following combinatorial parameter ofF:

Definition 11.4. The Vapnik-Chervonenkis (VC) dimension of the classF⊆{0, 1}X is defined as

vc (F),max

t : Π_F(t)=2t

or vc (F)= ∞if max does not exist. Further, forx1, . . . ,xn∈X, define vc (F,x1, . . . ,xn),vc ¡ F|x1,...,xn ¢ ,max n t : ∃i1, . . . ,it ∈{1, . . . ,n} s.t. card ³ F|xi1,...,xit ´ =2t o

Vapnik-Chervonenkis dimension is the largestt such thatFcan produce all possible sequences of bits {0, 1} on some set of examples. Clearly, the bound of Proposition 11.2is useless forn ≤vc (F). What aboutn >vc (F)? The following beautiful and surprising result was proved around the same time by Vapnik and Chervonenkis, Sauer, and Shelah.1

Lemma 11.5(Vapnik-Chervonenkis, Sauer, Shelah). It holds that

ΠF(n)≤ d X i=0 Ã n i ! whenever vc(F)=d< ∞.

There are several ways to prove this lemma, and we give one that will be useful later in the course. Importantly, the sum that upper bounds the growth function has aO(nd) behavior. In fact,

d X i=0 Ã n i ! ≤ ³en d ´d , (11.6)

1_{Thanks to Léon Bottou}_{http://leon.bottou.org/news/vapnik-chervonenkis_sauer}_for

figuring out the sequence of events (which, by the way, involves Paul Erdös) that led to these pub- lications.

and we leave the proof as an easy exercise. It is quite remarkable that the growth function is 2t fort≤dand polynomial afterwards, a non-obvious fact that makes the Vapnik-Chervonenkis results appealing.

As an immediate corollary of Proposition11.2we have

Corollary 11.6. Under the setting of Proposition11.2, if vc(F)=d , E sup g∈`(F) Gg ≤2Esup f Sf ≤2 s 2dlog(en/d) n Example 7. The class of thresholds

F={f_θ(x)=I{x≤θ} : θ∈[0, 1]}

onX=[0, 1] has vc (F)=1. Indeed, one cannot findx1,x2∈X,x1<x2, such that

some threshold gives a label 0 to x1 and 1 to x2. Thus, no set of two points is

shattered byF.

Example 8. The class of step-up and step-down thresholds

F={f_θ(x)=I{x≤θ} : θ∈[0, 1]}∪{g_θ(x)=I{x≥θ} : θ∈[0, 1]} onX=[0, 1] has vc (F)=2.

The following analogue holds inddimensions:

Example 9. Define the class of linear thresholds inRd by

F=nf_θ(x)=I{〈θ,x〉 ≥0} :θ∈Rdo

Then vc (F)=d+1, which justifies our use of the same letterd for both dimen- sionality of the space and the combinatorial dimension ofF.

Example 10. Let the set H of functionsX→ Rbe a subset of a d-dimensional vector space. Then the class

F=©I{h(x)≥0} :h∈Hª has VC dimension at mostd.

As the next example shows, it is not correct to think of the VC dimension as the number of parameters.

Example 11. The VC dimension of the class

f_α(x)=I{sin(αx)≥0} :α∈R o

is infinite despite the fact thatFis parametrized by a single number.

We now turn to the proof of the Vapnik-Chervonenkis-Sauer-Shelah lemma:

Proof of Lemma11.5. Define a function

g(d,n)= d X i=0 Ã n i !

ford,n≥0. We would like to proveΠ_F(n)≤g(d,n) ford=vc (F). By the way of induction, assume that the statement holds for (d−1,n−1) and (d−1,n). Fix any

x1, . . . ,xn∈X, and suppose vc (F,x1, . . . ,xn)=d. Define

F = n

r∈F|x2,...,xn: (0,r), (1,r)∈F|x1,...,xn o

the set of projections ofFon (x2, . . . ,xn) such that both labels are realized onx1.

LetF0=F|x2,...,xn, the set of all projections ofFon (x2, . . . ,xn). Observe thatF ⊂F0.

We claim that card¡ F|x1,...,xn ¢ =card (F)+card¡ F0¢ . (11.7)

To see this, supposer ∈F. Then both (0,r) and (1,r) are counted on the left-hand side of (11.7), and this is matched by countingr twice (inF andF0). Ifr ∈F0\F, then only (0,r) or (1,r) appears on the left-hand side, and the same is true for the right-hand side.

We now claim that vc (F) is at mostd−1, for otherwise vc¡

F|x1,...,xn ¢

=d+1 (indeed, both 0 and 1 can be appended to the full projection of size 2d to create a projection of size 2d+1). Then

card¡ F|x1,...,xn ¢ =card (F)+card¡ F0¢ ≤g(d−1,n−1)+g(d,n−1) (11.8) by the induction hypothesis. Since this holds for anyx1, . . . ,xn, we also have

ΠF(n)≤g(d−1,n−1)+g(d,n−1). The induction step follows from the identity

g(d,n−1)+g(d−1,n−1)=g(d,n). The base of the induction is easy to verify.

The proof of Lemma11.5might appear magical and unintuitive. Let us point (in a very informal way) to a few important pieces that make it work. In fact, the technique is rather general and will be used once again in a few lectures when we prove an analogue of the combinatorial lemma for sequential prediction. The ab- stract idea is that the behavior ofFon a sample can be separated into two pieces (F

andF0). Let us now forget about the way that these pieces are defined, and rather look into how their respective sizes can be used to unwind a recursion. Imagine that a split intoF andF0can be done so that the smaller pieceF is always of size 1. The recursion would then unfold asg(d,n)≤g(d,n−1)+1≤g(d,n−2)+2≤. . .≤n. Of course, such a recursion is rather trivial. Now, imagine another type of recursion: both pieces are of the same size. Then we would haveg(d,n)≤2g(d,n−1)≤ 4g(d,n−2)≤. . .≤2n. Once again, such a recursion is rather trivial, and the bound is not useful. In some sense, these two cases are extremes of how a recursion might play out. But note that we did not used in the two recursions. Imagine now that each time we perform the split intoF andF0, a “counter”d is decreased for the smaller piece. That is, g(d,n)≤g(d−1,n−1)+g(d,n−1). This situation is in- between the bounds of n and 2n, and, in fact, it is of the order nd as the proof shows.

The above observation is a rather general recipe: for a problem at hand, find a parameter (“counter”) so that the size of the smaller set being split off has to have a smaller value for this parameter. In this case, the recursion gives a polynomial rather than exponential size.

12

Statistical Learning: Real-Valued

Functions

In the real-valued supervised learning problem with i.i.d. data, we consider a bounded setY=[−1, 1]. Fix a setFof functionsX→Y, for some input spaceX. In the distribution-free scenario, we have i.i.d. data {(Xt,Yt)}n_t₌₁from an unknown

PX×Y, on which we place no assumption. Recall that the cardinality of the set

F|x1,...,xn= n

(f(x1), . . . ,f(xn)) :f ∈F

played an important role in the analysis of classification problems. For real-valued functions, however, the cardinality of this set is of little use since it is, in general, uncountable. However, two functions f andg with almost identical values

f(xt)≈g(xt), t∈{1, . . . ,n}

on the sample are essentially the same for our purposes, and we need to find a way to measure complexity ofF|x1,...,xnwithout paying attention to small discrepancies

between functions. This idea is captured by the notion ofcovering numbers.

12.1 Covering Numbers

Recall our shorthandxn={x1, . . . ,xn}.

Definition 12.1. A setV ⊂Rnis anα-coveronx1, . . . ,xnwith respect to`2norm if

∀f ∈F, ∃v∈V s.t. µ 1 n n X t=1 (f(xt)−vt)2 ¶1/2 ≤α. (12.1)

Anα-covering numberis defined as

N2(F,α,xn)=minncard (V) : V is an α-covero.

x1 x2 xT

Figure 12.1: Two sets of levels provide anα-cover for the four functions. Only the values of functions onx1, . . . ,xnare relevant.

As far as the cover is concerned, the values of functions outsidex1, . . . ,xn are immaterial. Hence, we may equivalently talk of a cover for the setF|xn ⊆[−1, 1]n.

We may then writeN2(F|xn,α) for theα-covering number.

It is important to note that here and below we refer to`p norms, but the definitions in fact use an extra normalization factor 1/n. We may equivalently write theα-closeness requirement (12.1) askf−vk2≤pnαwhere f=(f(x1), . . . ,f(xt))∈

F|x1,...,xn.

If the setV is realized asV = n

(g(x1), . . . ,g(xn)) :g∈G

for some collectionG⊆

F, we say that the cover isproper. A proper cover picks a subsetGofFso that any

f ∈Fis close to someg ∈Gonx1, . . . ,xnin the`2sense. The distinction between

proper and improper covers is minor, and one can show that a proper covering number at the αlevel is sandwiched between improper covering numbers at α andα/2 levels.

Of course, we can define`p-covers as

Definition 12.2. A setV ⊂Rnis anα-cover onx1, . . . ,xn with respect to`p norm, forp∈[1,∞), if ∀f ∈F,∃v∈V s.t. µ 1 n n X t=1 |f(xt)−vt|p ¶1/p ≤α, and with respect to`∞norm if

Theα-covering numbersNp(F,α,xn) andN∞(F,α,xn) are defined as before. The above notions of a cover can be seen as instances of a more general idea of anα-cover for a metric space (T,d). A subsetT0⊂T is anα-cover ofT if for anyt∈Tthere exists as∈T0such thatd(t,s)≤α. Above definitions correspond to endowingFwith an empirical metricLp(µn) whereµnis the empirical distribution on the datax1, . . . ,xn, thus explaining the extra 1/nnormalization factor.

↵

Figure 12.2: Anα-cover for a metric space (n,d).

P Exercise (?): Prove that for any 1≤p≤q≤ ∞, Np(F,α,xn)≤Nq(F,α,xn). Covering ofFallows us to squint our eyes and view the setF|xn as a finite set

at a granularityα. We can then apply the maximal inequalities for stochastic pro- cesses indexed by finite sets. Of course, by doing so we loseαof precision. Clearly, smallerαmeans larger α-cover, so there is tension between fine granularity and small cover. This is quantified in the following proposition.

Proposition 12.3. For any xn={x1, . . . ,xn}, conditional Rademacher averages of a

function classF⊆[−1, 1]Xsatisfy

b Ri i d₍_F₎ ≤inf α≥0    α+ s 2 logN1(F,α,xn₎ n   

Proof. Fixx1, . . . ,xnandα>0. LetV be a minimalα-cover ofFonxnwith respect to`1-norm. For f ∈F, denote by v[f]∈V any element that “α-covers” f in the

Write the conditional Rademacher averages as b Ri i d (F)=E ( sup f∈F 1 n n X t=1 ²tf(xt) ) =E ( sup f∈F 1 n n X t=1 ²t¡ f(xt)−v[f]t ¢ +²tv[f]t ) ≤E ( sup f∈F 1 n n X t=1 ²t¡ f(xt)−v[f]t ¢ ) +E ( sup f∈F 1 n n X t=1 ²tv[f]t ) ≤sup f∈F 1 n n X t=1 |f(xt)−v[f]t| +E ½ max v∈V 1 n n X t=1 ²tvt ¾

By definition ofα-cover, the first term is upper bounded byα. The second term is precisely controlled by the maximal inequality of Lemma 9.4, which gives an upper bound of

2 log|V| n ,

due to the fact that kvk2≤

n for any v ∈V. Note that if any coordinate ofv is outside the interval [−1, 1], it can be truncated.

Sinceα>0 is arbitrary, the statement follows.

Example 12. For certain classesF, one can show thatFx1,...,xnis a subset of [−1, 1]

n of dimension lower thann. This happens, for instance, if

F=nf(x)=f,x®: f ∈Bd_po and X=Bd_q,

whereBd_p andBd_q are unit balls inRd, with 1/p+1/q=1. For instance, forp= ∞, the function class is identified with [−1, 1]d and it is easy to check that

N∞(F,α,xn)≤

2 α

¶d

by a discretization argument. The bound of Proposition12.3then yields

b Ri i d (F)≤inf α≥0    α+ s 2dlog(2/α) n    (12.2)

Up to constant factors, the best choice forαisn−1/2, which yields an upper bound of the orderO µq dlogn n ¶ .

SinceBd_p⊆Bd_∞for anyp>0, the above bound also holds for otherF, though it might be loose. We also remark that an`∞cover is quite a stringent requirement, as Proposition12.3only requires the`1cover. Finally, we note that the discretiza-

tion argument yields more than an`_∞cover onxn: it yields a pointwise coverfor all x∈X. In general such pointwise bounds are not possible as soon as we consider infinite-dimensionalF.

12.2 Chaining Technique and the Dudley Entropy In-

In document Statistical Learning and Sequential Prediction - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 100-109)