PROOF OF THEOREM 6.7 - The VC-Dimension - Understanding Machine Learning

The VC-Dimension

6.5 PROOF OF THEOREM 6.7

6.5 PROOF OF THEOREM 6.7

We have already seen that 1→ 2 in Chapter4. The implications 2→ 3 and 3 → 4 are trivial and so is 2→ 5. The implications 4 → 6 and 5 → 6 follow from the No-Free-Lunch theorem. The difficult part is to show that 6→ 1. The proof is based on two main claims:

If VCdim(H) = d, then even though H might be infinite, when restricting it to a finite set C⊂ X , its “effective” size, |HC|, is only O(|C|^d). That is, the size of HC grows polynomially rather than exponentially with |C|. This claim is often referred to as Sauer’s lemma, but it has also been stated and proved indepen-dently by Shelah and by Perles. The formal statement is given in Section6.5.1 later.

In Section4we have shown that finite hypothesis classes enjoy the uniform con-vergence property. In Section 6.5.2 later we generalize this result and show that uniform convergence holds whenever the hypothesis class has a “small effective size.” By “small effective size” we mean classes for which|HC| grows polynomially with|C|.

6.5.1 Sauer’s Lemma and the Growth Function

We defined the notion of shattering, by considering the restriction ofH to a finite set of instances. The growth function measures the maximal “effective” size ofH on a set of m examples. Formally:

Definition 6.9 (Growth Function). Let H be a hypothesis class. Then the growth function ofH, denoted τH:N → N, is defined as

τ_H(m) = max

C⊂X:|C|=m|HC|.

In words,τH(m) is the number of different functions from a set C of size m to{0,1}

that can be obtained by restrictingH to C.

Obviously, if VCdim(H) = d then for any m ≤ d we have τH(m)= 2^m. In such cases, H induces all possible functions from C to {0,1}. The following beautiful lemma, proposed independently by Sauer, Shelah, and Perles, shows that when m becomes larger than the VC-dimension, the growth function increases polynomially rather than exponentially with m.

Lemma 6.10 (Sauer-Shelah-Perles). LetH be a hypothesis class with VCdim(H) ≤ d< ∞. Then, for all m, τ_H(m)≤_d

i=0

. In particular, if m> d + 1 then τ_H(m)≤ (em/d)^d.

Proof of Sauer’s Lemma*

To prove the lemma it suffices to prove the following stronger claim: For any C= {c1,...,cm} we have

∀H, |HC| ≤ |{B ⊆ C : H shatters B}|. (6.3)

50 The VC-Dimension

The reason why Equation (6.3) is sufficient to prove the lemma is that if VCdim(H) ≤ d then no set whose size is larger than d is shattered by H and therefore

|{B ⊆ C : H shatters B}| ≤

d i=0

m i

When m > d + 1 the right-hand side of the preceding is at most (em/d)^d (see LemmaA.5in AppendixA).

We are left with proving Equation (6.3) and we do it using an inductive argu-ment. For m= 1, no matter what H is, either both sides of Equation (6.3) equal 1 or both sides equal 2 (the empty set is always considered to be shattered byH).

Assume Equation (6.3) holds for sets of size k< m and let us prove it for sets of size m. FixH and C = {c1,...,cm}. Denote C= {c2,...,cm} and in addition, define the following two sets:

Y0= {(y2,..., ym) : (0, y2,..., ym)∈ HC∨ (1, y2,..., ym)∈ HC}, and

Y1= {(y2,..., ym) : (0, y2,..., ym)∈ HC∧ (1, y2,..., ym)∈ HC}.

It is easy to verify that |HC| = |Y0| + |Y1|. Additionally, since Y0= HC, using the induction assumption (applied onH and C) we have that

|Y0| = |HC| ≤ |{B ⊆ C:H shatters B}| = |{B ⊆ C : c1∈ B ∧ H shatters B}|.

Next, defineH⊆ H to be

H= {h ∈ H : ∃h∈ H s.t. (1 − h(c1),h(c2),...,h(cm))

= (h(c1),h(c2),...,h(cm)},

namely, H contains pairs of hypotheses that agree on C and differ on c₁. Using this definition, it is clear that ifHshatters a set B⊆ Cthen it also shatters the set B∪ {c1} and vice versa. Combining this with the fact that Y1= H_C and using the inductive assumption (now applied onHand C) we obtain that

|Y1| = |H_C| ≤ |{B ⊆ C:Hshatters B}| = |{B ⊆ C:Hshatters B∪ {c1}}|

= |{B ⊆ C : c1∈ B ∧ Hshatters B}| ≤ |{B ⊆ C : c1∈ B ∧ H shatters B}|.

Overall, we have shown that

|HC| = |Y0| + |Y1|

≤ |{B ⊆ C : c1∈ B ∧ H shatters B}| + |{B ⊆ C : c1∈ B ∧ H shatters B}|

= |{B ⊆ C : H shatters B}|, which concludes our proof.

6.5.2 Uniform Convergence for Classes of Small Effective Size

In this section we prove that ifH has small effective size then it enjoys the uniform convergence property. Formally,

6.5 Proof of Theorem 6.7 51

Theorem 6.11. LetH be a class and let τHbe its growth function. Then, for everyD and everyδ ∈ (0,1), with probability of at least 1 − δ over the choice of S ∼ D^m we

Before proving the theorem, let us first conclude the proof of Theorem6.7.

Proof of Theorem6.7. It suffices to prove that if the VC-dimension is finite then the uniform convergence property holds. We will prove that

m^UC_H ( ,δ) ≤ 4 16d with Theorem6.11we obtain that with probability of at least 1− δ,

|LS(h)− L_D(h)| ≤4+

To ensure that the preceding is at most we need that m≥ 2d log (m)

(δ )² +2 d log (2e/d) (δ )² .

Standard algebraic manipulations (see Lemma A.2in Appendix A) show that a sufficient condition for the preceding to hold is that

m≥ 4 2d

Remark 6.4. The upper bound on m^UC_H we derived in the proof Theorem6.7is not the tightest possible. A tighter analysis that yields the bounds given in Theorem6.8 can be found in Chapter28.

Proof of Theorem6.11*

We will start by showing that

S∼ED^m the theorem follows directly from the preceding using Markov’s inequality (see SectionB.1).

52 The VC-Dimension

To bound the left-hand side of Equation (6.4) we first note that for every h∈ H, we can rewrite L_D(h)= ES∼D^m[LS(h)], where S= z₁,...,z_mis an additional i.i.d.

A generalization of the triangle inequality yields E_S∼D^m[L_S(h)− LS(h)]

≤ E_S∼D^m|LS(h)− LS(h)|,

and the fact that supermum of expectation is smaller than expectation of supremum yields

h∈supH E

S∼D^m|LS(h)− LS(h)| ≤ E

S∼D^msup

h∈H|LS(h)− LS(h)|.

Formally, the previous two inequalities follow from Jensen’s inequality. Combining all we obtain

The expectation on the right-hand side is over a choice of two i.i.d. samples S = z1,...,zmand S= z₁,...,zm. Since all of these 2m vectors are chosen i.i.d., nothing will change if we replace the name of the random vector zi with the name of the random vector z_i. If we do it, instead of the term ((h,z_i)−(h,zi)) in Equation (6.5)

Since this holds for everyσ ∈ {±1}^m, it also holds if we sample each component ofσ uniformly at random from the uniform distribution over{±1}, denoted U±. Hence, Equation (6.5) also equals

and by the linearity of expectation it also equals

S,SE∼D^m E

6.7 Bibliographic Remarks 53

Next, fix S and S, and let C be the instances appearing in S and S. Then, we can take the supremum only over h∈ HC. Therefore,

σ∼UE_±^m θhis an average of independent variables, each of which takes values in [− 1,1], we have by Hoeffding’s inequality that for everyρ > 0,

P[|θh| > ρ] ≤ 2 exp Finally, LemmaA.4in AppendixAtells us that the preceding implies

Combining all with the definition ofτH, we have shown that

S∼ED^m

The fundamental theorem of learning theory characterizes PAC learnability of classes of binary classifiers using VC-dimension. The VC-dimension of a class is a combinatorial property that denotes the maximal sample size that can be shattered by the class. The fundamental theorem states that a class is PAC learnable if and only if its VC-dimension is finite and specifies the sample complexity required for PAC learning. The theorem also shows that if a problem is at all learnable, then uniform convergence holds and therefore the problem is learnable using the ERM rule.

6.7 BIBLIOGRAPHIC REMARKS

The definition of VC-dimension and its relation to learnability and to uniform con-vergence is due to the seminal work of Vapnik and Chervonenkis (1971). The relation to the definition of PAC learnability is due to Blumer, Ehrenfeucht, Haussler, and Warmuth (1989).

Several generalizations of the VC-dimension have been proposed. For example, the fat-shattering dimension characterizes learnability of some regression prob-lems (Kearns, Schapire & Sellie 1994; Alon, Ben-David, Cesa-Bianchi & Haussler

54 The VC-Dimension

1997; Bartlett, Long & Williamson 1994; Anthony & Bartlet 1999), and the Natarajan dimension characterizes learnability of some multiclass learning prob-lems (Natarajan 1989). However, in general, there is no equivalence between learnability and uniform convergence. See (Shalev-Shwartz, Shamir, Srebro &

Sridharan 2010; Daniely, Sabato, Ben-David & Shalev-Shwartz 2011).

Sauer’s lemma has been proved by Sauer in response to a problem of Erdos (Sauer 1972). Shelah (with Perles) proved it as a useful lemma for Shelah’s theory of stable models (Shelah 1972). Gil Kalai tells¹us that at some later time, Benjy Weiss asked Perles about such a result in the context of ergodic theory, and Perles, who forgot that he had proved it once, proved it again. Vapnik and Chervonenkis proved the lemma in the context of statistical learning theory.

6.8 EXERCISES

6.1 Show the following monotonicity property of VC-dimension: For every two hypoth-esis classes ifH⊆Hthen VCdim(H)≤ VCdim(H).

6.2 Given some finite domain set,X, and a number k≤ |X|, figure out the VC-dimension of each of the following classes (and prove your claims):

1. H^X_=k= {h ∈ {0,1}^X:|{x : h(x) = 1}| = k}: that is, the set of all functions that assign the value 1 to exactly k elements ofX.

2. _Hat−most−k= {h ∈ {0,1}^X:|{x : h(x) = 1}| ≤ k or |{x : h(x) = 0}| ≤ k}.

(That is, hI computes parity of bits in I .) What is the VC-dimension of the class of all such parity functions,_H_n-parity= {hI : I⊆ {1,2,...,n}}?

6.4 We proved Sauer’s lemma by proving that for every class_Hof finite VC-dimension d, and every subset A of the domain,

|HA| ≤ |{B ⊆ A : Hshatters B}| ≤

Show that there are cases in which the previous two inequalities are strict (namely, the≤ can be replaced by <) and cases in which they can be replaced by equalities.

Demonstrate all four combinations of= and <.

6.5 VC-dimension of axis aligned rectangles inR^d: Let_H^d_recbe the class of axis aligned rectangles inR^d. We have already seen that VCdim(H²_rec)= 4. Prove that in general, VCdim(H^d_rec)= 2d.

6.6 VC-dimension of Boolean conjunctions: LetH^d_conbe the class of Boolean conjunc-tions over the variables x₁,..., xd (d≥ 2). We already know that this class is finite and thus (agnostic) PAC learnable. In this question we calculate VCdim(_H^d_con).

1. Show that|H^d_con| ≤ 3^d+ 1.

2. Conclude that VCdim(_H)≤ d log3.

3. Show that_H^d_conshatters the set of unit vectors{ei: i≤ d}.

1 http://gilkalai.wordpress.com/2008/09/28/extremal-combinatorics-iii-some-basic-theorems

6.8 Exercises 55

For each i∈ [d + 1], hi(or more accurately, the conjunction that corresponds to hi) contains some literaliwhich is false on ciand true on cjfor each j= i. Use the Pigeonhole principle to show that there must be a pair i< j ≤ d + 1 such thatiandj use the same xkand use that fact to derive a contradiction to the requirements from the conjunctions hi,hj.

5. Consider the classH^dmconof monotone Boolean conjunctions over{0,1}^d. Mono-tonicity here means that the conjunctions do not contain negations. As inH^dcon, the empty conjunction is interpreted as the all-positive hypothesis. We augment H^dmconwith the all-negative hypothesis h⁻. Show that VCdim(_H^d_mcon)= d.

6.7 We have shown that for a finite hypothesis class_H, VCdim(_H)≤ log(|H|). How-ever, this is just an upper bound. The VC-dimension of a class can be much lower than that:

1. Find an example of a class_Hof functions over the real interval_X= [0,1] such thatHis infinite while VCdim(H)= 1.

2. Give an example of a finite hypothesis classHover the domainX= [0,1], where VCdim(H)= log₂(|H|).

6.8 (*) It is often the case that the VC-dimension of a hypothesis class equals (or can be bounded above by) the number of parameters one needs to set in order to define each hypothesis in the class. For instance, ifHis the class of axis aligned rectangles in R^d, then VCdim(H)= 2d, which is equal to the number of parameters used to define a rectangle inR^d. Here is an example that shows that this is not always the case.

We will see that a hypothesis class might be very complex and even not learnable, although it has a small number of parameters.

Consider the domain_X= R, and the hypothesis class H= {x → sin(θ x) : θ ∈ R}

(here, we take−1 = 0). Prove that VCdim(H)= ∞.

Hint: There is more than one way to prove the required result. One option is by applying the following lemma: If 0. x1x2x3..., is the binary expansion of x ∈ (0,1), then for any natural number m,sin(2^mπx) = (1 − xm), provided that∃k ≥ m s.t.

xk= 1.

6.9 LetHbe the class of signed intervals, that is, H= {ha,b,s: a≤ b,s ∈ {−1,1}} where

1. Prove that if VCdim(_H)≥ d, for any d, then for some probability distributionD overX× {0,1}, for every sample size, m,

S∼DE^m[L_D( A(S))]≥ min

h∈HL_D(h)+d− m 2d

56 The VC-Dimension

Hint: Use Exercise 6.3 in Chapter 5.

2. Prove that for everyHthat is PAC learnable, VCdim(H)< ∞. (Note that this is the implication 3→ 6 in Theorem6.7.)

6.11 VC of union: Let_H₁,...,Hr be hypothesis classes over some fixed domain set_X. Let d= maxiVCdim(_Hi) and assume for simplicity that d≥ 3.

1. Prove that

VCdim

∪^ri=1Hi

≤ 4d log(2d) + 2log(r).

Hint: Take a set of k examples and assume that they are shattered by the union class. Therefore, the union class can produce all 2^kpossible labelings on these examples. Use Sauer’s lemma to show that the union class cannot produce more than r k^dlabelings. Therefore, 2^kr k^d. Now use LemmaA.2.

2. (*) Prove that for r= 2 it holds that

VCdim(H1∪H2) ≤ 2d + 1.

6.12 Dudley classes: In this question we discuss an algebraic framework for defining

In document Understanding Machine Learning (Page 67-74)