Covering, packing, and approximation - Probability in High Dimension

Part I Concentration

5.2 Covering, packing, and approximation

Letx1, . . . , xnbe points in a Hilbert spaceH. Then for every0<"<1

andk&" 2logn, there exists a linear mapT :H !Rk _{such that}

(1 ")_kxi xjk  kT xi T xjk (1+")kxi xjk for all 1i, jn.

This result should interpreted in terms of compression: if we want to store the distances betweennpoints in a data structure, and if we tolerate a small distortion of order", it suffices to store ann_⇥kmatrix of size_⇠nlognrather than the fulln_⇥ndistance matrix of size_⇠n2_.

At first sight, the Johnson-Lindenstrauss lemma has nothing to do with probability: it is a deterministic statement about the geometry of Hilbert spaces. However, the easiest way to findT is to select it randomly!

a. Argue that we can assume without loss of generality thatH =Rn_.

b. For a k_⇥nrandom matrixT such thatTij are i.i.d.N(0, k 1), show that

P[|kT zk EkT zk| "kzk]2e k"2/2 _for_z 2Rn_.

Hint: Gaussian concentration. c. Show that _p

1 k 1_k_z_{k }_E_k_{T z}_{k  k}_z_k_, and conclude that for 0<"<1 andk " 1

P[(1 ")_kzk<kT zk<(1 +")kzk] 1 2e k"2/8 _for_z 2Rn_.

Hint: UseE_kT z_{k }E[_kT z_k2_]1/2_{for the upper bound. For the lower bound,} estimate Var_kT z_k from above using the Gaussian Poincar´e inequality. d. Show that if k >24" 2_log_{n, then}

P[(1 ")kxi xjk<kT xi T xjk<(1 +")kxi xjkfor alli, j]>0.

Hint: use a union bound.

5.2 Covering, packing, and approximation

If the set T is infinite, the maximal inequalities of the previous section pro- vide no information. This is, however, not surprising. We have seen that the inequalities for finite maxima work well when the random variables are independent. On the other hand, suppose that T is infinite but that t _7!Xt is

continuous in a suitable sense. Then limt!sXt=Xs, soXt andXs must be

strongly dependent when t and s are nearby points! Thus the lack of independence should in fact help us to control the infinite supremum: we should apply the maximal inequalities of the previous section only to a finite number

of well-separated points (at which the process might be expected to be nearly independent), and use continuity to control the fluctuations of the remaining (strongly dependent) degrees of freedom. In this section, we will develop the crudest illustration of this principle, which will be systematically developed in the sequel into a powerful machinery to control suprema.

To implement the above idea, we need to have a quantitative notion of continuity. In this section, we will use the simplest (but, as we will see, often unsatisfactory) such notion for random processes.

Definition 5.4 (Lipschitz process).The random process_{Xt}t2T is called

Lipschitzfor a metricdon T if there exists a random variableC such that

|Xt Xs|Cd(t, s) for allt, s2T.

Given a Lipschitz process, our aim is to approximate the supremum over T by the maximum over a finite setN, to which we will apply the inequalities of the previous section. To obtain a good bound, we have two competing demands: on the one hand, we would like the setN to be as small as possible (so that the bound on the maximum is small); on the other hand, to control the approximation error, we must make sure that every point inT is close to at least one of the points inN. This leads to the following concept.

Definition 5.5 (✏-net and covering number).A set N is called an "-net

for (T, d) if for every t ₂ T, there exists ⇡(t)₂ N such that d(t,⇡(t))_ ".

The smallest cardinality of an "-net for (T, d)is called the covering number N(T, d,") := inf_{|N_|:N is an"-net for (T, d)_}.

The covering number N(T, d,") should be viewed as a measure of the complexity of the setT at the scale ": the more complexT, the more points we will need to approximate its structure up to a fixed precision. Alternatively, we can interpret the covering number as describing the geometry of the metric space (T, d). Indeed, let B(t,") =_{s:d(t, s)"} be a ball of radius". Then

N is an"-net if and only if T _✓ [

t2N

B(t,"),

so that the covering numberN(T, d,") is the smallest number of balls of radius "needed to coverT (hence the name). We can therefore interpret the covering number as a measure of the degree of (non-)compactness of (T, d).

Remark 5.6.In many applications, we may want to compute the supremum supt2TXt of a stochastic process {Xt}t2S that is defined on a larger index

set S T. In this case, even though we are only interested in the process on the setT, it is not necessary to require that the"-netN is a subset ofT: it can be convenient to approximate the set T by points in S\T also. For this reason, we have not insisted in the above definition thatN ✓T.

5.2 Covering, packing, and approximation 119 We are now ready to develop our first bound on the supremum of a random process. We adopt the notation of Definitions 5.4 and 5.5.

Lemma 5.7 (Lipschitz maximal inequality).Suppose_{Xt}t2T is a Lips-

chitz process (Definition 5.4) and Xtis 2-subgaussian for everyt2T. Then

Esup t2T Xt inf ">0{"E[C] + p 2 2_log_{N(T, d,}_")_}_.

Note that this result is indeed a simple incarnation of the informal principle formulated in Chapter 1: if the process Xtis “sufficiently continuous,” then

supt2TXt is controlled by the “complexity” of the index setT.

Proof. Let">0 and letN be an"-net. Then sup t2T Xtsup t2T{ Xt X⇡(t)}+ sup t2T X⇡(t)C"+ max t2N Xt.

Taking the expectation and using Lemma 5.1 yields Esup

t2T

Xt "E[C] +

2 2_log_|_N_|_.

Optimizing over"-netsN and">0 yields the result. ut

Remark 5.8.The idea behind Lemma 5.7 is that it allows us to trade o↵

between exploiting independence (better at large scales) and controlling for dependence (worse at large scales). However, note that we never explicitly assume or use independence in the proof: instead, the distance d could be interpreted as a proxy for the degree of independence. While the conclusion of Lemma 5.7 does not depend on this validity of this interpretation, we expect that such bounds (and the more powerful bounds to be developed in the sequel) will be the most e↵ective when the distancedis chosen in such a way that large distance does indeed correspond to more independence. This is often the case in practice. In the case of Gaussian processes, for example, we will see in the next chapter that this idea holds to such a degree that we can obtain matching upper and lower bounds for the supremum of Gaussian processes in terms of the geometry of the index set (T, d), albeit in a much more sophisticated manner than is captured by the trivial Lemma 5.7.

Remark 5.9.WhenN(T, d,") =₁, the bound of Lemma 5.7 is infinite. How- ever, note that if X1, X2, . . .are i.i.d. unbounded random variables, then we already have supiXi =1a.s. It is therefore to be expected that the supremum

of a random process will typically indeed be infinite if it contains infinitely many independent degrees of freedom. Thus the fact that N(T, d,") = ₁ (which means there are infinitely many points in T that are well separated) yields an infinite bound is not a shortcoming of Lemma 5.7. To obtain a finite supremum for noncompact index setsT one must often add a penalty inside the supremum; such problems will be investigated in section 5.4 below.

In the remainder of this section, we will illustrate the application of Lemma 5.7 using two illuminating examples. Along the way, we will develop some useful examples of how one can control covering numbers.

Example 5.10 (Random matrices). LetM be an n_⇥m random matrix such thatMij are independent 2-subgaussian random variables. We would like to

estimate the magnitude of the operator norm kM_k:= sup v2Bn 2,w2Bm2 hv, M w_i= sup (v,w)2T Xv,w, whereBn

2 ={x2Rn:kxk 1}is the Euclidean unit ball inRn and T :=B2n⇥B2m, Xv,w:=hv, M wi= n X i=1 m X j=1 viMijwj.

It follows immediately from Azuma’s inequality (Lemma 3.7) that Xv,w is

2_{-subgaussian for every (v, w)}

2T. On the other hand, note that

|Xv,w Xv0,w0|=|hv, M wi hv0, M w0i|

|hv v0, M w_i_|+_|_hv0, M(w w0)_i_|  kv v0_kkM_kkw_k+_kv0_kkM_kkw w0_k  kMk{kv v0k+kw w0k}

for (v, w)₂T. If we define a metric onT as

d((v, w),(v0_{, w}0_{)) :=}_k_v _v0_k₊_k_w _w0_k_,

we see that the random process _{Xv,w}(v,w)2T is Lipschitz for the metric d.

Note that the random Lipschitz constant happens to be_kM_k, which is in fact the quantity we are trying to control in the first place! This is a rather peculiar situation, but we can nonetheless readily apply Lemma 5.7: this yields

E[_kMk]"E[kMk] +p2 2log_N(T, d,") for every">0, which we can rearrange to obtain

E[_kM_k]_inf ">0 p 2 1 " p logN(T, d,").

What remains is to estimate the covering number. To this end, we must intro- duce an additional idea that will be of significant importance in the sequel.

How can one construct asmall "-netN? The defining property of an"-net is that every point in T is within a distance at most" of some point inN. We can always achieve this by choosing a very dense set N. However, if we want_|N| to be small, we should intuitively choose the points in N to be as far apart as possible. This motivates the following definition.

5.2 Covering, packing, and approximation 121 Definition 5.11 ("-packing and packing number).A setN ✓T is called

an "-packing of (T, d) if d(t, t0₎ _> _" _{for every} _{t, t}0 ₂ _N_,_t ₆₌_t0_{. The largest}

cardinality of an "-packing of(T, d) is called the packing number D(T, d,") := sup_{|N|:N is an"-packing of (T, d)}.

The key idea, which was already hinted at above, is that the notion of packing dual to the notion of covering, as is made precise by the following result. This means that we can use covering and packing interchangeably (up to constants). In some cases it is easier to estimate packing numbers than covering numbers, as we will see shortly. On the other hand, we will see in the following chapter that packing numbers arise naturally when we aim to provelower bounds for the suprema of random processes (as opposed toupper

bounds which are considered exclusively in this chapter).

Lemma 5.12 (Duality between covering and packing).For every">0 D(T, d,2")_N(T, d,")D(T, d,").

Note that this can indeed be viewed as a form of duality (in the sense of optimization): the packing number is defined in terms of a supremum, but the covering number is defined in terms of an infimum.

Proof.LetD be a 2"-packing and letN be an"-net. For everyt₂D, choose ⇡(t)₂N such thatd(t,⇡(t))_". Then fort₆=t0_{, we have}

2"< d(t, t0₎__d(t,_{⇡(t)) +}_d(⇡(t),_⇡(t0_{)) +}_d(⇡(t0_{), t}0₎__2"₊_d(⇡(t),_⇡(t0_)),

which implies ⇡(t) ₆= ⇡(t0_{). Thus} _⇡ _: _D _! _N _{is one-to-one, and therefore}

|D_|__|N_|. This yields the first inequalityD(T, d,2")_N(T, d,").

To obtain the second inequality, let D be a maximal "-packing of (T, d) (that is, _|D_| =D(T, d,")). We claim that D is necessarily an "-net. Indeed, suppose this is not the case; then there is a pointt₂T such thatd(t, t0₎_>_"

for everyt0 ₂_{D. But then}_D_[_{_t_}_{must be a}_{"-packing also, which contradicts}

the maximality ofD. Thus we have D(T, d,") =_|D_| N(T, d,"). _u_t We are now in a position to bound the covering number of the Euclidean ballBn

2 with respect to the Euclidean distance. The proof of this elementary result uses a clever technique known as avolume argument.

Lemma 5.13.We haveN(Bn 2,k·k,") = 1 for" 1 and ✓₁ " ◆n N(Bn2,k·k,") ✓₃ " ◆n for0<"<1. Proof.That N(Bn

2,k·k,") = 1 for " 1 is obvious: by definition, we have ktk=kt 0k 1 for everyt2Bn

2, so the singleton{0} is an"-net. The main part of the proof is illustrated in the following figure:

The colored ball isBn

2. To obtain an upper bound on the covering number, we choose a 2"-packingD of Bn

2 (black dots in left figure). Then balls of radius "aroundt₂D be disjoint, and all these balls are contained in a large ball of size 1 +". As the sum of the volumes of the small balls (of which there are

|D|) is bounded above by the volume of the large ball, we obtain an upper bound on the size of D (and thus on the covering number by Lemma 5.12). To obtain a lower bound on the covering number, we choose an "-net N of Bn

2 (black dots in right figure). As the balls of radius "around t2N cover Bn

2, the sum of the volumes of these balls (of which there are|N|) is bounded below by the volume ofB2n. This yields a lower bound on the size ofN.

We now proceed to make this argument precise. Let us begin with the upper bound. Let D be a 2"-packing of Bn

2. Asd(t, t0)>2"for all t 6=t0 in D, the balls_{B(t,") :t₂D_} must be disjoint. On the other hand, every ball B(t,") fort₂Bn

2 must be contained in the larger ballB(0,1 +"). Thus

X t2D (B(t,")) = [ t2D B(t,") !  (B(0,1 +")),

where denotes the Lebesgue measure on_Rn_{. By homogeneity of the Lebesgue}

measure, (B(t,↵)) = (B(0,↵)) = (↵B(0,1)) =↵n _(B(0,_{1)). Thus} |D_|_ (B(0,1 +")) (B(0,")) = ✓_{1 +}_" " ◆n .

As this holds for every 2"-packing D, we have evidently proved the upper boundN(T, d,2")_D(T, d,2")_(1 + 1/")n

(3/2")n _{for 2"}_<_1.

To obtain the lower bound, letN be an"-net forBn

2. Then (Bn 2) [ t2N B(t,") ! X t2N (B(t,")), so we obtain |N| (B n 2) (B(0,")) = ✓₁ " ◆n .

5.2 Covering, packing, and approximation 123

Remark 5.14.Lemma 5.13 quantifies explicitly the dependence of the covering number on dimension: the number of balls of radius"needed to cover a ball in _Rn _{is polynomial in 1/"} _{of order} _{n. This is not surprising: think of how}

many cubes of side length "can fit into the unit cube in _Rn_{. While balls do}

not pack as nicely as cubes, the ultimate conclusion is the same (in fact, the conclusion of Lemma 5.13 carries over to any norm on_Rn_{, see Problem 5.5).}

In this manner, the dependence on dimension will enter explicitly into our estimates of the suprema of random processes.

Beyond the concrete result on covering numbers in_Rn_{, Lemma 5.13 pro-}

vides a good way to think about the notion of dimension in the first place. The classical idea that Rn _is _{n-dimensional stems from its linear structure:}

there is a basis of sizensuch that any vector inRn _{can be written as a linear}

combination of these basis elements. This linear-algebraic notion of dimension is not very useful in general spaces where one does not need to have any linear structure. Lemma 5.13 motivates a di↵erent notion of dimension that makes sense in any metric space: we say that a metric space (T, d) hasmetric dimension n ifN(T, d,")⇠" n. Lemma 5.13 shows that for (bounded subsets of)

Rn_{, the linear-algebraic and metric notions of dimension coincide; however,}

the definition of metric dimension is independent of the linear structure of the space. The notion of metric dimension certainly conforms to the intuitive notion that a high-dimensional space has more “room” than a low-dimensional space (the number of balls of fixed radius needed to cover the space increases exponentially in the dimension). Of course, not every metric space has finite metric dimension: we will shortly encounter an infinite-dimensional space (T, d) for which the covering numbers grow exponentially in 1/".

Having developed some basic estimates, we can now complete the example of random matrices. Here we are not interested in the covering number ofB2n itself, but rather in the covering number ofT =Bn

2 ⇥B2mwith respect to the metricd. The latter is however easily estimated using Lemma 5.13. LetN be an"-net for Bn

2 and let M be an"-net forBm2 . ThenN⇥M is a 2"-net for T of cardinality_|N_||M_|: indeed, setting⇡((t, s)) = (⇡(t),⇡(s)), we have

d((t, s),⇡((t, s))) =_kt ⇡(t)_k+_ks ⇡(s)_{k }2". This evidently implies that

N(T, d,2")N(Bn

2,k·k,")N(Bm2 ,k·k,")

✓₃

◆n+m

for"1. We therefore obtain E[_kM_k]_inf ">0 p 2 1 " p logN(T, d,"). pn+m.

It turns out that this crude bound already captures the correct order of magnitude of the matrix norm! In particular, for square matrices, we obtain E[_kMk].pnas was already alluded to in Example 2.5.

We now turn to our second example. Unlike in the previous example, where we got a sharp result with little work, we will not be so lucky here: we will derive a nontrivial bound from Lemma 5.7, but the methods we developed so far will prove to be too crude to capture the correct order of magnitude.

Example 5.15 (Wasserstein law of large numbers). Let X1, X2, . . . be i.i.d. random variables with values in the interval [0,1]. We denote their distribution asXi⇠µ. Define the empirical measure ofX1, . . . , Xn as

µn:= 1 n n X k=1 Xk.

Then it is easy to estimate

E|µnf µf|E[|µnf µf|2]1/2 kfpk1

n .

In particular, we haveµnf !µf in L1 for every bounded functionf: this is

none other than the weak law of large numbers with the optimaln 1/2_rate. At what rate the law of large numbers µn ! µ hold when we consider

other notions of distance between probability measures? In this spirit, we will presently attempt to estimate the expected Wasserstein distanceE[W1(µn, µ)]

between the empirical measure and the underlying distribution. Recall that W1(µn, µ) = sup

f2Lip([0,1]){

µnf µf}= sup f2F

In document Probability in High Dimension (Page 123-135)