• No results found

2.5 Bennett’s inequality . . . . 11

N/A
N/A
Protected

Academic year: 2021

Share "2.5 Bennett’s inequality . . . . 11"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

2.1 Tail bounds and concentration . . . . 1

2.2 Moment generating functions . . . . 3

2.3 Orlicz norms . . . . 4

2.4 Subgaussian tails . . . . 6

2.5 Bennett’s inequality . . . . 11

2.6 Bernstein’s inequality . . . . 14

2.7 Subexponential distributions? . . . . 17

2.8 The Hanson-Wright inequality for subgaussians . . . . 20

2.9 Problems . . . . 23

2.10 Notes . . . . 27

Printed: 27 November 2015

version: 20nov15 Mini-empirical

(2)

Chapter 2

A few good inequalities

Section 2.1 introduces exponential tail bounds.

Section 2.2 introduces the method for bounding tail probabilities using mo- ment generating functions.

Section 2.3 describes analogs of the usual L p norms, which have proved useful in empirical process theory.

Section 2.4 discusses subgaussianity, a most useful extension of the gaus- sianity property.

Section 2.5 discusses the Bennett inequality for sums of independent ran- dom variables that are bounded above by a constant.

Section 2.6 shows how the Bennett inequality implies one form of the Bern- stein inequality, then discusses extensions to unbounded summands.

Section 2.7 describes two ways to capture the idea of tails that decrease like those of the exponential distribution.

Section 2.8 illustrates the ideas from the previous Section by deriving a subgaussian/subexponential tail bound for quadratic forms in independent subgaussian random variables.

2.1 Tail bounds and concentration

Basic::S:intro

Sums of independent random variables are often approximately normal dis- tributed. In particular, their tail probabilities often decrease in ways similar to the tails of a normal distribution. Several classical inequalities make this vague idea more precise. Such inequalities provided the foundation on which empirical process theory was built.

version: 20nov15 Mini-empirical

(3)

We know a lot about the normal distribution. For example, if W has a N (µ, σ 2 ) distribution then (Feller, 1968, Section 7.1)

 1 x − 1

x 3

 e −x 2 /2

2π ≤ P{W ≥ µ + σx} ≤ 1 x

e −x 2 /2

2π for all x > 0.

Clearly the inequalities are useful only for larger x: as x decreases to zero the lower bound goes to −∞ and the upper bound goes to +∞. For many purposes the exp(−x 2 /2) factor matters the most. Indeed, the simpler tail bound P{W ≥ µ + σx} ≤ exp(−x 2 /2) for x ≥ 0 often suffices for asymptotic arguments. Sometimes we need a better inequality showing that the dis- tribution concentrates most of its probability mass in a small region, such as

P{|W − µ| ≥ σx} ≤ 2e −x

2 /2 for all x ≥ 0, a concentration inequality.

Similar-looking tail bounds exist for random variables that are not ex- actly normally distributed. Particularly useful are the so-called exponential tail bounds, which take the form

\E@ exp.bnd

\E@ exp.bnd <1> P{X ≥ x + PX} ≤ e −g(x) for x ≥ 0,

with g a function that typically increases like a power of x. Sometimes g behaves like a quadratic for some range of x and like a linear function further out in the tails; sometimes some logarithmic terms creep in; and sometimes it depends on var(X) or on some more exotic measure of size, such as an Orlicz norm (see Section 2.3).

For example, suppose S = X 1 + · · · + X n , a sum of independent random variables with PX i = 0 and each X i bounded above by some constant b.

The Bernstein inequality (Section 2.6) tells us that

P{S ≥ x

V } ≤ exp −x 2 /2 1 + 1 3 bx/ √

V

!

for x ≥ 0,

where V is any upper bound for var(S). If each X i is also bounded below by −b then a similar tail bound exists for −S, which leads to a concentration inequality, an upper bound for P{|S| ≥ x √

V }.

For x much smaller than √

V /b the exponent in the Bernstein inequality

behaves like −x 2 /2; for much larger x the exponent behaves like a negative

multiple of x. In a sense that is made more precise in Sections 2.6 and 2.7, the

inequality gives a subgaussian bound for x not too big and a subexponential

bound for very large x.

(4)

§2.2 Moment generating functions 3

2.2 Moment generating functions

Basic::S:mgf

The most common way to get a bound like < 1 > uses the moment gener- ating function, M (θ) := Pe θX . From the fact that exp is nonnegative and exp(θ(X − w)) ≥ 1 when X ≥ w and θ ≥ 0 we have

\E@ mgf.tail

\E@ mgf.tail <2> P{X ≥ w} ≤ inf

θ≥0 Pe θ(X−w) = inf

θ≥0 e −θw M (θ).

Basic::normal <3> Example. If X has a N (µ, σ 2 ) distribution then M (θ) = exp(θµ + σ 2 θ 2 /2) is finite for all real θ. For x ≥ 0 inequality < 2 > gives

P{X ≥ µ + σx} ≤ inf

θ≥0 exp(−θ(µ + σx) + θµ + θ 2 σ 2 /2)

= exp(−x 2 /2) for all x ≥ 0, the minimum being achieved by θ = x/σ.

Analogous arguments, with X − µ replaced by µ − X, give an analogous bound for the lower tail,

P{X ≤ µ − σx} ≤ exp(−x 2 /2) for all x > 0,

leading to the concentration inequality P{|X − µ| ≥ σx} ≤ 2e −x 2 /2 .



Remark. Of course the algebra would have been a tad simpler if I had worked with the standardized variable (X − µ)/σ. I did things the messier way in order to make a point about the centering in Section 2.4.

Exactly the same arguments work for any random variable whose mo- ment generating is bounded above by exp(θµ + σ 2 θ 2 /2), for all real θ and some finite constants µ and τ . This property essentially defines the sub- gaussian distributions, which are discussed in Section 2.4.

The function L(θ) := log M (θ) is called the cumulant-generating function. The upper bound in < 2 > is often rewritten as

e −L (θ) where L (θ) := sup θ≥0 (θw − L(θ)).

Remark. As Boucheron et al. (2013, Sections 2.2) and others have

noted, the function L is the Fenchel-Lagrange transform of L, which is

also known as the conjugate function (Boyd and Vandenberghe, 2004,

Section 3.3).

(5)

Clearly L(0) = 0. The H¨ older inequality shows that L is convex: if θ is a convex combination α 1 θ 1 + α 2 θ 2 then

e L(θ) = M (α 1 θ 1 + α 2 θ 2 )

= P



e α 1 θ 1 X e α 2 θ 2 X



≤  Pe θ 1 X

 α 1  Pe θ 2 X

 α 2

= e α 1 L(θ 1 )+α 2 L(θ 2 ) .

More precisely (Problem [1]), the set J = {θ : M (θ) < ∞} is an interval, possibly degenerate or unbounded, containing the origin. For example, for the Cauchy, M (θ) = ∞ for all real θ 6= 0; for the standard exponential distribution (density e −x {x > 0} with respect to Lebesgue measure),

M (θ) =

 (1 − θ) −1 for θ < 1

∞ otherwise .

Problem [1] also justifies the interchange of differentiation and expecta- tion on the interior of J , which leads to

L 0 (θ) = M 0 (θ)

M (θ) = PX e θX M (θ) L 00 (θ) = M 00 (θ)

M (θ) −  M 0 (θ) M (θ)

 2

= PX 2 e θX M (θ) −



PX e θX M (θ)

 2

.

If we define a probability measure P θ by its density, dP θ /dP = e θX /M (θ), then the derivatives of L can be re-expressed using moments of X under P θ :

L 0 (θ) = P θ X and L 00 (θ) = P θ (X − P θ X) 2 = var θ (X) ≥ 0.

In particular, L 0 (0) = PX, so that Taylor expansion gives

\E@ L.Taylor

\E@ L.Taylor <4> L(θ) = θPX + 1 2 θ 2 var θ∗ (X) with θ lying between 0 and θ.

Bounds on the P θ variance of X lead to bounds on L and tail bounds for X − PX. The Hoeffding inequality (Example < 14 >) provides a good illustration of this line of attack.

2.3 Orlicz norms

Basic::S:orlicz

There is a fruitful way to generalize the the concept of an L p norm by means

of a Young function, that is, a convex, increasing function Ψ : R + → R +

with Ψ(0) = 0 and Ψ(x) → ∞ as x → ∞. The concept applies with any

measure µ but I need it mostly for probability measures.

(6)

§2.3 Orlicz norms 5

Remark. The assumption that Ψ(x) → ∞ as x → ∞ is needed only to rule out the trivial case where Ψ ≡ 0. Indeed, if Ψ(x 0 ) > 0 for some x 0

then, for x = x 0 /α with α < 1,

Ψ(x 0 ) = Ψ(αx + (1 − α)0) ≤ αΨ(x)

That is, Ψ(x)/x ≥ Ψ(x 0 )/x 0 for all x ≥ x 0 . If Ψ is not identically zero then it must increase at least linearly with increasing x.

At one time Orlicz norms became very popular in the empirical process literature, in part (I suspect) because chaining arguments (see Chapter 4) are cleaner with norms than with tail probabilities.

The most important Young functions for my purposes are the power functions, Ψ(x) = x p for a fixed p ≥ 1, and the exponential power func- tions Ψ α (x) := exp(x α ) − 1 for α ≥ 1. The function Ψ 2 is particularly useful (see Section 2.4) for working with distributions whose tails decrease like a Gaussian. It is also possible (Problem [6]) to define a Young function Ψ α , for 0 < α < 1, for which

Ψ α (x) ≤ exp(x α ) ≤ K α + Ψ α (x) for all x ≥ 0,

where K α is a positive constant. Some linear interpolation near 0 is needed to overcome the annoyance that x 7→ exp(x α ) is concave on the inter- val [0, z α ], for z α := ((1 − α)/α) 1/α .

Basic::Orlicz.norm <5> Definition. Let ( X, A, µ) be a measure space and f be an A-measuable real function on X. For each Young function Ψ the corresponding Orlicz norm kf k Ψ (or Ψ-norm) is defined by

kf k Ψ = inf{c > 0 : µΨ(|f (x)|/c) ≤ 1},

with kf k Ψ = ∞ if the infimum runs over an empty set.

Remark. It is not hard to show that kf k Ψ < ∞ if and only if µΨ(|f |/C 0 ) < ∞ for at least one finite constant C 0 . Moreover, if 0 < c = kf k Ψ < ∞ then µΨ(|f |/c) ≤ 1; the infimum is achieved.

When restricted to the set L Ψ ( X, A, µ) of all measurable real functions on X with finite Ψ-norm, k·k Ψ is a seminorm. It fails to be a norm only because kf k Ψ = 0 if and only if µ{x : f (x) 6= 0} = 0.

The corresponding set of µ-equivalence classes is a Banach space. It is traditional and benign to call k·k Ψ a norm even though it is only a seminorm on L Ψ ( X, A, µ).

The assumption Ψ(0) = 0 can be discarded if µ is a finite measure.

For each finite constant A > Ψ(0), the quantity inf{c > 0 : µΨ(|f |/c) ≤

A} also defines a seminorm.

(7)

Finiteness of the Ψ-norm for a random variable Z immediately implies a tail bound: if 0 < σ = kZk Ψ < ∞ then

\E@ orlicz.tail

\E@ orlicz.tail <6> P{|Z| ≥ xσ} ≤ max



1, P Ψ (|Z|/σ) Ψ(x)



= max

 1, 1

Ψ(x)



for x ≥ 0.

For example, for each α > 0 there is a constant C α for which P{|Z| ≥ x kZk Ψ α } ≤ C α e −x α for x ≥ 0.

The constant can be chosen so that C α e −x α ≥ 1 for all x so small that Ψ α (x)e −x α is not close to 1.

When seeking to bound an Orlicz norm I often find it easier to first obtain a constant C 0 for which PΨ(|X|/C 0 ) ≤ K 0 , where K 0 is a constant strictly larger than 1. As the next Lemma shows, a simple convexity argument turns such an inequality into a bound on kXk Ψ .

Basic::Psi.bound <7> Lemma. Suppose Ψ is a Young function. If PΨ(|X|/C 0 ) ≤ K 0 for some finite constants C 0 and K 0 > 1 then kXk Ψ ≤ C 0 K 0 .

Proof For each θ in [0, 1] convexity of Ψ gives PΨ  θ|X|

C 0



≤ θPΨ  |X|

C 0



+ (1 − θ)Ψ(0) ≤ θK 0 . The choice θ = 1/K 0 makes the last bound equal to 1.



Beware. The Lemma does not give a foolproof way of determining rea- sonable bounds for the Ψ-norm. For example, if X has a N (0, 1) distribution then simple integration shows that

PΨ 2 (|X|/c) = g(c) := c/ p

c 2 − 2 − 1 for c > √ 2.

Thus kXk Ψ

2 = p8/3 but cg(c) → ∞ as c decreases to √ 2.

2.4 Subgaussian tails

Basic::S:subgaussian

As noted in Section 2.2, if X is a random variable for which there exist constants µ and σ 2 for which Pe θX ≤ exp(µθ + 1 2 σ 2 θ 2 ) for all real θ then

\E@ subg.oneside

\E@ subg.oneside <8> P{X ≥ µ + σx} ≤ e −x

2 /2 for all x ≥ 0.

(8)

§2.4 Subgaussian tails 7

The approximations for θ near zero,

Pe θX = 1 + θPX + 1 2 θ 2 PX 2 + o(θ 2 ), exp(µθ + 1 2 σ 2 θ 2 ) = 1 + µθ + 1 22 + µ 22 + o(θ 2 ),

would force PX = µ and PX 2 ≤ σ 2 + µ 2 , that is, var(X) ≤ σ 2 . As Prob- lem [11] shows, σ 2 might need to be much larger than the variance. To avoid any hint of the convention that σ 2 denotes a variance, I feel it is safer to replace σ by another symbol.

Basic::subgaussian <9> Definition. Say that a random variable X with PX = µ has a subgaussian distribution with scale factor τ < ∞ if Pe θX ≤ exp(µθ + τ 2 θ 2 /2) for all real θ. Write τ (X) for the smallest τ for which such a bound holds.

Remark. Some authors call such an X a τ -subgaussian random variable. Some authors require µ = 0 for subgaussianity; others use the qualifier centered to refer to the case where µ is zero.

For a subgaussian X, the two-sided analog of < 8 > gives

\E@ beta.def

\E@ beta.def <10> P{|X − µ| ≥ x} ≤ 2 exp



− x 22



for all x ≥ 0,

for 0 < β ≤ τ (X). In fact (Theorem < 16 >) existence of such a tail bound is equivalent to subgaussianity.

Basic::gaussian <11> Example. Suppose Y = (Y 1 , . . . , Y n ) has a multivariate normal distribu- tion. Define M := max i≤n |Y i | and σ 2 := max i≤n var(Y i ). Chapter 6 explains why

P{|M − PM | ≥ σx} ≤ 2 exp(−x 2 /2) for all x ≥ 0.

That is, M is subgaussian—a most surprising and wonderful fact.



Many useful results that hold for gaussian variables can be extended to subgaussians. In empirical process theory, conditional subgaussianity plays a key role in symmetrization arguments (see Chapter 8). The first example captures the main idea.

Basic::Rademachers <12> Example. Suppose X = P

i≤n b i s i , where the b i ’s are constants and the s i ’s are independent Rademacher random variables, that is, P{s i = +1} = 1 2 = P{s i = −1}. Then

Pe θb i s i = 1 2 

e θb i + e −θb i 

=

X

k=0

(θb i ) 2k

(2k)! ≤ exp(θ 2 b 2 i /2)

(9)

so that Pe θX ≤ Q

i≤n Pe θb i s i ≤ e θ 2 τ 2 /2 with τ 2 = P

i≤n b 2 i . That is, X has a centered subgaussian distribution with τ (X) 2 ≤ P

i≤n b 2 i .



More generally, if X is a sum of independent subgaussians X 1 , . . . , X n then the equality

Pe θ(X−PX) = Y

i Pe θ(X i −PX i )

shows that X is subgaussian with τ 2 (X) ≤ P

i τ 2 (X i ). For sums of indepen- dent variables we need consider only the subgaussianity of each summand.

Basic::interval.range <13> Example. Suppose X is a random variable with PX = µ and a ≤ X ≤ b, for finite constants a and b. By representation < 4 >, its moment generating function M (θ) satisfies

log M (θ) = θPX + 1 2 θ 2 var θ∗ (X) with θ lying between 0 and θ.

Write µ for P θ X and m for (b − a)/2. Note that |X − m| ≤ (b − a)/2.

Thus

var θ∗ (X) = P θ |X − µ | 2 ≤ P θ |X − m| 2 ≤ (b − a) 2 /4.

and log Pe θ(X−µ) ≤ (b − a) 2 θ 2 /8. The random variable X is subgaussian with τ (X) ≤ (b − a)/2.



Remark. Hoeffding (1963, page 22) derived the same bound on the moment generating function by a direct appeal to convexity of the exponential function:

Pe θX ≤ qe θa + pe θb where 1 − q = p = PX − a b − a

In effect, the bound represents the extremal case where X concentrates on the the two-point set {a, b}. He then calculated a second derivative for L, without interpreting the results as a variance but using an inequality that can be given a variance interpretation.

Basic::Hoeffding.ineq <14> Example. (Hoeffding’s inequality for independent summands) Suppose ran- dom variables X 1 , . . . , X n are independent with PX i = µ i and a i ≤ X i ≤ b i for each i, for constants a i and b i .

By Example < 13 >, each X i is subgaussian with τ (X i ) ≤ (b i −a i )/2. The sum S = P

i≤n X i is also subgaussian, with expected value µ = P

i≤n µ i

(10)

§2.4 Subgaussian tails 9

and τ 2 (S) = P

i≤n τ 2 (X i ) = P

i≤n (b i − a i ) 2 /4. The one-sided version of inequality < 10 > gives

\E@ Hoeff.indep

\E@ Hoeff.indep <15>

P nX

i≤n (X i − µ i ) ≥ x o

≤ exp 

−2x 2 / X

i≤n (b i − a i ) 2 

for each x ≥ 0.

See Chapter 3 for an extension of Hoeffding’s inequality < 15 > to sums of bounded martingale differences.



The next Theorem provides other characterizations of subgaussianity, using the Orlicz norm for the Young function Ψ 2 (x) = exp(x 2 ) − 1 and mo- ment inequalities. The proof of the Theorem provides a preview of a “sym- metrization” argument that plays a major role in Chapter 8. I prefer to write the argument using product measures rather than conditional expectations.

Here is the idea. Suppose X 1 is a random variable with PX 1 = µ, defined on a probability space (Ω 1 , F 1 , P 1 ). Make an “independent copy” (Ω 2 , F 2 , P 2 ) of the probability space then carry X 1 over to a new random variable X 2 on Ω 2 . Under P 1 ⊗ P 2 the random variables X 1 and X 2 are independent and identically distributed.

Remark. You might prefer to start with an X defined on (Ω, F, P) then define random variables on Ω × Ω by X 1 (ω 1 , ω 2 ) = X(ω 1 ) and X 21 , ω 2 ) = X(ω 2 ). Under P ⊗ P the random variables X 1 and X 2 are independent and each has the same distribution as X (under P). I once referred to independent copies during a seminar in a Math department.

One member of the audience protested vehemently that a copy of X is the same as X and hence can only be independent of X in trivial cases.

Note that P 2 X 2 = P 1 X 1 = µ. For each convex function f , P 1 f (X 1 − µ) = P ω 1 1 f [X 1 (ω 1 ) − P ω 2 2 X 2 (ω 2 )]

= P ω 1 1 f [P ω 2 2 (X 11 ) − X 22 ))]

≤ P ω 1 1 P ω 2 2 f [X 11 ) − X 22 )] , the last line coming from Jensen’s inequality for P 2 .

Basic::subgauss.equiv <16> Theorem. For each random variable X with expected value µ, the following assertions are equivalent.

(i) X has a subgaussian disribution (with τ (X) < ∞)

Basic::L.finite (ii) L(X) := sup r≥1 kXk r

√ r < ∞

(11)

Basic::Psitwo.finite (iii) kXk Ψ

2 < ∞

Basic::beta.def2 (iv) there exists a positive constant β for which P{|X−µ| ≥ x} ≤ 2 exp



− x 22

 for all x ≥ 0

More precisely,

Basic::Psi2L.equiv (a) c 0 kXk Ψ

2 ≤ L(X) ≤ kXk Ψ

2 with c 0 = 1/ √ 4e ;

Basic::centered (b) if β(X) denotes the smallest β for which inequality (iv) holds, c 1 kX − µk Ψ

2 ≤ β(X) ≤ τ (X) ≤ 4L(X − µ) with c 1 = 1/ √ 6.

Remark. Finiteness of L(X) is also equivalent to finiteness of L(X − µ) because |µ| ≤ kXk 1 ≤ L(X) and |L(X)−L(X −µ)| ≤ |µ|. Consequently L(X − µ) ≤ 2L(X), but there can be no universal inequality in the other direction: L(X − µ) is unchanged by addition of a constant to X but L(X) can be made arbitrarily large.

The precise values of the constants c i are not important. Most likely they could be improved, but I can see no good reason for trying.

Proof Implicit in the sequences of inequalities is the assertion that finite- ness of one quantity implies finiteness of another quantity.

Proof of (a)

Problem [8] shows that k k ≥ k! ≥ (k/e) k for all k ∈ N.

Suppose PΨ 2 (|X|/D) ≤ 1. Then P

k∈N P(X/D) 2k /k! ≤ 1 so that kXk 2k ≤ D(k!) 1/2k ≤ D √

k for each k ∈ N and L(X) ≤ √

2 sup k∈N kXk 2k / √

2k ≤ D by Problem [9].

Conversely, if ∞ > L ≥ kXk r / √

r for all r ≥ 1 then

PΨ 2

 |X|

D



= X

k∈N

kX/Dk 2k 2k

k! ≤ X

k∈N

(L √

2k/D) 2k (e/k) k ,

which equals 1 if D = √ 4e L.

Proof of (b)

None of the quantities appearing in the inequalities depend on µ, so we may as well assume µ = 0.

Inequality < 10 > gives β(X) ≤ τ (X).

(12)

§2.5 Bennett’s inequality 11

Simplify notation by writing β for β(X) and τ for τ (X) and L for L(X), and define γ := kXk Ψ

2 . Also, assume that X is not a constant, to avoid annoying trivial cases.

To show that γ ≤ β √ 6:

PΨ 2

 |X − µ|

β √ 6



= P Z ∞

0

e x {|X − µ| 2 ≥ 6β 2 x} dx

= Z ∞

0

e x P{|X − µ| ≥ β

√ 6x} dx

≤ 2 Z ∞

0

exp (x − 3x) dx = 1.

To show that τ ≤ 4L: Symmetrize. Construct a random variable X 0 with the same distribution as X but independent of X. By the triangle inequality, kX − X 0 k r ≤ 2 kXk r ≤ 2L √

r for all r ≥ 1. By Jensen’s inequality Pe θX ≤ Pe θ(X−X 0 )

= 1 + X

k∈N

θ 2k P(X − X 0 ) 2k

(2k)! symmetry kills odd moments

≤ 1 + X

k∈N

(2L √ 2k ) 2k k k k!

= 1 + X

k∈N

(8θ 2 L 2 ) k

k! = e 8L 2 θ 2 , and so on.



2.5 Bennett’s inequality

Basic::S:Bennett

The Hoeffding inequality depends on the somewhat crude bound var(W ) ≤ (b − a) 2 /4 if a ≤ W ≤ b.

Such an inequality can be too pessimistic if W has only a small probability

of taking values near the endpoints of the interval [a, b]. The inequalities

can sometimes be improved by introducing second moments explicitly into

the upper bound. Bennett’s one-sided tail bound provides a particularly

elegant illustration of the principle.

(13)

Basic::Bennett.ineq <17> Theorem. Suppose X 1 , . . . , X n are independent random variables for which X i ≤ b and PX i = µ i and PX i 2 ≤ v i for each i, for nonnegative constants b and v i . Let W equal P

i≤n v i . Then P

nX

i≤n (X i − µ i ) ≥ x o

≤ exp



− x 2 2W ψ Benn

 bx W



for x ≥ 0 where ψ Benn denotes the function defined on [−1, ∞) by

\E@ Benpsi.def

\E@ Benpsi.def <18> ψ Benn (t) := (1 + t) log(1 + t) − t

t 2 /2 for t 6= 0, and ψ Benn (0) = 1.

Remark. By Problem [14], ψ Benn is positive, decreasing, and convex, with ψ Benn (−1) = 2 and ψ Benn (t) decreasing like (2/t) log t as t → ∞.

When bx/W is close to zero the ψ Benn contribution to the upper bound is close to 1 and the exponent is close to a subgaussian −x 2 /(2W ).

(If b = 0 then it is exactly subgaussian.) Further out in the tail, for larger x, the exponent decreases like −x log(bx/W )/b, which is slightly faster than exponential.

Proof Define Φ 1 (x) = e x − 1 − x, the convex function Ψ 1 adjusted to have zero derivative at the origin, as in Problem [7]. From Problem [14], the positive function

\E@ Del.def

\E@ Del.def <19> ∆(x) = Φ 1 (x)

x 2 /2 for x 6= 0, and ∆(0) = 1,

is continuous and increasing on the whole real line. For θ ≥ 0, Pe θX i = 1 + θPX i + PΦ 1 (θX i )

= 1 + θµ i + P 1 2 (θX i ) 2 ∆(θX i ) 

≤ 1 + θµ i + 1 2 θ 2 P(X i 2 )∆(θb) because ∆ is increasing

≤ exp 

θµ i + 1 2 θ 2 v i ∆(θb)  .

\E@ Bennett.mgf

\E@ Bennett.mgf <20>

That is,

\E@ bdd1

\E@ bdd1 <21> Pe θ(X i −µ i ) ≤ exp 1 2 θ 2 v i ∆(θb) = exp  v i

b 2 Φ 1 (θb)



for θ ≥ 0.

Remark. In the last expression I am implicitly assuming that b

is nonzero. We can recover the bound when b = 0 by a passage

to the limit as b decreases to zero. Actually, the b = 0 case is

quite noteworthy because the ∆ contribution disappears from the

upper bound for P exp(θX i ) and the ψ Benn disappears from the final

inequality.

(14)

§2.5 Bernstein’s inequality 13

Abbreviate P

i≤n (X i − µ i ) to T . By independence and < 21 >, Pe θT = Y

i≤n Pe θ(X i −µ i ) ≤ exp  W

b 2 Φ 1 (θb)



and, by < 2 >,

P{T ≥ x} ≤ inf θ≥0 exp



−xθ + W

b 2 Φ 1 (θb)



= exp − W b 2 sup

bθ≥0

 bx

W bθ − Φ 1 (θb)

 !

= exp



− W

b 2 Φ 1 (bx/W )

 , where Φ 1 denotes the conjugate of Φ 1 ,

Φ 1 (y) := sup

t∈R

(ty − Φ 1 (t)) =  (1 + y) log(1 + y) − y if y ≥ −1

+∞ otherwise .

When y ≥ −1 the supremum is achieved at t = log(1+y). Thus 1 2 y 2 ψ Benn (y) = Φ 1 (y) for y ≥ −1 and (W/b 2 1 (bx/W ) = x 2 /(2W )ψ Benn (bx/W ), as as- serted.



Remark. The inequality still holds if 0 > b ≥ −x/W , although the proof changes slightly. What happens for smaller b?

Unlike the case of the two-sided Hoeffding inequality, there is no au- tomatic lower tail analog for the Bennett inequality without an explicit lower bound for the summands. One particularly interesting case occurs when X i ≥ 0:

P{

X

i≤n (X i − µ i ) ≤ −x} = P{ X

i≤n (µ i − X i ) ≥ x} ≤ exp(− 2 /(2W )) for all x ≥ 0, a clean (one-sided) subgaussian tail bound.

The Bennett inequality also works for the Poisson distribution. If X has a Poisson(λ) distribution then (Problem [15])

P{X ≥ λ + x} exp



− x 2 2λ ψ Benn

 x λ

 

for x ≥ 0 and, for 0 ≤ x ≤ λ,

P{X ≤ λ − x} = exp



− x 2 2λ ψ Benn

 − x λ

 

≤ exp



− x 2



,

(15)

a perfect subgaussian lower tail. See Problem [16] if you have any doubts about the boundedness of Poisson random variables.

See Chapter 3 for an extension of the Bennett inequality to sums of martingale differences.

2.6 Bernstein’s inequality

Basic::S:Bernstein

As shown by Problem [14], ψ Benn (t) ≥ (1 + t/3) −1 for t ≥ −1. Thus the Bennett inequality implies a weaker result: if X 1 , . . . , X n are independent with PX i = µ i and PX i 2 ≤ v i and X i ≤ b then

\E@ Bernstein.indep0

\E@ Bernstein.indep0 <22> P{

X

i≤n (X i − µ i ) ≥ x} ≤ exp



− x 2 /2 W + bx/3



for x ≥ 0,

a form of Bernstein’s inequality. When bx/W is close to zero the expo- nent is again close to x 2 /(2W ); when bx/W is large, the Bernstein inequality loses the extra log term in the exponent. If the summands X i are bounded in absolute value (not just bound above) we get an analogous two-sided inequality.

The assumptions PX i 2 ≤ v i and X i ≤ b imply P|X i | k ≤ PX i 2 b k−2 ≤ v i b k−2 for k ≥ 2.

A much weaker condition on the growth of P|X i | k with k, without any assumption of boundedness, leads to a useful upper bound for the moment generating function and a more genral version of the Bernstein ineqi=uality..

Basic::Bernstein.unbdd <23> Theorem. (Bernstein inequality) Suppose X 1 , . . . , X n are independent ran- dom variables with PX i = µ i and

\E@ Bernstein.moment

\E@ Bernstein.moment <24> P|X i | k1 2 v i B k−2 k! for k ≥ 2.

Define W = P

i≤n v i . Then, for x ≥ 0, P

nX

i≤n (X i − µ i ) ≥ x o

≤ e −H Benn (x,B,W ) ≤ e −H Bern (x,B,W ) where

H Bern (x, B, W ) = x 2 /W 2(1 + xB/W ) H Benn (x, B, W ) = x 2 /W

1 + xB/W + p1 + 2xB/W

(16)

§2.6 Bernstein’s inequality 15

Remark. The B here does not represent an upper bound for X i . The traditional form of Bernstein’s inequality, with HBern, corresponds to the bound < 22 > derived from the Bennett inequality under the assumption X i ≤ b = B/3.

Proof As before, F := P nX

i X i ≥ x + X

i µ i

o

≤ inf

θ≥0 e −(x+ P i µ i Y

i≤n Pe θX i . Use the moment inequality < 24 > to bound the moment generating func- tion for X i .

Pe θX i = 1 + θµ i + PΦ 1 (θX i ) where Φ 1 (t) := e t − 1 − t and, for θ ≥ 0,

PΦ 1 (θX i ) ≤ PΦ 1 (θ|X i |) = X

k≥2

θ k P|X i | k k!

1 2 v i θ 2 X

k≥2 (θB) k−2

= 1 2 v i θ 2

1 − Bθ provided 0 ≤ θ < 1/B.

\E@ Bernstein.mgf

\E@ Bernstein.mgf <25>

Thus

Pe θX i ≤ 1 + θµ i + 1 2 v i θ 2

1 − bθ ≤ exp



θµ i + 1 2 v i θ 2 1 − bθ



for 0 ≤ θB < 1 and

F ≤ exp − sup 0≤Bθ<1 g(θ) 

where g(θ) := xθ − 1 2 W θ 2 1 − Bθ . The H Bern bound

As an approximate maximizer for g choose θ 0 = x/(W + xB) so that F ≤ e −g(θ 0 ) = exp



− x 2 /2 W + xB

 .

Bennett (1962, equation (7)) made this strange choice by defining G = 1−θB then minimizing −xθ+ 1 2 θ 2 W/G while ignoring the fact that G depends on θ.

That gives θ = xG/W = (1 − θB)x/W , a linear equation with solution θ 0 .

Very confusing. As Bennett noted, a boundedness assumption |X i | ≤ b

(17)

actually implies < 24 > with B = b/3. The exponential upper bound then agrees with < 22 >.

The H Benn bound

Following Bennett (1962, equation (7a)), this time we really maximize g(θ).

The calculation is cleaner if we rewrite g with u := xB/W and s := 1 − θB.

With those substitutions, g(θ) = g 0 (s) = W

2b 2 2us − (1 − s) 2 /s = W

2B 2 −s −1 − (1 + 2u)s − 2(1 + u) . The maximum is achieved at s 0 = (1 + 2u) −1/2 giving F ≤ e −g 0 (s 0 ) where

\E@ u.bound

\E@ u.bound <26> g 0 (s 0 ) = W

B 2 1 + u − √

1 + 2u  = W B 2

 u 2

1 + u + √ 1 + 2u

 .



Both inequalities in Theorem < 23 > have companion lower bounds, ob- tained by replacing X i by −X i in all the calculations. This works because the Bernstein condition < 24 > also holds with X i replaced by −X i . The in- equalities could all have been written as two-sided bounds, that is, as upper bounds for P{| P

i (X i − µ i )| ≥ x}.

If we are really only interested in a one-sided bound then it seems super- fluous to control the size of |X i |. For example, for the upper tail it should be enough to control X i + . Boucheron et al. (2013, page 37)=BLM derived such a bound (which they attributed to Emmanuel Rio) by means of the elementary inequality

e x − 1 − x ≤ 1 2 x 2 + X

k≥3 (x+) k /k! for all real x.

(For x ≥ 0 both sides are equal. For x < 0 it reflects the fact that g(x) = e x − (1 + x + 1 2 x 2 ) has a positive derivative with g(0) = 0.) They replaced the assumption < 24 > by

PX i 2 = v i and P|X i | k1 2 v i B k−2 k! for k ≥ 3.

For B > θ ≥ 0, we then get

Pe θX i ≤ 1 + θµ i + 1 2 θ 2 v i + 1 2 v i θ 2 X

k≥3

(θB) k−2 ≤ exp



θµ i + 1 2 v i θ 2 1 − θB



,

(18)

§2.7 Subexponential distributions? 17

which is exactly the same inequality as in the proof of Theorem < 23 >. As with the H Benn calculation it follows that

P nX

i X i ≥ x + X

i µ i o

≤ exp



− W

B 2 1 + u − √

1 + 2u 



where W = P

i v i and u = xB/W . BLM observed that the upper bound can be inverted by solving a quadratic in u,

1 + 2u = 1 + u − tB 2 /W  2

,

which has a positive solution u = tB 2 /W +p2tB 2 /W , corresponding to x = W u/B = tB + √

2tW . The one-sided Bernstein bound then takes an elegant form,

\E@ one.sided.Bernstein

\E@ one.sided.Bernstein <27> P nX

i (X i − µ i ) ≥ tB +

√ 2tW

o

≤ e −t for t ≥ 0.

By splitting according to whether tB ≥ √

2tW or not, and by introducing an extraneous factor of 2, we can split the elegant form into two more suggestive inequalities,

P{

X

i (X i − µ i ) ≥ 2tB} ≤ e −t for t ≥ 2W/B 2 P{

X

i (X i − µ i ) ≥ 2t √

W } ≤ e −t 2 /2 for t 2 /2 ≥ 2W/B 2 .

These inequalities make clear that the subexponential part (large t) of the tail is controlled by B and the subgaussian part (moderate t) is controlled by W .

2.7 Subexponential distributions?

Basic::S:subexp

To me, subexponential behavior refers to functions that are bounded above by a constant multiple of exp(−c|x|), for some positive constant c. For example, the Bernstein inequality shows that, far enough from the mean, the tail probabilities for a large class of sums of independent random variables are subexponential. Closer to the mean the tail are subgaussian, which one might think is better than subexponential. As another example, if a random variable X takes values in [−1, +1] we have P{|X| ≥ x} = 0 ≤ exp(−10 10 x) for all x ≥ 1, but I would hesitate to call that behavior subexponential.

What then does it mean for a whole distribution to be subexponential?

And how is subexponentiality related to the Bernstein moment condition,

\E@ Bernstein.moment0

\E@ Bernstein.moment0 <28> P |X| k

B k k! ≤ 1 2 v/B 2 for k ≥ 2?

(19)

As you already know, condition < 28 > implies

\E@ Bernstein.mgf0

\E@ Bernstein.mgf0 <29> Pe θX −1−θPX = PΦ 1 (θX) ≤ PΦ 1 (θ|X|) ≤ 1 22

1 − θB for 0 ≤ θB < 1, which implies

Pe θX ≤ exp



θµ + 1 22 1 − θB



for 0 ≤ θB < 1 and µ = PX.

Van der Vaart and Wellner (1996, page 103) pointed out that condition < 28 >

is implied by

1

2 v/B 2 ≥ PΦ 1 (|X|/B) = X

k≥2

P|X| k

B k k! where Φ 1 (x) := e x − 1 − x.

And, if < 28 > holds, then

PΦ 1 (|X|/2B) = X

k≥2

P|X| k

2 k B k k! ≤ 1 2 (2v)/(2B) 2 ,

which is just < 29 > with θ = 1/(2B). It would seem that the moment condi- tion is somehow related to kXk Φ 1 . Indeed, Vershynin (2010, Section 5.2.4) defined a random variable X to be subexponential if sup k∈N kXk k /k < ∞.

By Problems [7] and [10], this condition is equivalent to finiteness of the Orlicz norm kXk Φ 1 .

Boucheron et al. (2013, Section 2.4) defined a centered random vari- able X to be sub-gamma on the right tail with variance factor v and scale factor B if

\E@ sub.gamma

\E@ sub.gamma <30> Pe θX ≤ exp  vθ 2 /2 1 − θB



for 0 ≤ θB < 1.

They pointed out that if X has a gamma(α, β) distribution (with mean µ = αβ and variance v = αβ 2 ) then

Pe θ(X−µ) = e −θµ (1 − θβ) −α = exp (−θαβ − α log(1 − θβ))

≤ exp  αβ 2 θ 2 /2 1 − θβ



for 0 ≤ θβ < 1.

Hence the name. The special case where α = 1 corresponds to the exponen-

tial.

(20)

§2.7 Subexponential distributions? 19

Let me try to tie these definitions and facts together.

First note that the upper bound in < 29 > equals 1 if θ = 2/  B + √

B 2 + 2v  . Thus the Bernstein assumption < 28 > implies

γ := kXk Φ 1 ≤  B + p

B 2 + 2v

 /2.

If, as is sometimes assumed, v = PX 2 then < 28 > also implies v 3/2 ≤ P|X| 31 2 v3!B 2 .

That is, v ≤ (3B) 2 and B < γ ≤ 2.7B. It appears that the B, or kXk Φ 1 , controls the subexponential part of the tail probabilities, at least as far as the Bernstein inequality is concerned. Inequality < 27 > conveyed the same message. Compare with the subexponential bound

P{|X| ≥ x} ≤ min (1, 1/Φ 1 (x/γ)) for γ = kXk Φ

1 . The inequality PΦ 1 (|X|/γ) ≤ 1 also implies

P

|X| k

γ k k! ≤ 1 ≤ 1 2 (2γ 2 )/γ 2 for k ≥ 2.

That is, < 28 > holds with v = 2γ 2 and B = γ. Theorem < 23 > then gives

P{X − µ ≥ x} ≤ exp



− x 22 + 2xγ



≤ exp(−x 2 /(8γ 2 )) for 0 ≤ x ≤ 2γ.

Is that not a subgaussian bound? Unfortunately, the bound—subgaussian or not—has little value for such a small range near zero. The upper bound always exceeds 1/ √

e .

The useful subgaussian part of the bound is contributed by the v in < 28 >.

The effect is clearer for a sum of independent random variables X 1 , . . . , X n , each distributed like X:

P exp(θ X

i≤n X i ) ≤ exp  nvθ 2 /2 1 − θB



for 0 ≤ θB < 1, which leads to tail bounds like

P{

X

i X i ≥ x} ≤ exp



− x 2 /2 nv + xB



,

(21)

useful subgaussian behavior when 0 ≤ x ≤ nv/B.

In my opinion, the sub-gamma property < 30 > is better thought of as a restricted subgaussian assumption that automatically implies subexponen- tial behavior far out in the tails. For each α in (0, 1) it implies

Pe θX ≤ exp  vθ 2 /2 1 − α



for 0 ≤ θB ≤ α < 1.

Remark. Uspensky (1937, page 204) used this approach for the Bernstein inequality, before making the choice α = xB/(W + xB) to recover the traditional HBern form of the upper bound.

More generally, if we have a bound

\E@ restricted.subg

\E@ restricted.subg <31> Pe θ(X−µ) ≤ e 2 /2 for 0 ≤ θ ≤ ρ,

with v and ρ positive constants and µ = PX, then for x ≥ 0, P{X ≥ µ + x} ≤ min

0≤θ≤ρ exp −xθ + 1 22 

=  exp − 1 2 x 2 /v 

for 0 ≤ x ≤ ρv exp −ρx + 1 22  < e −xρ/2 for x > ρv .

\E@ gauss.exp

\E@ gauss.exp <32>

The minimum is achieved by θ = min(ρ, x/v).

The proof in the next Section of a generalized version of the Hanson- Wright inequality—showing that a quadratic P

i,j X i A i,j X j in independent subgaussians has a subgaussian/subexponential tail— illustrates this ap- proach.

2.8 The Hanson-Wright inequality for subgaussians

Basic::S:HW

The following is based on an elegant argument by Rudelson and Vershynin (2013) = R&V. To shorten the proof I assume that the A matrix has zeros down its diagonal, so that P P

i,j X i A i,j X j = 0 and I can skip the part of the argument (see Problem [17]) that deals with P

i A i,i (X i 2 − PX i 2 ).

Basic::HW <33> Theorem. Let X 1 , . . . , X n be independent, centered subgaussian random variables with τ ≥ max i τ (X i ). Let A be an n × n matrix of real numbers with A ii = 0 for each i. Then

P{X 0 AX ≥ t} ≤ exp − min t 2

64τ 4 kAk HS , t 8 √

2 τ 2 kAk 2

!!

for t ≥ 0,

where kAk HS := ptrace(A 0 A) = q P

i,j A 2 i,j and kAk 2 := sup kuk

2 ≤1 kAuk 2 .

(22)

§2.8 The Hanson-Wright inequality for subgaussians 21

Remark. There are no assumptions of symmetry or positive definiteness on A. The proof also works with A replaced by −A, leading to a similar upper bound for P{|X 0 AX| ≥ t}. The Hilbert-Schmidt norm kAk HS is also called the Frobenius norm. The constants 64 and 8 √

2 are not important.

Proof Without loss of generality, suppose τ = 1, so that Pe θX i ≤ e θ 2 /2 for all i and all real θ and P exp(α 0 X) ≤ exp(kαk 2 2 /2) for each α ∈ R n .

As explained at the end of Section 2.7, it suffices to show Pe θX

0 AX ≤ exp( 1 22 ) for 0 ≤ θ ≤ ρ where v = 16 kAk 2 HS and ρ = √

32 kAk 2  −1

.

To avoid conditioning arguments, assume P = ⊗ i∈[n] P i on B(R [n] ), where [n] := {1, 2, . . . , n}, and each X i is a coordinate map, that is, X i (x) = x i and X(x) = x. Write G for the N (0, I n ) distribution, another probability measure on B(R [n] ). Of course G = ⊗ i∈[n] G i with each G i equal to N (0, 1).

For each subset J of [n] write P J for ⊗ i∈J P i and G J for ⊗ i∈J G i and, for generic vectors w = (w i : i ∈ [n]) in R [n] , write w J for (w i : i ∈ J ) ∈ R J .

The proof works by bounding Pe θX 0 AX by the G expected value of an analogous quadratic, using a decoupling argument that R&V attributed to Bourgain (1998).

For each δ = (δ i : i ∈ I) ∈ {0, 1} [n] define A δ to be the n × n matrix with (i, j)th element δ i (1 − δ j )A i,j . Write D δ for {i ∈ [n] : δ i = 1} and N δ for [n]\D δ . Then X 0 A δ X = X D 0

δ EX N δ where E δ := A[D δ , N δ ]. If the rows and columns of A were permuted so that i < i 0 for all i ∈ D δ and i 0 ∈ N δ then A δ would look like

A δ = 0 E δ 0 0



where E δ := A[D δ , N δ ] .

Now let Q denote the uniform distribution on {0, 1} [n] . Under Q the δ i ’s are independent Ber(1/2) random variables, so that QA δ = 1 4 A. For each θ, Jensen’s inequality and Fubini give

P exp θX 0 AX  = P x exp(4θQ δ X 0 A δ X) ≤ Q δ P x exp(4θX 0 A δ X).

Write Q δ for X 0 A δ X. It therefore suffices to show that

P exp(4θQ δ ) ≤ exp( 1 22 ) for each δ and all 0 ≤ θ ≤ ρ.

From this point onwards the δ is held fixed. To simplify notation I drop

most of the subscript δ’s, writing X D instead of X D δ , and so on. Of course

A δ cannot drop its subscript.

(23)

Under P D , with X N held fixed, the quadratic Q δ := X 0 A δ X = X D 0 EX N

is a linear combination of the independent subgaussians X D . Thus P D e Q δ ≤ exp 

1

2 (4θ) 2 kEX N k 2 2 

= G D exp (4θ Q δ ) . Similarly, P N exp (4θQ δ ) ≤ G N exp (θQ δ ). Thus

P exp(4θQ δ ) = P D P N exp(4θQ δ )

≤ P D G N exp(4θ Q δ ) = G N P D exp(4θ Q δ )

≤ G N G D exp(4θ Q δ ) = G N e 2 X N 0 E 0 EX N .

The problem is now reduced to its gaussian analog, for which rotational symmetry of G N allows further simplification by means of a diagonalization of a positive definite matrix, E 0 E = L 0 ΛL, where L is orthogonal and Λ = diag(λ 1 , . . . , λ n ), with λ 1 ≥ λ 2 ≥ . . . λ n ≥ 0 are the eigenvalues of E 0 E.

G N e 2 X N 0 E 0 EX N = G N exp 8θ 2 X N 0 ΛX N  = Y

i∈N G i exp(8θ 2 λ i x 2 i )

≤ exp X

i∈N 16θ 2 λ i 

provided 8θ 2 max

i∈N λ i ≤ 1/4.

\E@ G.bnd

\E@ G.bnd <34>

The last inequality uses the fact that, if Z ∼ N (0, 1), Pe sZ

2 = (1 − 2s) −1/2 = exp − 1 2 log(1 − 2s) ≤ exp(2s) if 0 ≤ s ≤ 1 4 . Now we need some matrix facts to bring everything back to an expression involving A. For notational simplicity again assume that the rows D all precede the rows N , so that

A = B 1 E B 2 B 3



and A 0 A = B 1 0 B 1 + B 2 0 B 2 B 1 0 E + B 2 0 B 3 E 0 B 1 + B 3 0 B 2 E 0 E + B 3 0 B 3

 . It follows that

kAk 2 HS = trace(A 0 A) ≥ trace(E 0 E) = X

i∈N λ i and

kAk 2 = sup

kvk 2 ≤1

kAvk 2 ≥ sup

kuk 2 ≤1

kEuk 2 = λ 1 . Combine the various inequalities to conclude that P exp θX 0 AX ≤ exp 

16θ 2 kAk 2 HS 

provide 0 ≤ θ ≤ √

32 kAk 2  −1

. An appeal to inequality < 32 > completes the proof.



(24)

§2.9 Problems 23

2.9 Problems

Basic::S:Problems

[1] Define a = sup{θ < 0 : M (θ) = +∞} and b = inf{θ > 0 : M (θ) = +∞}

Basic::P:convex.interval

where M (θ) = Pe θX . For simplicity, suppose both a and b are finite and a < b.

(i) Show that the restriction of M to the interval [a, b] is continuous. (For example, if M (b) = +∞ you should show that M (θ) → +∞ as θ increases to b.)

(ii) Suppose θ ∈ (a, b). Choose δ > 0 so that [θ − 2δ, θ + 2δ 0 ] ⊂ (a, b). For

|h| < δ, establish the domination bound

|e (θ+h)X − e θX |/h ≤

Z 1 0

Xe (θ+sh)X ds

≤ max 

e (θ+2δ)X , e θX , e (θ−2sδ)X

 /δ.

Deduce via a Dominated Convergence argument that M 0 (θ) = P(Xe θX ).

(iii) Argue similarly to justify another differentiation inside the expectation, lead- ing to the expression for M 00 (θ).

[2] Suppose h(x) = e g(x) , where g is a twice differentiable real-valued function

Basic::P:e.to.g

on R + .

(i) Show that h is convex on any interval J for which (g 0 (x)) 2 + g 00 (x) ≥ 0 for x ∈ int(J ).

[3] For a fixed α ≥ 1 define Ψ α (x) = e x α − 1 for x ≥ 0.

Basic::P:Psia.ge1

(i) Show that Ψ α (x) is a Young function..

(ii) Show that

Ψ α (x)Ψ α (y) ≤ exp(x α + y α ) − 1 ≤ Ψ α (x + y) for all x, y ≥ 0

≤ Ψ α (2 1/a xy) for x ∧ y ≥ 1.

(iii) Deduce from the inequality Ψ α (x)Ψ α (y) ≤ Ψ α (x + y) that Ψ −1 α (uv) ≤ Ψ −1 α (u) + Ψ −1 α (v) for all u, v ≥ 0.

[4] Suppose f is a convex, nonnegative, increasing function on an interval [a, ∞),

Basic::P:extend.to.Young

with a > 0. Suppose also that there exists an x 0 ∈ [a, ∞) for which

(25)

f (x 0 )/x 0 = τ := inf x≥a f (x)/x. Show that h(x) := τ x{0 ≤ x < x 0 } + f (x){x ≥ x 0 } is a Young function.

[5] Suppose f : R + → R + is strictly increasing with f (0) = 0 and

Basic::P:extend.range

f (u 1 u 2 ) ≤ c 0 (f (u 1 ) + f (u 2 )) for min(u 1 , u 2 ) ≥ v 0 ,

for some constants c 0 and v 0 > 0. For each v 1 in (0, v 0 ) prove existence of a larger constant c 1 for which

f (u 1 u 2 ) ≤ c 1 (f (u 1 ) + f (u 2 )) for min(u 1 , u 2 ) ≥ v 1 . Hint: For u i = max(u i , v 0 ) show that f (u i ) ≤ f (u i )f (v 0 )/f (v 1 ).

[6] For a fixed α ∈ (0, 1) define f α (x) = e x α for x ≥ 0.

0 2 6 8

051015

zα xα

α=1/2

Basic::P:Psia.lt1

(i) Show that x 7→ f α (x) is convex for x ≥ z α := (α −1 − 1) 1/α and concave for 0 ≤ x ≤ z α .

(ii) Define x α = (1/α) 1/α and τ α = f α (x α )/x α = (αe) 1/α . Show that Ψ α (x) := τ x{x < x α } + f α (x){x ≥ x α }

is a Young function.

(iii) Show that there exists a constant K α for which

Ψ α (x) ≤ exp(x α ) ≤ K α + Ψ α (x) for all x ≥ 0.

(iv) Show that Ψ α (x)Ψ α (y) ≤ Ψ α (2 1/α xy) for x ∧ y ≥ x α . (v) Deduce that there exists a constant C α for which

C α Ψ −1 α (u) + Ψ −1 α (v) ≥ Ψ −1 α (uv) when u ∧ v ≥ 1.

[7] Suppose Ψ is a Young function with right-hand derivative τ > 0 at 0. Define

Basic::P:slope.zero

Φ(x) = Ψ(x) − τ x for x ≥ 0. If Φ is not identically zero then it is a Young function. For that case show that

kf k Ψ ≥ kf k Φ ≥ kf k Ψ /K where K = 1 + τ Φ −1 (1).

for each measurable real function f .

(26)

§2.9 Problems 25

[8] For each k in N prove that e k ≥ k k /k! ≥ 1. Compare with the sharper

Basic::P:Stirling

estimate via Stirling’s formula (Feller, 1968, Section 2.9), k k e −k

k! = e −r k / √

2πk where 1

12k > r k > 1 12k + 1 .

[9] Suppose n 0 = 1 ≤ n 1 < n 2 < . . . is an increasing sequence of positive

Basic::P:interp

integers with sup k∈N n k /n k−1 = B < ∞ and f : [1, ∞) → R + is an increasing function with sup t≥1 f (Bt)/f (t) = A < ∞. For each random variable X show that sup r≥1 kXk r /f (r) ≤ A sup k∈N kXk n

k /f (n k ).

[10] With Ψ α as in Problem [2], prove the existence of positive constants c α an

Basic::P:moment.Psia

C α for which c α kXk Ψ

α ≤ sup

r≥1

kXk r

r 1/α ≤ C α kXk Ψ

α

for every random variable X.

[11] Theorem < 16 > suggests that kXk 2 might be comparable to β(X) if X has

Basic::P:subg.var

a subgaussian distribution. It is easy to deduce from inequality < 10 > that kXk 2 ≤ 2β(X). Show that there is no companion inequality in the other direction by considering the bounded, symmetric random variable X for which P{X = ±M } = δ and P{X = 0} = 1 − 2δ. If 2δ = M show that PX 2 = 1 but

log 1 + θ 2 /2! + (θ 4 M 2 )/4!  ≤ log Pe ≤ τ 2 θ 2 /2 for all real θ would force τ 2 ≥ 2 log 3/2 + M 2 /24.

[12] Show that a random variable X that has either of the following two prop-

Basic::P:subg.suff

erties is subgaussian.

(i) For some positive constants c 1 , c 2 , and c 3 ,

P{|X| ≥ x} ≤ c 1 exp(−c 2 x 2 ) for all |x| ≥ c 3 . (ii) For some positive constants c 1 , c 2 , and c 3 ,

Pe θX ≤ c 1 exp(c 2 θ 2 ) for all |θ| ≥ c 3

[13] Suppose {X n } is a sequence of random variables for which

Basic::P:subg.limit1

P exp(θX n ) ≤ exp(τ 2 θ 2 /2) for all n and all real θ,

(27)

where τ is a fixed positive constant. If X n → X almost surely, show that Pe θX ≤ exp(τ 2 θ 2 /2) for all real θ.

[14] Suppose f is a sufficiently smooth real-valued function defined at least in a

Basic::P:lagrange

neighborhood of the origin of the real line. Define G(x) = f (x) − f (0) − xf 0 (0)

x 2 /2 if x 6= 0, and G(0) = f 00 (0).

Prove the following facts.

(i) f (x) − f (0) − xf 0 (0) = x 2 RR f 00 (xs){0 < s < t < 1} ds dt (ii) If f is convex then G is nonnegative.

(iii) If f 00 is an increasing function then so is G(x).

(iv) If f 00 is a convex function then so is G. Moreover G(x) ≥ f 00 (x/3). Hint:

Jensen and 2 RR s{0 < s < t < 1} ds dt = 1/3.

(v) For the special case where f (x) = (1 + x) log(1 + x) − x show that G = ψ Benn

from < 18 > and G(x) ≥ (1 + x/3) −1 .

[15] Recall the notation F (x) = e x − 1 − x for x ∈ R and

Basic::P:Poisson

ψ Benn (y) = F (y)/(y 2 /2) where F (y) = (1 + y) log(1 + y) − y for y ≥ −1.

Suppose X has a Poisson(λ) distribution (i) Show that L(θ) = log P θX = λθ + λF (θ).

(ii) For x ≥ 0, deduce that

P{X ≥ λ + x} ≤ exp (−λF (x/λ)) = exp



− x 2

2λ ψ Benn (x/λ)

 . (iii) For 0 ≤ x ≤ λ show that

P{X ≤ λ − x} ≤ inf

θ≥0 exp (λF (−θ) − λx)

= exp



− x 2

2λ ψ Benn (−x/λ)



≤ exp



− x 2

 .

[16] To what extent can the bounds from Problem [15] be recovered as limit-

Basic::P:BinPoisson

ing cases of the bounds derived via the Bennett inequality applied to the Bin(n, λ/n) distribution?

[17] Suppose ∞ > γ = kXk Ψ

2 . Use Theorem < 16 > to deduce that PX 2k

Basic::P:diag.HW

√ 2kγ  2k

≤ (2eγ 2 ) k k!. Use Bernstein’s inequality from Theorem < 23 > to deduce a tail bound analogous to Theorem < 33 > for P

i α i (X i 2 − PX i 2 ) with

independent subgaussians X 1 , . . . , X n and constants α i .

(28)

§2.10 Notes 27

2.10 Notes

Basic::S:Notes

Anyone who has read Massart’s Saint-Flour lectures (Massart, 2003) or Lu- gosi’s lecture notes (Lugosi, 2003) will realize how much I have been influ- enced by those authors. Many of the ideas in those lectures reappear in the book by Boucheron, Lugosi, and Massart (2013).

Many authors seem to credit Chernoff (1952) with the moment gener- ating trick in < 2 >, even though the idea is obviously much older: com- pare with the treatment of the Bernstein inequality by Uspensky (1937, pages 204–206)). Chernoff himself gave credit to Cram´ er for “extremely more powerful results” obtained under stronger assumptions.

The characterizations for subgaussians in Theorem < 16 > are essentially due to Kahane (1968, Exercise 6.10).

Hoeffding (1963) established many tail bounds for sums of independent random variables. He also commented that the results extend to martin- gales.

The Bennett inequality for independent summands was proved by Ben- nett (1962) by a more complicated method.

Section 2.6 on the Bernstein inequality is based on the exposition by Uspensky (1937, pages 204–206), with refinements from Bennett (1962) and Massart (2003, Section 1.2.3). See also Boucheron et al. (2013, Sections 2.4, 2.8) for a slightly more detailed exposition, including the idea of using the positive part to get a one-sided bound, which they credited to Emmanuel Rio. Van der Vaart and Wellner (1996, page 103) noted the connection between the moment assumption < 24 > and the behavior of PΦ 1 (θ|X i |).

Dudley (1999, Appendix H) used the name Young-Orlicz modulus for what I am calling a Young function. to such a function as an Orlicz modulus.

Every Young function can be written as Ψ(t) = R t

0 ψ(x) dx, where ψ is the right-hand derivative of Ψ. The function ψ is increasing and right- continuous. For their N -functions, Krasnosel’ski˘ı and Ruticki˘ı (1961, Sec- tion 1.3) added the requirements that ψ(0) = 0 < ψ(x) for x > 0 and ψ(x) →

∞ as x → ∞, which ensures that Ψ is strictly increasing with Ψ(t)/t → ∞ as t → ∞. Dudley called such a function an Orlicz modulus.

When ψ(0) = 0 there is a measure µ on B(R + ) for which ψ(b) = µ(0, b]

for all intervals (0, b]. In that case PΨ(X) = µ s P(X − s) + , a representation

closely related to Lemma 1 of Panchenko (2003).

References

Related documents

BMI: body mass index; BP: blood pressure; CrCl: creatinine clearance, as esti- mated by the Cockroft–Gault formula; DM: diabetes mellitus; E/A ratio: ratio of early (E) to late

They have influenced the selection of comparators and participants, sample size calculations, analysis and interpretation of subsequent trials, and the conduct and analysis of

For instance, North America and Europe, both being mature markets, are the most consolidated markets where the top three distributors hold 30 to 40% and 15 to 20% of the

that inflation has a purely backward-looking dynamics and that this specification fits the data for inflation quite well. As in the traditional linear framework, elimi- nating from

The objective of the study was to characterize the socioeconomic and gestational profile of women, identify the prevalence of exclusive breastfeeding among their

Desse modo, na corrente do processo de redemocratização ocorre a expansão de movimentos contestatórios no rural Catarinense, e,trata-sedo Movimento dos

Using unbiased metabolomics, we discovered that Vibrio cholerae mutants genetically locked in a low cell density (LCD) QS state are unable to alter the pyruvate flux to convert

There is no significant relative effect of Parenting Practices (Positive Parenting, Inconsistent Discipline and Poor Supervision) on Social Interaction of Pupils.. Table 2: