2.5 Bennett’s inequality . . . . 11

(1)

2.1 Tail bounds and concentration . . . . 1

2.2 Moment generating functions . . . . 3

2.3 Orlicz norms . . . . 4

2.4 Subgaussian tails . . . . 6

2.5 Bennett’s inequality . . . . 11

2.6 Bernstein’s inequality . . . . 14

2.7 Subexponential distributions? . . . . 17

2.8 The Hanson-Wright inequality for subgaussians . . . . 20

2.9 Problems . . . . 23

2.10 Notes . . . . 27

Printed: 27 November 2015

version: 20nov15 Mini-empirical

(2)

Chapter 2

A few good inequalities

Section 2.1 introduces exponential tail bounds.

Section 2.2 introduces the method for bounding tail probabilities using mo- ment generating functions.

Section 2.3 describes analogs of the usual L ^p norms, which have proved useful in empirical process theory.

Section 2.4 discusses subgaussianity, a most useful extension of the gaus- sianity property.

Section 2.5 discusses the Bennett inequality for sums of independent ran- dom variables that are bounded above by a constant.

Section 2.6 shows how the Bennett inequality implies one form of the Bern- stein inequality, then discusses extensions to unbounded summands.

Section 2.7 describes two ways to capture the idea of tails that decrease like those of the exponential distribution.

Section 2.8 illustrates the ideas from the previous Section by deriving a subgaussian/subexponential tail bound for quadratic forms in independent subgaussian random variables.

2.1 Tail bounds and concentration

Basic::S:intro

Sums of independent random variables are often approximately normal dis- tributed. In particular, their tail probabilities often decrease in ways similar to the tails of a normal distribution. Several classical inequalities make this vague idea more precise. Such inequalities provided the foundation on which empirical process theory was built.

version: 20nov15 Mini-empirical

(3)

We know a lot about the normal distribution. For example, if W has a N (µ, σ ² ) distribution then (Feller, 1968, Section 7.1)

1 x − 1

x ³

e ^−x ² ^/2

√

2π ≤ P{W ≥ µ + σx} ≤ 1 x

e ^−x ² ^/2

√

2π for all x > 0.

Clearly the inequalities are useful only for larger x: as x decreases to zero the lower bound goes to −∞ and the upper bound goes to +∞. For many purposes the exp(−x ² /2) factor matters the most. Indeed, the simpler tail bound P{W ≥ µ + σx} ≤ exp(−x ² /2) for x ≥ 0 often suffices for asymptotic arguments. Sometimes we need a better inequality showing that the dis- tribution concentrates most of its probability mass in a small region, such as

P{|W − µ| ≥ σx} ≤ 2e ^−x

2 /2 for all x ≥ 0, a concentration inequality.

Similar-looking tail bounds exist for random variables that are not ex- actly normally distributed. Particularly useful are the so-called exponential tail bounds, which take the form

\E@ exp.bnd

\E@ exp.bnd <1> P{X ≥ x + PX} ≤ e ^−g(x) for x ≥ 0,

with g a function that typically increases like a power of x. Sometimes g behaves like a quadratic for some range of x and like a linear function further out in the tails; sometimes some logarithmic terms creep in; and sometimes it depends on var(X) or on some more exotic measure of size, such as an Orlicz norm (see Section 2.3).

For example, suppose S = X 1 + · · · + X n , a sum of independent random variables with PX i = 0 and each X i bounded above by some constant b.

The Bernstein inequality (Section 2.6) tells us that

P{S ≥ x

√

V } ≤ exp −x ² /2 1 + ¹ ₃ bx/ √

V

!

for x ≥ 0,

where V is any upper bound for var(S). If each X _i is also bounded below by −b then a similar tail bound exists for −S, which leads to a concentration inequality, an upper bound for P{|S| ≥ x √

V }.

For x much smaller than √

V /b the exponent in the Bernstein inequality

behaves like −x ² /2; for much larger x the exponent behaves like a negative

multiple of x. In a sense that is made more precise in Sections 2.6 and 2.7, the

inequality gives a subgaussian bound for x not too big and a subexponential

bound for very large x.

(4)

§2.2 Moment generating functions 3

2.2 Moment generating functions

Basic::S:mgf

The most common way to get a bound like < 1 > uses the moment gener- ating function, M (θ) := Pe ^θX . From the fact that exp is nonnegative and exp(θ(X − w)) ≥ 1 when X ≥ w and θ ≥ 0 we have

\E@ mgf.tail

\E@ mgf.tail <2> P{X ≥ w} ≤ inf

θ≥0 Pe ^θ(X−w) = inf

θ≥0 e ^−θw M (θ).

Basic::normal <3> Example. If X has a N (µ, σ ² ) distribution then M (θ) = exp(θµ + σ ² θ ² /2) is finite for all real θ. For x ≥ 0 inequality < 2 > gives

P{X ≥ µ + σx} ≤ inf

θ≥0 exp(−θ(µ + σx) + θµ + θ ² σ ² /2)

= exp(−x ² /2) for all x ≥ 0, the minimum being achieved by θ = x/σ.

Analogous arguments, with X − µ replaced by µ − X, give an analogous bound for the lower tail,

P{X ≤ µ − σx} ≤ exp(−x ² /2) for all x > 0,

leading to the concentration inequality P{|X − µ| ≥ σx} ≤ 2e ^−x ² ^/2 .

Remark. Of course the algebra would have been a tad simpler if I had worked with the standardized variable (X − µ)/σ. I did things the messier way in order to make a point about the centering in Section 2.4.

Exactly the same arguments work for any random variable whose mo- ment generating is bounded above by exp(θµ + σ ² θ ² /2), for all real θ and some finite constants µ and τ . This property essentially defines the sub- gaussian distributions, which are discussed in Section 2.4.

The function L(θ) := log M (θ) is called the cumulant-generating function. The upper bound in < 2 > is often rewritten as

e ^−L ^∗ ^(θ) where L ^∗ (θ) := sup _θ≥0 (θw − L(θ)).

Remark. As Boucheron et al. (2013, Sections 2.2) and others have

noted, the function L ^∗ is the Fenchel-Lagrange transform of L, which is

also known as the conjugate function (Boyd and Vandenberghe, 2004,

Section 3.3).

(5)

Clearly L(0) = 0. The H¨ older inequality shows that L is convex: if θ is a convex combination α ₁ θ ₁ + α ₂ θ ₂ then

e ^L(θ) = M (α 1 θ 1 + α 2 θ 2 )

= P

e ^α ¹ ^θ ¹ ^X e ^α ² ^θ ² ^X

≤ Pe ^θ ¹ ^X

α 1 Pe ^θ ² ^X

α 2

= e ^α ¹ ^L(θ ¹ ^)+α ² ^L(θ ² ⁾ .

More precisely (Problem [1]), the set J = {θ : M (θ) < ∞} is an interval, possibly degenerate or unbounded, containing the origin. For example, for the Cauchy, M (θ) = ∞ for all real θ 6= 0; for the standard exponential distribution (density e ^−x {x > 0} with respect to Lebesgue measure),

M (θ) =

(1 − θ) ⁻¹ for θ < 1

∞ otherwise .

Problem [1] also justifies the interchange of differentiation and expecta- tion on the interior of J , which leads to

L ⁰ (θ) = M ⁰ (θ)

M (θ) = PX e ^θX M (θ) L ⁰⁰ (θ) = M ⁰⁰ (θ)

M (θ) − M ⁰ (θ) M (θ)

2 = PX ² e ^θX M (θ) −

PX e ^θX M (θ)

2 .

If we define a probability measure P θ by its density, dP θ /dP = e ^θX /M (θ), then the derivatives of L can be re-expressed using moments of X under P θ :

L ⁰ (θ) = P θ X and L ⁰⁰ (θ) = P θ (X − P θ X) ² = var _θ (X) ≥ 0.

In particular, L ⁰ (0) = PX, so that Taylor expansion gives

\E@ L.Taylor

\E@ L.Taylor <4> L(θ) = θPX + ¹ ₂ θ ² var _θ∗ (X) with θ ^∗ lying between 0 and θ.

Bounds on the P θ variance of X lead to bounds on L and tail bounds for X − PX. The Hoeffding inequality (Example < ¹⁴ >) provides a good illustration of this line of attack.

2.3 Orlicz norms

Basic::S:orlicz

There is a fruitful way to generalize the the concept of an L ^p norm by means

of a Young function, that is, a convex, increasing function Ψ : R ⁺ → R ⁺

with Ψ(0) = 0 and Ψ(x) → ∞ as x → ∞. The concept applies with any

measure µ but I need it mostly for probability measures.

(6)

§2.3 Orlicz norms 5

Remark. The assumption that Ψ(x) → ∞ as x → ∞ is needed only to rule out the trivial case where Ψ ≡ 0. Indeed, if Ψ(x 0 ) > 0 for some x 0

then, for x = x 0 /α with α < 1,

Ψ(x ₀ ) = Ψ(αx + (1 − α)0) ≤ αΨ(x)

That is, Ψ(x)/x ≥ Ψ(x 0 )/x 0 for all x ≥ x 0 . If Ψ is not identically zero then it must increase at least linearly with increasing x.

At one time Orlicz norms became very popular in the empirical process literature, in part (I suspect) because chaining arguments (see Chapter 4) are cleaner with norms than with tail probabilities.

The most important Young functions for my purposes are the power functions, Ψ(x) = x ^p for a fixed p ≥ 1, and the exponential power func- tions Ψ α (x) := exp(x ^α ) − 1 for α ≥ 1. The function Ψ 2 is particularly useful (see Section 2.4) for working with distributions whose tails decrease like a Gaussian. It is also possible (Problem [6]) to define a Young function Ψ _α , for 0 < α < 1, for which

Ψ _α (x) ≤ exp(x ^α ) ≤ K _α + Ψ _α (x) for all x ≥ 0,

where K _α is a positive constant. Some linear interpolation near 0 is needed to overcome the annoyance that x 7→ exp(x ^α ) is concave on the inter- val [0, z α ], for z α := ((1 − α)/α) ^1/α .

Basic::Orlicz.norm <5> Definition. Let ( X, A, µ) be a measure space and f be an A-measuable real function on X. For each Young function Ψ the corresponding Orlicz norm kf k _Ψ (or Ψ-norm) is defined by

kf k _Ψ = inf{c > 0 : µΨ(|f (x)|/c) ≤ 1},

with kf k _Ψ = ∞ if the infimum runs over an empty set.

Remark. It is not hard to show that kf k _Ψ < ∞ if and only if µΨ(|f |/C ₀ ) < ∞ for at least one finite constant C ₀ . Moreover, if 0 < c = kf k _Ψ < ∞ then µΨ(|f |/c) ≤ 1; the infimum is achieved.

When restricted to the set L ^Ψ ( X, A, µ) of all measurable real functions on X with finite Ψ-norm, k·k _Ψ is a seminorm. It fails to be a norm only because kf k _Ψ = 0 if and only if µ{x : f (x) 6= 0} = 0.

The corresponding set of µ-equivalence classes is a Banach space. It is traditional and benign to call k·k _Ψ a norm even though it is only a seminorm on L ^Ψ ( X, A, µ).

The assumption Ψ(0) = 0 can be discarded if µ is a finite measure.

For each finite constant A > Ψ(0), the quantity inf{c > 0 : µΨ(|f |/c) ≤

A} also defines a seminorm.

(7)

Finiteness of the Ψ-norm for a random variable Z immediately implies a tail bound: if 0 < σ = kZk _Ψ < ∞ then

\E@ orlicz.tail

\E@ orlicz.tail <6> P{|Z| ≥ xσ} ≤ max

1, P Ψ (|Z|/σ) Ψ(x)

= max

1, 1

Ψ(x)

for x ≥ 0.

For example, for each α > 0 there is a constant C α for which P{|Z| ≥ x kZk _Ψ _α } ≤ C _α e ^−x ^α for x ≥ 0.

The constant can be chosen so that C α e ^−x ^α ≥ 1 for all x so small that Ψ α (x)e ^−x ^α is not close to 1.

When seeking to bound an Orlicz norm I often find it easier to first obtain a constant C 0 for which PΨ(|X|/C 0 ) ≤ K 0 , where K 0 is a constant strictly larger than 1. As the next Lemma shows, a simple convexity argument turns such an inequality into a bound on kXk _Ψ .

Basic::Psi.bound <7> Lemma. Suppose Ψ is a Young function. If PΨ(|X|/C 0 ) ≤ K ₀ for some finite constants C ₀ and K ₀ > 1 then kXk _Ψ ≤ C ₀ K ₀ .

Proof For each θ in [0, 1] convexity of Ψ gives PΨ θ|X|

C 0

≤ θPΨ |X|

C 0

+ (1 − θ)Ψ(0) ≤ θK ₀ . The choice θ = 1/K ₀ makes the last bound equal to 1.

Beware. The Lemma does not give a foolproof way of determining rea- sonable bounds for the Ψ-norm. For example, if X has a N (0, 1) distribution then simple integration shows that

PΨ 2 (|X|/c) = g(c) := c/ p

c ² − 2 − 1 for c > √ 2.

Thus kXk _Ψ

2 = p8/3 but cg(c) → ∞ as c decreases to √ 2.

2.4 Subgaussian tails

Basic::S:subgaussian

As noted in Section 2.2, if X is a random variable for which there exist constants µ and σ ² for which Pe ^θX ≤ exp(µθ + ¹ ₂ σ ² θ ² ) for all real θ then

\E@ subg.oneside

\E@ subg.oneside <8> P{X ≥ µ + σx} ≤ e ^−x

2 /2 for all x ≥ 0.

(8)

§2.4 Subgaussian tails 7

The approximations for θ near zero,

Pe ^θX = 1 + θPX + ¹ ₂ θ ² PX ² + o(θ ² ), exp(µθ + ¹ ₂ σ ² θ ² ) = 1 + µθ + ¹ ₂ (σ ² + µ ² )θ ² + o(θ ² ),

would force PX = µ and PX ² ≤ σ ² + µ ² , that is, var(X) ≤ σ ² . As Prob- lem [11] shows, σ ² might need to be much larger than the variance. To avoid any hint of the convention that σ ² denotes a variance, I feel it is safer to replace σ by another symbol.

Basic::subgaussian <9> Definition. Say that a random variable X with PX = µ has a subgaussian distribution with scale factor τ < ∞ if Pe ^θX ≤ exp(µθ + τ ² θ ² /2) for all real θ. Write τ (X) for the smallest τ for which such a bound holds.

Remark. Some authors call such an X a τ -subgaussian random variable. Some authors require µ = 0 for subgaussianity; others use the qualifier centered to refer to the case where µ is zero.

For a subgaussian X, the two-sided analog of < 8 > gives

\E@ beta.def

\E@ beta.def <10> P{|X − µ| ≥ x} ≤ 2 exp

− x ² 2β ²

for all x ≥ 0,

for 0 < β ≤ τ (X). In fact (Theorem < 16 >) existence of such a tail bound is equivalent to subgaussianity.

Basic::gaussian <11> Example. Suppose Y = (Y 1 , . . . , Y n ) has a multivariate normal distribu- tion. Define M := max _i≤n |Y _i | and σ ² := max _i≤n var(Y _i ). Chapter 6 explains why

P{|M − PM | ≥ σx} ≤ 2 exp(−x ² /2) for all x ≥ 0.

That is, M is subgaussian—a most surprising and wonderful fact.

Many useful results that hold for gaussian variables can be extended to subgaussians. In empirical process theory, conditional subgaussianity plays a key role in symmetrization arguments (see Chapter 8). The first example captures the main idea.

Basic::Rademachers <12> Example. Suppose X = P

i≤n b i s _i , where the b i ’s are constants and the s i ’s are independent Rademacher random variables, that is, P{s i = +1} = ¹ ₂ = P{s i = −1}. Then

Pe ^θb ⁱ ^s ⁱ = ¹ ₂

e ^θb ⁱ + e ^−θb ⁱ

=

∞

X

k=0

(θb _i ) ^2k

(2k)! ≤ exp(θ ² b ² _i /2)

(9)

so that Pe ^θX ≤ Q

i≤n Pe ^θb ⁱ ^s ⁱ ≤ e ^θ ² ^τ ² ^/2 with τ ² = P

i≤n b ² _i . That is, X has a centered subgaussian distribution with τ (X) ² ≤ P

i≤n b ² _i .

More generally, if X is a sum of independent subgaussians X ₁ , . . . , X _n then the equality

Pe ^θ(X−PX) = Y

i Pe ^θ(X ⁱ ^−PX ⁱ ⁾

shows that X is subgaussian with τ ² (X) ≤ P

i τ ² (X _i ). For sums of indepen- dent variables we need consider only the subgaussianity of each summand.

Basic::interval.range <13> Example. Suppose X is a random variable with PX = µ and a ≤ X ≤ b, for finite constants a and b. By representation < 4 >, its moment generating function M (θ) satisfies

log M (θ) = θPX + ¹ ₂ θ ² var θ∗ (X) with θ ^∗ lying between 0 and θ.

Write µ ^∗ for P θ ^∗ X and m for (b − a)/2. Note that |X − m| ≤ (b − a)/2.

Thus

var _θ∗ (X) = P θ ^∗ |X − µ ^∗ | ² ≤ P θ ^∗ |X − m| ² ≤ (b − a) ² /4.

and log Pe ^θ(X−µ) ≤ (b − a) ² θ ² /8. The random variable X is subgaussian with τ (X) ≤ (b − a)/2.

Remark. Hoeffding (1963, page 22) derived the same bound on the moment generating function by a direct appeal to convexity of the exponential function:

Pe ^θX ≤ qe ^θa + pe ^θb where 1 − q = p = PX − a b − a

In effect, the bound represents the extremal case where X concentrates on the the two-point set {a, b}. He then calculated a second derivative for L, without interpreting the results as a variance but using an inequality that can be given a variance interpretation.

Basic::Hoeffding.ineq <14> Example. (Hoeffding’s inequality for independent summands) Suppose ran- dom variables X 1 , . . . , X n are independent with PX i = µ i and a i ≤ X _i ≤ b _i for each i, for constants a _i and b _i .

By Example < 13 >, each X _i is subgaussian with τ (X _i ) ≤ (b _i −a _i )/2. The sum S = P

i≤n X i is also subgaussian, with expected value µ = P

i≤n µ i

(10)

§2.4 Subgaussian tails 9

and τ ² (S) = P

i≤n τ ² (X i ) = P

i≤n (b i − a _i ) ² /4. The one-sided version of inequality < 10 > gives

\E@ Hoeff.indep

\E@ Hoeff.indep <15>

P nX

i≤n (X _i − µ _i ) ≥ x o

≤ exp

−2x ² / X

i≤n (b _i − a _i ) ²

for each x ≥ 0.

See Chapter 3 for an extension of Hoeffding’s inequality < 15 > to sums of bounded martingale differences.

The next Theorem provides other characterizations of subgaussianity, using the Orlicz norm for the Young function Ψ ₂ (x) = exp(x ² ) − 1 and mo- ment inequalities. The proof of the Theorem provides a preview of a “sym- metrization” argument that plays a major role in Chapter 8. I prefer to write the argument using product measures rather than conditional expectations.

Here is the idea. Suppose X 1 is a random variable with PX 1 = µ, defined on a probability space (Ω ₁ , F 1 , P 1 ). Make an “independent copy” (Ω ₂ , F 2 , P 2 ) of the probability space then carry X ₁ over to a new random variable X ₂ on Ω 2 . Under P 1 ⊗ P 2 the random variables X 1 and X 2 are independent and identically distributed.

Remark. You might prefer to start with an X defined on (Ω, F, P) then define random variables on Ω × Ω by X 1 (ω 1 , ω 2 ) = X(ω 1 ) and X ₂ (ω ₁ , ω ₂ ) = X(ω ₂ ). Under P ⊗ P the random variables X 1 and X ₂ are independent and each has the same distribution as X (under P). I once referred to independent copies during a seminar in a Math department.

One member of the audience protested vehemently that a copy of X is the same as X and hence can only be independent of X in trivial cases.

Note that P 2 X ₂ = P 1 X ₁ = µ. For each convex function f , P 1 f (X 1 − µ) = P ^ω 1 ¹ f [X 1 (ω 1 ) − P ^ω 2 ² X 2 (ω 2 )]

= P ^ω ₁ ¹ f [P ^ω ₂ ² (X ₁ (ω ₁ ) − X ₂ (ω ₂ ))]

≤ P ^ω ₁ ¹ P ^ω ₂ ² f [X ₁ (ω ₁ ) − X ₂ (ω ₂ )] , the last line coming from Jensen’s inequality for P 2 .

Basic::subgauss.equiv <16> Theorem. For each random variable X with expected value µ, the following assertions are equivalent.

(i) X has a subgaussian disribution (with τ (X) < ∞)

Basic::L.finite (ii) L(X) := sup _r≥1 kXk _r

√ r < ∞

(11)

Basic::Psitwo.finite (iii) kXk _Ψ

2 < ∞

Basic::beta.def2 (iv) there exists a positive constant β for which P{|X−µ| ≥ x} ≤ 2 exp

− x ² 2β ²

for all x ≥ 0

More precisely,

Basic::Psi2L.equiv (a) c 0 kXk _Ψ

2 ≤ L(X) ≤ kXk _Ψ

2 with c 0 = 1/ √ 4e ;

Basic::centered (b) if β(X) denotes the smallest β for which inequality (iv) holds, c 1 kX − µk _Ψ

2 ≤ β(X) ≤ τ (X) ≤ 4L(X − µ) with c 1 = 1/ √ 6.

Remark. Finiteness of L(X) is also equivalent to finiteness of L(X − µ) because |µ| ≤ kXk ₁ ≤ L(X) and |L(X)−L(X −µ)| ≤ |µ|. Consequently L(X − µ) ≤ 2L(X), but there can be no universal inequality in the other direction: L(X − µ) is unchanged by addition of a constant to X but L(X) can be made arbitrarily large.

The precise values of the constants c _i are not important. Most likely they could be improved, but I can see no good reason for trying.

Proof Implicit in the sequences of inequalities is the assertion that finite- ness of one quantity implies finiteness of another quantity.

Proof of (a)

Problem [8] shows that k ^k ≥ k! ≥ (k/e) ^k for all k ∈ N.

Suppose PΨ 2 (|X|/D) ≤ 1. Then P

k∈N P(X/D) ^2k /k! ≤ 1 so that kXk _2k ≤ D(k!) ^1/2k ≤ D √

k for each k ∈ N and L(X) ≤ √

2 sup _k∈N kXk _2k / √

2k ≤ D by Problem [9].

Conversely, if ∞ > L ≥ kXk _r / √

r for all r ≥ 1 then

PΨ 2

|X|

D

= X

k∈N

kX/Dk ^2k _2k

k! ≤ X

k∈N

(L √

2k/D) ^2k (e/k) ^k ,

which equals 1 if D = √ 4e L.

Proof of (b)

None of the quantities appearing in the inequalities depend on µ, so we may as well assume µ = 0.

Inequality < 10 > gives β(X) ≤ τ (X).

(12)

§2.5 Bennett’s inequality 11

Simplify notation by writing β for β(X) and τ for τ (X) and L for L(X), and define γ := kXk _Ψ

2 . Also, assume that X is not a constant, to avoid annoying trivial cases.

To show that γ ≤ β √ 6:

PΨ 2

|X − µ|

β √ 6

= P Z ∞

0 e ^x {|X − µ| ² ≥ 6β ² x} dx

= Z ∞

0 e ^x P{|X − µ| ≥ β

√ 6x} dx

≤ 2 Z ∞

0 exp (x − 3x) dx = 1.

To show that τ ≤ 4L: Symmetrize. Construct a random variable X ⁰ with the same distribution as X but independent of X. By the triangle inequality, kX − X ⁰ k _r ≤ 2 kXk _r ≤ 2L √

r for all r ≥ 1. By Jensen’s inequality Pe ^θX ≤ Pe ^θ(X−X ⁰ ⁾

= 1 + X

k∈N

θ ^2k P(X − X ⁰ ) ^2k

(2k)! symmetry kills odd moments

≤ 1 + X

k∈N

(2L √ 2k ) ^2k k ^k k!

= 1 + X

k∈N

(8θ ² L ² ) ^k

k! = e ^8L ² ^θ ² , and so on.

2.5 Bennett’s inequality

Basic::S:Bennett

The Hoeffding inequality depends on the somewhat crude bound var(W ) ≤ (b − a) ² /4 if a ≤ W ≤ b.

Such an inequality can be too pessimistic if W has only a small probability

of taking values near the endpoints of the interval [a, b]. The inequalities

can sometimes be improved by introducing second moments explicitly into

the upper bound. Bennett’s one-sided tail bound provides a particularly

elegant illustration of the principle.

(13)

Basic::Bennett.ineq <17> Theorem. Suppose X 1 , . . . , X n are independent random variables for which X _i ≤ b and PX i = µ _i and PX _i ² ≤ v _i for each i, for nonnegative constants b and v i . Let W equal P

i≤n v i . Then P

nX

i≤n (X _i − µ _i ) ≥ x o

≤ exp

− x ² 2W ψ Benn

bx W

for x ≥ 0 where ψ Benn denotes the function defined on [−1, ∞) by

\E@ Benpsi.def

\E@ Benpsi.def <18> ψ Benn (t) := (1 + t) log(1 + t) − t

t ² /2 for t 6= 0, and ψ Benn (0) = 1.

Remark. By Problem [14], ψ Benn is positive, decreasing, and convex, with ψ Benn (−1) = 2 and ψ Benn (t) decreasing like (2/t) log t as t → ∞.

When bx/W is close to zero the ψ Benn contribution to the upper bound is close to 1 and the exponent is close to a subgaussian −x ² /(2W ).

(If b = 0 then it is exactly subgaussian.) Further out in the tail, for larger x, the exponent decreases like −x log(bx/W )/b, which is slightly faster than exponential.

Proof Define Φ 1 (x) = e ^x − 1 − x, the convex function Ψ ₁ adjusted to have zero derivative at the origin, as in Problem [7]. From Problem [14], the positive function

\E@ Del.def

\E@ Del.def <19> ∆(x) = Φ ₁ (x)

x ² /2 for x 6= 0, and ∆(0) = 1,

is continuous and increasing on the whole real line. For θ ≥ 0, Pe ^θX ⁱ = 1 + θPX i + PΦ 1 (θX _i )

= 1 + θµ i + P ¹ ₂ (θX i ) ² ∆(θX i )

≤ 1 + θµ _i + ¹ ₂ θ ² P(X i ² )∆(θb) because ∆ is increasing

≤ exp

θµ _i + ¹ ₂ θ ² v _i ∆(θb) .

\E@ Bennett.mgf

\E@ Bennett.mgf <20>

That is,

\E@ bdd1

\E@ bdd1 <21> Pe ^θ(X ⁱ ^−µ ⁱ ⁾ ≤ exp ¹ ₂ θ ² v i ∆(θb) = exp v i

b ² Φ 1 (θb)

for θ ≥ 0.

Remark. In the last expression I am implicitly assuming that b

is nonzero. We can recover the bound when b = 0 by a passage

to the limit as b decreases to zero. Actually, the b = 0 case is

quite noteworthy because the ∆ contribution disappears from the

upper bound for P exp(θX ⁱ ) and the ψ Benn disappears from the final

inequality.

(14)

§2.5 Bernstein’s inequality 13

Abbreviate P

i≤n (X i − µ _i ) to T . By independence and < 21 >, Pe ^θT = Y

i≤n Pe ^θ(X ⁱ ^−µ ⁱ ⁾ ≤ exp W

b ² Φ 1 (θb)

and, by < 2 >,

P{T ≥ x} ≤ inf θ≥0 exp

−xθ + W

b ² Φ 1 (θb)

= exp − W b ² sup

bθ≥0

bx

W bθ − Φ 1 (θb)

!

= exp

− W

b ² Φ ^∗ ₁ (bx/W )

, where Φ ^∗ ₁ denotes the conjugate of Φ ₁ ,

Φ ^∗ ₁ (y) := sup

t∈R

(ty − Φ ₁ (t)) = (1 + y) log(1 + y) − y if y ≥ −1

+∞ otherwise .

When y ≥ −1 the supremum is achieved at t = log(1+y). Thus ¹ ₂ y ² ψ Benn (y) = Φ ^∗ ₁ (y) for y ≥ −1 and (W/b ² )Φ ^∗ ₁ (bx/W ) = x ² /(2W )ψ Benn (bx/W ), as as- serted.

Remark. The inequality still holds if 0 > b ≥ −x/W , although the proof changes slightly. What happens for smaller b?

Unlike the case of the two-sided Hoeffding inequality, there is no au- tomatic lower tail analog for the Bennett inequality without an explicit lower bound for the summands. One particularly interesting case occurs when X _i ≥ 0:

P{

X

i≤n (X _i − µ _i ) ≤ −x} = P{ X

i≤n (µ _i − X _i ) ≥ x} ≤ exp(− ² /(2W )) for all x ≥ 0, a clean (one-sided) subgaussian tail bound.

The Bennett inequality also works for the Poisson distribution. If X has a Poisson(λ) distribution then (Problem [15])

P{X ≥ λ + x} exp

− x ² 2λ ψ Benn

x λ

for x ≥ 0 and, for 0 ≤ x ≤ λ,

P{X ≤ λ − x} = exp

− x ² 2λ ψ Benn

− x λ

≤ exp

− x ² 2λ

,

(15)

a perfect subgaussian lower tail. See Problem [16] if you have any doubts about the boundedness of Poisson random variables.

See Chapter 3 for an extension of the Bennett inequality to sums of martingale differences.

2.6 Bernstein’s inequality

Basic::S:Bernstein

As shown by Problem [14], ψ Benn (t) ≥ (1 + t/3) ⁻¹ for t ≥ −1. Thus the Bennett inequality implies a weaker result: if X 1 , . . . , X n are independent with PX i = µ _i and PX _i ² ≤ v _i and X _i ≤ b then

\E@ Bernstein.indep0

\E@ Bernstein.indep0 <22> P{

X

i≤n (X _i − µ _i ) ≥ x} ≤ exp

− x ² /2 W + bx/3

for x ≥ 0,

a form of Bernstein’s inequality. When bx/W is close to zero the expo- nent is again close to x ² /(2W ); when bx/W is large, the Bernstein inequality loses the extra log term in the exponent. If the summands X i are bounded in absolute value (not just bound above) we get an analogous two-sided inequality.

The assumptions PX i ² ≤ v _i and X i ≤ b imply P|X i | ^k ≤ PX i ² b ^k−2 ≤ v _i b ^k−2 for k ≥ 2.

A much weaker condition on the growth of P|X i | ^k with k, without any assumption of boundedness, leads to a useful upper bound for the moment generating function and a more genral version of the Bernstein ineqi=uality..

Basic::Bernstein.unbdd <23> Theorem. (Bernstein inequality) Suppose X ₁ , . . . , X _n are independent ran- dom variables with PX i = µ _i and

\E@ Bernstein.moment

\E@ Bernstein.moment <24> P|X i | ^k ≤ ¹ ₂ v _i B ^k−2 k! for k ≥ 2.

Define W = P

i≤n v i . Then, for x ≥ 0, P

nX

i≤n (X i − µ _i ) ≥ x o

≤ e ^−H ^Benn ^{(x,B,W )} ≤ e ^−H ^Bern ^{(x,B,W )} where

H _Bern (x, B, W ) = x ² /W 2(1 + xB/W ) H _Benn (x, B, W ) = x ² /W

1 + xB/W + p1 + 2xB/W

(16)

§2.6 Bernstein’s inequality 15

Remark. The B here does not represent an upper bound for X i . The traditional form of Bernstein’s inequality, with HBern, corresponds to the bound < 22 > derived from the Bennett inequality under the assumption X i ≤ b = B/3.

Proof As before, F := P nX

i X i ≥ x + X

i µ i

o

≤ inf

θ≥0 e ^−(x+ ^P ⁱ ^µ ⁱ ^)θ Y

i≤n Pe ^θX ⁱ . Use the moment inequality < 24 > to bound the moment generating func- tion for X i .

Pe ^θX ⁱ = 1 + θµ _i + PΦ 1 (θX _i ) where Φ ₁ (t) := e ^t − 1 − t and, for θ ≥ 0,

PΦ 1 (θX _i ) ≤ PΦ 1 (θ|X _i |) = X

k≥2

θ ^k P|X i | ^k k!

≤ ¹ ₂ v _i θ ² X

k≥2 (θB) ^k−2

= ¹ ₂ v i θ ²

1 − Bθ provided 0 ≤ θ < 1/B.

\E@ Bernstein.mgf

\E@ Bernstein.mgf <25>

Thus

Pe ^θX ⁱ ≤ 1 + θµ _i + ¹ ₂ v _i θ ²

1 − bθ ≤ exp

θµ _i + ¹ ₂ v _i θ ² 1 − bθ

for 0 ≤ θB < 1 and

F ≤ exp − sup 0≤Bθ<1 g(θ)

where g(θ) := xθ − ¹ ₂ W θ ² 1 − Bθ . The H _Bern bound

As an approximate maximizer for g choose θ ₀ = x/(W + xB) so that F ≤ e ^−g(θ ⁰ ⁾ = exp

− x ² /2 W + xB

.

Bennett (1962, equation (7)) made this strange choice by defining G = 1−θB then minimizing −xθ+ ¹ ₂ θ ² W/G while ignoring the fact that G depends on θ.

That gives θ = xG/W = (1 − θB)x/W , a linear equation with solution θ ₀ .

Very confusing. As Bennett noted, a boundedness assumption |X i | ≤ b

(17)

actually implies < 24 > with B = b/3. The exponential upper bound then agrees with < 22 >.

The H _Benn bound

Following Bennett (1962, equation (7a)), this time we really maximize g(θ).

The calculation is cleaner if we rewrite g with u := xB/W and s := 1 − θB.

With those substitutions, g(θ) = g 0 (s) = W

2b ² 2us − (1 − s) ² /s = W

2B ² −s ⁻¹ − (1 + 2u)s − 2(1 + u) . The maximum is achieved at s 0 = (1 + 2u) ^−1/2 giving F ≤ e ^−g ⁰ ^(s ⁰ ⁾ where

\E@ u.bound

\E@ u.bound <26> g 0 (s 0 ) = W

B ² 1 + u − √

1 + 2u = W B ²

u ²

1 + u + √ 1 + 2u

.

Both inequalities in Theorem < 23 > have companion lower bounds, ob- tained by replacing X _i by −X _i in all the calculations. This works because the Bernstein condition < 24 > also holds with X _i replaced by −X _i . The in- equalities could all have been written as two-sided bounds, that is, as upper bounds for P{| P

i (X i − µ _i )| ≥ x}.

If we are really only interested in a one-sided bound then it seems super- fluous to control the size of |X i |. For example, for the upper tail it should be enough to control X _i ⁺ . Boucheron et al. (2013, page 37)=BLM derived such a bound (which they attributed to Emmanuel Rio) by means of the elementary inequality

e ^x − 1 − x ≤ ¹ ₂ x ² + X

k≥3 (x+) ^k /k! for all real x.

(For x ≥ 0 both sides are equal. For x < 0 it reflects the fact that g(x) = e ^x − (1 + x + ¹ ₂ x ² ) has a positive derivative with g(0) = 0.) They replaced the assumption < ²⁴ > by

PX i ² = v _i and P|X i | ^k ≤ ¹ ₂ v _i B ^k−2 k! for k ≥ 3.

For B > θ ≥ 0, we then get

Pe ^θX ⁱ ≤ 1 + θµ _i + ¹ ₂ θ ² v i + ¹ ₂ v i θ ² X

k≥3

(θB) ^k−2 ≤ exp

θµ i + ¹ ₂ v i θ ² 1 − θB

,

(18)

§2.7 Subexponential distributions? 17

which is exactly the same inequality as in the proof of Theorem < 23 >. As with the H _Benn calculation it follows that

P nX

i X _i ≥ x + X

i µ _i o

≤ exp

− W

B ² 1 + u − √

1 + 2u

where W = P

i v _i and u = xB/W . BLM observed that the upper bound can be inverted by solving a quadratic in u,

1 + 2u = 1 + u − tB ² /W 2

,

which has a positive solution u = tB ² /W +p2tB ² /W , corresponding to x = W u/B = tB + √

2tW . The one-sided Bernstein bound then takes an elegant form,

\E@ one.sided.Bernstein

\E@ one.sided.Bernstein <27> P nX

i (X i − µ _i ) ≥ tB +

√ 2tW

o

≤ e ^−t for t ≥ 0.

By splitting according to whether tB ≥ √

2tW or not, and by introducing an extraneous factor of 2, we can split the elegant form into two more suggestive inequalities,

P{

X

i (X _i − µ _i ) ≥ 2tB} ≤ e ^−t for t ≥ 2W/B ² P{

X

i (X _i − µ _i ) ≥ 2t √

W } ≤ e ^−t ² ^/2 for t ² /2 ≥ 2W/B ² .

These inequalities make clear that the subexponential part (large t) of the tail is controlled by B and the subgaussian part (moderate t) is controlled by W .

2.7 Subexponential distributions?

Basic::S:subexp

To me, subexponential behavior refers to functions that are bounded above by a constant multiple of exp(−c|x|), for some positive constant c. For example, the Bernstein inequality shows that, far enough from the mean, the tail probabilities for a large class of sums of independent random variables are subexponential. Closer to the mean the tail are subgaussian, which one might think is better than subexponential. As another example, if a random variable X takes values in [−1, +1] we have P{|X| ≥ x} = 0 ≤ exp(−10 ¹⁰ x) for all x ≥ 1, but I would hesitate to call that behavior subexponential.

What then does it mean for a whole distribution to be subexponential?

And how is subexponentiality related to the Bernstein moment condition,

\E@ Bernstein.moment0

\E@ Bernstein.moment0 <28> P |X| ^k

B ^k k! ≤ ¹ ₂ v/B ² for k ≥ 2?

(19)

As you already know, condition < 28 > implies

\E@ Bernstein.mgf0

\E@ Bernstein.mgf0 <29> Pe ^θX −1−θPX = PΦ 1 (θX) ≤ PΦ 1 (θ|X|) ≤ ¹ ₂ vθ ²

1 − θB for 0 ≤ θB < 1, which implies

Pe ^θX ≤ exp

θµ + ¹ ₂ vθ ² 1 − θB

for 0 ≤ θB < 1 and µ = PX.

Van der Vaart and Wellner (1996, page 103) pointed out that condition < 28 >

is implied by

1 2 v/B ² ≥ PΦ 1 (|X|/B) = X

k≥2

P|X| ^k

B ^k k! where Φ 1 (x) := e ^x − 1 − x.

And, if < 28 > holds, then

PΦ 1 (|X|/2B) = X

k≥2

P|X| ^k

2 ^k B ^k k! ≤ ¹ ₂ (2v)/(2B) ² ,

which is just < 29 > with θ = 1/(2B). It would seem that the moment condi- tion is somehow related to kXk _Φ ₁ . Indeed, Vershynin (2010, Section 5.2.4) defined a random variable X to be subexponential if sup _k∈N kXk _k /k < ∞.

By Problems [7] and [10], this condition is equivalent to finiteness of the Orlicz norm kXk _Φ ₁ .

Boucheron et al. (2013, Section 2.4) defined a centered random vari- able X to be sub-gamma on the right tail with variance factor v and scale factor B if

\E@ sub.gamma

\E@ sub.gamma <30> Pe ^θX ≤ exp vθ ² /2 1 − θB

for 0 ≤ θB < 1.

They pointed out that if X has a gamma(α, β) distribution (with mean µ = αβ and variance v = αβ ² ) then

Pe ^θ(X−µ) = e ^−θµ (1 − θβ) ^−α = exp (−θαβ − α log(1 − θβ))

≤ exp αβ ² θ ² /2 1 − θβ

for 0 ≤ θβ < 1.

Hence the name. The special case where α = 1 corresponds to the exponen-

tial.

(20)

§2.7 Subexponential distributions? 19

Let me try to tie these definitions and facts together.

First note that the upper bound in < 29 > equals 1 if θ = 2/ B + √

B ² + 2v . Thus the Bernstein assumption < 28 > implies

γ := kXk _Φ ₁ ≤ B + p

B ² + 2v

/2.

If, as is sometimes assumed, v = PX ² then < 28 > also implies v ^3/2 ≤ P|X| ³ ≤ ¹ ₂ v3!B ² .

That is, v ≤ (3B) ² and B < γ ≤ 2.7B. It appears that the B, or kXk _Φ ₁ , controls the subexponential part of the tail probabilities, at least as far as the Bernstein inequality is concerned. Inequality < 27 > conveyed the same message. Compare with the subexponential bound

P{|X| ≥ x} ≤ min (1, 1/Φ 1 (x/γ)) for γ = kXk _Φ

1 . The inequality PΦ 1 (|X|/γ) ≤ 1 also implies

P

|X| ^k

γ ^k k! ≤ 1 ≤ ¹ ₂ (2γ ² )/γ ² for k ≥ 2.

That is, < 28 > holds with v = 2γ ² and B = γ. Theorem < 23 > then gives

P{X − µ ≥ x} ≤ exp

− x ² 4γ ² + 2xγ

≤ exp(−x ² /(8γ ² )) for 0 ≤ x ≤ 2γ.

Is that not a subgaussian bound? Unfortunately, the bound—subgaussian or not—has little value for such a small range near zero. The upper bound always exceeds 1/ √

e .

The useful subgaussian part of the bound is contributed by the v in < 28 >.

The effect is clearer for a sum of independent random variables X ₁ , . . . , X _n , each distributed like X:

P exp(θ X

i≤n X _i ) ≤ exp nvθ ² /2 1 − θB

for 0 ≤ θB < 1, which leads to tail bounds like

P{

X

i X i ≥ x} ≤ exp

− x ² /2 nv + xB

,

(21)

useful subgaussian behavior when 0 ≤ x ≤ nv/B.

In my opinion, the sub-gamma property < 30 > is better thought of as a restricted subgaussian assumption that automatically implies subexponen- tial behavior far out in the tails. For each α in (0, 1) it implies

Pe ^θX ≤ exp vθ ² /2 1 − α

for 0 ≤ θB ≤ α < 1.

Remark. Uspensky (1937, page 204) used this approach for the Bernstein inequality, before making the choice α = xB/(W + xB) to recover the traditional HBern form of the upper bound.

More generally, if we have a bound

\E@ restricted.subg

\E@ restricted.subg <31> Pe ^θ(X−µ) ≤ e ^vθ ² ^/2 for 0 ≤ θ ≤ ρ,

with v and ρ positive constants and µ = PX, then for x ≥ 0, P{X ≥ µ + x} ≤ min

0≤θ≤ρ exp −xθ + ¹ ₂ vθ ²

= exp − ¹ ₂ x ² /v

for 0 ≤ x ≤ ρv exp −ρx + ¹ ₂ vρ ² < e ^−xρ/2 for x > ρv .

\E@ gauss.exp

\E@ gauss.exp <32>

The minimum is achieved by θ = min(ρ, x/v).

The proof in the next Section of a generalized version of the Hanson- Wright inequality—showing that a quadratic P

i,j X _i A _i,j X _j in independent subgaussians has a subgaussian/subexponential tail— illustrates this ap- proach.

2.8 The Hanson-Wright inequality for subgaussians

Basic::S:HW

The following is based on an elegant argument by Rudelson and Vershynin (2013) = R&V. To shorten the proof I assume that the A matrix has zeros down its diagonal, so that P P

i,j X i A i,j X j = 0 and I can skip the part of the argument (see Problem [17]) that deals with P

i A i,i (X _i ² − PX i ² ).

Basic::HW <33> Theorem. Let X ₁ , . . . , X _n be independent, centered subgaussian random variables with τ ≥ max _i τ (X _i ). Let A be an n × n matrix of real numbers with A ii = 0 for each i. Then

P{X ⁰ AX ≥ t} ≤ exp − min t ²

64τ ⁴ kAk _HS , t 8 √

2 τ ² kAk ₂

!!

for t ≥ 0,

where kAk _HS := ptrace(A ⁰ A) = q P

i,j A ² _i,j and kAk ₂ := sup _kuk

2 ≤1 kAuk ₂ .

(22)

§2.8 The Hanson-Wright inequality for subgaussians 21

Remark. There are no assumptions of symmetry or positive definiteness on A. The proof also works with A replaced by −A, leading to a similar upper bound for P{|X ⁰ AX| ≥ t}. The Hilbert-Schmidt norm kAk _HS is also called the Frobenius norm. The constants 64 and 8 √

2 are not important.

Proof Without loss of generality, suppose τ = 1, so that Pe ^θX ⁱ ≤ e ^θ ² ^/2 for all i and all real θ and P exp(α ⁰ X) ≤ exp(kαk ² ₂ /2) for each α ∈ R ⁿ .

As explained at the end of Section 2.7, it suffices to show Pe ^θX

0 AX ≤ exp( ¹ ₂ vθ ² ) for 0 ≤ θ ≤ ρ where v = 16 kAk ² _HS and ρ = √

32 kAk ₂ −1

.

To avoid conditioning arguments, assume P = ⊗ i∈[n] P i on B(R ^[n] ), where [n] := {1, 2, . . . , n}, and each X _i is a coordinate map, that is, X _i (x) = x _i and X(x) = x. Write G for the N (0, I n ) distribution, another probability measure on B(R ^[n] ). Of course G = ⊗ i∈[n] G i with each G i equal to N (0, 1).

For each subset J of [n] write P J for ⊗ _i∈J P _i and G J for ⊗ _i∈J G _i and, for generic vectors w = (w i : i ∈ [n]) in R ^[n] , write w J for (w i : i ∈ J ) ∈ R ^J .

The proof works by bounding Pe ^θX ⁰ ^AX by the G expected value of an analogous quadratic, using a decoupling argument that R&V attributed to Bourgain (1998).

For each δ = (δ i : i ∈ I) ∈ {0, 1} ^[n] define A _δ to be the n × n matrix with (i, j)th element δ _i (1 − δ _j )A _i,j . Write D _δ for {i ∈ [n] : δ _i = 1} and N _δ for [n]\D _δ . Then X ⁰ A _δ X = X _D ⁰

δ EX _N _δ where E _δ := A[D _δ , N _δ ]. If the rows and columns of A were permuted so that i < i ⁰ for all i ∈ D _δ and i ⁰ ∈ N _δ then A _δ would look like

A _δ = 0 E _δ 0 0

where E _δ := A[D _δ , N _δ ] .

Now let Q denote the uniform distribution on {0, 1} ^[n] . Under Q the δ i ’s are independent Ber(1/2) random variables, so that QA δ = ¹ ₄ A. For each θ, Jensen’s inequality and Fubini give

P exp θX ⁰ AX = P ^x exp(4θQ ^δ X ⁰ A _δ X) ≤ Q ^δ P ^x exp(4θX ⁰ A _δ X).

Write Q δ for X ⁰ A δ X. It therefore suffices to show that

P exp(4θQ δ ) ≤ exp( ¹ ₂ vθ ² ) for each δ and all 0 ≤ θ ≤ ρ.

From this point onwards the δ is held fixed. To simplify notation I drop

most of the subscript δ’s, writing X _D instead of X _D _δ , and so on. Of course

A δ cannot drop its subscript.

(23)

Under P D , with X N held fixed, the quadratic Q δ := X ⁰ A _δ X = X _D ⁰ EX N

is a linear combination of the independent subgaussians X _D . Thus P D e ^4θ ^Q ^δ ≤ exp

1 2 (4θ) ² kEX _N k ² ₂

= G D exp (4θ Q δ ) . Similarly, P N exp (4θQ δ ) ≤ G N exp (θQ δ ). Thus

P exp(4θQ δ ) = P D P N exp(4θQ δ )

≤ P D G N exp(4θ Q δ ) = G N P D exp(4θ Q δ )

≤ G N G D exp(4θ Q δ ) = G N e ^8θ ² ^X ^N ⁰ ^E ⁰ ^EX ^N .

The problem is now reduced to its gaussian analog, for which rotational symmetry of G N allows further simplification by means of a diagonalization of a positive definite matrix, E ⁰ E = L ⁰ ΛL, where L is orthogonal and Λ = diag(λ ₁ , . . . , λ _n ), with λ ₁ ≥ λ ₂ ≥ . . . λ _n ≥ 0 are the eigenvalues of E ⁰ E.

G N e ^8θ ² ^X ^N ⁰ ^E ⁰ ^EX ^N = G N exp 8θ ² X _N ⁰ ΛX N = Y

i∈N G i exp(8θ ² λ i x ² _i )

≤ exp X

i∈N 16θ ² λ _i

provided 8θ ² max

i∈N λ _i ≤ 1/4.

\E@ G.bnd

\E@ G.bnd <34>

The last inequality uses the fact that, if Z ∼ N (0, 1), Pe ^sZ

2 = (1 − 2s) ^−1/2 = exp − ¹ ₂ log(1 − 2s) ≤ exp(2s) if 0 ≤ s ≤ ¹ ₄ . Now we need some matrix facts to bring everything back to an expression involving A. For notational simplicity again assume that the rows D all precede the rows N , so that

A = B ₁ E B ₂ B ₃

and A ⁰ A = B ₁ ⁰ B ₁ + B ₂ ⁰ B ₂ B ₁ ⁰ E + B ₂ ⁰ B ₃ E ⁰ B ₁ + B ₃ ⁰ B ₂ E ⁰ E + B ₃ ⁰ B ₃

. It follows that

kAk ² _HS = trace(A ⁰ A) ≥ trace(E ⁰ E) = X

i∈N λ _i and

kAk ₂ = sup

kvk ₂ ≤1

kAvk ₂ ≥ sup

kuk ₂ ≤1

kEuk ₂ = λ ₁ . Combine the various inequalities to conclude that P exp θX ⁰ AX ≤ exp

16θ ² kAk ² _HS

provide 0 ≤ θ ≤ √

32 kAk ₂ −1

. An appeal to inequality < 32 > completes the proof.

(24)

§2.9 Problems 23

2.9 Problems

Basic::S:Problems

[1] Define a = sup{θ < 0 : M (θ) = +∞} and b = inf{θ > 0 : M (θ) = +∞}

Basic::P:convex.interval

where M (θ) = Pe ^θX . For simplicity, suppose both a and b are finite and a < b.

(i) Show that the restriction of M to the interval [a, b] is continuous. (For example, if M (b) = +∞ you should show that M (θ) → +∞ as θ increases to b.)

(ii) Suppose θ ∈ (a, b). Choose δ > 0 so that [θ − 2δ, θ + 2δ ₀ ] ⊂ (a, b). For

|h| < δ, establish the domination bound

|e ^(θ+h)X − e ^θX |/h ≤

Z 1 0

Xe ^(θ+sh)X ds

≤ max

e ^(θ+2δ)X , e ^θX , e ^(θ−2sδ)X

/δ.

Deduce via a Dominated Convergence argument that M ⁰ (θ) = P(Xe ^θX ).

(iii) Argue similarly to justify another differentiation inside the expectation, lead- ing to the expression for M ⁰⁰ (θ).

[2] Suppose h(x) = e ^g(x) , where g is a twice differentiable real-valued function

Basic::P:e.to.g

on R ⁺ .

(i) Show that h is convex on any interval J for which (g ⁰ (x)) ² + g ⁰⁰ (x) ≥ 0 for x ∈ int(J ).

[3] For a fixed α ≥ 1 define Ψ _α (x) = e ^x ^α − 1 for x ≥ 0.

Basic::P:Psia.ge1

(i) Show that Ψ _α (x) is a Young function..

(ii) Show that

Ψ _α (x)Ψ _α (y) ≤ exp(x ^α + y ^α ) − 1 ≤ Ψ _α (x + y) for all x, y ≥ 0

≤ Ψ _α (2 ^1/a xy) for x ∧ y ≥ 1.

(iii) Deduce from the inequality Ψ _α (x)Ψ _α (y) ≤ Ψ _α (x + y) that Ψ ⁻¹ _α (uv) ≤ Ψ ⁻¹ _α (u) + Ψ ⁻¹ _α (v) for all u, v ≥ 0.

[4] Suppose f is a convex, nonnegative, increasing function on an interval [a, ∞),

Basic::P:extend.to.Young

with a > 0. Suppose also that there exists an x 0 ∈ [a, ∞) for which

(25)

f (x 0 )/x 0 = τ := inf x≥a f (x)/x. Show that h(x) := τ x{0 ≤ x < x 0 } + f (x){x ≥ x ₀ } is a Young function.

[5] Suppose f : R ⁺ → R ⁺ is strictly increasing with f (0) = 0 and

Basic::P:extend.range

f (u ₁ u ₂ ) ≤ c ₀ (f (u ₁ ) + f (u ₂ )) for min(u ₁ , u ₂ ) ≥ v ₀ ,

for some constants c ₀ and v ₀ > 0. For each v ₁ in (0, v ₀ ) prove existence of a larger constant c ₁ for which

f (u ₁ u ₂ ) ≤ c ₁ (f (u ₁ ) + f (u ₂ )) for min(u ₁ , u ₂ ) ≥ v ₁ . Hint: For u ^∗ _i = max(u i , v 0 ) show that f (u ^∗ _i ) ≤ f (u i )f (v 0 )/f (v 1 ).

[6] For a fixed α ∈ (0, 1) define f _α (x) = e ^x ^α for x ≥ 0.

0 2 6 8

051015

zα xα

fα

α=1/2

Basic::P:Psia.lt1

(i) Show that x 7→ f α (x) is convex for x ≥ z α := (α ⁻¹ − 1) ^1/α and concave for 0 ≤ x ≤ z _α .

(ii) Define x α = (1/α) ^1/α and τ α = f α (x α )/x α = (αe) ^1/α . Show that Ψ α (x) := τ x{x < x α } + f _α (x){x ≥ x α }

is a Young function.

(iii) Show that there exists a constant K _α for which

Ψ _α (x) ≤ exp(x ^α ) ≤ K _α + Ψ _α (x) for all x ≥ 0.

(iv) Show that Ψ _α (x)Ψ _α (y) ≤ Ψ _α (2 ^1/α xy) for x ∧ y ≥ x _α . (v) Deduce that there exists a constant C _α for which

C α Ψ ⁻¹ _α (u) + Ψ ⁻¹ _α (v) ≥ Ψ ⁻¹ _α (uv) when u ∧ v ≥ 1.

[7] Suppose Ψ is a Young function with right-hand derivative τ > 0 at 0. Define

Basic::P:slope.zero

Φ(x) = Ψ(x) − τ x for x ≥ 0. If Φ is not identically zero then it is a Young function. For that case show that

kf k _Ψ ≥ kf k _Φ ≥ kf k _Ψ /K where K = 1 + τ Φ ⁻¹ (1).

for each measurable real function f .

(26)

§2.9 Problems 25

[8] For each k in N prove that e ^k ≥ k ^k /k! ≥ 1. Compare with the sharper

Basic::P:Stirling

estimate via Stirling’s formula (Feller, 1968, Section 2.9), k ^k e ^−k

k! = e ^−r ^k / √

2πk where 1

12k > r _k > 1 12k + 1 .

[9] Suppose n ₀ = 1 ≤ n ₁ < n ₂ < . . . is an increasing sequence of positive

Basic::P:interp

integers with sup _k∈N n k /n k−1 = B < ∞ and f : [1, ∞) → R ⁺ is an increasing function with sup _t≥1 f (Bt)/f (t) = A < ∞. For each random variable X show that sup _r≥1 kXk _r /f (r) ≤ A sup _k∈N kXk _n

k /f (n _k ).

[10] With Ψ _α as in Problem [2], prove the existence of positive constants c _α an

Basic::P:moment.Psia

C α for which c _α kXk _Ψ

α ≤ sup

r≥1

kXk _r

r ^1/α ≤ C _α kXk _Ψ

α

for every random variable X.

[11] Theorem < 16 > suggests that kXk ₂ might be comparable to β(X) if X has

Basic::P:subg.var

a subgaussian distribution. It is easy to deduce from inequality < 10 > that kXk ₂ ≤ 2β(X). Show that there is no companion inequality in the other direction by considering the bounded, symmetric random variable X for which P{X = ±M } = δ and P{X = 0} = 1 − 2δ. If 2δ = M show that PX ² = 1 but

log 1 + θ ² /2! + (θ ⁴ M ² )/4! ≤ log Pe ^Xθ ≤ τ ² θ ² /2 for all real θ would force τ ² ≥ 2 log 3/2 + M ² /24.

[12] Show that a random variable X that has either of the following two prop-

Basic::P:subg.suff

erties is subgaussian.

(i) For some positive constants c 1 , c 2 , and c 3 ,

P{|X| ≥ x} ≤ c 1 exp(−c ₂ x ² ) for all |x| ≥ c ₃ . (ii) For some positive constants c 1 , c 2 , and c 3 ,

Pe ^θX ≤ c ₁ exp(c ₂ θ ² ) for all |θ| ≥ c ₃

[13] Suppose {X _n } is a sequence of random variables for which

Basic::P:subg.limit1

P exp(θX n ) ≤ exp(τ ² θ ² /2) for all n and all real θ,

(27)

where τ is a fixed positive constant. If X n → X almost surely, show that Pe ^θX ≤ exp(τ ² θ ² /2) for all real θ.

[14] Suppose f is a sufficiently smooth real-valued function defined at least in a

Basic::P:lagrange

neighborhood of the origin of the real line. Define G(x) = f (x) − f (0) − xf ⁰ (0)

x ² /2 if x 6= 0, and G(0) = f ⁰⁰ (0).

Prove the following facts.

(i) f (x) − f (0) − xf ⁰ (0) = x ² RR f ⁰⁰ (xs){0 < s < t < 1} ds dt (ii) If f is convex then G is nonnegative.

(iii) If f ⁰⁰ is an increasing function then so is G(x).

(iv) If f ⁰⁰ is a convex function then so is G. Moreover G(x) ≥ f ⁰⁰ (x/3). Hint:

Jensen and 2 RR s{0 < s < t < 1} ds dt = 1/3.

(v) For the special case where f (x) = (1 + x) log(1 + x) − x show that G = ψ Benn

from < 18 > and G(x) ≥ (1 + x/3) ⁻¹ .

[15] Recall the notation F (x) = e ^x − 1 − x for x ∈ R and

Basic::P:Poisson

ψ Benn (y) = F ^∗ (y)/(y ² /2) where F ^∗ (y) = (1 + y) log(1 + y) − y for y ≥ −1.

Suppose X has a Poisson(λ) distribution (i) Show that L(θ) = log P ^θX = λθ + λF (θ).

(ii) For x ≥ 0, deduce that

P{X ≥ λ + x} ≤ exp (−λF ^∗ (x/λ)) = exp

− x ²

2λ ψ Benn (x/λ)

. (iii) For 0 ≤ x ≤ λ show that

P{X ≤ λ − x} ≤ inf

θ≥0 exp (λF (−θ) − λx)

= exp

− x ²

2λ ψ Benn (−x/λ)

≤ exp

− x ² 2λ

.

[16] To what extent can the bounds from Problem [15] be recovered as limit-

Basic::P:BinPoisson

ing cases of the bounds derived via the Bennett inequality applied to the Bin(n, λ/n) distribution?

[17] Suppose ∞ > γ = kXk _Ψ

2 . Use Theorem < 16 > to deduce that PX ^2k ≤

Basic::P:diag.HW

√ 2kγ 2k

≤ (2eγ ² ) ^k k!. Use Bernstein’s inequality from Theorem < 23 > to deduce a tail bound analogous to Theorem < 33 > for P

i α _i (X _i ² − PX _i ² ) with

independent subgaussians X 1 , . . . , X n and constants α i .

(28)

§2.10 Notes 27

2.10 Notes

Basic::S:Notes

Anyone who has read Massart’s Saint-Flour lectures (Massart, 2003) or Lu- gosi’s lecture notes (Lugosi, 2003) will realize how much I have been influ- enced by those authors. Many of the ideas in those lectures reappear in the book by Boucheron, Lugosi, and Massart (2013).

Many authors seem to credit Chernoff (1952) with the moment gener- ating trick in < 2 >, even though the idea is obviously much older: com- pare with the treatment of the Bernstein inequality by Uspensky (1937, pages 204–206)). Chernoff himself gave credit to Cram´ er for “extremely more powerful results” obtained under stronger assumptions.

The characterizations for subgaussians in Theorem < 16 > are essentially due to Kahane (1968, Exercise 6.10).

Hoeffding (1963) established many tail bounds for sums of independent random variables. He also commented that the results extend to martin- gales.

The Bennett inequality for independent summands was proved by Ben- nett (1962) by a more complicated method.

Section 2.6 on the Bernstein inequality is based on the exposition by Uspensky (1937, pages 204–206), with refinements from Bennett (1962) and Massart (2003, Section 1.2.3). See also Boucheron et al. (2013, Sections 2.4, 2.8) for a slightly more detailed exposition, including the idea of using the positive part to get a one-sided bound, which they credited to Emmanuel Rio. Van der Vaart and Wellner (1996, page 103) noted the connection between the moment assumption < 24 > and the behavior of PΦ 1 (θ|X _i |).

Dudley (1999, Appendix H) used the name Young-Orlicz modulus for what I am calling a Young function. to such a function as an Orlicz modulus.

Every Young function can be written as Ψ(t) = R t

0 ψ(x) dx, where ψ is the right-hand derivative of Ψ. The function ψ is increasing and right- continuous. For their N -functions, Krasnosel’ski˘ı and Ruticki˘ı (1961, Sec- tion 1.3) added the requirements that ψ(0) = 0 < ψ(x) for x > 0 and ψ(x) →

∞ as x → ∞, which ensures that Ψ is strictly increasing with Ψ(t)/t → ∞ as t → ∞. Dudley called such a function an Orlicz modulus.

When ψ(0) = 0 there is a measure µ on B(R ⁺ ) for which ψ(b) = µ(0, b]

for all intervals (0, b]. In that case PΨ(X) = µ ^s P(X − s) ⁺ , a representation

closely related to Lemma 1 of Panchenko (2003).

2.5 Bennett’s inequality . . . . 11

2.1 Tail bounds and concentration . . . . 1

2.2 Moment generating functions . . . . 3

2.3 Orlicz norms . . . . 4

2.4 Subgaussian tails . . . . 6

2.5 Bennett’s inequality . . . . 11

2.6 Bernstein’s inequality . . . . 14

2.7 Subexponential distributions? . . . . 17

2.8 The Hanson-Wright inequality for subgaussians . . . . 20

2.9 Problems . . . . 23

2.10 Notes . . . . 27

Printed: 27 November 2015

version: 20nov15 Mini-empirical

Chapter 2

A few good inequalities

Section 2.1 introduces exponential tail bounds.

Section 2.2 introduces the method for bounding tail probabilities using mo- ment generating functions.

Section 2.3 describes analogs of the usual L p norms, which have proved useful in empirical process theory.

Section 2.4 discusses subgaussianity, a most useful extension of the gaus- sianity property.

Section 2.5 discusses the Bennett inequality for sums of independent ran- dom variables that are bounded above by a constant.

Section 2.6 shows how the Bennett inequality implies one form of the Bern- stein inequality, then discusses extensions to unbounded summands.

Section 2.7 describes two ways to capture the idea of tails that decrease like those of the exponential distribution.

Section 2.8 illustrates the ideas from the previous Section by deriving a subgaussian/subexponential tail bound for quadratic forms in independent subgaussian random variables.

2.1 Tail bounds and concentration

Basic::S:intro

version: 20nov15 Mini-empirical

We know a lot about the normal distribution. For example, if W has a N (µ, σ 2 ) distribution then (Feller, 1968, Section 7.1)

 1 x − 1

x 3

 e −x 2 /2

√

2π ≤ P{W ≥ µ + σx} ≤ 1 x

e −x 2 /2

√

2π for all x > 0.

P{|W − µ| ≥ σx} ≤ 2e −x

2 /2 for all x ≥ 0, a concentration inequality.

Similar-looking tail bounds exist for random variables that are not ex- actly normally distributed. Particularly useful are the so-called exponential tail bounds, which take the form

\E@ exp.bnd

\E@ exp.bnd <1> P{X ≥ x + PX} ≤ e −g(x) for x ≥ 0,

For example, suppose S = X 1 + · · · + X n , a sum of independent random variables with PX i = 0 and each X i bounded above by some constant b.

The Bernstein inequality (Section 2.6) tells us that

P{S ≥ x

√

V } ≤ exp −x 2 /2 1 + 1 3 bx/ √

V

!

for x ≥ 0,

where V is any upper bound for var(S). If each X i is also bounded below by −b then a similar tail bound exists for −S, which leads to a concentration inequality, an upper bound for P{|S| ≥ x √

V }.

For x much smaller than √

V /b the exponent in the Bernstein inequality

behaves like −x 2 /2; for much larger x the exponent behaves like a negative

multiple of x. In a sense that is made more precise in Sections 2.6 and 2.7, the

inequality gives a subgaussian bound for x not too big and a subexponential

bound for very large x.

§2.2 Moment generating functions 3

2.2 Moment generating functions

Basic::S:mgf

The most common way to get a bound like < 1 > uses the moment gener- ating function, M (θ) := Pe θX . From the fact that exp is nonnegative and exp(θ(X − w)) ≥ 1 when X ≥ w and θ ≥ 0 we have

\E@ mgf.tail

\E@ mgf.tail <2> P{X ≥ w} ≤ inf

θ≥0 Pe θ(X−w) = inf

θ≥0 e −θw M (θ).

Basic::normal <3> Example. If X has a N (µ, σ 2 ) distribution then M (θ) = exp(θµ + σ 2 θ 2 /2) is finite for all real θ. For x ≥ 0 inequality < 2 > gives

P{X ≥ µ + σx} ≤ inf

θ≥0 exp(−θ(µ + σx) + θµ + θ 2 σ 2 /2)

= exp(−x 2 /2) for all x ≥ 0, the minimum being achieved by θ = x/σ.

Analogous arguments, with X − µ replaced by µ − X, give an analogous bound for the lower tail,

P{X ≤ µ − σx} ≤ exp(−x 2 /2) for all x > 0,

leading to the concentration inequality P{|X − µ| ≥ σx} ≤ 2e −x 2 /2 .

Remark. Of course the algebra would have been a tad simpler if I had worked with the standardized variable (X − µ)/σ. I did things the messier way in order to make a point about the centering in Section 2.4.

Exactly the same arguments work for any random variable whose mo- ment generating is bounded above by exp(θµ + σ 2 θ 2 /2), for all real θ and some finite constants µ and τ . This property essentially defines the sub- gaussian distributions, which are discussed in Section 2.4.

The function L(θ) := log M (θ) is called the cumulant-generating function. The upper bound in < 2 > is often rewritten as

e −L ∗ (θ) where L ∗ (θ) := sup θ≥0 (θw − L(θ)).

Remark. As Boucheron et al. (2013, Sections 2.2) and others have

noted, the function L ∗ is the Fenchel-Lagrange transform of L, which is

also known as the conjugate function (Boyd and Vandenberghe, 2004,

Section 3.3).

Clearly L(0) = 0. The H¨ older inequality shows that L is convex: if θ is a convex combination α 1 θ 1 + α 2 θ 2 then

Section 2.3 describes analogs of the usual L ^p norms, which have proved useful in empirical process theory.

We know a lot about the normal distribution. For example, if W has a N (µ, σ ² ) distribution then (Feller, 1968, Section 7.1)

1 x − 1

x ³

e ^−x ² ^/2

e ^−x ² ^/2

P{|W − µ| ≥ σx} ≤ 2e ^−x

\E@ exp.bnd <1> P{X ≥ x + PX} ≤ e ^−g(x) for x ≥ 0,

V } ≤ exp −x ² /2 1 + ¹ ₃ bx/ √

where V is any upper bound for var(S). If each X _i is also bounded below by −b then a similar tail bound exists for −S, which leads to a concentration inequality, an upper bound for P{|S| ≥ x √

behaves like −x ² /2; for much larger x the exponent behaves like a negative

The most common way to get a bound like < 1 > uses the moment gener- ating function, M (θ) := Pe ^θX . From the fact that exp is nonnegative and exp(θ(X − w)) ≥ 1 when X ≥ w and θ ≥ 0 we have

θ≥0 Pe ^θ(X−w) = inf

θ≥0 e ^−θw M (θ).

Basic::normal <3> Example. If X has a N (µ, σ ² ) distribution then M (θ) = exp(θµ + σ ² θ ² /2) is finite for all real θ. For x ≥ 0 inequality < 2 > gives

θ≥0 exp(−θ(µ + σx) + θµ + θ ² σ ² /2)

= exp(−x ² /2) for all x ≥ 0, the minimum being achieved by θ = x/σ.

P{X ≤ µ − σx} ≤ exp(−x ² /2) for all x > 0,

leading to the concentration inequality P{|X − µ| ≥ σx} ≤ 2e ^−x ² ^/2 .

Exactly the same arguments work for any random variable whose mo- ment generating is bounded above by exp(θµ + σ ² θ ² /2), for all real θ and some finite constants µ and τ . This property essentially defines the sub- gaussian distributions, which are discussed in Section 2.4.

e ^−L ^∗ ^(θ) where L ^∗ (θ) := sup _θ≥0 (θw − L(θ)).

noted, the function L ^∗ is the Fenchel-Lagrange transform of L, which is

Clearly L(0) = 0. The H¨ older inequality shows that L is convex: if θ is a convex combination α ₁ θ ₁ + α ₂ θ ₂ then

e ^L(θ) = M (α 1 θ 1 + α 2 θ 2 )

e ^α ¹ ^θ ¹ ^X e ^α ² ^θ ² ^X

≤ Pe ^θ ¹ ^X

α 1 Pe ^θ ² ^X

α 2

= e ^α ¹ ^L(θ ¹ ^)+α ² ^L(θ ² ⁾ .

(1 − θ) ⁻¹ for θ < 1

L ⁰ (θ) = M ⁰ (θ)

M (θ) = PX e ^θX M (θ) L ⁰⁰ (θ) = M ⁰⁰ (θ)

M (θ) − M ⁰ (θ) M (θ)

2

= PX ² e ^θX M (θ) −

PX e ^θX M (θ)

2

If we define a probability measure P θ by its density, dP θ /dP = e ^θX /M (θ), then the derivatives of L can be re-expressed using moments of X under P θ :

L ⁰ (θ) = P θ X and L ⁰⁰ (θ) = P θ (X − P θ X) ² = var _θ (X) ≥ 0.

In particular, L ⁰ (0) = PX, so that Taylor expansion gives

\E@ L.Taylor <4> L(θ) = θPX + ¹ ₂ θ ² var _θ∗ (X) with θ ^∗ lying between 0 and θ.

Bounds on the P θ variance of X lead to bounds on L and tail bounds for X − PX. The Hoeffding inequality (Example < ¹⁴ >) provides a good illustration of this line of attack.

There is a fruitful way to generalize the the concept of an L ^p norm by means

of a Young function, that is, a convex, increasing function Ψ : R ⁺ → R ⁺

Ψ(x ₀ ) = Ψ(αx + (1 − α)0) ≤ αΨ(x)

Ψ _α (x) ≤ exp(x ^α ) ≤ K _α + Ψ _α (x) for all x ≥ 0,

where K _α is a positive constant. Some linear interpolation near 0 is needed to overcome the annoyance that x 7→ exp(x ^α ) is concave on the inter- val [0, z α ], for z α := ((1 − α)/α) ^1/α .

Basic::Orlicz.norm <5> Definition. Let ( X, A, µ) be a measure space and f be an A-measuable real function on X. For each Young function Ψ the corresponding Orlicz norm kf k _Ψ (or Ψ-norm) is defined by

kf k _Ψ = inf{c > 0 : µΨ(|f (x)|/c) ≤ 1},

with kf k _Ψ = ∞ if the infimum runs over an empty set.

Remark. It is not hard to show that kf k _Ψ < ∞ if and only if µΨ(|f |/C ₀ ) < ∞ for at least one finite constant C ₀ . Moreover, if 0 < c = kf k _Ψ < ∞ then µΨ(|f |/c) ≤ 1; the infimum is achieved.

When restricted to the set L ^Ψ ( X, A, µ) of all measurable real functions on X with finite Ψ-norm, k·k _Ψ is a seminorm. It fails to be a norm only because kf k _Ψ = 0 if and only if µ{x : f (x) 6= 0} = 0.

The corresponding set of µ-equivalence classes is a Banach space. It is traditional and benign to call k·k _Ψ a norm even though it is only a seminorm on L ^Ψ ( X, A, µ).

Finiteness of the Ψ-norm for a random variable Z immediately implies a tail bound: if 0 < σ = kZk _Ψ < ∞ then

1, 1

For example, for each α > 0 there is a constant C α for which P{|Z| ≥ x kZk _Ψ _α } ≤ C _α e ^−x ^α for x ≥ 0.

The constant can be chosen so that C α e ^−x ^α ≥ 1 for all x so small that Ψ α (x)e ^−x ^α is not close to 1.