2.1 Tail bounds and concentration . . . . 1
2.2 Moment generating functions . . . . 3
2.3 Orlicz norms . . . . 4
2.4 Subgaussian tails . . . . 6
2.5 Bennett’s inequality . . . . 11
2.6 Bernstein’s inequality . . . . 14
2.7 Subexponential distributions? . . . . 17
2.8 The Hanson-Wright inequality for subgaussians . . . . 20
2.9 Problems . . . . 23
2.10 Notes . . . . 27
Printed: 27 November 2015
version: 20nov15 Mini-empirical
Chapter 2
A few good inequalities
Section 2.1 introduces exponential tail bounds.
Section 2.2 introduces the method for bounding tail probabilities using mo- ment generating functions.
Section 2.3 describes analogs of the usual L p norms, which have proved useful in empirical process theory.
Section 2.4 discusses subgaussianity, a most useful extension of the gaus- sianity property.
Section 2.5 discusses the Bennett inequality for sums of independent ran- dom variables that are bounded above by a constant.
Section 2.6 shows how the Bennett inequality implies one form of the Bern- stein inequality, then discusses extensions to unbounded summands.
Section 2.7 describes two ways to capture the idea of tails that decrease like those of the exponential distribution.
Section 2.8 illustrates the ideas from the previous Section by deriving a subgaussian/subexponential tail bound for quadratic forms in independent subgaussian random variables.
2.1 Tail bounds and concentration
Basic::S:intro
Sums of independent random variables are often approximately normal dis- tributed. In particular, their tail probabilities often decrease in ways similar to the tails of a normal distribution. Several classical inequalities make this vague idea more precise. Such inequalities provided the foundation on which empirical process theory was built.
version: 20nov15 Mini-empirical
We know a lot about the normal distribution. For example, if W has a N (µ, σ 2 ) distribution then (Feller, 1968, Section 7.1)
1 x − 1
x 3
e −x 2 /2
√
2π ≤ P{W ≥ µ + σx} ≤ 1 x
e −x 2 /2
√
2π for all x > 0.
Clearly the inequalities are useful only for larger x: as x decreases to zero the lower bound goes to −∞ and the upper bound goes to +∞. For many purposes the exp(−x 2 /2) factor matters the most. Indeed, the simpler tail bound P{W ≥ µ + σx} ≤ exp(−x 2 /2) for x ≥ 0 often suffices for asymptotic arguments. Sometimes we need a better inequality showing that the dis- tribution concentrates most of its probability mass in a small region, such as
P{|W − µ| ≥ σx} ≤ 2e −x
2 /2 for all x ≥ 0, a concentration inequality.
Similar-looking tail bounds exist for random variables that are not ex- actly normally distributed. Particularly useful are the so-called exponential tail bounds, which take the form
\E@ exp.bnd
\E@ exp.bnd <1> P{X ≥ x + PX} ≤ e −g(x) for x ≥ 0,
with g a function that typically increases like a power of x. Sometimes g behaves like a quadratic for some range of x and like a linear function further out in the tails; sometimes some logarithmic terms creep in; and sometimes it depends on var(X) or on some more exotic measure of size, such as an Orlicz norm (see Section 2.3).
For example, suppose S = X 1 + · · · + X n , a sum of independent random variables with PX i = 0 and each X i bounded above by some constant b.
The Bernstein inequality (Section 2.6) tells us that
P{S ≥ x
√
V } ≤ exp −x 2 /2 1 + 1 3 bx/ √
V
!
for x ≥ 0,
where V is any upper bound for var(S). If each X i is also bounded below by −b then a similar tail bound exists for −S, which leads to a concentration inequality, an upper bound for P{|S| ≥ x √
V }.
For x much smaller than √
V /b the exponent in the Bernstein inequality
behaves like −x 2 /2; for much larger x the exponent behaves like a negative
multiple of x. In a sense that is made more precise in Sections 2.6 and 2.7, the
inequality gives a subgaussian bound for x not too big and a subexponential
bound for very large x.
§2.2 Moment generating functions 3
2.2 Moment generating functions
Basic::S:mgf
The most common way to get a bound like < 1 > uses the moment gener- ating function, M (θ) := Pe θX . From the fact that exp is nonnegative and exp(θ(X − w)) ≥ 1 when X ≥ w and θ ≥ 0 we have
\E@ mgf.tail
\E@ mgf.tail <2> P{X ≥ w} ≤ inf
θ≥0 Pe θ(X−w) = inf
θ≥0 e −θw M (θ).
Basic::normal <3> Example. If X has a N (µ, σ 2 ) distribution then M (θ) = exp(θµ + σ 2 θ 2 /2) is finite for all real θ. For x ≥ 0 inequality < 2 > gives
P{X ≥ µ + σx} ≤ inf
θ≥0 exp(−θ(µ + σx) + θµ + θ 2 σ 2 /2)
= exp(−x 2 /2) for all x ≥ 0, the minimum being achieved by θ = x/σ.
Analogous arguments, with X − µ replaced by µ − X, give an analogous bound for the lower tail,
P{X ≤ µ − σx} ≤ exp(−x 2 /2) for all x > 0,
leading to the concentration inequality P{|X − µ| ≥ σx} ≤ 2e −x 2 /2 .
Remark. Of course the algebra would have been a tad simpler if I had worked with the standardized variable (X − µ)/σ. I did things the messier way in order to make a point about the centering in Section 2.4.
Exactly the same arguments work for any random variable whose mo- ment generating is bounded above by exp(θµ + σ 2 θ 2 /2), for all real θ and some finite constants µ and τ . This property essentially defines the sub- gaussian distributions, which are discussed in Section 2.4.
The function L(θ) := log M (θ) is called the cumulant-generating function. The upper bound in < 2 > is often rewritten as
e −L ∗ (θ) where L ∗ (θ) := sup θ≥0 (θw − L(θ)).
Remark. As Boucheron et al. (2013, Sections 2.2) and others have
noted, the function L ∗ is the Fenchel-Lagrange transform of L, which is
also known as the conjugate function (Boyd and Vandenberghe, 2004,
Section 3.3).
Clearly L(0) = 0. The H¨ older inequality shows that L is convex: if θ is a convex combination α 1 θ 1 + α 2 θ 2 then
e L(θ) = M (α 1 θ 1 + α 2 θ 2 )
= P
e α 1 θ 1 X e α 2 θ 2 X
≤ Pe θ 1 X
α 1 Pe θ 2 X
α 2
= e α 1 L(θ 1 )+α 2 L(θ 2 ) .
More precisely (Problem [1]), the set J = {θ : M (θ) < ∞} is an interval, possibly degenerate or unbounded, containing the origin. For example, for the Cauchy, M (θ) = ∞ for all real θ 6= 0; for the standard exponential distribution (density e −x {x > 0} with respect to Lebesgue measure),
M (θ) =
(1 − θ) −1 for θ < 1
∞ otherwise .
Problem [1] also justifies the interchange of differentiation and expecta- tion on the interior of J , which leads to
L 0 (θ) = M 0 (θ)
M (θ) = PX e θX M (θ) L 00 (θ) = M 00 (θ)
M (θ) − M 0 (θ) M (θ)
2
= PX 2 e θX M (θ) −
PX e θX M (θ)
2
.
If we define a probability measure P θ by its density, dP θ /dP = e θX /M (θ), then the derivatives of L can be re-expressed using moments of X under P θ :
L 0 (θ) = P θ X and L 00 (θ) = P θ (X − P θ X) 2 = var θ (X) ≥ 0.
In particular, L 0 (0) = PX, so that Taylor expansion gives
\E@ L.Taylor
\E@ L.Taylor <4> L(θ) = θPX + 1 2 θ 2 var θ∗ (X) with θ ∗ lying between 0 and θ.
Bounds on the P θ variance of X lead to bounds on L and tail bounds for X − PX. The Hoeffding inequality (Example < 14 >) provides a good illustration of this line of attack.
2.3 Orlicz norms
Basic::S:orlicz
There is a fruitful way to generalize the the concept of an L p norm by means
of a Young function, that is, a convex, increasing function Ψ : R + → R +
with Ψ(0) = 0 and Ψ(x) → ∞ as x → ∞. The concept applies with any
measure µ but I need it mostly for probability measures.
§2.3 Orlicz norms 5
Remark. The assumption that Ψ(x) → ∞ as x → ∞ is needed only to rule out the trivial case where Ψ ≡ 0. Indeed, if Ψ(x 0 ) > 0 for some x 0
then, for x = x 0 /α with α < 1,
Ψ(x 0 ) = Ψ(αx + (1 − α)0) ≤ αΨ(x)
That is, Ψ(x)/x ≥ Ψ(x 0 )/x 0 for all x ≥ x 0 . If Ψ is not identically zero then it must increase at least linearly with increasing x.
At one time Orlicz norms became very popular in the empirical process literature, in part (I suspect) because chaining arguments (see Chapter 4) are cleaner with norms than with tail probabilities.
The most important Young functions for my purposes are the power functions, Ψ(x) = x p for a fixed p ≥ 1, and the exponential power func- tions Ψ α (x) := exp(x α ) − 1 for α ≥ 1. The function Ψ 2 is particularly useful (see Section 2.4) for working with distributions whose tails decrease like a Gaussian. It is also possible (Problem [6]) to define a Young function Ψ α , for 0 < α < 1, for which
Ψ α (x) ≤ exp(x α ) ≤ K α + Ψ α (x) for all x ≥ 0,
where K α is a positive constant. Some linear interpolation near 0 is needed to overcome the annoyance that x 7→ exp(x α ) is concave on the inter- val [0, z α ], for z α := ((1 − α)/α) 1/α .
Basic::Orlicz.norm <5> Definition. Let ( X, A, µ) be a measure space and f be an A-measuable real function on X. For each Young function Ψ the corresponding Orlicz norm kf k Ψ (or Ψ-norm) is defined by
kf k Ψ = inf{c > 0 : µΨ(|f (x)|/c) ≤ 1},
with kf k Ψ = ∞ if the infimum runs over an empty set.
Remark. It is not hard to show that kf k Ψ < ∞ if and only if µΨ(|f |/C 0 ) < ∞ for at least one finite constant C 0 . Moreover, if 0 < c = kf k Ψ < ∞ then µΨ(|f |/c) ≤ 1; the infimum is achieved.
When restricted to the set L Ψ ( X, A, µ) of all measurable real functions on X with finite Ψ-norm, k·k Ψ is a seminorm. It fails to be a norm only because kf k Ψ = 0 if and only if µ{x : f (x) 6= 0} = 0.
The corresponding set of µ-equivalence classes is a Banach space. It is traditional and benign to call k·k Ψ a norm even though it is only a seminorm on L Ψ ( X, A, µ).
The assumption Ψ(0) = 0 can be discarded if µ is a finite measure.
For each finite constant A > Ψ(0), the quantity inf{c > 0 : µΨ(|f |/c) ≤
A} also defines a seminorm.
Finiteness of the Ψ-norm for a random variable Z immediately implies a tail bound: if 0 < σ = kZk Ψ < ∞ then
\E@ orlicz.tail
\E@ orlicz.tail <6> P{|Z| ≥ xσ} ≤ max
1, P Ψ (|Z|/σ) Ψ(x)
= max
1, 1
Ψ(x)
for x ≥ 0.
For example, for each α > 0 there is a constant C α for which P{|Z| ≥ x kZk Ψ α } ≤ C α e −x α for x ≥ 0.
The constant can be chosen so that C α e −x α ≥ 1 for all x so small that Ψ α (x)e −x α is not close to 1.
When seeking to bound an Orlicz norm I often find it easier to first obtain a constant C 0 for which PΨ(|X|/C 0 ) ≤ K 0 , where K 0 is a constant strictly larger than 1. As the next Lemma shows, a simple convexity argument turns such an inequality into a bound on kXk Ψ .
Basic::Psi.bound <7> Lemma. Suppose Ψ is a Young function. If PΨ(|X|/C 0 ) ≤ K 0 for some finite constants C 0 and K 0 > 1 then kXk Ψ ≤ C 0 K 0 .
Proof For each θ in [0, 1] convexity of Ψ gives PΨ θ|X|
C 0
≤ θPΨ |X|
C 0
+ (1 − θ)Ψ(0) ≤ θK 0 . The choice θ = 1/K 0 makes the last bound equal to 1.
Beware. The Lemma does not give a foolproof way of determining rea- sonable bounds for the Ψ-norm. For example, if X has a N (0, 1) distribution then simple integration shows that
PΨ 2 (|X|/c) = g(c) := c/ p
c 2 − 2 − 1 for c > √ 2.
Thus kXk Ψ
2 = p8/3 but cg(c) → ∞ as c decreases to √ 2.
2.4 Subgaussian tails
Basic::S:subgaussian
As noted in Section 2.2, if X is a random variable for which there exist constants µ and σ 2 for which Pe θX ≤ exp(µθ + 1 2 σ 2 θ 2 ) for all real θ then
\E@ subg.oneside
\E@ subg.oneside <8> P{X ≥ µ + σx} ≤ e −x
2 /2 for all x ≥ 0.
§2.4 Subgaussian tails 7
The approximations for θ near zero,
Pe θX = 1 + θPX + 1 2 θ 2 PX 2 + o(θ 2 ), exp(µθ + 1 2 σ 2 θ 2 ) = 1 + µθ + 1 2 (σ 2 + µ 2 )θ 2 + o(θ 2 ),
would force PX = µ and PX 2 ≤ σ 2 + µ 2 , that is, var(X) ≤ σ 2 . As Prob- lem [11] shows, σ 2 might need to be much larger than the variance. To avoid any hint of the convention that σ 2 denotes a variance, I feel it is safer to replace σ by another symbol.
Basic::subgaussian <9> Definition. Say that a random variable X with PX = µ has a subgaussian distribution with scale factor τ < ∞ if Pe θX ≤ exp(µθ + τ 2 θ 2 /2) for all real θ. Write τ (X) for the smallest τ for which such a bound holds.
Remark. Some authors call such an X a τ -subgaussian random variable. Some authors require µ = 0 for subgaussianity; others use the qualifier centered to refer to the case where µ is zero.
For a subgaussian X, the two-sided analog of < 8 > gives
\E@ beta.def
\E@ beta.def <10> P{|X − µ| ≥ x} ≤ 2 exp
− x 2 2β 2
for all x ≥ 0,
for 0 < β ≤ τ (X). In fact (Theorem < 16 >) existence of such a tail bound is equivalent to subgaussianity.
Basic::gaussian <11> Example. Suppose Y = (Y 1 , . . . , Y n ) has a multivariate normal distribu- tion. Define M := max i≤n |Y i | and σ 2 := max i≤n var(Y i ). Chapter 6 explains why
P{|M − PM | ≥ σx} ≤ 2 exp(−x 2 /2) for all x ≥ 0.
That is, M is subgaussian—a most surprising and wonderful fact.
Many useful results that hold for gaussian variables can be extended to subgaussians. In empirical process theory, conditional subgaussianity plays a key role in symmetrization arguments (see Chapter 8). The first example captures the main idea.
Basic::Rademachers <12> Example. Suppose X = P
i≤n b i s i , where the b i ’s are constants and the s i ’s are independent Rademacher random variables, that is, P{s i = +1} = 1 2 = P{s i = −1}. Then
Pe θb i s i = 1 2
e θb i + e −θb i
=
∞
X
k=0
(θb i ) 2k
(2k)! ≤ exp(θ 2 b 2 i /2)
so that Pe θX ≤ Q
i≤n Pe θb i s i ≤ e θ 2 τ 2 /2 with τ 2 = P
i≤n b 2 i . That is, X has a centered subgaussian distribution with τ (X) 2 ≤ P
i≤n b 2 i .
More generally, if X is a sum of independent subgaussians X 1 , . . . , X n then the equality
Pe θ(X−PX) = Y
i Pe θ(X i −PX i )
shows that X is subgaussian with τ 2 (X) ≤ P
i τ 2 (X i ). For sums of indepen- dent variables we need consider only the subgaussianity of each summand.
Basic::interval.range <13> Example. Suppose X is a random variable with PX = µ and a ≤ X ≤ b, for finite constants a and b. By representation < 4 >, its moment generating function M (θ) satisfies
log M (θ) = θPX + 1 2 θ 2 var θ∗ (X) with θ ∗ lying between 0 and θ.
Write µ ∗ for P θ ∗ X and m for (b − a)/2. Note that |X − m| ≤ (b − a)/2.
Thus
var θ∗ (X) = P θ ∗ |X − µ ∗ | 2 ≤ P θ ∗ |X − m| 2 ≤ (b − a) 2 /4.
and log Pe θ(X−µ) ≤ (b − a) 2 θ 2 /8. The random variable X is subgaussian with τ (X) ≤ (b − a)/2.
Remark. Hoeffding (1963, page 22) derived the same bound on the moment generating function by a direct appeal to convexity of the exponential function:
Pe θX ≤ qe θa + pe θb where 1 − q = p = PX − a b − a
In effect, the bound represents the extremal case where X concentrates on the the two-point set {a, b}. He then calculated a second derivative for L, without interpreting the results as a variance but using an inequality that can be given a variance interpretation.
Basic::Hoeffding.ineq <14> Example. (Hoeffding’s inequality for independent summands) Suppose ran- dom variables X 1 , . . . , X n are independent with PX i = µ i and a i ≤ X i ≤ b i for each i, for constants a i and b i .
By Example < 13 >, each X i is subgaussian with τ (X i ) ≤ (b i −a i )/2. The sum S = P
i≤n X i is also subgaussian, with expected value µ = P
i≤n µ i
§2.4 Subgaussian tails 9
and τ 2 (S) = P
i≤n τ 2 (X i ) = P
i≤n (b i − a i ) 2 /4. The one-sided version of inequality < 10 > gives
\E@ Hoeff.indep
\E@ Hoeff.indep <15>
P nX
i≤n (X i − µ i ) ≥ x o
≤ exp
−2x 2 / X
i≤n (b i − a i ) 2
for each x ≥ 0.
See Chapter 3 for an extension of Hoeffding’s inequality < 15 > to sums of bounded martingale differences.
The next Theorem provides other characterizations of subgaussianity, using the Orlicz norm for the Young function Ψ 2 (x) = exp(x 2 ) − 1 and mo- ment inequalities. The proof of the Theorem provides a preview of a “sym- metrization” argument that plays a major role in Chapter 8. I prefer to write the argument using product measures rather than conditional expectations.
Here is the idea. Suppose X 1 is a random variable with PX 1 = µ, defined on a probability space (Ω 1 , F 1 , P 1 ). Make an “independent copy” (Ω 2 , F 2 , P 2 ) of the probability space then carry X 1 over to a new random variable X 2 on Ω 2 . Under P 1 ⊗ P 2 the random variables X 1 and X 2 are independent and identically distributed.
Remark. You might prefer to start with an X defined on (Ω, F, P) then define random variables on Ω × Ω by X 1 (ω 1 , ω 2 ) = X(ω 1 ) and X 2 (ω 1 , ω 2 ) = X(ω 2 ). Under P ⊗ P the random variables X 1 and X 2 are independent and each has the same distribution as X (under P). I once referred to independent copies during a seminar in a Math department.
One member of the audience protested vehemently that a copy of X is the same as X and hence can only be independent of X in trivial cases.
Note that P 2 X 2 = P 1 X 1 = µ. For each convex function f , P 1 f (X 1 − µ) = P ω 1 1 f [X 1 (ω 1 ) − P ω 2 2 X 2 (ω 2 )]
= P ω 1 1 f [P ω 2 2 (X 1 (ω 1 ) − X 2 (ω 2 ))]
≤ P ω 1 1 P ω 2 2 f [X 1 (ω 1 ) − X 2 (ω 2 )] , the last line coming from Jensen’s inequality for P 2 .
Basic::subgauss.equiv <16> Theorem. For each random variable X with expected value µ, the following assertions are equivalent.
(i) X has a subgaussian disribution (with τ (X) < ∞)
Basic::L.finite (ii) L(X) := sup r≥1 kXk r
√ r < ∞
Basic::Psitwo.finite (iii) kXk Ψ
2 < ∞
Basic::beta.def2 (iv) there exists a positive constant β for which P{|X−µ| ≥ x} ≤ 2 exp
− x 2 2β 2
for all x ≥ 0
More precisely,
Basic::Psi2L.equiv (a) c 0 kXk Ψ
2 ≤ L(X) ≤ kXk Ψ
2 with c 0 = 1/ √ 4e ;
Basic::centered (b) if β(X) denotes the smallest β for which inequality (iv) holds, c 1 kX − µk Ψ
2 ≤ β(X) ≤ τ (X) ≤ 4L(X − µ) with c 1 = 1/ √ 6.
Remark. Finiteness of L(X) is also equivalent to finiteness of L(X − µ) because |µ| ≤ kXk 1 ≤ L(X) and |L(X)−L(X −µ)| ≤ |µ|. Consequently L(X − µ) ≤ 2L(X), but there can be no universal inequality in the other direction: L(X − µ) is unchanged by addition of a constant to X but L(X) can be made arbitrarily large.
The precise values of the constants c i are not important. Most likely they could be improved, but I can see no good reason for trying.
Proof Implicit in the sequences of inequalities is the assertion that finite- ness of one quantity implies finiteness of another quantity.
Proof of (a)
Problem [8] shows that k k ≥ k! ≥ (k/e) k for all k ∈ N.
Suppose PΨ 2 (|X|/D) ≤ 1. Then P
k∈N P(X/D) 2k /k! ≤ 1 so that kXk 2k ≤ D(k!) 1/2k ≤ D √
k for each k ∈ N and L(X) ≤ √
2 sup k∈N kXk 2k / √
2k ≤ D by Problem [9].
Conversely, if ∞ > L ≥ kXk r / √
r for all r ≥ 1 then
PΨ 2
|X|
D
= X
k∈N
kX/Dk 2k 2k
k! ≤ X
k∈N
(L √
2k/D) 2k (e/k) k ,
which equals 1 if D = √ 4e L.
Proof of (b)
None of the quantities appearing in the inequalities depend on µ, so we may as well assume µ = 0.
Inequality < 10 > gives β(X) ≤ τ (X).
§2.5 Bennett’s inequality 11
Simplify notation by writing β for β(X) and τ for τ (X) and L for L(X), and define γ := kXk Ψ
2 . Also, assume that X is not a constant, to avoid annoying trivial cases.
To show that γ ≤ β √ 6:
PΨ 2
|X − µ|
β √ 6
= P Z ∞
0
e x {|X − µ| 2 ≥ 6β 2 x} dx
= Z ∞
0
e x P{|X − µ| ≥ β
√ 6x} dx
≤ 2 Z ∞
0
exp (x − 3x) dx = 1.
To show that τ ≤ 4L: Symmetrize. Construct a random variable X 0 with the same distribution as X but independent of X. By the triangle inequality, kX − X 0 k r ≤ 2 kXk r ≤ 2L √
r for all r ≥ 1. By Jensen’s inequality Pe θX ≤ Pe θ(X−X 0 )
= 1 + X
k∈N
θ 2k P(X − X 0 ) 2k
(2k)! symmetry kills odd moments
≤ 1 + X
k∈N
(2L √ 2k ) 2k k k k!
= 1 + X
k∈N
(8θ 2 L 2 ) k
k! = e 8L 2 θ 2 , and so on.
2.5 Bennett’s inequality
Basic::S:Bennett
The Hoeffding inequality depends on the somewhat crude bound var(W ) ≤ (b − a) 2 /4 if a ≤ W ≤ b.
Such an inequality can be too pessimistic if W has only a small probability
of taking values near the endpoints of the interval [a, b]. The inequalities
can sometimes be improved by introducing second moments explicitly into
the upper bound. Bennett’s one-sided tail bound provides a particularly
elegant illustration of the principle.
Basic::Bennett.ineq <17> Theorem. Suppose X 1 , . . . , X n are independent random variables for which X i ≤ b and PX i = µ i and PX i 2 ≤ v i for each i, for nonnegative constants b and v i . Let W equal P
i≤n v i . Then P
nX
i≤n (X i − µ i ) ≥ x o
≤ exp
− x 2 2W ψ Benn
bx W
for x ≥ 0 where ψ Benn denotes the function defined on [−1, ∞) by
\E@ Benpsi.def
\E@ Benpsi.def <18> ψ Benn (t) := (1 + t) log(1 + t) − t
t 2 /2 for t 6= 0, and ψ Benn (0) = 1.
Remark. By Problem [14], ψ Benn is positive, decreasing, and convex, with ψ Benn (−1) = 2 and ψ Benn (t) decreasing like (2/t) log t as t → ∞.
When bx/W is close to zero the ψ Benn contribution to the upper bound is close to 1 and the exponent is close to a subgaussian −x 2 /(2W ).
(If b = 0 then it is exactly subgaussian.) Further out in the tail, for larger x, the exponent decreases like −x log(bx/W )/b, which is slightly faster than exponential.
Proof Define Φ 1 (x) = e x − 1 − x, the convex function Ψ 1 adjusted to have zero derivative at the origin, as in Problem [7]. From Problem [14], the positive function
\E@ Del.def
\E@ Del.def <19> ∆(x) = Φ 1 (x)
x 2 /2 for x 6= 0, and ∆(0) = 1,
is continuous and increasing on the whole real line. For θ ≥ 0, Pe θX i = 1 + θPX i + PΦ 1 (θX i )
= 1 + θµ i + P 1 2 (θX i ) 2 ∆(θX i )
≤ 1 + θµ i + 1 2 θ 2 P(X i 2 )∆(θb) because ∆ is increasing
≤ exp
θµ i + 1 2 θ 2 v i ∆(θb) .
\E@ Bennett.mgf
\E@ Bennett.mgf <20>
That is,
\E@ bdd1
\E@ bdd1 <21> Pe θ(X i −µ i ) ≤ exp 1 2 θ 2 v i ∆(θb) = exp v i
b 2 Φ 1 (θb)
for θ ≥ 0.
Remark. In the last expression I am implicitly assuming that b
is nonzero. We can recover the bound when b = 0 by a passage
to the limit as b decreases to zero. Actually, the b = 0 case is
quite noteworthy because the ∆ contribution disappears from the
upper bound for P exp(θX i ) and the ψ Benn disappears from the final
inequality.
§2.5 Bernstein’s inequality 13
Abbreviate P
i≤n (X i − µ i ) to T . By independence and < 21 >, Pe θT = Y
i≤n Pe θ(X i −µ i ) ≤ exp W
b 2 Φ 1 (θb)
and, by < 2 >,
P{T ≥ x} ≤ inf θ≥0 exp
−xθ + W
b 2 Φ 1 (θb)
= exp − W b 2 sup
bθ≥0
bx
W bθ − Φ 1 (θb)
!
= exp
− W
b 2 Φ ∗ 1 (bx/W )
, where Φ ∗ 1 denotes the conjugate of Φ 1 ,
Φ ∗ 1 (y) := sup
t∈R
(ty − Φ 1 (t)) = (1 + y) log(1 + y) − y if y ≥ −1
+∞ otherwise .
When y ≥ −1 the supremum is achieved at t = log(1+y). Thus 1 2 y 2 ψ Benn (y) = Φ ∗ 1 (y) for y ≥ −1 and (W/b 2 )Φ ∗ 1 (bx/W ) = x 2 /(2W )ψ Benn (bx/W ), as as- serted.
Remark. The inequality still holds if 0 > b ≥ −x/W , although the proof changes slightly. What happens for smaller b?
Unlike the case of the two-sided Hoeffding inequality, there is no au- tomatic lower tail analog for the Bennett inequality without an explicit lower bound for the summands. One particularly interesting case occurs when X i ≥ 0:
P{
X
i≤n (X i − µ i ) ≤ −x} = P{ X
i≤n (µ i − X i ) ≥ x} ≤ exp(− 2 /(2W )) for all x ≥ 0, a clean (one-sided) subgaussian tail bound.
The Bennett inequality also works for the Poisson distribution. If X has a Poisson(λ) distribution then (Problem [15])
P{X ≥ λ + x} exp
− x 2 2λ ψ Benn
x λ
for x ≥ 0 and, for 0 ≤ x ≤ λ,
P{X ≤ λ − x} = exp
− x 2 2λ ψ Benn
− x λ
≤ exp
− x 2 2λ
,
a perfect subgaussian lower tail. See Problem [16] if you have any doubts about the boundedness of Poisson random variables.
See Chapter 3 for an extension of the Bennett inequality to sums of martingale differences.
2.6 Bernstein’s inequality
Basic::S:Bernstein
As shown by Problem [14], ψ Benn (t) ≥ (1 + t/3) −1 for t ≥ −1. Thus the Bennett inequality implies a weaker result: if X 1 , . . . , X n are independent with PX i = µ i and PX i 2 ≤ v i and X i ≤ b then
\E@ Bernstein.indep0
\E@ Bernstein.indep0 <22> P{
X
i≤n (X i − µ i ) ≥ x} ≤ exp
− x 2 /2 W + bx/3
for x ≥ 0,
a form of Bernstein’s inequality. When bx/W is close to zero the expo- nent is again close to x 2 /(2W ); when bx/W is large, the Bernstein inequality loses the extra log term in the exponent. If the summands X i are bounded in absolute value (not just bound above) we get an analogous two-sided inequality.
The assumptions PX i 2 ≤ v i and X i ≤ b imply P|X i | k ≤ PX i 2 b k−2 ≤ v i b k−2 for k ≥ 2.
A much weaker condition on the growth of P|X i | k with k, without any assumption of boundedness, leads to a useful upper bound for the moment generating function and a more genral version of the Bernstein ineqi=uality..
Basic::Bernstein.unbdd <23> Theorem. (Bernstein inequality) Suppose X 1 , . . . , X n are independent ran- dom variables with PX i = µ i and
\E@ Bernstein.moment
\E@ Bernstein.moment <24> P|X i | k ≤ 1 2 v i B k−2 k! for k ≥ 2.
Define W = P
i≤n v i . Then, for x ≥ 0, P
nX
i≤n (X i − µ i ) ≥ x o
≤ e −H Benn (x,B,W ) ≤ e −H Bern (x,B,W ) where
H Bern (x, B, W ) = x 2 /W 2(1 + xB/W ) H Benn (x, B, W ) = x 2 /W
1 + xB/W + p1 + 2xB/W
§2.6 Bernstein’s inequality 15
Remark. The B here does not represent an upper bound for X i . The traditional form of Bernstein’s inequality, with HBern, corresponds to the bound < 22 > derived from the Bennett inequality under the assumption X i ≤ b = B/3.
Proof As before, F := P nX
i X i ≥ x + X
i µ i
o
≤ inf
θ≥0 e −(x+ P i µ i )θ Y
i≤n Pe θX i . Use the moment inequality < 24 > to bound the moment generating func- tion for X i .
Pe θX i = 1 + θµ i + PΦ 1 (θX i ) where Φ 1 (t) := e t − 1 − t and, for θ ≥ 0,
PΦ 1 (θX i ) ≤ PΦ 1 (θ|X i |) = X
k≥2
θ k P|X i | k k!
≤ 1 2 v i θ 2 X
k≥2 (θB) k−2
= 1 2 v i θ 2
1 − Bθ provided 0 ≤ θ < 1/B.
\E@ Bernstein.mgf
\E@ Bernstein.mgf <25>
Thus
Pe θX i ≤ 1 + θµ i + 1 2 v i θ 2
1 − bθ ≤ exp
θµ i + 1 2 v i θ 2 1 − bθ
for 0 ≤ θB < 1 and
F ≤ exp − sup 0≤Bθ<1 g(θ)
where g(θ) := xθ − 1 2 W θ 2 1 − Bθ . The H Bern bound
As an approximate maximizer for g choose θ 0 = x/(W + xB) so that F ≤ e −g(θ 0 ) = exp
− x 2 /2 W + xB
.
Bennett (1962, equation (7)) made this strange choice by defining G = 1−θB then minimizing −xθ+ 1 2 θ 2 W/G while ignoring the fact that G depends on θ.
That gives θ = xG/W = (1 − θB)x/W , a linear equation with solution θ 0 .
Very confusing. As Bennett noted, a boundedness assumption |X i | ≤ b
actually implies < 24 > with B = b/3. The exponential upper bound then agrees with < 22 >.
The H Benn bound
Following Bennett (1962, equation (7a)), this time we really maximize g(θ).
The calculation is cleaner if we rewrite g with u := xB/W and s := 1 − θB.
With those substitutions, g(θ) = g 0 (s) = W
2b 2 2us − (1 − s) 2 /s = W
2B 2 −s −1 − (1 + 2u)s − 2(1 + u) . The maximum is achieved at s 0 = (1 + 2u) −1/2 giving F ≤ e −g 0 (s 0 ) where
\E@ u.bound
\E@ u.bound <26> g 0 (s 0 ) = W
B 2 1 + u − √
1 + 2u = W B 2
u 2
1 + u + √ 1 + 2u
.
Both inequalities in Theorem < 23 > have companion lower bounds, ob- tained by replacing X i by −X i in all the calculations. This works because the Bernstein condition < 24 > also holds with X i replaced by −X i . The in- equalities could all have been written as two-sided bounds, that is, as upper bounds for P{| P
i (X i − µ i )| ≥ x}.
If we are really only interested in a one-sided bound then it seems super- fluous to control the size of |X i |. For example, for the upper tail it should be enough to control X i + . Boucheron et al. (2013, page 37)=BLM derived such a bound (which they attributed to Emmanuel Rio) by means of the elementary inequality
e x − 1 − x ≤ 1 2 x 2 + X
k≥3 (x+) k /k! for all real x.
(For x ≥ 0 both sides are equal. For x < 0 it reflects the fact that g(x) = e x − (1 + x + 1 2 x 2 ) has a positive derivative with g(0) = 0.) They replaced the assumption < 24 > by
PX i 2 = v i and P|X i | k ≤ 1 2 v i B k−2 k! for k ≥ 3.
For B > θ ≥ 0, we then get
Pe θX i ≤ 1 + θµ i + 1 2 θ 2 v i + 1 2 v i θ 2 X
k≥3
(θB) k−2 ≤ exp
θµ i + 1 2 v i θ 2 1 − θB
,
§2.7 Subexponential distributions? 17
which is exactly the same inequality as in the proof of Theorem < 23 >. As with the H Benn calculation it follows that
P nX
i X i ≥ x + X
i µ i o
≤ exp
− W
B 2 1 + u − √
1 + 2u
where W = P
i v i and u = xB/W . BLM observed that the upper bound can be inverted by solving a quadratic in u,
1 + 2u = 1 + u − tB 2 /W 2
,
which has a positive solution u = tB 2 /W +p2tB 2 /W , corresponding to x = W u/B = tB + √
2tW . The one-sided Bernstein bound then takes an elegant form,
\E@ one.sided.Bernstein
\E@ one.sided.Bernstein <27> P nX
i (X i − µ i ) ≥ tB +
√ 2tW
o
≤ e −t for t ≥ 0.
By splitting according to whether tB ≥ √
2tW or not, and by introducing an extraneous factor of 2, we can split the elegant form into two more suggestive inequalities,
P{
X
i (X i − µ i ) ≥ 2tB} ≤ e −t for t ≥ 2W/B 2 P{
X
i (X i − µ i ) ≥ 2t √
W } ≤ e −t 2 /2 for t 2 /2 ≥ 2W/B 2 .
These inequalities make clear that the subexponential part (large t) of the tail is controlled by B and the subgaussian part (moderate t) is controlled by W .
2.7 Subexponential distributions?
Basic::S:subexp
To me, subexponential behavior refers to functions that are bounded above by a constant multiple of exp(−c|x|), for some positive constant c. For example, the Bernstein inequality shows that, far enough from the mean, the tail probabilities for a large class of sums of independent random variables are subexponential. Closer to the mean the tail are subgaussian, which one might think is better than subexponential. As another example, if a random variable X takes values in [−1, +1] we have P{|X| ≥ x} = 0 ≤ exp(−10 10 x) for all x ≥ 1, but I would hesitate to call that behavior subexponential.
What then does it mean for a whole distribution to be subexponential?
And how is subexponentiality related to the Bernstein moment condition,
\E@ Bernstein.moment0
\E@ Bernstein.moment0 <28> P |X| k
B k k! ≤ 1 2 v/B 2 for k ≥ 2?
As you already know, condition < 28 > implies
\E@ Bernstein.mgf0
\E@ Bernstein.mgf0 <29> Pe θX −1−θPX = PΦ 1 (θX) ≤ PΦ 1 (θ|X|) ≤ 1 2 vθ 2
1 − θB for 0 ≤ θB < 1, which implies
Pe θX ≤ exp
θµ + 1 2 vθ 2 1 − θB
for 0 ≤ θB < 1 and µ = PX.
Van der Vaart and Wellner (1996, page 103) pointed out that condition < 28 >
is implied by
1
2 v/B 2 ≥ PΦ 1 (|X|/B) = X
k≥2
P|X| k
B k k! where Φ 1 (x) := e x − 1 − x.
And, if < 28 > holds, then
PΦ 1 (|X|/2B) = X
k≥2
P|X| k
2 k B k k! ≤ 1 2 (2v)/(2B) 2 ,
which is just < 29 > with θ = 1/(2B). It would seem that the moment condi- tion is somehow related to kXk Φ 1 . Indeed, Vershynin (2010, Section 5.2.4) defined a random variable X to be subexponential if sup k∈N kXk k /k < ∞.
By Problems [7] and [10], this condition is equivalent to finiteness of the Orlicz norm kXk Φ 1 .
Boucheron et al. (2013, Section 2.4) defined a centered random vari- able X to be sub-gamma on the right tail with variance factor v and scale factor B if
\E@ sub.gamma
\E@ sub.gamma <30> Pe θX ≤ exp vθ 2 /2 1 − θB
for 0 ≤ θB < 1.
They pointed out that if X has a gamma(α, β) distribution (with mean µ = αβ and variance v = αβ 2 ) then
Pe θ(X−µ) = e −θµ (1 − θβ) −α = exp (−θαβ − α log(1 − θβ))
≤ exp αβ 2 θ 2 /2 1 − θβ
for 0 ≤ θβ < 1.
Hence the name. The special case where α = 1 corresponds to the exponen-
tial.
§2.7 Subexponential distributions? 19
Let me try to tie these definitions and facts together.
First note that the upper bound in < 29 > equals 1 if θ = 2/ B + √
B 2 + 2v . Thus the Bernstein assumption < 28 > implies
γ := kXk Φ 1 ≤ B + p
B 2 + 2v
/2.
If, as is sometimes assumed, v = PX 2 then < 28 > also implies v 3/2 ≤ P|X| 3 ≤ 1 2 v3!B 2 .
That is, v ≤ (3B) 2 and B < γ ≤ 2.7B. It appears that the B, or kXk Φ 1 , controls the subexponential part of the tail probabilities, at least as far as the Bernstein inequality is concerned. Inequality < 27 > conveyed the same message. Compare with the subexponential bound
P{|X| ≥ x} ≤ min (1, 1/Φ 1 (x/γ)) for γ = kXk Φ
1 . The inequality PΦ 1 (|X|/γ) ≤ 1 also implies
P
|X| k
γ k k! ≤ 1 ≤ 1 2 (2γ 2 )/γ 2 for k ≥ 2.
That is, < 28 > holds with v = 2γ 2 and B = γ. Theorem < 23 > then gives
P{X − µ ≥ x} ≤ exp
− x 2 4γ 2 + 2xγ
≤ exp(−x 2 /(8γ 2 )) for 0 ≤ x ≤ 2γ.
Is that not a subgaussian bound? Unfortunately, the bound—subgaussian or not—has little value for such a small range near zero. The upper bound always exceeds 1/ √
e .
The useful subgaussian part of the bound is contributed by the v in < 28 >.
The effect is clearer for a sum of independent random variables X 1 , . . . , X n , each distributed like X:
P exp(θ X
i≤n X i ) ≤ exp nvθ 2 /2 1 − θB
for 0 ≤ θB < 1, which leads to tail bounds like
P{
X
i X i ≥ x} ≤ exp
− x 2 /2 nv + xB
,
useful subgaussian behavior when 0 ≤ x ≤ nv/B.
In my opinion, the sub-gamma property < 30 > is better thought of as a restricted subgaussian assumption that automatically implies subexponen- tial behavior far out in the tails. For each α in (0, 1) it implies
Pe θX ≤ exp vθ 2 /2 1 − α
for 0 ≤ θB ≤ α < 1.
Remark. Uspensky (1937, page 204) used this approach for the Bernstein inequality, before making the choice α = xB/(W + xB) to recover the traditional HBern form of the upper bound.
More generally, if we have a bound
\E@ restricted.subg
\E@ restricted.subg <31> Pe θ(X−µ) ≤ e vθ 2 /2 for 0 ≤ θ ≤ ρ,
with v and ρ positive constants and µ = PX, then for x ≥ 0, P{X ≥ µ + x} ≤ min
0≤θ≤ρ exp −xθ + 1 2 vθ 2
= exp − 1 2 x 2 /v
for 0 ≤ x ≤ ρv exp −ρx + 1 2 vρ 2 < e −xρ/2 for x > ρv .
\E@ gauss.exp
\E@ gauss.exp <32>
The minimum is achieved by θ = min(ρ, x/v).
The proof in the next Section of a generalized version of the Hanson- Wright inequality—showing that a quadratic P
i,j X i A i,j X j in independent subgaussians has a subgaussian/subexponential tail— illustrates this ap- proach.
2.8 The Hanson-Wright inequality for subgaussians
Basic::S:HW
The following is based on an elegant argument by Rudelson and Vershynin (2013) = R&V. To shorten the proof I assume that the A matrix has zeros down its diagonal, so that P P
i,j X i A i,j X j = 0 and I can skip the part of the argument (see Problem [17]) that deals with P
i A i,i (X i 2 − PX i 2 ).
Basic::HW <33> Theorem. Let X 1 , . . . , X n be independent, centered subgaussian random variables with τ ≥ max i τ (X i ). Let A be an n × n matrix of real numbers with A ii = 0 for each i. Then
P{X 0 AX ≥ t} ≤ exp − min t 2
64τ 4 kAk HS , t 8 √
2 τ 2 kAk 2
!!
for t ≥ 0,
where kAk HS := ptrace(A 0 A) = q P
i,j A 2 i,j and kAk 2 := sup kuk
2 ≤1 kAuk 2 .
§2.8 The Hanson-Wright inequality for subgaussians 21
Remark. There are no assumptions of symmetry or positive definiteness on A. The proof also works with A replaced by −A, leading to a similar upper bound for P{|X 0 AX| ≥ t}. The Hilbert-Schmidt norm kAk HS is also called the Frobenius norm. The constants 64 and 8 √
2 are not important.
Proof Without loss of generality, suppose τ = 1, so that Pe θX i ≤ e θ 2 /2 for all i and all real θ and P exp(α 0 X) ≤ exp(kαk 2 2 /2) for each α ∈ R n .
As explained at the end of Section 2.7, it suffices to show Pe θX
0 AX ≤ exp( 1 2 vθ 2 ) for 0 ≤ θ ≤ ρ where v = 16 kAk 2 HS and ρ = √
32 kAk 2 −1
.
To avoid conditioning arguments, assume P = ⊗ i∈[n] P i on B(R [n] ), where [n] := {1, 2, . . . , n}, and each X i is a coordinate map, that is, X i (x) = x i and X(x) = x. Write G for the N (0, I n ) distribution, another probability measure on B(R [n] ). Of course G = ⊗ i∈[n] G i with each G i equal to N (0, 1).
For each subset J of [n] write P J for ⊗ i∈J P i and G J for ⊗ i∈J G i and, for generic vectors w = (w i : i ∈ [n]) in R [n] , write w J for (w i : i ∈ J ) ∈ R J .
The proof works by bounding Pe θX 0 AX by the G expected value of an analogous quadratic, using a decoupling argument that R&V attributed to Bourgain (1998).
For each δ = (δ i : i ∈ I) ∈ {0, 1} [n] define A δ to be the n × n matrix with (i, j)th element δ i (1 − δ j )A i,j . Write D δ for {i ∈ [n] : δ i = 1} and N δ for [n]\D δ . Then X 0 A δ X = X D 0
δ EX N δ where E δ := A[D δ , N δ ]. If the rows and columns of A were permuted so that i < i 0 for all i ∈ D δ and i 0 ∈ N δ then A δ would look like
A δ = 0 E δ 0 0
where E δ := A[D δ , N δ ] .
Now let Q denote the uniform distribution on {0, 1} [n] . Under Q the δ i ’s are independent Ber(1/2) random variables, so that QA δ = 1 4 A. For each θ, Jensen’s inequality and Fubini give
P exp θX 0 AX = P x exp(4θQ δ X 0 A δ X) ≤ Q δ P x exp(4θX 0 A δ X).
Write Q δ for X 0 A δ X. It therefore suffices to show that
P exp(4θQ δ ) ≤ exp( 1 2 vθ 2 ) for each δ and all 0 ≤ θ ≤ ρ.
From this point onwards the δ is held fixed. To simplify notation I drop
most of the subscript δ’s, writing X D instead of X D δ , and so on. Of course
A δ cannot drop its subscript.
Under P D , with X N held fixed, the quadratic Q δ := X 0 A δ X = X D 0 EX N
is a linear combination of the independent subgaussians X D . Thus P D e 4θ Q δ ≤ exp
1
2 (4θ) 2 kEX N k 2 2
= G D exp (4θ Q δ ) . Similarly, P N exp (4θQ δ ) ≤ G N exp (θQ δ ). Thus
P exp(4θQ δ ) = P D P N exp(4θQ δ )
≤ P D G N exp(4θ Q δ ) = G N P D exp(4θ Q δ )
≤ G N G D exp(4θ Q δ ) = G N e 8θ 2 X N 0 E 0 EX N .
The problem is now reduced to its gaussian analog, for which rotational symmetry of G N allows further simplification by means of a diagonalization of a positive definite matrix, E 0 E = L 0 ΛL, where L is orthogonal and Λ = diag(λ 1 , . . . , λ n ), with λ 1 ≥ λ 2 ≥ . . . λ n ≥ 0 are the eigenvalues of E 0 E.
G N e 8θ 2 X N 0 E 0 EX N = G N exp 8θ 2 X N 0 ΛX N = Y
i∈N G i exp(8θ 2 λ i x 2 i )
≤ exp X
i∈N 16θ 2 λ i
provided 8θ 2 max
i∈N λ i ≤ 1/4.
\E@ G.bnd
\E@ G.bnd <34>
The last inequality uses the fact that, if Z ∼ N (0, 1), Pe sZ
2 = (1 − 2s) −1/2 = exp − 1 2 log(1 − 2s) ≤ exp(2s) if 0 ≤ s ≤ 1 4 . Now we need some matrix facts to bring everything back to an expression involving A. For notational simplicity again assume that the rows D all precede the rows N , so that
A = B 1 E B 2 B 3
and A 0 A = B 1 0 B 1 + B 2 0 B 2 B 1 0 E + B 2 0 B 3 E 0 B 1 + B 3 0 B 2 E 0 E + B 3 0 B 3
. It follows that
kAk 2 HS = trace(A 0 A) ≥ trace(E 0 E) = X
i∈N λ i and
kAk 2 = sup
kvk 2 ≤1
kAvk 2 ≥ sup
kuk 2 ≤1
kEuk 2 = λ 1 . Combine the various inequalities to conclude that P exp θX 0 AX ≤ exp
16θ 2 kAk 2 HS
provide 0 ≤ θ ≤ √
32 kAk 2 −1
. An appeal to inequality < 32 > completes the proof.
§2.9 Problems 23
2.9 Problems
Basic::S:Problems
[1] Define a = sup{θ < 0 : M (θ) = +∞} and b = inf{θ > 0 : M (θ) = +∞}
Basic::P:convex.interval
where M (θ) = Pe θX . For simplicity, suppose both a and b are finite and a < b.
(i) Show that the restriction of M to the interval [a, b] is continuous. (For example, if M (b) = +∞ you should show that M (θ) → +∞ as θ increases to b.)
(ii) Suppose θ ∈ (a, b). Choose δ > 0 so that [θ − 2δ, θ + 2δ 0 ] ⊂ (a, b). For
|h| < δ, establish the domination bound
|e (θ+h)X − e θX |/h ≤
Z 1 0
Xe (θ+sh)X ds
≤ max
e (θ+2δ)X , e θX , e (θ−2sδ)X
/δ.
Deduce via a Dominated Convergence argument that M 0 (θ) = P(Xe θX ).
(iii) Argue similarly to justify another differentiation inside the expectation, lead- ing to the expression for M 00 (θ).
[2] Suppose h(x) = e g(x) , where g is a twice differentiable real-valued function
Basic::P:e.to.g
on R + .
(i) Show that h is convex on any interval J for which (g 0 (x)) 2 + g 00 (x) ≥ 0 for x ∈ int(J ).
[3] For a fixed α ≥ 1 define Ψ α (x) = e x α − 1 for x ≥ 0.
Basic::P:Psia.ge1
(i) Show that Ψ α (x) is a Young function..
(ii) Show that
Ψ α (x)Ψ α (y) ≤ exp(x α + y α ) − 1 ≤ Ψ α (x + y) for all x, y ≥ 0
≤ Ψ α (2 1/a xy) for x ∧ y ≥ 1.
(iii) Deduce from the inequality Ψ α (x)Ψ α (y) ≤ Ψ α (x + y) that Ψ −1 α (uv) ≤ Ψ −1 α (u) + Ψ −1 α (v) for all u, v ≥ 0.
[4] Suppose f is a convex, nonnegative, increasing function on an interval [a, ∞),
Basic::P:extend.to.Young
with a > 0. Suppose also that there exists an x 0 ∈ [a, ∞) for which
f (x 0 )/x 0 = τ := inf x≥a f (x)/x. Show that h(x) := τ x{0 ≤ x < x 0 } + f (x){x ≥ x 0 } is a Young function.
[5] Suppose f : R + → R + is strictly increasing with f (0) = 0 and
Basic::P:extend.range
f (u 1 u 2 ) ≤ c 0 (f (u 1 ) + f (u 2 )) for min(u 1 , u 2 ) ≥ v 0 ,
for some constants c 0 and v 0 > 0. For each v 1 in (0, v 0 ) prove existence of a larger constant c 1 for which
f (u 1 u 2 ) ≤ c 1 (f (u 1 ) + f (u 2 )) for min(u 1 , u 2 ) ≥ v 1 . Hint: For u ∗ i = max(u i , v 0 ) show that f (u ∗ i ) ≤ f (u i )f (v 0 )/f (v 1 ).
[6] For a fixed α ∈ (0, 1) define f α (x) = e x α for x ≥ 0.
0 2 6 8
051015
zα xα
fα
α=1/2