• No results found

Sequential and Adaptive Inference Based on Martingale Concentration

N/A
N/A
Protected

Academic year: 2021

Share "Sequential and Adaptive Inference Based on Martingale Concentration"

Copied!
227
0
0

Loading.... (view fulltext now)

Full text

(1)

UC Berkeley Electronic Theses and Dissertations

Title

Sequential and Adaptive Inference Based on Martingale Concentration Permalink https://escholarship.org/uc/item/63m9j4hw Author Howard, Steven R Publication Date 2019 Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California

(2)

by

Steven R. Howard

A dissertation submitted in partial satisfaction of the requirements for the degree of

Doctor of Philosophy in Statistics in the Graduate Division of the

University of California, Berkeley

Committee in charge: Professor Jon McAuliffe, Co-chair Professor Jasjeet Sekhon, Co-chair

Professor Bin Yu Professor Nikhil Srivastava

(3)

Copyright 2019 by

(4)

Abstract

Sequential and Adaptive Inference Based on Martingale Concentration by

Steven R. Howard

Doctor of Philosophy in Statistics University of California, Berkeley Professor Jon McAuliffe, Co-chair Professor Jasjeet Sekhon, Co-chair

Randomized experiments hold a well-deserved place at the top of the hierarchy of scientific evidence, and as such have received a great deal of attention from the statistical research community. In the simplest setting, a fixed group of subjects is available to the experimenter, who assigns one of two treatments to each subject via randomization, then observes corresponding outcomes. The goal is to draw inference about the effect of the experimental treatment on the observed outcome.

Classical, frequentist statistical inference provides a powerful set of tools for this fixed-sample setting. We begin with an observed sample of some deterministic size and seek procedures which yield valid hypothesis tests, p-values, and confidence intervals—for example, a t-test of the null hypothesis that the experimental treat-ment has no effect, on average, or a corresponding confidence interval for the average treatment effect. The fixed-sample paradigm demands that we plan the experiment ahead of time, including the size of the experimental sample and the exact hypotheses to be tested, and that we adhere rigidly to this plan.

In contrast, modern data analysis demands adaptivity. In particular, often the sam-ple we choose to analyze is itself selected on the basis of observed data. For examsam-ple, in an online A/B test, we may observe an ongoing stream of visitors enrolled into an experiment, so that the experimental sample is growing over time. The final exper-imental sample will include all of the visitors observed up to the time we decide to stop the experiment. The decision to stop could be made adaptively, by monitoring

(5)

observed results and stopping early if a strong effect is observed, later if not. This is the realm of sequential, as opposed to fixed-sample, analysis.

There are many other kinds of adaptivity that arise in practice. A second example is in the analysis of nonrandomized, or observational, studies of causal effects. In testing for statistical evidence of an effect, we may choose to focus on a subpopulation which we believe to be highly affected by the treatment of interest. For example, in studying the effect of fish consumption on mercury levels in the blood, we may focus on individuals whose diets are especially high in fish. Classical statistics requires that we define precisely which diets will be classified as “especially high in fish” before we analyze outcomes, but experimenters may prefer for this choice to be guided by the observed outcomes themselves.

In both of the above examples—the sequential stopping of a randomized experi-ment and the adaptive choice of subgroup in an observational study—the use of fixed-sample methods, which do not account for adaptivity, will lead to violations of statistical guarantees such as false positive control. These violations are commonly included under the label “p-hacking” and have received much blame for the lack of reproducibility in various fields of scientific research. Fortunately, alternative statis-tical methods are available, methods that explicitly account for adaptivity to yield robust inference while placing fewer restrictions on the researcher. Such methods are the ultimate aim of the present work.

This thesis develops a framework for constructing sequential and adaptive statis-tical procedures by taking advantage of the time-uniform concentration properties of certain martingales. Chapter 1 begins by laying out a mathematical framework for the derivation of time-uniform concentration inequalities for various classes of martingales. This framework unifies and strengthens a plethora of results from the exponential concentration literature and provides a toolbox for developing sequen-tial and adaptive statistical procedures. The remaining three chapters develop such procedures.

Chapter 2 builds upon the techniques of Chapter 1 to develop uniform concentra-tion bounds which are somewhat more analytically and computaconcentra-tionally complex but are much more useful for statistical applications. We frame these methods in terms of confidence sequences, that is, sequences of confidence intervals that are uni-formly valid over an unbounded time horizon. One of the key results of this work is an empirical-Bernstein confidence sequence which provides a time-uniform, non-parametric, and non-asymptotic analogue of the t-test applicable to any distribution with bounded support. We explore applications to sequential estimation of average

(6)

treatment effects in a randomized experiment, our first example above, as well as sequential estimation of a covariance matrix.

Chapter 3 applies ideas from Chapters 1 and 2 to develop methods for the two related problems of estimating quantiles and estimating the entire cumulative distri-bution function, based on i.i.d. samples. We present confidence sequences for these estimands which are valid uniformly over time for any distribution, and we explore applications to A/B testing and best-arm identification when objectives are based on quantiles rather than means. Finally, Chapter 4 explores an application of uniform martingale concentration to the second example given above, the adaptive choice of subgroup within the analysis of an observational study. We introduce Rosenbaum’s sensitivity analysis framework for observational studies, and show how our procedure yields qualitative improvements over existing methods within this framework. The martingale-based inferential methods we explore in this work trace their origins to Abraham Wald’s work on the sequential probability ratio test during the 1940s, as well as to pioneering extensions developed in the late 1960s and early 1970s by Herbert Robbins, Donald Darling, David Siegmund, and Tze Leung Lai, not to men-tion many others. However, despite the decades of relevant literature, we believe most of the potential of the core ideas has yet to be realized. The key to unlocking this potential, we hope, is a fuller understanding of the nonparametric applicability of these methods, a detailed study of their implementation and tuning in practice, and an exploration of their utility beyond the sequential setting. While we propose several procedures that have immediate practical utility, we hope the larger contri-bution of the work will be as a first step towards a deeper appreciation of the power of martingale-based methods for adaptive inference, and ultimately to the develop-ment of a new class of statistical procedures which permit the kinds of adaptivity contemporary data analysts desire.

(7)
(8)

Contents

Contents ii

List of Figures iv

List of Tables vi

1 Exponential line-crossing inequalities 1

1.1 Introduction . . . 1

1.2 Main results . . . 8

1.3 Sufficient conditions for sub-ψ processes . . . 22

1.4 Applications of Theorem 1.1 . . . 32

1.5 Discussion and extensions . . . 44

1.6 Proofs . . . 48

1.7 Appendix . . . 59

2 Nonparametric confidence sequences 64 2.1 Introduction . . . 64

2.2 Preliminaries: confidence sequences based on linear boundaries . . . . 70

2.3 Curved uniform boundaries . . . 73

2.4 Applications . . . 82

2.5 Simulations . . . 88

2.6 Extensions . . . 91

2.7 Summary and future work . . . 95

2.8 Proofs of main results . . . 97

2.9 Appendix . . . 115

3 Sequential estimation of quantiles 127 3.1 Introduction . . . 128

(9)

3.3 Confidence sequences for a fixed quantile . . . 133

3.4 Confidence sequences for all quantiles simultaneously . . . 137

3.5 Graphical comparison of bounds . . . 140

3.6 Quantile -best-arm identification . . . 143

3.7 Sequential hypothesis tests based on quantiles . . . 147

3.8 Proofs . . . 151

3.9 Appendix . . . 165

4 The uniform general signed rank test 170 4.1 Introduction . . . 171

4.2 Background and notation. . . 172

4.3 A uniform general signed rank test . . . 177

4.4 Design sensitivity of the uniform test . . . 180

4.5 Simulations . . . 185

4.6 Handling ties . . . 187

4.7 Application: impact of fish consumption on mercury concentration . . 190

4.8 Conclusion and future work . . . 193

4.9 Appendix . . . 194

(10)

List of Figures

1.1 Equivalence of Freedman-style inequalities and de la Pe˜na-style

inequali-ties via Theorem 1.1 . . . 5

1.2 Illustration of the equivalent statements of Theorem 1.1. . . 18

1.3 Schematic of implications among sub-ψ conditions . . . 31

1.4 Comparison of fixed-time Cram´er-Chernoff bound, Freedman-style con-stant uniform bound, and linear uniform bound from Theorem 1.1(b) . . 33

1.5 Comparison of our decreasing boundary from Theorem 1.1(c) to a de la Pe˜na-style constant uniform bound . . . 38

1.6 Geometric illustration of Theorem 1.1(b) and its relation to fixed-time Cram´er-Chernoff bounds . . . 46

1.7 Comparison of ψ functions given in Table 1.2 . . . 61

2.1 Introductory illustration of confidence sequences . . . 67

2.2 Relations among sub-ψ boundaries . . . 72

2.3 Comparison of finite LIL bounds for independent 1-sub-Gaussian obser-vations . . . 75

2.4 Comparison of normalized uniform boundaries u(v)/√v optimized for dif-ferent intrinsic times . . . 80

2.5 Empirical-Bernstein confidence sequence for ATEt under Bernoulli ran-domization . . . 85

2.6 Illustration of covariance matrix confidence sequence . . . 87

2.7 Simulations illustrating confidence sequence size and coverage for bounded observations . . . 89

2.8 Illustration of Theorem 2.1, stitching together linear boundaries to con-struct a curved boundary . . . 98

2.9 Illustration of Theorem 2.2, the discrete mixture bound . . . 110

(11)

3.2 Comparison of upper confidence bound radii used in quantile confidence sequences . . . 141

3.3 Illustration of tuning quantile confidence sequences . . . 142

3.4 The QLUCB algorithm . . . 144

3.5 Average sample size for various quantile best-arm identification algo-rithms based on simulations . . . 146

3.6 Average ratio of sample size for Theorem 3.5 to sample size for naive strategy, based on simulations . . . 150

3.7 Extended comparison of upper confidence bound radii used in quantile confidence sequences . . . 166

3.8 Average sample size for additional quantile best-arm identification algo-rithms based on simulations . . . 169

4.1 The four score functions ϕ(q) used in this chapter . . . 174

4.2 Illustration of Theorem 4.1 and the uniform bound (4.10) for the uniform sign test . . . 178

4.3 π(x) from Theorem 4.2 for sign and WSRT score functions when G is standard normal, Laplace (double exponential) or Cauchy . . . 183

4.4 Comparison of simulated power for fixed-sample tests vs. uniform tests . 186

4.5 Comparison of simulated power for uniform tests using different score functions. . . 188

4.6 π(x) from Theorem 4.2 for additional score functions not included in Figure 4.3 . . . 200

(12)

List of Tables

1.1 Some existing results which are strengthened by Theorem 1.1 . . . 14

1.2 Summary of common ψ functions and related transforms . . . 25

1.3 Summary of sufficient conditions for a real-valued, discrete- or continuous-time martingale (St) to be sub-ψ with the given variance process. . . 26

1.4 Summary of sufficient conditions for an Hd-valued, discrete-time

martin-gale (Yt) to have a sub-ψ maximum eigenvalue process St = γmax(Yt) with

variance process Vt= γmax(Zt) . . . 27

1.5 Implications among sub-ψ conditions . . . 31

2.1 Comparison of parameters for finite LIL boundaries . . . 125

4.1 Balance table for 1,672 matched pairs formed from NHANES data . . . . 191

(13)

Acknowledgments

This work would not have been possible without the close collaboration and constant guidance of Aaditya Ramdas. Over the years since I started down the path of my present research agenda, Aaditya has given me weekly and often daily feedback and ideas for improvement, not only on my research, but on my writing and paper organization and generally on how to approach Ph.D. studies. I have been extremely lucky that he happened to be at Berkeley at the right time for me, that he was curious enough to read my incoherent initial drafts, and that he had the vision to see the possibilities laying within. Aaditya co-authored the material of Chapters 1

to3.

My advisors Jon McAuliffe and Jasjeet Sekhon have played huge roles in shaping my philosophy as a statistician and my approach as a researcher, from my first year in the program to the present day. I am grateful for all of the opportunities I’ve had to learn from them and look forwarding to continuing to do so. Jon and Jas co-authored the material of Chapters 1 and 2.

All of my teachers at Berkeley have been generous, wise, patient, and extremely influential on my thinking. This includes Sam Pimentel (who co-authored the ma-terial of Chapter 4), Bin Yu, David Aldous, Peng Ding, Will Fithian, Avi Feller, Martin Wainwright, Michael Jordan, Laurent El Ghaoui, and Elizabeth Purdom, among others. I am grateful as well to my fellow students, who have been friendly and ever-willing to explain concepts I struggled to grasp, and especially to Eli Ben-Michael, whose conversation sparked the research presented in Chapter4, and whose friendship I value.

No project of this scope would be possible without my friends and family. Among others, I must mention Dan Birken and Meghan Loisel, Andrew Junkin, Eric Konieczny, Brian McDonald, Chris and Megan Mueller (and Sky!), Jess Riedel, Trevor and Danielle Seret, Ian Shea, Ben Shestakofsky and Isheh Beck, and Josh Specht, for not letting me descend into all work and no play. My siblings Danielle, JP, Mike and Lina have supported me with encouragement and humor all along. My parents Robin and Andy have always been and continue to be my number one fans, and I never could have completed graduate studies without the work ethic and self-confidence they instilled in me. My daughter Ellie has brought joy to the last year of my studies and I can’t wait for her sister to join the fray. Most of all, I’m thankful for the support and patience of my partner Jessie, who celebrated every little accomplishment along the way, had complete faith in me when I doubted myself, and never complained when I spaced out on a hike thinking about math. Thank you.

(14)

Chapter 1

Exponential line-crossing

inequalities

We begin by developing a class of exponential bounds for the probability that a mar-tingale sequence crosses a time-dependent linear threshold. Our key insight is that it is both natural and fruitful to formulate exponential concentration inequalities in this way. We illustrate this point by presenting a single assumption and a single theorem that together unify and strengthen many tail bounds for martingales, includ-ing classical inequalities (1960-80) by Bernstein, Bennett, Hoeffdinclud-ing, and Freedman; contemporary inequalities (1980-2000) by Shorack and Wellner, Pinelis, Blackwell, van de Geer, and de la Pe˜na; and several modern inequalities (post-2000) by Khan, Tropp, Bercu and Touati, Delyon, and others. In each of these cases, we give the strongest and most general statements to date, quantifying the time-uniform concen-tration of scalar, matrix, and Banach-space-valued martingales, under a variety of nonparametric assumptions in discrete and continuous time. In doing so, we bridge the gap between existing line-crossing inequalities, the sequential probability ratio test, the Cram´er-Chernoff method, self-normalized processes, and other parts of the literature. Additionally, this chapter lays the foundation for most of the methods developed in the remaining chapters.

1.1

Introduction

Concentration inequalities play an important role in probability and statistics, giving non-asymptotic tail probability bounds for random variables or suprema of random processes. In this chapter, we consider a method to bound the probability that a martingale ever crosses a time-dependent linear threshold. We were motivated by the

(15)

fact that such bounds are the key ingredient in many sequential inference procedures. We argue, however, that this formulation is materially better for the development of exponential concentration inequalities, even in some non-sequential settings. We give a master assumption and theorem which handle all of these cases, in discrete and continuous time, for scalar-valued, matrix-valued, and smooth Banach-space-valued martingales. By unifying and organizing dozens of results, we illustrate how these results relate to one another and highlight the specific ingredients contributed by each author. Our improvements to existing results come in the form of weakened assumptions, extension of fixed-time or finite-horizon bounds to infinite-horizon uni-form bounds, and improved exponents.

Our main results are presented in full generality in the following section. To motivate these results, we first contrast a small handful of well-known, concrete results from the exponential concentration literature; see Section 1.1 for a more detailed overview of the literature we draw upon. Throughout the chapter, most of our results are presented for filtered probability spaces, and we use Et to denote

expectation conditional on the underlying filtration Ft at time t. For any

discrete-time process (Yt)t∈N, we write ∆Yt:= Yt− Yt−1 for the increments. Finally, we write

Hd for the space of d × d Hermitian matrices. The relation A  B denotes the

semidefinite order on Hd, while λmax : Hd → R denotes the maximum eigenvalue

map.

Example 1.1. Unless indicated otherwise, let (St)∞t=0 be a real-valued martingale

with respect to a filtration (Ft)∞t=0, with S0 = 0.

(a) Three of the earliest and most well-known results for exponential concentration are attributed to Bernstein, Bennett, and Hoeffding. Assume the increments (∆St) are independent, and let vt :=

Pt

i=1E(∆Si)2. We present Bernstein’s

inequality (Bernstein,1927) in a widely used form (e.g.,Boucheron et al.,2013, Corollary 2.11): if, for some fixed m ∈ N and c > 0, the increments satisfy the moment condition Pm

i=1E(∆St)k ≤ k!2ck−2vm for all integers k ≥ 3, then for

any x > 0, we have P (Sm ≥ x) ≤ exp  − x 2 2(vm+ cx)  . (1.1)

Bernstein’s moment condition is easily seen to be satisfied if the increments are bounded. Bennett (1962, eq. 8b) improved Bernstein’s result for bounded increments: if ∆St≤ 1 for all t, then for any x > 0 and m ∈ N, we have

P (Sm ≥ x) ≤  vm x + vm x+vm ex. (1.2)

(16)

Finally,Hoeffding(1963, eq. 2.3) gave a simplified result for increments bounded from above and below: if |∆St| ≤ 1 for all t, then for any x > 0 and m ∈ N,

we have

P (Sm ≥ x) ≤ e−x

2/2m

. (1.3)

(b) Blackwell (1997, Theorem 1): if |∆St| ≤ 1 for all t, then for any a, b > 0, we

have

P(∃t ∈ N : St≥ a + bt) ≤ e−2ab. (1.4)

Relative to Hoeffding’s inequality, Blackwell removes the assumption of in-dependent increments, though this possibility was noted by Hoeffding him-self (Hoeffding, 1963, p. 18). More importantly, Blackwell replaces the event {Sm ≥ x} for fixed time m with the time-uniform event {∃t ∈ N : St ≥ a + bt}.

To see that Blackwell’s result recovers and strengthens that of Hoeffding, set a = x/2, b = x/2m and note that Blackwell’s uniform bound recovers Hoeffd-ing’s bound at time t = m, so that Blackwell obtains the same probability bound for a larger event.

(c) Freedman (1975, Theorem 1.6): if |∆St| ≤ 1 for all t, then writing Vt :=

Pt

i=1Var ( ∆Si| Fi−1), for any x, m > 0 we have

P (∃t ∈ N : Vt≤ m and St≥ x) ≤  m x + m x+m ex. (1.5)

Similar to Bernstein’s and Bennett’s inequalities, but unlike those of Hoeffding and Blackwell, Freedman’s inequality measures time in terms of a predictable quantity, the accumulated conditional variance Vt, rather than simply the

num-ber of observations t. Freedman’s inequality bounds the deviations of (St)

uniformly over time, but only up to the finite time horizon defined by Vt≤ m.

(d) de la Pe˜na (1999, Theorem 6.2, eq. 6.4): if the increments are conditionally symmetric, that is, ∆St ∼ −∆St | Ft−1 for all t, then letting Vt =Pti=1∆Si2,

for any α ≥ 0 and β, x, m > 0 we have

P  ∃t ∈ N : Vt ≥ m and St α + βVt ≥ x  ≤ exp  −x2 β 2 2m + αβ  . (1.6) A remarkable feature of this result is that we measure time via the adapted quantity Vt. Unlike Freedman’s inequality, which uses the true conditional

(17)

variance to measure time, de la Pe˜na’s inequality relies only on empirical quan-tities. In further contrast to Freedman’s inequality, de la Pe˜na’s bound holds uniformly over Vt ≥ m rather than Vt ≤ m, and we bound the deviations of

the self-normalized process St/(α + βVt).

(e) Tropp (2012, Theorem 6.2): departing from the above results for real-valued martingales, here we begin with a martingale (Yt)t∈N taking values in Hd.

Assume that the increments ∆Yt are independent and, for some fixed c >

0 and Hd-valued sequence (W

t)t∈N, the moments of the increments satisfy

E ∆Stk Ft−1   k! 2c k−2∆W

t for all t and all k ≥ 2. Then, writing St =

γmax(Yt) and Vt= γmax(Wt), for any x > 0 and t ≥ 1, we have

P (St ≥ x) ≤ d · exp  − x 2 2(Vt+ cx)  . (1.7)

This elegant result extends Bernstein’s inequality to the matrix setting. Note the prefactor of d that appears when we bound the deviations of the maximum eigenvalue of a d × d matrix-valued process.

(f) Finally, we recall a textbook result for Brownian motion (e.g., Durrett, 2017, Exercise 7.5.2): if (St)t∈(0,∞) is a standard Brownian motion, then for any

a, b > 0, we have

P(∃t ∈ (0, ∞) : St≥ a + bt) = e−2ab. (1.8)

The result closely resembles Blackwell’s inequality for discrete-time martingales with bounded increments, but here we have an equality.

Clearly, these results have much in common with each other and with myriad other results from the exponential concentration literature. Examining the proofs, we find many shared ingredients which are now well known: the notions of sub-Gaussian and sub-exponential random variables, the Cram´er-Chernoff method, the large-deviations supermartingale, and so on. Nonetheless, there are enough differ-ences among the results and their proofs to leave us wondering whether these results are merely similar in appearance, or whether they are all special cases of some un-derlying, general argument.

In this chapter, we give a framework which formally unifies the above results along with many others. Our framework consists of two pieces. First, we crystallize the notion of a sub-ψ process (Definition1.1), a sufficient condition general enough to encompass a broad set of results not previously treated together, yet specific enough to derive a useful set of equivalent concentration inequalities. This definition provides

(18)

This chapter’s Theorem 1.1(b) This chapter’s Theorem 1.1(c,d)

Freedman-style inequalities, such as (1.5)

de la Pe˜na-style inequalities, such as (1.6)

implies implies

imply each other

do not imply each other

Figure 1.1: This chapter’s Theorem 1.1 implies both Freedman-style inequalities such as (1.5) and de la Pe˜na-style inequalities such as (1.6). Refer also to Figures1.4

and 1.5 for visualizations of these implications.

a convenient categorization of exponential concentration results into sub-Bernoulli, sub-Gaussian, sub-Poisson, sub-exponential, and sub-gamma bounds. Second, we give a generalization of the Cram´er-Chernoff argument, Theorem 1.1. This result yields strengthened versions of many existing inequalities and illustrates equivalences among different forms of exponential bounds. For example, Theorem1.1 strengthens both “Freedman-style” inequalities such as (1.5) and “de la Pe˜na-style” inequalities such as (1.6) to hold uniformly over all time, and in these strengthened forms, the two styles of inequality are shown to be equivalent, as depicted in Figure 1.1. We remark that the seminal works from which these examples are drawn, like others referenced below, include many other important contributions, and our claims about Theorem 1.1 refer only to the particular inequalities cited from each work.

Once the framework is in place, the proof of the main result follows using tools from classical large-deviation theory (Dembo and Zeitouni, 2010). We construct a nonnegative supermartingale as in Freedman (1975), and we obtain a bound on its entire trajectory using Ville’s maximal inequality (Ville, 1939). We invoke Tropp’s ideas (Tropp,2011) to extend the results to the matrix setting. The equivalences that follow from optimizing linear bounds are obtained using convex analysis (Rockafellar,

1970). By drawing together various proof ingredients from different sources, we elucidate previously unrecognized connections, for example demonstrating how self-normalized matrix inequalities follow easily upon combining ideas from the literature on self-normalized processes with those from matrix concentration.

(19)

Chapter organization

Section1.2lays out our framework for exponential line-crossing inequalities. Specifi-cally, we formally state Definition1.1and Theorem1.1that together describe a novel formulation of the Cram´er-Chernoff method. After stating Theorem 1.1, we give a quick overview of existing results which can be recovered in our framework and the improvements thus obtained. A short proof of our master theorem comes next, and following some remarks, we provide three simple, illustrative examples.

Sections1.3 and1.4are devoted to a catalog of important results from the litera-ture on exponential concentration which fit into our framework, often yielding results which are stronger than those originally published. In Section 1.3, we consider the maximum-eigenvalue process of a matrix-valued martingale and enumerate useful sufficient conditions for such a process to be sub-ψ, collecting and in some cases generalizing a variety of ingenious results from the literature. Section 1.4 examines various instantiations of our master theorem, obtaining corollaries by combining one of the sufficient conditions from Section 1.3 with one of the four equivalent conclu-sions of Theorem 1.1. These illustrate how our framework recovers and strengthens existing exponential concentration results. We discuss sharpness, another geometri-cal insight, and future work in Section 1.5. Proofs of most results are in Section 1.6.

Historical context

To aid the reader, we give here some historical context for the existing results dis-cussed below. This is not intended to be a comprehensive history of the literature on exponential concentration, and we focus on the specific results discussed in Sec-tion 1.4, giving pointers to further references as appropriate.

The Cram´er-Chernoff method takes its name from the works ofCram´er(1938) and

Chernoff(1952). Both of these authors were concerned with a precise characterization of the asymptotic decay of tail probabilities beyond the regime in which the central limit theorem applies; Cram´er provided the first proof of such a “large deviation principle”, while Chernoff gave a more general formulation and placed more emphasis on the non-asymptotic upper bound which is our focus. These results spawned a vast literature on large deviation principles, with the goal of giving sharp upper and lower bounds on the limiting exponential decay of certain probabilities under a sequence of measures; see Dembo and Zeitouni (2010) for an excellent presentation of this literature. Our focus, on non-asymptotic upper bounds for nonparametric classes of distributions, is rather different, though such upper bounds often make an appearance in proofs of large deviation principles.

(20)

Bernstein was perhaps the earliest proponent of the sort of exponential tail bounds that are the focus of this chapter, having proposed his famous inequality in 1911, according to Prokhorov (1995); see also Craig (1933), Uspensky (1937, ch. 10, ex. 12-14, pp. 204-205) and Bernstein (1927), though the last source appears rather inaccessible. The modern theory of exponential concentration began to take shape in the 1960’s, as (using the terminology of this chapter, from Section1.3)Bennett(1962) improved Bernstein’s sub-gamma inequality to sub-Bernoulli and sub-Poisson ones for random variables bounded from above. Hoeffding (1963) gave alternative sub-Bernoulli and sub-Gaussian bounds for random variables bounded from both above and below. For further references on this line of work, see Boucheron et al. (2013), whose treatment of the Cram´er-Chernoff method has been invaluable in formulating our own framework, as well as McDiarmid (1998).

Godwin (1955, p. 936) reports that Bernstein generalized his inequality to de-pendent random variables. Hoeffding(1963, pp. 17-18) considered the generalization of his sub-Bernoulli and sub-Gaussian bounds to martingales and the possibility of finite-horizon uniform inequalities based on Doob’s maximal inequality; the martin-gale generalization was later explored by Azuma (1967). Freedman (1975) extended Bennett’s sub-Poisson bound to martingales, giving a uniform bound subject to a maximum value of the predictable quadratic variation of the martingale. This “Freedman-style” bound has been generalized to other settings in many subsequent works (de la Pe˜na, 1999; Khan, 2009;Tropp, 2011;Fan et al., 2015).

The extension of these methods to matrix-valued processes, via control of the ma-trix moment-generating function, originated withAhlswede and Winter(2002). The method was refined by Christofides and Markstr¨om (2007), Oliveira (2010a,b) and then by Tropp (2011, 2012), whose influential treatment synthesized and improved upon past work, generalizing many scalar exponential inequalities to operator-norm inequalities for matrix martingales. We have incorporated Tropp’s formulation into our framework, and we focus on his theorem statements for our matrix bound state-ments. SeeTropp (2015) for a recent exposition and further references.

There is a long history of investigation of the concentration of Student’s t-statistic under non-normal sampling. Efron (1969) gives many references to early work. He also shows, by making use of Hoeffding’s sub-Gaussian bound, that the equiva-lent self-normalized statistic (P

iXi) /

pP

iXi2 satisfies a 1-sub-Gaussian tail bound

whenever the Xi satisfy a symmetry condition, a result he attributes to Bahadur

and Eaton (Efron, 1969, p. 1284). Starting with Logan et al. (1973), there has been a great deal of work on limiting distributions and large deviation principles for self-normalized statistics; see Shao (1997) and references therein. In terms of exponential tail bounds,de la Pe˜na(1999) explored general conditions for bounding the deviations of a martingale, introduced new decoupling techniques (cf. de la Pe˜na

(21)

and Gin´e,1999), and showed that any martingale with conditionally symmetric incre-ments satisfies a self-normalized sub-Gaussian bound with no integrability condition. This work laid the foundation for the type of self-normalized exponential inequalities which we explore in this chapter. These methods were extended by de la Pe˜na et al.

(2000, 2004), which introduced a general supermartingale “canonical assumption” that is a key precursor of our sub-ψ condition, and initiated a flurry of subsequent activity on self-normalized exponential inequalities (cf. de la Pe˜na et al.,2007;de la Pe˜na, Klass and Lai, 2009). We note in particular inequality (3.9) of de la Pe˜na et al.(2001), which gives an infinite-horizon boundary-crossing inequality based on a mixture extension of their canonical assumption, as well as the multivariate inequal-ities (3.24) (for a t-statistic) and (3.29) (for general mixture boundaries) given by

de la Pe˜na, Klass and Lai (2009). Bercu and Touati (2008) gave a self-normalized sub-Gaussian bound without symmetry by incorporating the conditional quadratic variation, requiring only finite second moments, and some ingenious further exten-sions have been given by Delyon (2009), Fan et al. (2015), and Bercu et al. (2015), many of which we include in our collection of sufficient conditions for a process to be sub-ψ (Section1.3). Seede la Pe˜na, Lai and Shao(2009) andBercu et al. (2015) for further references.

Ville’s maximal inequality for nonnegative supermartingales, the technical un-derpinning of Theorem 1.1, originates with Ville (1939, p. 101). It is commonly attributed to Doob, though Doob acknowledged Ville’s priority extensively in his works, e.g., Doob (1940, pp. 458-460). Mazliak and Shafer (2009) contains further historical discussion and sources.

1.2

Main results

Let (St)t∈T ∪{0}be a real-valued process adapted to an underlying filtration (Ft)t∈T ∪{0},

where either T = N for discrete-time processes or T = (0, ∞) for continuous-time processes. In continuous time, we assume (Ft) satisfies the “usual hypotheses”,

namely, that it is right-continuous and complete, and we assume (St) is c`adl`ag; see,

e.g., Protter (2005). In a statistical setting, we may think of (St) as a summary

statistic accumulating over time, for example a cumulative sum of observations, whose deviations from zero we would like to bound under some null hypothesis. In this setting, a bound on the deviations of (St) holding uniformly over time can be

used to construct an appropriate sequential hypothesis test, a special case of which is Wald’s sequential probability ratio test discussed in Section 1.4. We first explain our key condition on (St), the sub-ψ condition. We then state, prove, and interpret

(22)

The sub-ψ condition

Our key condition on (St) is stated in terms of two additional objects. The first object

is a real-valued, nondecreasing process (Vt)t∈T ∪{0}, also adapted to (Ft) (and c`adl`ag

in the continuous-time case), an “accumulated variance” process which serves as a measure of intrinsic time, an appropriate quantity to control the deviations of Stfrom

zero (Blackwell and Freedman, 1973). The second object is a function ψ : R≥0 →

R, reminiscent of a cumulant-generating function, which quantifies the relationship between St and Vt. The simplest case is when St is a cumulative sum of i.i.d.,

real-valued, mean-zero random variables with distribution F , in which case we take Vt = t and let ψ(λ) = logR eλxdF (x) be the CGF of F . Our key condition requires

that St is unlikely to grow too quickly relative to intrinsic time Vt; it generalizes

developments from Freedman (1975); de la Pe˜na et al. (2004); Tropp (2011), and others.

Definition 1.1 (Sub-ψ process). Let (St)t∈T ∪{0} and (Vt)t∈T ∪{0} be two real-valued

processes adapted to an underlying filtration (Ft)t∈T ∪{0} with S0 = V0 = 0 a.s. and

Vt ≥ 0 a.s. for all t ∈ T . For a function ψ : [0, λmax) → R and a scalar l0 ∈ [1, ∞), we

say (St) is l0-sub-ψ with variance process (Vt) if, for each λ ∈ [0, λmax), there exists

a supermartingale (Lt(λ))t∈T ∪{0} with respect to (Ft) such that L0(λ) ≤ l0 a.s. and

exp {λSt− ψ(λ)Vt} ≤ Lt(λ) a.s. for all t ∈ T . (1.9)

We often say simply that a process is sub-ψ, omitting l0 from our terminology

for simplicity. For all cases considered in this chapter, we have either l0 = 1, when

deriving one-sided bounds on scalar martingales; l0 = 2, when deriving bounds on the

norm of certain Banach-space-valued martingales; or l0 = d, when deriving bounds

on the maximum-eigenvalue process of a d × d matrix-valued martingale. We also wish to point out that, although we often speak of a process (St) being sub-ψ, the

sub-ψ condition formally applies to the pair (St, Vt) and not to the process (St) alone,

so that meaningful statements are always made in the context of a specific intrinsic time process (Vt).

Although Definition 1.1 may defy intuition upon first glance, we can motivate it from several angles:

• Suppose St is a scalar-valued martingale whose deviations we wish to bound

uniformly over time. We might like to apply Ville’s maximal inequality (see Section 1.2), but must first transform St into a nonnegative supermartingale.

It is natural to consider the exponential transform eλSt for some λ > 0, which

immediately yields a submartingale. Our task, then, is to find some appro-priate ψ and (Vt) which “pull down” the submartingale so that the process

(23)

exp {λSt− ψ(λ)Vt} is a supermartingale. Intuitively, the exponential process

exp {λSt− ψ(λ)Vt} measures how quickly St has grown relative to intrinsic

time Vt, and the free parameter λ determines the relative emphasis placed on

the tails of the distribution of St, i.e., on the higher moments. Larger values

of λ exaggerate larger movements in St, and ψ captures how much we must

correspondingly exaggerate Vt.

• Consider again the simple case in which St is a cumulative sum of i.i.d. draws

from a distribution F over the reals with mean zero and CGF ψ(λ) < ∞ for λ ∈ [0, λmax). Then, setting Vt = t, we may take Lt(λ) equal to the

exponential process exp {λSt− ψ(λ)t}, which is a martingale in this case, so

that the defining inequality of Definition 1.1 is an equality. The exponential process may be interpreted as the likelihood ratio in an exponential family passing through F with sufficient statistic St. See Example 1.2 for a more

detailed exposition of this setting and Section 1.4 for more on the connection with exponential families.

• Alternatively, we may begin from the martingale method for concentration inequalities (Hoeffding, 1963; Azuma, 1967; McDiarmid, 1998; Raginsky and Sason,2012, section 2.2), itself based on the classical Cram´er-Chernoff method (Cram´er, 1938; Chernoff,1952;Boucheron et al., 2013, section 2.2). The mar-tingale method starts from an assumption such as E eλ(Xt−E( Xt| Ft−1))

Ft−1 ≤

eψ(λ)σt2 for all t ≥ 1 and λ ∈ [0, λ

max). When ψ(λ) = λ2/2 and λmax = ∞

(and the condition holds for λ < 0 as well), this is the definition of a condi-tionally sub-Gaussian random variable with variance parameter σ2

t. When

ψ(λ) = λ2/(2(1 − cλ)) and λ

max = 1/c, we have the definition of a

ran-dom variable which is conditionally sub-gamma on the right tail with vari-ance parameter σ2

t and scale parameter c (Boucheron et al., 2013). Writing

St:=

Pt

i=1(Xi− Ei−1Xi) and Vt :=

Pt

i=1σi2, the process exp {λSt− ψ(λ)Vt} is

then a supermartingale for each λ ∈ R. For example, if ∆St ∈ [at, bt] for all t,

then (St) is 1-sub-ψ with ψ(λ) = λ2/2 on λ ∈ [0, ∞), and Vt =

Pt i=1 b−a 2 2 ; this fact underlies Example1.1(a,b). Or, if St ≤ 1 for all t, then (St) is 1-sub-ψ

with ψ(λ) = eλ− λ − 1 on λ ∈ [0, ∞), a fact which leads to Example 1.1(c).

• Unlike the martingale method assumption, Definition 1.1 allows (Vt) to be

adapted rather than predictable, which leads to a variety of self-normalized inequalities (de la Pe˜na, 1999; de la Pe˜na et al., 2004; de la Pe˜na, Lai and Shao, 2009; Bercu et al., 2015; Fan et al., 2015), for example yielding bounds on the deviation of a martingale in terms of its quadratic variation. In this context, Definition1.1is closely related to the “canonical assumption” of de la

(24)

Pe˜na et al. (2004, eq. 1.6), which requires that exp {λSt− Φ(λVt)} is a

super-martingale for certain nonnegative, strictly convex functions Φ. We have found it more useful to separate the second term into ψ(λ)Vt, though both

formula-tions yield interesting results. For example, if ∆St ∼ −∆St | Ft−1, then (St)

is 1-sub-ψ with ψ(λ) = λ2/2 over λ ∈ [0, ∞), and V t=

Pt i=1∆S

2

t, from which

we may obtain Example1.1(d).

• Also in contrast to de la Pe˜na et al. (2004), we allow the exponential process to be merely upper bounded by a supermartingale, rather than being a super-martingale itself; this permits us to handle bounds on the maximum eigenvalue process of a matrix-valued martingale, using techniques fromTropp(2011). For example, under the conditions of Example1.1(e), the maximum eigenvalue pro-cess (St) is d-sub-ψ with ψ(λ) = λ2/[2(1 − cλ)] on λ ∈ [0, 1/c). In this case, the

exponential process exp {λSt− ψ(λ)Vt} is not a supermartingale, but is

up-per bounded by the trace-exponential suup-permartingale tr exp {λYt− ψ(λ)Wt}.

The initial value of this trace-exponential process is l0 = d, which leads to the

pre-factor of d in the bound (1.7).

Section 1.3 collects a variety of sufficient conditions from the literature for a process to be sub-ψ, including all of the examples given above. These conditions illustrate the broad applicability of Definition 1.1 in nonparametric settings, i.e., those which restrict the distribution of (St) to some infinite-dimensional class, for

example all processes with bounded increments, or with increments having finite variance. Even in such nonparametric cases, ψ is still a CGF of some distribution in all of our examples, though this is not required for the most basic conclusion of Theorem 1.1. Indeed, the full force of Theorem 1.1 comes into effect only when ψ satisfies certain properties which hold for CGFs of zero-mean, non-constant random variables (Jorgensen, 1997, Theorem 2.3):

Definition 1.2. A real-valued function ψ with domain [0, λmax) is called CGF-like if

it is strictly convex and twice continuously differentiable with ψ(0) = ψ0(0+) = 0 and

supλ∈[0,λmax)ψ(λ) = ∞. For such a function we define ¯b = ¯b(ψ) := supλ∈[0,λmax)ψ0(λ) ∈ (0, ∞].

In many typical cases we have λmax = ∞ and ¯b = ∞. With Definitions1.1and1.2

(25)

The master theorem

To state our main theorem on general exponential line-crossing inequalities, we will make use of the following transforms of ψ:

The Legendre-Fenchel transform ψ?(u) := sup

λ∈[0,λmax)

[λu − ψ(λ)], for u ≥ 0. The “decay” transform D(u) := sup

 λ ∈ (0, λmax) : ψ(λ) λ ≤ u  , for u ≥ 0. The “slope” transform s(u) := ψ(ψ

?0(u))

ψ?0(u) , for u ∈ (0, ¯b).

In the definition of D(u), we take the supremum of the empty set to equal zero instead of the usual −∞. For u > 0, this case can arise in general, but not when ψ is CGF-like. Note that D(u) can also be infinite. We call D(u) the “decay” transform because it determines the rate of exponential decay of the upcrossing probability bound in Theorem1.1(a) below. We call s(u) the “slope” transform because it gives the slope of the linear boundary in Theorem 1.1(b); this is defined only when ψ is CGF-like. Defining s(0) = 0 and s(¯b) = ¯b when ¯b < ∞, we find that s(u) is continuous, strictly increasing, and 0 ≤ s(u) < u on u ∈ [0, ¯b) (see Lemma 1.2).

Our main theorem has four parts, each of which facilitates comparisons with a particular related literature, as we discuss in Section 1.4. Recall Definition 1.1 of a sub-ψ process and the underlying filtration (Ft) to which (St) and (Vt) are adapted.

Theorem 1.1. If (St) is l0-sub-ψ with variance process (Vt), then

(a) For any a, b > 0, we have

P ( ∃t ∈ T : St≥ a + bVt| F0) ≤ l0exp {−aD(b)} .

Additionally, whenever ψ is CGF-like, the following three statements are equivalent to statement (a).

(b) For any m > 0 and x ∈ (0, m¯b), we have P  ∃t ∈ T : St≥ x + s x m  · (Vt− m) F0  ≤ l0exp n −mψ?x m o . (c) For any m > 0 and x ∈ (0, ¯b), we have

P  ∃t ∈ T : St Vt ≥ x − x − s(x) Vt  · (Vt− m) F0  ≤ l0exp {−mψ?(x)} .

(26)

(d) For any m ≥ 0, x > 0 and b > 0, we have (below we take m¯b = ∞ whenever ¯b = ∞) P ( ∃t ∈ T : Vt≥ m and St≥ x + b(Vt− m) | F0) ≤ ( l0exp−(x − (b ∧ ¯b)m)D(b) , x > m¯b or s mx > b l0exp−mψ? mx , x ≤ m¯b and s mx ≤ b. (1.10)

We give a straightforward proof in Section 1.2 that uses only Ville’s maximal inequality for nonnegative supermartingales (Ville, 1939) and elementary convex analysis. Theorem 1.1 can be seen to unify and strengthen many known exponen-tial bounds, showing that we lose nothing in going from a fixed-time to a uniform bound. This includes classical inequalities by Hoeffding (Corollary 1.1a), Bennett and Freedman (Corollary 1.1b), and Bernstein (Corollary 1.1c), along with their matrix extensions due to Tropp and Mackey et al. (Corollary 1.1a-c); discrete-time scalar line-crossing inequalities due to Blackwell (Corollaries 1.4 and 1.5) and Khan (Section 1.4); self-normalized bounds due to de la Pe˜na (Corollaries 1.6 and 1.7), Delyon (Corollary 1.8), Bercu and Touati (Corollary 1.8), and Fan (Corollary 1.9); bounds for martingales in smooth Banach spaces due to Pinelis (Corollary 1.10); continuous-time bounds due to Shorack and Wellner (Corollary 1.11) and van de Geer (Corollary 1.11); and Wald’s sequential probability ratio test (Corollary 1.12). Visualizations of how the bounds of Theorem 1.1 relate to Freedman’s and de la Pe˜na’s inequalities are provided in Figures 1.4 and 1.5. For convenience, Table 1.1

lists the existing results we recover and our corresponding corollaries, along with ways in which our analysis strengthens conclusions.

For the remainder of the chapter after Section 1.2, we will assume F0 is the

trivial σ-field and omit from our notation the conditioning on F0 in the results of

Theorem 1.1 and its corollaries.

Proof of Theorem

1.1

Throughout the proof, we write P0(·) for the conditional probability P (· | F0). Ville’s

maximal inequality for nonnegative supermartingales (Ville, 1939; Durrett, 2017, exercise 4.8.2) is the foundation of all uniform bounds in this chapter. It is an infinite-horizon uniform extension of Markov’s inequality:

Lemma 1.1 (Ville’s inequality). If (Lt)t∈T ∪{0}is a nonnegative supermartingale with

respect to the filtration (Ft)t∈T ∪{0}, then for any a > 0, we have

P0(∃t ∈ T : Lt≥ a) ≤

L0

(27)

Existing result Our result [A] [B] [C] [D] [E]

Bernstein(1927) Corollary 1.1(c) X X X

Bennett (1962, eq. 8b) Corollary 1.1(b) X X X X

Hoeffding (1963, Theorem 2) Corollary 1.1(a) X X X

Freedman (1975, Theorem 1.6) Corollary 1.1(b) X X X

Shorack and Wellner (1986, eq. B.1) Corollary 1.11(b) X

Pinelis (1994, Theorems 3.4, 3.5) Corollary 1.10 X

van de Geer (1995, Lemma 2.2) Corollary 1.11(c) X X

Blackwell (1997, Theorem 1) Corollary 1.4(a) X X X

Blackwell (1997, Theorem 2) Corollary 1.5 X

Blackwell (1997, Theorem 2) Corollary 1.4(b) X X X

de la Pe˜na(1999, Thms. 6.1, 1.2B) Corollary 1.6 X X X

de la Pe˜na(1999, Theorem 6.2) Corollary 1.7 X X X

Bercu and Touati (2008, Thm. 2.1) Corollary 1.8 X X X

Delyon (2009, Theorem 4) Corollary 1.8 X X

Khan (2009, Theorem 4.2) Theorem 1.1(b) X X X

Khan (2009, Theorem 4.3) Theorem 1.1(d) X X X

Tropp (2011, Theorem 1.2) Corollary 1.1(b) X

Tropp (2012, Theorem 1.3) Corollary 1.1(a) X X

Tropp (2012, Theorem 1.4) Corollary 1.1(c) X

Mackey et al. (2014, Corollary 4.2) Corollary 1.1(a) X X

Table 1.1: Some existing results which are strengthened by Theorem1.1, as detailed in Section 1.4. For clarity, we enumerate the different ways in which we strengthen or generalize existing results with the following mnemonics:

[A] Assumptions: we recover the result under weaker conditions on the distribu-tional or dependence structure of the process.

[B] Boundary: we strengthen the result by replacing a fixed-time bound or a finite-horizon constant uniform boundary with an infinite-finite-horizon linear uniform boundary which is everywhere at least as strong (i.e., low) as the fixed-time or finite-horizon bound.

[C] Continuous time: we extend a discrete-time result to include continuous time. [D] Dimension: we extend a result for scalar process to one for Hd-valued processes,

recovering the scalar result at d = 1.

(28)

Applying Ville’s inequality to Definition 1.1 gives, for any λ ∈ (0, λmax) and

z ∈ R,

P0(∃t ∈ T : exp {λSt− ψ(λ)Vt} ≥ ez) ≤ P0(∃t ∈ T : Lt ≥ ez) ≤ L0e−z ≤ l0e−z.

(1.12) To derive Theorem1.1(a) from (1.12), fix a, b > 0 and choose λ ∈ [0, λmax) such that

ψ(λ) ≤ bλ, supposing for the moment that some such value of λ exists. Then P0(∃t ∈ T : St≥ a + bVt) = P0 ∃t ∈ T : exp {λSt− bλVt} ≥ eaλ



≤ P0 ∃t ∈ T : exp {λSt− ψ(λ)Vt} ≥ eaλ



≤ l0e−aλ,

applying (1.12) in the last step. This bound holds for all choices of λ in the set {λ ∈ [0, λmax) : ψ(λ)/λ ≤ b}, so to minimize the final bound, we take the supremum

over this set, recovering the stated bound l0e−aD(b) by the definition of D(b). If no

value λ ∈ [0, λmax) satisfies ψ(λ) ≤ bλ, then D(b) = 0 by definition, so that the

bound holds trivially. This shows that Definition 1.1 implies Theorem 1.1(a). To complete the proof we will show that the four parts of Theorem 1.1are equiv-alent whenever ψ is CGF-like. We repeatedly use the well-known fact about the Legendre-Fenchel transform that ψ0−1(u) = ψ?0(u) for 0 < u < ¯b, which follows by differentiating the identity ψ?(u) = uψ0−1(u) − ψ(ψ0−1(u)). We also require some

simple facts about ψ(λ)/λ:

Lemma 1.2. Suppose ψ is CGF-like with domain [0, λmax).

(i) ψ(λ)/λ < ψ0(λ) for all λ ∈ (0, λmax).

(ii) λ 7→ ψ(λ)/λ is continuous and strictly increasing on λ > 0. (iii) infλ∈(0,λmax)ψ(λ)/λ = limλ↓0ψ(λ)/λ = 0.

(iv) supλ∈(0,λmax)ψ(λ)/λ = limλ↑λmaxψ(λ)/λ = ¯b.

(v) ψ(D(b))/D(b) = b for any b ∈ (0, ¯b). That is, D(b) is the inverse of ψ(λ)/λ. (vi) s(u) is continuous, strictly increasing, and 0 < s(u) < u for all u ∈ (0, ¯b). Proof of Lemma 1.2. To see (i), write ψ(λ) = Rλ

0 ψ

0(t) dt < λψ0(λ), where the

in-equality follows since ψ is strictly convex so that ψ0 is strictly increasing. For (ii), the function is continuous because ψ is continuous, and differentiating reveals it to

(29)

be strictly increasing by part (i). L’Hˆopital’s rule implies (iii) along with the as-sumptions ψ(λ) = ψ0(λ) = 0 at λ = 0, and implies (iv) along with the CGF-like assumption supλψ(λ) = ∞, which means ψ(λ) ↑ ∞ as λ ↑ λmax since ψ is

con-vex. Part (v) follows from the definition of D(·) and parts (ii), (iii) and (iv). To obtain (vi), note that s is the composition of λ 7→ ψ(λ)/λ with ψ?0. Both of these

are continuous and strictly increasing, the former by part (ii) and the latter since ψ?0= ψ0−1 and ψ0 is continuous and strictly increasing by the CGF-like assumption. As u ↓ 0, we have ψ?0(u) = ψ0−1(u) ↓ 0, so s(u) ↓ 0 since ψ(0) = ψ0(0

+) = 0.

Likewise, if ¯b < ∞, then as u ↑ ¯b, ψ?0(u) ↑ λ

max and s(u) ↑ ¯b. Hence s(u) is

con-tinuous as defined. Next, note that ψ(u) > 0 for u > 0 since ψ is strictly convex with ψ(0) = ψ0(0+) = 0, and ψ?0(u) = ψ0−1(u) > 0 since ψ0(λ) increases from zero

at λ = 0 to ¯b as λ ↑ λmax. Hence s(u) > 0 for u > 0. Finally, use part (i) to write

s(u) = ψ(ψ?0(u))/ψ?0(u) < ψ0(ψ?0(u)) = u, using the fact that ψ?0(u) = ψ0−1(u) for u ∈ (0, ¯b).

Lemma 1.2 allows us to prove the equivalences among the parts of Theorem 1.1

as follows.

• (a) ⇒ (b): Fix m > 0 and x ∈ (0, m¯b). Any line with slope b ∈ (0, x/m) and intercept x − bm passes through the point (m, x) in the (Vt, St) plane, and part

(a) yields P0(∃t ∈ T : St ≥ x + b(Vt− m)) ≤ l0exp {−(x − bm)D(b)} = l0exp n −mx m · D(b) − ψ(D(b)) o

using Lemma1.2(v) in the second step. Now we choose the slope b to minimize the probability bound. The unconstrained optimizer b?satisfies ψ0(D(b?)) = x/m,

and a solution is guaranteed to exist by our restriction on x. This solu-tion is given by D(b?) = ψ0−1(x/m) = ψ?0(x/m). Hence b? = s(x/m) using

Lemma 1.2(v) and the definition of s(·). Lemma 1.2(vi) shows 0 < b? < x/m,

verifying that b? is feasible. Identify the Legendre-Fenchel transformation

ψ?(x/m) = (x/m)D(b

?) − ψ(D(b?)) to complete the proof of part (b).

• (b) ⇒ (c): Fix m > 0 and x ∈ (0, ¯b) and observe that P0  ∃t ∈ T : St Vt ≥ x − x − s(x) Vt  · (Vt− m)  = P0(∃t ∈ T : St≥ mx + s(x) · (Vt− m)) .

(30)

• (c) ⇒ (a): Fix a, b > 0. Suppose first that b < ¯b, and set x = ψ0(D(b)) and m = a/(x − s(x)). Recalling ψ?0= ψ0−1 we see that s(x) = ψ(D(b))/D(b) = b

by Lemma1.2(v). Also, Lemma 1.2(vi) shows that m > 0. Now apply part (c) to obtain P0(∃t ∈ T : St≥ a + bVt) ≤ l0exp  −a · ψ ?(x) x − s(x)  = l0exp  −a · ψ ?(x) · ψ?0(x) xψ?0(x) − ψ(ψ?0(x))  .

Recognizing the Legendre-Fenchel transform in the denominator of the final exponent, we see that the probability bound equals l0exp {−aψ?0(x)}. Again

using ψ?0(x) = ψ0−1(x) = D(b) yields part (a).

If instead b ≥ ¯b, then the above argument yields P0(∃t ∈ T : St≥ a + bVt) ≤ inf b0bP0(∃t ∈ T : St≥ a + b 0 Vt) (1.13) ≤ l0exp  −a sup b0b D(b0)  . (1.14)

But supb0bD(b0) = λmax= D(b) from the definition of D(·).

• (a) ⇒ (d): Fix m ≥ 0 and x, b > 0. Observe that {∃t ∈ T : Vt ≥ m, St ≥

x + b(Vt− m)} ⊆ {∃t ∈ T : St ≥ x0 + b0(Vt− m)} for any 0 < x0 ≤ x and

0 < b0 ≤ b, so part (a) yields

P0(∃t ∈ T : Vt≥ m, St≥ x + b(Vt− m)) ≤ l0exp {−(x0− b0m)D(b0)} (1.15)

for any (x0, b0) in the feasible set {x0 ∈ (0, x], b0 ∈ (0, b] : x0 > mb0}. If x > m¯b,

then (x, b ∧ ¯b) is feasible; note that D(b ∧ ¯b) = D(b) by the definition of D(·). If x ≤ m¯b and b < s(x/m), then by Lemma1.2(vi) and the definition s(¯b) := ¯b, we have b < x/m, so (x, b) is feasible and b ≤ ¯b. Combining these two cases, we have

P0(∃t ∈ T : Vt ≥ m, St≥ x + b(Vt− m)) ≤ l0exp−(x − (b ∧ ¯b)m)D(b)

(1.16) whenever x > m¯b or b < s(x/m), proving the first case in (1.10). On the other hand, if x ≤ m¯b and s(x/m) ≤ b, then (x0, s(x0/m)) is feasible for any x0 < x, by Lemma 1.2(vi). This yields

P0(∃t ∈ T : Vt ≥ m, St ≥ x + b(Vt− m)) ≤ l0exp  −mψ? x 0 m  (1.17)

(31)

as in part (b). We minimize the probability bound over x0 < x, noting that supx0<xψ?(x0/m) = ψ?(x/m) since ψ? is increasing (as ψ is CGF-like) and

closed (Rockafellar,1970, Theorem 12.2). This proves the second case in (1.10). • (d) ⇒ (a): set m = 0 and x = a to recover part (a).

It is worth noting here that, unlike the proofs of Freedman (1975),Khan (2009),

Tropp (2011), and Fan et al.(2015), we do not explicitly construct a stopping time in our proof. While an optional stopping argument is hidden within the proof of Ville’s inequality, the underlying stopping time here is different from that in the aforementioned citations.

Interpreting the theorem

Slope b a Vt St Theorem 1(a) x m Vt St Theorem 1(b) x m Vt St Vt Theorem 1(c) Slope b x m Vt St Theorem 1(d)

Figure 1.2: Illustration of the equivalent statements of Theorem1.1, as described in the text.

It is instructive to think of the parts of Theorem 1.1 as statements about the process (Vt, St) or (Vt, St/Vt) in R2. Many of our results are better understood via

this geometric intuition. Specifically, Figure 1.2 illustrates the following points: • Theorem 1.1(a) takes a given line a + bVt and bounds its St-upcrossing

(32)

• Theorem 1.1(b) takes a point (m, x) in the (Vt, St)-plane and, out of the

in-finitely many lines passing through it, chooses the one which yields the tightest upper bound on the corresponding St-upcrossing probability.

• Theorem 1.1(c) is like part (b), but instead of looking at St, we look at St/Vt,

fix a point (m, x) in the (Vt, St/Vt)-plane, and choose from among the infinitely

many curves b + a/Vt passing through it to minimize the probability bound.

• The intuition for Theorem 1.1(d) is as follows. If we want to bound the up-crossing probability of the line (x − bm) + bVt on {Vt ≥ m}, we can clearly

obtain a conservative bound from Theorem1.1(a) with a = x − bm. This yields the first case in (1.10). However, we can also apply Theorem 1.1(b) with the values m, x, obtaining a bound on the upcrossing probability for a line which passes through the point (m, x) in the (Vt, St)-plane, and this line yields the

minimum possible probability bound among all lines passing through (m, x). If the slope of this line, s(x/m), is less than b, then this optimal probability bound is conservative for the upcrossing probability over the original line x+b(Vt−m)

on {Vt ≥ m}. This gives the second case in (1.10), which is guaranteed to be

at least as small as the bound in the first case when s(x/m) ≤ b. We make some additional remarks below:

• We extend bounds for discrete-time scalar-valued processes to include both discrete-time matrix-valued processes and continuous-time scalar-valued pro-cesses, but we do not handle continuous-time matrix-valued propro-cesses, as this seems to require further technical developments beyond the scope of this chap-ter (seeBacry et al.(2018) for one approach to exponential bounds in this case). We write [C or D] when discussing extensions to existing results to emphasize this fact.

• Most of this chapter is concerned with right-tail bounds, hence the restriction to λ ≥ 0 in Definition 1.1. It is understood that identical techniques yield left-tail bounds upon verifying that Definition 1.1 holds for (−St).

• The purpose of excluding ψ being CGF-like from Definition 1.1 is to separate the truth of statement (a), which follows solely from the assumption, from its equivalence to (b), (c), and (d), which follows from ψ being CGF-like.

Three simple examples

We illustrate some simple instantiations of our theorem with three examples: a sum of coin flips, a discrete-time concentration inequality for random matrices, and a

(33)

continuous-time scalar Brownian motion. These examples make use of several results from Section 1.3 describing conditions under which a process is sub-ψ; such results may be taken for granted on a first reading.

Example 1.2 (Coin flipping). Suppose Xi iid

∼ Ber(p), and let St =

Pt

i=1(Xi − p)

denote the centered sum. The CGF of each increment of St, scaled by 1/[p(1 − p)], is

ψB(λ) := [p(1 − p)]−1log E exp {λ(Xi− p)} = [p(1 − p)]−1log(pe(1−p)λ+ (1 − p)e−pλ),

so that λmax = ∞ and ¯b = 1/p. One may directly check the martingale property to

confirm that Lt(λ) := exp {λSt− ψB(λ)p(1 − p)t} is a martingale for any λ, so that

(St) is 1-sub-ψB with Vt = p(1 − p)t. Then, for any t0 ∈ N and x ∈ (0, (1 − p)t0),

setting m = p(1 − p)t0 in Theorem 1.1(b) yields

P  ∃t ∈ N : St≥ x + p(1 − p)sB  x p(1 − p)t0  · (t − t0)  ≤ exp  −t0KL  p + x t0 p  = "  p p + x/t0 p+x/t0 1 − p 1 − p − x/t0 1−p−x/t0#t0 .

Here KL denotes the Bernoulli Kullback-Leibler divergence, KL ( q k p) = q logpq+ (1 − q) log



1−q 1−p



. It takes some algebra to obtain this KL as the Legendre-Fenchel transform of ψB; in Table 1.2 we summarize all such transforms used in this chapter.

The final expression is Equation (2.1) of Hoeffding(1963), but here we have a bound not just for the deviation of Sm above its expectation at the fixed time m, but for the

upper deviations of St for all t ∈ N, simultaneously. We can use this to sequentially

test a hypothesis about p, or to construct a sequence of confidence intervals for p possessing a coverage guarantee holding uniformly over unbounded time.

The slope transform sB(u) for ψB, given in Table 1.2, is unwieldy. To derive a

more analytically convenient bound, we use the fact that p(1 − p)ψB(λ) ≤ λ2/8 for

all λ ≥ 0; see the proof of Proposition 1.2, part 2. Hence exp {λSt− λ2t/8} ≤ Lt(λ)

with Lt defined as above, so (St) is also 1-sub-ψ with ψ(λ) = λ2/8 and Vt = t. Now

Theorem 1.1(b) yields P  ∃t ∈ N : St ≥ x + x 2m · (t − m)  ≤ exp  −2x 2 m  . (1.18)

This is equivalent to Blackwell’s line-crossing inequality (1.4), and in the form (1.18) it is clear that it recovers Hoeffding’s inequality at the fixed time t = m. Instead of using p(1 − p)ψB(λ) ≤ λ2/8, we might alternatively use ψB(λ) ≤ (1 − 2p)−2(e(1−2p)λ−

(34)

extension of Bennett’s inequality (1.2) which improves upon Hoeffding’s inequality substantially for values of p near zero and one. We will see other examples of such “sub-Poisson” bounds below.

Example 1.3 (Covariance estimation for a spiked random vector ensemble). The estimation of a covariance matrix via an i.i.d. sample is a common application of exponential matrix concentration, starting withRudelson(1999). See alsoVershynin

(2012),Gittens and Tropp(2011),Tropp(2015), andKoltchinskii and Lounici(2017) for more recent treatments; this particular example is drawn fromWainwright(2017). Let d ≥ 2 and consider Rd-valued, mean-zero observations X

i =

dξieUi, where

ξi iid

∼ Rademacher, (ek)dk=1 are the standard basis vectors and Ui iid

∼ Unif {1, . . . , d}. What can we say about the concentration of the sample covariance matrix bΣt :=

t−1Pt

i=1XiXiT around the true covariance Id, the d × d identity matrix? Let γmax(A)

denote the maximum eigenvalue of a matrix A. We have γmax(XiXiT − Id) = d −

1 always, and E(XiXiT − Id)2 =



(d−1)2

d



Id. Hence Fact 1.1(c) shows that St =

tγmax(bΣt− Id) is d-sub-ψ with variance process Vt = (d−1)2t d , where ψ(λ) = e (d−1)λ− (d − 1)λ − 1 (d − 1)2 ≤ λ2 2(1 − (d − 1)λ/3). (1.19)

Here the inequality holds for all λ ∈ [0, 3/(d − 1)) as demonstrated in the proof of Proposition1.2, part 5. Applying Theorem1.1(c) with ψ equal to the final expression in (1.19), we obtain, after some algebra, for any x, m > 0,

P ∃t ∈ N : γmax  b Σt− Id  ≥ x 1 + m tp1 + 2x/3(d − 1) 1 +p1 + 2x/3(d − 1) !! ≤ d exp  − mx 2 2(d − 1) [(d − 1)/d + x/3]  . (1.20) At the fixed time t = m, this implies

γmax  b Σm− Id  ≤ r 2(d − 1)2log(d/α) dm + 2(d − 1) log(d/α) 3m

with probability at least 1 − α, a known fixed-sample result (Wainwright, 2017). However, as above, (1.20) gives a bound on the upper deviations of bΣt for all t ∈

N simultaneously. Such a bound enables, for example, sequential hypothesis tests concerning the true covariance matrix.

(35)

Example 1.4 (Line-crossing for Brownian motion). Let (St)t∈[0,∞) denote standard

Brownian motion. It is a standard fact that the process exp {λSt− λ2t/2} is a

martingale, so that (St) is 1-sub-ψ with ψ(λ) = λ2/2 and Vt = t. In this case,

Theorem 1.1 says that, for any a, b > 0,

P (∃t ∈ (0, ∞) : St ≥ a + bt) ≤ e−2ab,

a well-known line-crossing bound for Brownian motion, which in fact holds with equality (Durrett, 2017, Exercise 7.5.2).

1.3

Sufficient conditions for sub-ψ processes

Much of the power of Definition1.1comes from the array of sufficient conditions for it which have been discovered under diverse, nonparametric conditions. In this section, we define some standard ψ functions and collect a broad set of conditions from the literature for a process (St) to be sub-ψ with one of these functions, summarized in

Tables 1.3 and 1.4. All discrete-time results in this chapter use St = γmax(Yt) where

(Yt)t∈N is a martingale taking values in Hd, with the exception of Section1.4, which

deals with martingales in abstract Banach spaces. Typically, setting d = 1 recovers the corresponding known scalar result exactly. We note also that our results for Hermitian matrices extend directly to rectangular matrices using Hermitian dilations (Tropp, 2012), as we illustrate in Corollary 1.2.

Five useful ψ functions

We define five particular ψ functions corresponding to five ψ cases: the sub-Gaussian case in Hoeffding’s inequality, the “sub-gamma” case corresponding to Bernstein’s inequality, the sub-Poisson case from Bennett’s and Freedman’s inequal-ities, and the sub-exponential and sub-Bernoulli cases which are used in several other existing bounds. The ψ functions and corresponding transforms for these five cases are summarized in Table 1.2, while Figure 1.3 summarizes relationships among these cases, with Proposition 1.2 containing the formal statements. Recall ¯b = sup

λ∈[0,λmax)ψ

0(λ) from Definition 1.2, and note that we take 1/0 = ∞ by

con-vention in the expressions for λmax and ¯b below.

1. We say (St) is Bernoulli with range parameters g, h > 0 when it is

sub-ψB,g,h for some suitable variance process (Vt), where

ψB,g,h(λ) := 1 ghlog  gehλ+ he−gλ g + h  for 0 ≤ λ < ∞ = λmax,

(36)

which is the scaled CGF of a mean-zero random variable taking values −g and h. Here ¯b = 1/g.

2. We say (St) is sub-Gaussian when it is sub-ψN for some suitable variance

process (Vt), where

ψN(λ) := λ2/2 for 0 ≤ λ < ∞ = λmax.

Here ¯b = ∞.

3. We say (St) is sub-Poisson with scale parameter c ∈ R when it is sub-ψP,c for

some suitable variance process (Vt), where

ψP,c(λ) :=

ecλ− cλ − 1

c2 for 0 ≤ λ < ∞ = λmax.

By taking the limit, we define ψP = ψN when c = 0. Here ¯b = |c ∧ 0|−1.

4. We say (St) is sub-gamma with scale parameter c ∈ R when it is sub-ψG,c for

some suitable variance process (Vt), where

ψG,c(λ) := λ2 2(1 − cλ) for 0 ≤ λ < 1 c ∨ 0 = λmax, Here ¯b = |2c ∧ 0|−1.

5. We say (St) is sub-exponential with scale parameter c ∈ R when it is sub-ψE,c

for some suitable variance process (Vt), where

ψE,c(λ) :=

− log(1 − cλ) − cλ

c2 , for 0 ≤ λ <

1

c ∨ 0 = λmax. By taking the limit, we define ψE = ψN when c = 0. Here ¯b = |c ∧ 0|−1.

We will typically write ψB, ψP, ψG, and ψE, omitting the range or scale

parame-ters from the notation when they are clear from the context. We follow the definition of sub-gamma from Boucheron et al. (2013), despite the somewhat inconsistent ter-minology: unlike the other four cases, ψG is not the CGF of a gamma-distributed

random variable. It is convenient for a number of reasons: it includes ψN as a special

case, it gives a useful upper bound for ψP (see Proposition 1.2part 5, below), it falls

naturally out of the use of a Bernstein condition on higher moments to bound the CGF, and it is simple enough to permit analytically tractable results for the slope

(37)

and decay transforms and the various bounds to follow. We remark also that our def-inition of sub-exponential in terms of the CGF of the exponential distribution follows that of Boucheron et al. (2013, Exercise 2.22), but differs from another well-known definition which says that the CGF is bounded by λ2/2 for λ in some neighborhood of zero. The two are equivalent up to appropriate choice of constants, as detailed in Section1.7.

The sub-gamma and sub-exponential functions ψG,cand ψE,cpossess the following

universality property, which we prove in Section 1.6.

Proposition 1.1. For any twice-differentiable ψ : [0, λmax) → R with ψ(0) =

ψ0(0+) = 0, there exist constants a, c > 0 such that ψ(λ) ≤ aψG,c(λ) for all λ ∈

[0, λmax). Likewise, there exists constants ˜a, ˜c > 0 such that ψ(λ) ≤ ˜aψE,˜c(λ) for all

λ.

In particular, this means that if St =Pti=1Xi for any zero-mean, i.i.d. sequence

(Xi) satisfying EeλX1 < ∞ for some λ > 0, then (St) is gamma and

sub-exponential with appropriate scale constants and variance process Vt proportional

to t. Furthermore, any process which is sub-ψ with a CGF-like ψ function is also sub-gamma and sub-exponential with appropriate scaling of the variance process by a constant.

Conditions for sub-ψ processes

In Tables 1.3 and 1.4, we summarize a variety of standard and novel conditions for a process (St) to be sub-ψ. Fact 1.1 and Lemma 1.3 contain discrete-time results,

while results for continuous time are in Fact 1.2. We let Id denote the d × d

identity matrix. For a process (Yt)t∈T, [Y ]t denotes the quadratic variation and hY it

the conditional quadratic variation; in discrete time, [Y ]t:=

Pt

i=1∆Y 2

i and hY it:=

Pt

i=1Ei−1∆Y 2

i . We extend a function f : R → R on the real line to an operator

f : Hd → Hd on the space of Hermitian matrices in the standard way: if A ∈ Hd

has the spectral decomposition U ΛU? where Λ is diagonal with elements λ1, . . . , λd,

then f (A) = U f (Λ)U? where f (Λ) is diagonal with elements f (λ

1), . . . , f (λd). In

particular, the absolute value function extends to Hdby taking absolute values of the

eigenvalues, while [Y+]t :=

Pt

i=1max(0, ∆Yi)2 and hY−it := Pti=1Ei−1min(0, ∆Yi)2

operate by truncating the eigenvalues.

In the discrete-time case, we have the following known results.

Fact 1.1. Let (Yt)t∈N be any Hd-valued martingale, and let St:= γmax(Yt) for t ∈ N.

References

Related documents