1.3 Sequences of Spaces, Events, and Random Variables
1.3.3 Types of Convergence
The first important point to understand about asymptotic theory is that there are different kinds of convergence of a sequence of random variables, {Xn}. Three of these kinds of convergence have analogues in convergence of gen- eral measurable functions (see Appendix 0.1) and a fourth type applies to convergence of the measures themselves. Different types of convergence apply to
• a function, that is, directly to the random variable (Definition1.35). This is the convergence that is ordinarily called “strong convergence”.
• expected values of powers of the random variable (Definition1.36). This is also a type of strong convergence.
• probabilities of the random variable being within a range of another ran- dom variable (Definition1.37). This is a weak convergence.
• the distribution of the random variable (Definition1.39, stated in terms of weak convergence of probability measures, Definition1.38). This is the convergence that is ordinarily called “weak convergence”.
In statistics, we are interested in various types of convergence of proce- dures of statistical inference. Depending on the kind of inference, one type of convergence may be more relevant than another. We will discuss these in later chapters. At this point, however, it is appropriate to point out that an important property ofpoint estimators isconsistency, and the various types of consistency of point estimators, which we will discuss in Section3.8.1, cor- respond directly to the types of convergence of sequences of random variables we discuss below.
Almost Sure Convergence
Definition 1.35 (almost sure (a.s.) convergence) We say that{Xn} converges almost surelyto X if
lim n→∞Xn=X a.s. (1.155) We write Xn a.s. → X.
Writing this definition in the form of Definition0.1.38on page726, with Xn andX defined on the probability space (Ω,F, P), we have
P({ω : lim
n→∞Xn(ω) =X(ω)}) = 1. (1.156)
This expression provides a very useful heuristic for distinguishing a.s. conver- gence from other types of convergence.
Almost sure convergence is equivalent to lim
n→∞Pr (∪ ∞
m=nkXm−Xk> ) = 0, (1.157) for every >0 (exercise).
Almost sure convergence is also called “almost certain” convergence, and written asXn
a.c. → X.
The condition (1.155) can also be written as Prlim
n→∞kXn−Xk<
= 1, (1.158)
for every >0. For this reason, almost sure convergence is also calledconver- gence with probability 1, and may be indicated by writingXn
wp1
→ X. Hence, we may encounter three equivalent expressions:
a.s.
→ ≡ a.c.→ ≡ wp1→ .
Almost sure convergence of a sequence of random variables {Xn} to a constant cimplies lim supnXn = lim infnXn =c, and implies{Xn=ci.o.}; by itself, however, {Xn =c i.o.} does not imply any kind of convergence of {Xn}.
Convergence in rth Moment
Definition 1.36 (convergence in rth moment (convergence in L r)) For fixedr >0, we say that{Xn}converges inrth momentto X if
lim
n→∞E(kXn−Xk
r
We write
Xn Lr
→X. (Compare Definition0.1.50on page748.)
Convergence inrthmoment requires that E(
kXnkrr)<∞for eachn. Con- vergence inrthmoment implies convergence insth moment fors≤r(and, of course, it implies that E(kXnkss)<∞for eachn). (See Theorem 1.16, which was stated only for scalar random variables.)
For r = 1, convergence in rth moment is called convergence in absolute
mean. For r = 2, it is called convergence in mean square or convergence in second moment, and of course, it implies convergence in mean. (Recall our notational convention:kXn−Xk=kXn−Xk2.)
The Cauchy criterion (see Exercise0.0.6don page689) is often useful for proving convergence in mean or convergence in mean square, without speci- fying the limit of the sequence. The sequence{Xn}converges in mean square (to some real number) iff
lim
n,m→∞E(kXn−Xmk) = 0. (1.160)
Convergence in Probability
Definition 1.37 (convergence in probability)
We say that{Xn} converges in probabilitytoX if for every >0, lim n→∞Pr(kXn−Xk> ) = 0. (1.161) We write Xn p →X.
(Compare Definition0.1.51on page748for general measures.)
Notice the difference in convergence in probability and convergence in rth moment. Convergence in probability together with uniform integrability implies convergence in mean, but not in higher rth moments. It is easy to construct examples of sequences that converge in probability but that do not converge in second moment (exercise).
Notice the difference in convergence in probability and almost sure con- vergence; in the former case the limit of probabilities is taken, in the lat- ter the case a probability of a limit is evaluated; compare equations (1.157) and (1.161). It is easy to construct examples of sequences that converge in probability but that do not converge almost surely (exercise).
Although convergence in probability does not imply almost sure converge, it does imply the existence of a subsequence that does converge almost surely, as stated in the following theorem.
Theorem 1.31
Suppose {Xn} converges in probability toX. Then there exists a subsequence {Xni} that converges almost surely toX.
Stated another way, this theorem says that if {Xn} converges in probability toX, then there is an increasing sequence{ni}of positive integers such that
lim i→∞Xni
a.s. = X.
Proof.The proof is an exercise. You could first show that there is an increas- ing sequence {ni} such that
∞
X i=1
Pr(|Xni−X|>1/i)<∞,
and from this conclude that Xni
a.s. → X.
Weak Convergence
There is another type of convergence that is very important in statistical applications; in fact, it is the basis for asymptotic statistical inference. This convergence is defined in terms of pointwise convergence of the sequence of CDFs; hence it is aweakconvergence. We will give the definition in terms of the sequence of CDFs or, equivalently, of probability measures, and then state the definition in terms of a sequence of random variables.
Definition 1.38 (weak convergence of probability measures)
Let {Pn} be a sequence of probability measures and {Fn} be the sequence of corresponding CDFs, and letF be a CDF with corresponding probability measureP. If at each point of continuityt ofF,
lim
n→∞Fn(t) =F(t), (1.162)
we say that the sequence of CDFs{Fn} converges weaklyto F, and, equiva- lently, we say that the sequence of probability measures{Pn}converges weakly toP. We write
Fn→w F or
Pn→w P
Definition 1.39 (convergence in distribution (in law))
If{Xn}have CDFs {Fn}andX has CDFF, we say that{Xn} converges in
distributionorin lawtoX iffFn→w F. We write Xn
d →X.
Because convergence in distribution is not precisely a convergence of the random variables themselves, it may be preferable to use a notation of the form
L(Xn)→ L(X),
where the symbol L(·) refers to the distribution or the “law” of the random variable.
When a random variable converges in distribution to a distribution for which we have adopted a symbol such as N(µ, σ2), for example, we may use notation of the form
Xn →∼ N(µ, σ2).
Because this notation only applies in this kind of situation, we often write it more simply as just
Xn →N(µ, σ2), or in the “law” notation,L(Xn)→N(µ, σ2)
For certain distributions we have special symbols to represent a random variable. In such cases, we may use notation of the form
Xn d →χ2ν,
which in this case indicates that the sequence {Xn}converges in distribution to a random variable with a chi-squared distribution withνdegrees of freedom. The “law” notation for this would beL(Xn)→ L(χ2ν).
Determining Classes
In the case of multiple probability measures over a measurable space, we may be interested in how these measures behave over different sub-σ-fields, in particular, whether there is a determining class smaller than the σ-field of the given measurable space. For convergent sequences of probability measures, the determining classes of interest are those that preserve convergence of the measures for all sets in theσ-field of the given measurable space.
Definition 1.40 (convergence-determining class)
Let {Pn} be a sequence of probability measures defined on the measurable space (Ω,F) that converges (weakly) toP, also a probability measure defined on (Ω,F). A collection of subsets C ⊆ F is called a convergence-determining classof the sequence, iff
Pn(A)→P(A)∀A∈ C 3P(∂A) = 0 =⇒Pn(B)→P(B)∀B ∈ F.
It is easy to see that a convergence-determining class is a determining class (exercise), but the converse is not true, as the following example from Romano and Siegel (1986) shows.
Example 1.20 a determining class that is not a convergence-deter- mining class
For this example, we use the familiar measurable space (IR,B), and construct a determining class C whose sets exclude exactly one point, and then define a probability measure P that puts mass one at that point. All that is then required is to define a sequence {Pn}that converges toP. The example given by Romano and Siegel (1986) is the collection C of all finite open intervals that do not include the single mass point of P. (It is an exercise to show that this is a determining class.) For definiteness, let that special point be 0, and letPn be the probability measure that puts mass one at n. Then, for any A∈ C, Pn(A)→0 =P(A), but for any interval (a, b) where a <0 and 0< b <1,Pn((a, b)) = 0 butP((a, b)) = 1.
Both convergence in probability and convergence in distribution are weak types of convergence. Convergence in probability, however, means that the probability is high that the two random variables are close to each other, while convergence in distribution means that two random variables have the same distribution. That does not mean that they are very close to each other. The term “weak convergence” is often used specifically for convergence in distribution because this type of convergence has so many applications in asymptotic statistical inference. In many interesting cases the limiting dis- tribution of a sequence {Xn} may be degenerate, but for some sequence of constantsan, the limiting distribution of{anXn}may not be degenerate and in fact may be very useful in statistical applications. The limiting distribution of{anXn}for a reasonable choice of a sequence of normalizing constants{an} is called the asymptotic distribution of{Xn}. After some consideration of the relationships among the various types of convergence, in Section1.3.7, we will consider the “reasonable” choice of normalizing constants and other proper- ties of weak convergence in distribution in more detail. The relevance of the limiting distribution of{anXn} will become more apparent in the statistical applications in Section3.8.2and later sections.
Relationships among Types of Convergence
Almost sure convergence and convergence inrthmoment are both strong types of convergence, but they are not closely related to each other. We have the logical relations shown in Figure1.3.
The directions of the arrows in Figure1.3 correspond to theorems with straightforward proofs. Where there are no arrows, as between Lr and a.s., we can find examples that satisfy one condition but not the other (see Ex- amples 1.21 and 1.22 below). For relations in the opposite direction of the arrows, we can construct counterexamples, as for example, the reader is asked to do in Exercises 1.54a and1.54b.
Lr Q Q Q Q Q Q Q s ? L1 PP PP PPPPqP P P P P P i uniformly integrable a.s + ?a.s 1 ) subsequence p ? d (or w)
Figure 1.3. Relationships of Convergence Types
Useful Sequences for Studying Types of Convergence
Just as for working with limits of unions and intersections of sets where we find it useful to identify sequences of sets that behave in some simple way (such as the intervals [a+ 1/n, b−1/n] on page 646), it is also useful to identify sequences of random variables that behave in interesting but simple ways.
One useful sequence begins with{Un}, where Un∼U(0,1/n). We define
Xn =nUn. (1.163)
This sequence can be used to show that an a.s. convergent sequence may not converge in L1.
Example 1.21 converges a.s. but not in mean
Let{Xn}be the sequence defined in equation (1.163). Since Pr(limn→∞Xn = 0) = 1, Xn
a.s.
→ 0. The mean and in fact the rth moment (for r > 0) is 0. However,
E(|Xn−0|r) = Z 1/n
0
nrdu=nr−1.
Forr= 1, this does not converge to the mean of 0, and forr >1, it diverges; hence {Xn} does not converge to 0 in rth moment for any r ≥ 1. (It does converge to the correct rth moment for 0< r <1, however.)
This example is also an example of a sequence that converges in probability (since a.s. convergence implies that), but does not converge inrth moment.
Other kinds of interesting sequences can be constructed as indicators of events; that is, 0-1 random variables. One such simple sequence is the Bernoulli random variables{Xn}with probability thatXn = 1 being 1/n. This sequence can be used to show that a sequence that converges toX in probability does not necessarily converge to X a.s.
Other ways of defining 0-1 random variables involve breaking a U(0,1) distribution into uniform distributions on partitions of ]0,1[. For example, for a positive integerk, we may form 2k subintervals of ]0,1[ forj = 1, . . . ,2k as
j −1 2k , j 2k .
As k gets larger, the Lebesgue measure of these subintervals approaches 0 rapidly.Romano and Siegel(1986) build an indicator sequence using random variables on these subintervals for various counterexamples. This sequence can be used to show that an L2 convergent sequence may not converge a.s., as in the following example.
Example 1.22 converges in second moment but not a.s. LetU ∼U(0,1) and define
Xn= ( 1 if jn−1 2kn < U < jn 2kn 0 otherwise,
where jn= 1, . . . ,2kn andkn→ ∞as n→ ∞. We see that E((Xn−0)2) = 1/(2kn),
hence {Xn} converges in quadratic mean (or in mean square) to 0. We see, however, that limn→∞Xndoes not exist (since for any value ofU,Xn takes on each of the values 0 and 1 infinitely often). Therefore, {Xn}cannot converge a.s. (to anything!).
This is another example of a sequence that converges in probability (since convergence inrth moment implies that), but does not converge a.s.
Convergence of PDFs
The weak convergence of a sequence of CDFs {Fn} is the basis for most asymptotic statistical inference. The convergence of a sequence of PDFs{fn} is a stronger form of convergence because it implies uniform convergence of probability on any given Borel set.
Theorem 1.32 (Scheff´e)
Let {fn} be a sequence of PDFs that converge pointwise to a PDF f; that is,
at each x lim n→∞fn(x) =f(x). Then lim n→∞ Z B| fn(x)−f(x)|dx= 0 (1.164)
For a proof see Scheff´e(1947).
Hettmansperger and Klimko (1974) showed that if a weakly convergent sequence of CDFs {Fn} has an associated sequence of PDFs {fn}, and if these PDFs are unimodal at a given point, then on any closed interval that does not contain the modal point the sequence of PDFs converge uniformly to a PDF.
Big O and Little o Almost Surely
We are often interested in nature of the convergence or the rate of convergence of a sequence of random variables to another sequence of random variables. As in general spaces of real numbers that we consider in Section 0.0.5 on page 652, we distinguish two types of limiting behavior by big O and little o. These are involve the asymptotic ratio of the elements of one sequence to the elements of a given sequence {an}. We defined two order classes, O(an) and o(an). In this section we begin with a given sequence of random variables {Yn} and define four different order classes, O(Yn) a.s., o(Yn) a.s., OP(Yn), and oP(Yn), based on whether or not the ratio is approaching 0 (that is, big O or little o) and on whether the converge is almost sure or in probability.
For sequences of random variables{Xn} and {Yn} defined on a common probability space, we identify different types of convergence, either almost sure or in probability.
• Big O almost surely, written O(Yn) a.s.
Xn∈O(Yn) a.s.iff Pr (kXnk ∈O(kYnk)) = 1 • Little o almost surely, written o(Yn) a.s.
Xn ∈o(Yn) a.s.iff kXnk/kYnka.s.→0. CompareXn/Yn a.s.→ 0 forXn∈IRmandYn∈IR.
Big O and Little o Weakly
We also have relationships in which one sequence converges to another in probability.
• Big O in probability, written OP(Yn).
Xn∈OP(Yn) iff∀ >0∃constantC>03sup
n Pr(kXnk ≥CkYnk)< . IfXn∈OP(1),Xn is said to be bounded in probability.
• Little o in probability, written oP(Yn).
Xn∈oP(Yn) iffkXnk/kYnk p →0.
IfXn∈oP(1), then Xn converges in probability to 0, and conversely. IfXn∈oP(1), then also Xn∈OP(1). (Exercise.)
Instead of a defining sequence{Yn} of random variables, the sequence of interest may be a sequence of constants{an}.
Some useful properties are the following, in which{Xn},{Yn}, and{Zn} are random variables defined on a common probability space, and {an} and {bn} are sequences of constants.
Xn∈op(an) =⇒Xn∈Op(an) (1.165) Xn ∈op(1)⇐⇒Xn→0. (1.166) Xn∈Op(1/an), limbn/an<∞=⇒Xn∈Op(mn). (1.167) Xn∈Op(an) =⇒XnYn∈Op(anYn). (1.168) Xn∈Op(an), Yn ∈Op(bn) =⇒XnYn∈Op(anbn). (1.169) Xn ∈Op(an), Yn ∈Op(bn) =⇒Xn+Yn∈Op(kank+kbnk). (1.170) Xn∈Op(Zn), Yn ∈Op(Zn) =⇒Xn+Yn∈Op(Zn). (1.171) Xn∈op(an), Yn ∈op(bn) =⇒XnYn∈op(anbn). (1.172) Xn∈op(an), Yn ∈op(bn) =⇒Xn+Yn ∈op(kank+kbnk). (1.173) Xn∈op(an), Yn ∈Op(bn) =⇒XnYn∈op(anbn). (1.174) You are asked to prove these statements in Exercise1.61. There are, of course, other variations on these relationships. The order of convergence of sequence of absolute expectations can be related to order of convergence in probability: an∈IR+, E(|Xn|)∈O(an) =⇒Xn ∈Op(an). (1.175) Almost sure convergence implies that the sup is bounded in probability. For any random variableX (recall that a random variable is finite a.s.),