1.3 Sequences of Spaces, Events, and Random Variables
1.3.4 Weak Convergence in Distribution
Convergence in distribution, sometimes just called “weak convergence”, plays a fundamental role in statistical inference. It is the type of convergence in the central limits (see Section 1.4.2) and it is the basis for the definition of asymptotic expectation (see Section1.3.8), which, in turn is the basis for most of the concepts of asymptotic inference. (Asymptotic inference is not based on the limits of the properties of the statistics in a sequence, and in Section3.8.3, beginning on page311, we will consider some differences between “aysmptotic” properties and “limiting” properties.)
In studying the properties of a sequence of random variables{Xn}, the holy grail often is to establish thatanXn →N(µ, σ2) for some sequence{an}, and to determine reasonable estimates ofµandσ2. In this section we will show how this is sometimes possible, and we will consider it further in Section1.3.7, and later in Section3.8, where we will emphasize the statistical applications. Weak convergence to normality under less rigorous assumptions will be discussed in Section 1.4.
Convergence in distribution of a sequence of random variables is defined in terms of convergence of a sequence of CDFs. For a sequence that converges to a continuous CDFF, the Chebyshev norm of the difference between a function in the sequence andF goes to zero, as stated in the following theorem. Theorem 1.33 (Polya’s theorem)
If Fn w
→F and F is continuous inIRk, then
lim n→∞tsup∈IRk|
Fn(t)−F(t)|= 0.
Theorem 1.34
Let {Fn} be a sequence of CDFs onIR. Let Gn(x) =Fn(bgnx+agn)
and
Hn(x) =Fn(bhnx+ahn),
where{bdn}and {bhn} are sequences of positive real numbers and{agn} and {ahn} are sequences of real numbers. Suppose
Gn w →G and Hn w →H,
whereGand H are nondegenerate CDFs. Then
bgn/bhn →b >0, (agn−ahn)/bgn→a∈IR,
and
H(bx+a) =G(x) ∀x∈IR. Proof.** fix
The distributions in Theorem1.34are in a location-scale family (see Sec- tion2.6, beginning on page178).
There are several necessary and sufficient conditions for convergence in distribution. A set of such conditions is given in the following “portmanteau” theorem.
Theorem 1.35 (characterizations of convergence in distribution; “portmanteau” theorem)
Given the sequence of random variablesXn and the random variableX, all de-
fined on a common probability space, then each of the following is a necessary and sufficient condition thatXn
d →X.
(i) E(g(Xn))→E(g(X))for all real bounded continuous functions g.
(ii) E(g(Xn))→E(g(X))for all real functionsg such thatg(x)→0as|x| → ∞.
(iii) Pr(Xn∈B)→Pr(X ∈B)for all Borel setsBsuch thatPr(X∈∂B) = 0.
(iv) lim inf Pr(Xn∈S)≥Pr(X ∈S) for all open sets S.
(v) lim sup Pr(Xn∈T)≤Pr(X ∈T) for all closed setsT.
Proof. The proofs of the various parts of this theorem are in Billingsley (1995), among other resources.
Although convergence in distribution does not imply a.s. convergence, con- vergence in distribution does allow us to construct an a.s. convergent sequence. This is stated in Skorokhod’s representation theorem.
Theorem 1.36 (Skorokhod’s representation theorem)
If for the random variables (vectors!) X1, X2, . . ., we have Xn d
→ X, then there exist random variables Y1
d =X1, Y2 d =X2, . . ., and Y d = X, such that Yn a.s. →Y. Proof.Exercise.
Theorem 1.37 (continuity theorem)
Let X1, X2,· · · be a sequence of random variables (not necessarily indepen-
dent) with characteristic functionsϕX1, ϕX2,· · · and letX be a random vari-
able with characteristic function ϕX. Then Xn
d
→X ⇐⇒ ϕXn(t)→ϕX(t)∀t.
Proof.Exercise.
The⇐= part of the continuity theorem is called the L´evy-Cram´er theorem and the =⇒part is sometimes called thefirst limit theorem.
The continuity theorem also applies to MGFs if they exist for allXn. A nice use of the continuity theorem is in the proof of a simple form of the central limit theorem, or CLT. Here I will give the proof for scalar random variables. There are other forms of the CLT, and other important limit theorems, which will be the topic of Section 1.4. Another reason for introducing this simple CLT now is so we can use it for some other results that we discuss before Section 1.4.
Theorem 1.38 (central limit theorem)
If X1, . . . , Xn are iid with mean µ and variance 0 < σ2 < ∞, then Yn = (PXi−nµ)/√nσhas limiting distributionN(0,1).
Proof.It will be convenient to define a function related to the CF: leth(t) = eµtϕ
X(t); hence h(0) = 1, h0(0) = 0, and h00(0) = σ2. Now expand h in a Taylor series about 0:
h(t) =h(0) +h0(0)it−12h00(ξ)t2,
for someξbetween 0 andt. Substituting forh(0) and h0(0), and adding and
subtractingσ2t/2 to this, we have h(t) = 1−σ
2t2
2 −
(h00(ξ)−σ2)t2
2 .
This is the form we will find useful. Now, consider the CF ofYn: ϕYn(t) = E exp it P Xi−nµ) √nσ = E exp it X −µ) √nσ n = h it √ nσ n .
From the expansion of h, we have h it √nσ = 1− t 2 2n− (h00(ξ)−σ2)t2 2nσ2 . So, ϕYn(t) = 1− t 2 2n− (h00(ξ)−σ2)t2 2nσ2 n .
Now we need a well-known (but maybe forgotten) result (see page 652): If limn→∞f(n) = 0, then lim n→∞ 1 + a n + f(n) n b n= eab.
Therefore, because limn→∞h00(ξ) = h00(0) = σ2, limn→∞ϕYn(t) = e−
t2/2
, which is the CF of the N(0,1) distribution. (Actually, the conclusion relies on the L´evy-Cram´er theorem, the ⇐= part of the continuity theorem, The- orem 1.37 on page 87; that is, while we know that the CF determines the distribution, we must also know that the convergent of a sequence of CFs determines a convergent distribution.)
An important CLT has a weaker hypothesis than the simple one above; instead of iid random variables, we only require that they be independent (and have finite first and second moments, of course). In Section 1.6, we relax the hypothesis in the other direction; that is, we allow dependence in the random variables. (In that case, we must impose some conditions of similarity of the distributions of the random variables.)
Tightness of Sequences
In a convergent sequence of probability measures on a metric space, we may be interested in how concentrated the measures in the sequence are. (If the space does not have a metric, this question would not make sense.) We refer to this as “tightness” of the sequence, and we will define it only on the metric space IRd.
Definition 1.41 (tightness of a sequence of probability measures) Let{Pn}be a sequence of probability measures on (IRd,Bd). The sequence is said to be tight iff for every >0, there is a compact (bounded and closed) setC∈ Bd such that
inf
n Pn(C)>1−.
Notice that this definition does not require that {Pn} be convergent, but of course, we are interested primarily in sequences that converge. The following theorem, whose proof can be found inBillingsley(1995) on page 336, among other places, connects tightness to convergence.
Theorem 1.39
Let {Pn} be a sequence of probability measures on(IRd,Bd).
(i) The sequence {Pn} is tight iff for every subsequence {Pni} there exists a further subsequence {Pnj} ⊆ {Pni} and a probability measure P on (IR
d ,Bd) such that Pnj w →P, asj→ ∞.
(ii) If {Pn} is tight and each weakly convergent subsequence converges to the
same measureP, thenPn w →P.
Tightness of a sequence of random variables is defined in terms of tightness of their associated probability measures.
Definition 1.42 (tightness of a sequence of random variables) Let{Xn}be a sequence of random variables, with associated probability mea- sures {Pn}. The sequence{Xn} is said to betightiff
∀ >0∃M <∞ 3sup
n Pn(|Xn|> M)< .