NOTES ON PROBABILITY Greg Lawler
Last Updated: March 21, 2016
Overview
This is an introduction to the mathematical foundations of probability theory. It is intended as a supplement or follow-up to a graduate course in real analysis. The first two sections assume the knowledge of measure spaces, measurable functions, Lebesgue integral, and notions of convergence of functions; the third assumes Fubini’s Theorem; the fifth assumes knowledge of Fourier transform of nice (Schwartz) functions onR; and Section 6 uses the Radon-Nikodym Theorem.
The mathematical foundations of probability theory are exactly the same as those of Lebesgue integration. However, probability adds much intuition and leads to different developments of the area. These notes are only intended to be a brief introduction — this might be considered what every graduate student should know about the theory of probability.
Probability uses some different terminology than that of Lebesgue integration in R. These notes will introduce the terminology and will also relate these ideas to those that would be encountered in an ele-mentary (by which we will mean pre-measure theory) course in probability or statistics. Graduate students encountering probabilty for the first time might want to also read an undergraduate book in probability.
1
Probability spaces
Definition Aprobability spaceis a measure space with total measure one. The standard notation is (Ω,F,P) where:
• Ω is a set (sometimes called asample spacein elementary probability). Elements of Ω are denotedω
and are sometimes calledoutcomes.
• F is aσ-algebra (orσ-field, we will use these terms synonymously) of subsets of Ω. Sets inFare called events.
• Pis a function from F to [0,1] withP(Ω) = 1 and such that ifE1, E2, . . .∈ F are disjoint,
P ∞ [ j=1 Ej = ∞ X j=1 P[Ej].
We say “probability of E” forP(E).
A discrete probability spaceis a probability space such that Ω is finite or countably infinite. In this case we usually chooseF to be all the subsets of Ω (this can be writtenF = 2Ω), and the probability measure
P is given by a functionp: Ω→[0,1] withP
ω∈Ωp(ω) = 1.Then,
P(E) = X
ω∈E p(ω).
We will consider another important example here, the probability space associated to an infinite number of flips of a coin. Let
We think of 0 as “tails” and 1 as “heads”. For each positive integern, let Ωn ={(ω1, ω2, . . . , ωn) :ωj = 0 or 1}.
Each Ωnis a finite set with 2nelements. We can consider Ωn as a probability space withσ-algebra 2Ωn and
probabilityPn induced by
pn(ω) = 2−n, ω∈Ωn.
LetFn be the σ-algebra on Ω consisting of all events that depend only on the first nflips. More formally,
we defineFn to be the collection of all subsetsAof Ω such that there is anE∈2Ωn with
A={(ω1, ω2, . . .) : (ω1, ω2, . . . , ωn)∈E}. (1)
Note thatFn is a finiteσ-algebra (containing 22
n
subsets) and F1⊂ F2⊂ F3⊂ · · ·.
IfAis of the form (1), we let
P(A) =Pn(E).
One can easily check that this definition is consistent. This gives a functionPon F0:=
∞
[
j=1
Fj.
Recall that analgebraof subsets of Ω is a collection of subsets containing Ω, closed under complementation, and closed underfinite unions. To show closure under finite unions, it suffices to show that ifE1, E2∈ F0,
thenE1∪E2∈ F0.
Proposition 1.1. F0 is an algebra but not a σ-algebra.
Proof. Ω∈ F0since Ω∈ F
1. SupposeE ∈ F0. ThenE ∈ Fn for some n, and henceEc∈ Fn andEc∈ F.
Also, suppose E1, E2 ∈ F. Then there existsj, k with E1 ∈ Fj, E2 ∈ Fk. Letn= max{j, k}. Then since
theσ-algebras are increasing,E1, E2∈ Fn and henceE1∪E2∈ Fn. Therefore,E1∪E2∈ F0andF0 is an
algebra.
To see thatF0 is not aσ-algebra consider the singleton set E={(1,1,1, . . .)}.
E is not inF0 butE can be written as a countable intersection of events inF0,
E=
∞
\
j=1
{(ω1, ω2, . . .) :ω1=ω2=· · ·=ωj= 1}.
Proposition 1.2. The functionPis a (countably additive) measure onF0, i.e., it satisfies
P[∅] = 0,and if
E1, E2, . . .∈ F0 are disjoint with∪∞n=1En∈ F0, then
P "∞ [ n=1 En # = ∞ X n=1 P(En).
Proof. P[∅] = 0 is immediate. Also it is easy to see thatPisfinitely additive, i.e., ifE1, E2, . . . , En∈ F0are disjoint then P n [ j=1 Ej = n X j=1 P(Ej).
(To see this, note that there must be anN such that E1, . . . , En ∈ FN and then we can use the additivity
ofPN.)
Showing countable subadditivity is harder. In fact, the following stronger fact holds: supposeE1, E2, . . .∈ F0 are disjoint andE=∪∞
n=1En ∈ F0. Then there is anN such thatEj =∅ forj > N. Once we establish
this, countable additivity follows from finite additivity. To establish this, we consider Ω as thetopologicalspace
{0,1} × {0,1} × · · ·
with the product topology where we have given each{0,1}the discrete topology (all four subsets are open). The product topology is the smallest topology such that all the sets inF0are open. Note also that all sets
in F0 are closed since they are complements of sets in F0. It follows from Tychonoff’s Theorem that Ω is
a compact topological space under this topology. SupposeE1, E2, . . .are as above withE=∪∞
n=1En ∈ F0.
Then E1, E2, . . . is an open cover of the closed (and hence compact) set E. Therefore there is a finite subcover,E1, E2, . . . , EN. Since theE1, E2, . . .are disjoint, this must imply thatEj=∅ forj > N.
LetFbe the smallestσ-algebra containingF0. Then the Carath´eodory Extension Theorem tells us that
Pcan be extended uniquely to a complete measure space (P,F¯,P) whereF ⊂F.¯
The astute reader will note that the construction we just did is exactly the same as the construction of Lebesgue measure on [0,1]. Here we denote a real numberx∈[0,1] by its dyadic expansion
x=.ω1ω2ω3· · ·= ∞ X j=1 ωj 2j.
(There is a slight nuisance with the fact that .011111· · ·=.100000· · ·, but this can be handled.) The
σ-algebra F above corresponds to the Borel subsets of [0,1] and the completionF corresponds to the Lebesgue measurable sets. If the most complicated probability space we were interested were the space above, then we could just use Lebesgue measure on [0,1]. In fact, for almost all important applications of probability, onecouldchoose the measure space to be [0,1] with Lebesgue measure (see Exercise 3). However, this choice is not always the most convenient or natural.
Continuing on the last remark, one generally does not care what probability space one is working on. What one observes are “random variables” which are discussed in the next section.
2
Random Variables and Expectation
DefinitionArandom variableX is a measurable function from a probability space (Ω,F,P) to the reals1, i.e., it is a function
X: Ω−→(−∞,∞) such that for every Borel setB,
X−1(B) ={X ∈B} ∈ F.
Here we use the shorthand notation
{X ∈B}={ω∈Ω :X(ω)∈B}.
If X is a random variable, then for every Borel subsetB ofR, X−1(B)∈ F. We can define a function on Borel sets by
µX(B) =P{X ∈B}=P[X−1(B)].
This function is in fact a measure, and (R,B, µX) is a probability space. The measure µX is called the
distributionof the random variable. If µX gives measure one to a countable set of reals, thenX is called
a discrete random variable. If µX gives zero measure to every singleton set, and hence to every countable
set,X is called acontinuous random variable. Every random variable can be written as a sum of a discrete random variable and a continuous random variable. All random variables defined on a discrete probability space are discrete.
The distribution µX is often given in terms of thedistribution function2 defined by FX(x) =P{X ≤x}=µX(−∞, x].
Note thatF =FX satisfies the following:
• limx→−∞F(x) = 0.
• limx→∞F(x) = 1.
• F is a nondecreasing function.
• F is right continuous, i.e., for everyx,
F(x+) := lim
↓0F(x+) =F(x).
Conversely, any F satisfying the conditions above is the distribution function of a random variable. The distribution can be obtained from the distribution function by setting
µX(−∞, x] =FX(x),
and extending uniquely to the Borel sets.
For some continuous random variablesX, there is a functionf =fX :R→[0,∞) such that P{a≤X ≤b}=
Z b
a
f(x)dx.
Such a function, if it exists, is called thedensity3 of the random variable. If the density exists, then
F(x) = Z x
−∞
f(t)dt.
1or, more generally, to any topological space
2often called cumulative distribution function (cdf) in elementary courses
3More precisely, it is the density or Radon-Nikoydm derivative with respect to Lebesgue measure. In elementary courses, the term probability density function (pdf) is often used.
Iff is continuous att, then the fundamental theorem of calculus implies that f(x) =F0(x). A densityf satisfies Z ∞ −∞ f(x)dx= 1.
Conversely, any nonnegative function that integrates to one is the density of a random variable. Example IfE is an event, theindicator functionofE is the random variable
1E(ω) =
1, ω∈E,
0, ω6∈E.
(The corresponding function in analysis is often called the characteristic function and denotedχE.
Proba-bilists never use the term characteristic function for the indicator function because the term characteristic function has another meaning. The term indicator function has no ambiguity.)
Example Let (Ω,F,P) be the probability space for infinite tossings of a coin as in the previous section. Let
Xn(ω1, ω2, . . .) =ωn=
1, ifnth flip heads,
0, ifnth flip tails. (2)
Sn=X1+· · ·+Xn= # heads on firstnflips. (3)
Then X1, X2, . . . , andS1, S2, . . .are discrete random variables. IfFn denotes the σ-algebra of events that
depend only on the first n flips, then Sn is also a random variable on the probability space (Ω,Fn,P). However,Sn+1 is not a random variable on (Ω,Fn,P).
Example Letµbe any probability measure on (R,B). Consider the trivial random variable
X =x,
defined on the probability space (R,B, µ). ThenXis a random variable andµX=µ. Hence every probability
measure onRis the distribution of a random variable.
Example A random variableX has anormal distributionwith meanµand varianceσ2 if it has density f(x) =√ 1
2πσ2e
−(x−µ)2/2σ2
, −∞< x <∞.
If µ = 0 and σ2 = 1, X is said to have a standard normal distribution. The distribution function of the
standard normal is often denoted Φ,
Φ(x) = Z x −∞ 1 √ 2πe −t2/2dt.
Example IfX is a random variable and
g: (R,B)→R
is a Borel measurable function, thenY =g(X) is also a random variable.
Example Recall that the Cantor function is a continuous functionF : [0,1]→[0,1] withF(0) = 0, F(1) = 1 and such thatF0(x) = 0 for allx∈[0,1]\AwhereAdenotes the “middle thirds” Cantor set. ExtendF toR by settingF(x) = 0 forx≤0 andF(x) = 1 forx≥1. ThenF is a distribution function. A random variable with this distribution function is continuous, sinceF is continuous. However, such a random variable has no density.
The term random variable is a little misleading but it standard. It is perhaps easier to think of a random variable as a “random number”. For example, in the case of coin-flips we get an infinite sequence of random numbers corresponding to the results of the flips.
Definition IfX is a nonnegative random variable, the expectationofX, denotedE(X), is E(X) =
Z
X dP,
where the integral is the Lebesgue integral. IfX is a random variable withE(|X|)<∞, then we also define the expectation by
E(X) = Z
X dP.
IfX takes on positive and negative values, and E(|X|) =∞, the expectation is not defined.
Other terms used for expectation are expected value and mean (the term “expectation value” can be found in the physics literature but it is not good English and sometimes refers to integrals with respect to measures that are not probability measures). The letterµis often used for mean. IfX is a discrete random variable taking on the valuesa1, a2, a3, . . .we have the standard formula from elementary courses:
E(X) =
∞
X
j=1
ajP{X=aj}, (4)
provided the sum is absolutely convergent.
Lemma 2.1. SupposeX is a random variable with distributionµX. Then, E(X) =
Z
R x dµX.
(Either side exists if and only if the other side exists.)
Proof. First assume thatX ≥0. Ifn, k are positive integers let
Ak,n= ω: k−1 n ≤X(ω)< k n .
For everyn <∞, consider the discrete random variableXn taking values in{k/n:k∈Z},
Xn= ∞ X k=1 k n1Ak,n. Then, Xn− 1 n ≤X≤Xn. Hence E[Xn]− 1 n ≤E[X]≤E[Xn]. But, E Xn− 1 n = ∞ X k=1 k−1 n P(Ak,n),
E[Xn] = ∞ X k=1 k nP(Ak,n), and k−1 n µX[ k−1 n , k n)≤ Z [k−1 n , k n) x dµX ≤ k nµX[ k−1 n , k n). By summing we get E Xn− 1 n ≤ Z [0,∞) x dµX≤E[Xn].
By lettingn go to infinity we get the result. The general case can be done by writingX =X+−X−. We omit the details.
In particular, the expectation of a random variable depends only on its distribution and not on the probability space on which it is defined. IfX has a densityf, then the measureµX is the same asf(x)dx
so we can write (as in elementary courses)
E[X] = Z ∞
−∞
x f(x)dx,
where again the expectation exists if and only if Z ∞
−∞
|x|f(x)dx <∞.
Since expectation is defined using the Lebesgue integral, all results about the Lebesgue integral (e.g., linearity, Monotone Convergence Theorem, Fatou’s Lemma, Dominated Convergence Theorem) also hold for expectation.
Exercise 2.2. Use H¨older’s inequality to show that if X is a random variable and q ≥1, then E[|X|q]≥ (E[X])q. For whichX does equality hold?
Example Consider the coin flipping random variables (2) and (3). Clearly E[Xn] = 1· 1 2 + 0· 1 2 = 1 2. Hence E[Sn] =E[X1+· · ·+Xn] =E[X1] +· · ·+E[Xn] = n 2. Also, E[X1+X1+· · ·+X1] =E[X1] +E[X1] +· · ·+E[X1] = n 2.
Note that in the above example, X1+· · ·+Xn and X1+· · ·+X1 do not have the same distribution,
yet the linearity rule for expectation show that they have the same expectation. A very powerful (although trivial!) tool in probability is the linearity of the expectation even for random variables that are “correlated”. IfX1, X2, . . .are nonnegative we can even take countable sums,
E "∞ X n=1 Xn # = ∞ X n=1 E[Xn].
This can be derived from the rule for finite sums and the monotone convergence theorem by considering the increasing sequence of random variables
Yn= n
X
Lemma 2.3. SupposeX is a random variable with distributionµX, and g is a Borel measurable function. Then, E[g(X)] = Z R g(x)dµX.
(Either side exists if and only if the other side exists.)
Proof. Assume first thatgis a nonnegative function. Then there exists an increasing sequence of nonnegative simple functions withgn approaching g. Note that gn(X) is then a sequence of nonnegative simple random
variables approachingg(X). For the simple functionsgn it is immediate from the definition that
E[gn(X)] =
Z
R
gn(x)dµX.
and hence by the monotone convergence theorem,
E[g(X)] = Z
R
g(x)dµX.
The general case can be done by writingg=g+−g− andg(X) =g+(X)−g−(X).
IfX has a densityf, then we get the following formula found in elementary courses
E[g(X)] = Z ∞
−∞
g(x)f(x)dx.
Definition Thevarianceof a random variable X is defined by Var(X) =E[(X−E(X))2],
provided this expecation exists. Note that
Var(X) = E[(X−E(X))2]
= E[X2−2XE(X) + (E(X))2] = E[X2]−(E(X))2.
The variance is often denoted σ2 = σ2(X) and the square root of the variance, σ, is called the standard
deviation
There is a slightly different use of the term standard deviation in statistics; see any textbook on statistics for a discussion.
Lemma 2.4. Let X be a random variable. (Markov’s Inequality) If a >0, P{|X| ≥a} ≤ E[|X|] a . (Chebyshev’s Inequality) If a >0, P{|X−E(X)| ≥a} ≤ Var(X) a2 .
Proof. LetXa=a1{|X|≥a} . ThenXa ≤ |X|and hence
aP{|X| ≥a}=E[Xa]≤E[X].
This gives Markov’s inequality. For Chebyshev’s inequality, we apply Markov’s inequality to the random variable [X−E(X)]2 to get
P{[X−E(X)]2≥a2} ≤ E([X−E(X)]
2)
a2 .
Exercise 2.5(Generalized Chebyshev Inequality). Letf : [0,∞)→[0,∞)be a nondecreasing Borel function and letX be a nonnegative random variable. Then for alla >0,
P{X ≥a} ≤ E [f(X)]
f(a) .
3
Independence
Throughout this section we will assume that there is a probability space (Ω,A,P). Allσ-algebras will be sub-σ-algebras ofA, and all random variables will be defined on this space.
Definition
• Two eventsA, B are called independentifP(A∩B) =P(A)P(B).
• A collection of events{Aα}is calledindependentif for every distinctAα1, . . . , Aαn,
P(Aα1∩ · · · ∩Aαn) =P(Aα1)· · ·P(Aαn).
• A collection of events{Aα}is calledpairwise independentif for each distinctAα1, Aα2,
P(Aα1∩Aα2) =P(Aα1)P(Aα2).
Clearly independence implies pairwise independence but the converse is false. An easy example can be seen by rolling two dice and lettingA1={ sum of rolls is 7},A2={ first roll is 1},A3={ second roll is 6}. Then P(A1) =P(A2) =P(A3) =1 6, P(A1∩A2) =P(A1∩A3) =P(A2∩A3) = 1 36, butP(A1∩A2∩A3) = 1/366=P(A1)P(A2)P(A3).
Definition A finite collection ofσ-algebrasF1, . . . ,Fnis calledindependentif for anyA1∈ F1, . . . , An∈ Fn, P(A1∩ · · · ∩An) =P(A1)· · ·P(An).
An infinite collection ofσ-algebras is called independent if every finite subcollection is independent. Definition IfX is a random variable and F is aσ-algebra, we say thatX isF-measurable ifX−1(B)∈ F for every Borel setB, i.e., ifX is a random variable on the probability space (Ω,F,P). We defineF(X) to be the smallestσ-algebraF for whichX isF-measurable.
In fact,
F(X) ={X−1(B) :B Borel}.
(ClearlyF(X) contains the right hand side, and we can check that the right hand side is a σ-algebra.) If {Xα} is a collection of random variables, then F({Xα}) is the smallestσ-algebra F such that each Xα is
F-measurable.
Probabilists think of a σ-algebra as “information”. Roughly speaking, we have the information in the
σ-algebra F if for every event E ∈ F we know whether or not the event E has occured. A random variableX isF-measurable if we can determine the value ofX by knowing whether or notEhas occured for everyF-measurable eventE. For example suppose we roll two dice andX is the sum of the two dice. To determine the value ofX it suffices to know which of the following events occured:
Ej,1={first roll isj}, j= 1, . . .6,
Ej,2={second roll isj}, j= 1, . . .6.
More generally, if X is any random variable, we can determine X by knowing which of the events
Ea:={X < a}have occured.
Exercise 3.1. SupposeF1⊂ F2⊂ · · · is an increasing sequence ofσ-algebras and
F0= ∞ [ n=1 Fn. ThenF0 is an algebra.
Exercise 3.2. SupposeX is a random variable and g is a Borel measurable function. LetY =g(X). Then F(Y)⊂ F(X). The inclusion can be a strict inclusion.
The following is a very useful way to check independence.
Lemma 3.3. SupposeF0andG0are two algebras of events that are independent, i.e.,
P(A∩B) =P(A)P(B) forA∈ F0, B∈ G0. ThenF :=σ(F0)andG:=σ(G0)are independentσ-algebras.
Proof. LetB∈ G0and define the measure
µB(A) =P(A∩B), µ˜B(A) =P(A)P(B).
Note thatµB,µ˜B are finite measures and they agree onF0. Hence (using Cartheodory extension) they must
agree onF. HenceP(A∩B) =P(A)P(B) forA∈ F, B∈ G0. Now takeA∈F and consider the measures νA(B) =P(A∩B), ν˜A(B) =P(A)P(B).
If X1, . . . , Xn are random variables, we can consider them as a random vector (X1, . . . , Xn) and hence
as a random variable taking values in Rn. The joint distribution of the random variables is the measure
µ=µX1,...,Xn on Borel sets inR
n given by
µ(B) =P{(X1, . . . , Xn)∈B}.
Each random variableX1, . . . , Xn also has a distribution onR,µXi; these are called themarginal distribu-tions. Thejoint distribution functionF =FX1,...,Xn is defined by
Definition Random variablesX1, X2, . . . , Xn are said to beindependentif any of these (equivalent)
condi-tions hold:
• For all Borel setsB1, . . . , Bn,
P{X1∈B1, . . . , Xn ∈Bn}=P{X1∈B1}P{X2∈B2} · · ·P{Xn∈Bn}.
• Theσ-algebrasF(X1), . . . ,F(Xn) are independent.
• µ=µX1×µX2× · · · ×µXn. • For all realt1, . . . , tn,
F(t1, . . . , tn) =FX1(t1)FX2(t2)· · ·FXn(tn).
The random variables X1, . . . , Xn have ajoint density(joint probability density function)f(x1, . . . , xn)
if for all Borel setsB∈Rn,
P{(X1, . . . , Xn)∈B}=
Z
B
f(x1, . . . , xn)dx1dx2 · · · dxn.
Random variables with a joint density are independent if and only if •
f(x1, . . . , xn) =fX1(x1)· · ·fXn(xn), where fXj(xj) denotes the density ofXj.
An infinite collection of random variables is called independent if each finite subcollection is independent. The following follows immediately from the definition and Exercise 3.2.
Proposition 3.4. If X1, . . . , Xn are independent random variables and g1, . . . , gn are Borel measurable
functions, then g1(X1), . . . , gn(Xn)are independent.
Proposition 3.5. If X, Y are independent random variables withE|X|<∞,E|Y|<∞, thenE[|XY|]<∞ and
E[XY] =E[X]E[Y]. (5)
Proof. First consider simple random variables
S= m X i=1 ai1Ai, T = n X j=1 bj1Bj,
whereSisF(X) measurable andT isF(Y) measurable. Then eachAiis independent of eachBjand (5) can
be derived immediately usingP(Ai∩Bj) =P(Ai)P(Bj). Suppose X, Y are nonnegative random variables.
Then there exists simple random variables Sn, measurable with respect to F(X), that increase to X and
simple random variablesTn, measurable with respect to F(Y), increasing to Y. Note thatSnTn increases
toXY. Hence by the monotone convergence theorem, E[XY] = lim
n→∞E[SnTn] = limn→∞E[Sn]E[Tn] =E[X]E[Y].
Finally for generalX, Y, write X =X+−X−, Y+−Y−. Note that X+, X− are independent of Y+, Y−
The multiplication rule for expectation can also be viewed as an application of Fubini’s Theorem. Let
µ, ν denote the distributions ofX andY. If they are independent, then the distribution of (X, Y) isµ×ν. Hence E[XY] = Z R2 xy d(µ×ν) = Z Z xy dµ(x) dν(y) = Z E[X]y dν(y) =E[X]E[Y].
Random variablesX, Y that satisfy E[XY] =E[X]E[Y] are called orthogonal. The last proposition shows that independent integrable random variables are orthogonal.
Exercise 3.6. Give an example of orthogonal random variables that are not independent.
Proposition 3.7. SupposeX1, . . . , Xnare random variables with finite variance. IfX1, . . . , Xn are pairwise
orthogonal, then
Var[X1+· · ·+Xn] = Var[X1] +· · ·+ Var[Xn].
Proof. Without loss of generality we may assume that all the random variables have zero mean (otherwise, we subtract the mean without changing any of the variances). Then
Var n X j=1 Xj = E ( n X j=1 Xj)2 = n X j=1 E[Xj2] + X i6=j E[XiXj] = n X j=1 Var(Xj) + X i6=j E[XiXj]. But by orthogonality, ifi6=j,E[XiXj] =E[Xi]E[Xj] = 0
This proposition is most often used when the random variables are independent. It is not true that Var(X+Y) = Var(X) + Var(Y) for all random variables. Two extreme cases are Var(X+X) = Var(2X) = 4Var(X) and Var(X−X) = 0.
The last proposition is really just a generalization of the Pythagorean Theorem. A natural framework for discussing random variables with zero mean and finite variance is the Hilbert spaceL2 with the inner
producthX, Yi=E[XY].In this case,X, Y are orthogonal iffhX, Yi= 0.
Exercise 3.8. Let (Ω,F,P) be the unit interval with Lebesgue measure on the Borel subsets. Follow this outline to show that one can find independent random variables X1, X2, X3, . . . defined on (Ω,F,P), each normal mean zero, variance 1. Let use writex∈[0,1]in its dyadic expansion
x=.ω1ω2ω3· · · , ωj ∈ {0,1}.
as in Section 1.
(i) Show that the random variable X(x) = x is a uniform random variable on [0,1], i.e., has density
f(x) = 1[0,1].
(ii)Use a diagonalization process to defineU1, U2, . . ., independent random variables each with a uniform distribution on[0,1].
(iii) Show that Xj = Φ−1(Uj) has a normal distribution with mean zero, variance one. Here Φ is the
4
Sums of independent random variables
Definition• A sequence of random variablesYn converges toY in probabilityif for every >0,
lim
n→∞P{|Yn−Y|> }= 0.
• A sequence of random variablesYn converges toY almost surely (a.s.) orwith probability one (w.p.1)
if there is an event EwithP(E) = 1 and such that forω∈E,Yn(ω)→Y(ω).
In other words, “in probability” is the same as “in measure” and “almost surely” is the analogue of “almost everywhere”. The term almost surely is not very accurate (something that happens with probability one happenssurely!) so many people prefer to say “with probability one”.
Exercise 4.1. SupposeY1, Y2, . . .is a sequence of random variables withE[Yn]→µandVar[Yn]→0. Show
that Yn →µin probability. Give an example to show that it is not necessarily true that Yn→µw.p.1.
Much of the focus of classical probability is on sums of independent random variables. Let (Ω,A,P) be a probability space and assume that
X1, X2, . . .
are independent random variables on this probability space. We say that the random variables areidentically distributed if they all have the same distribution, and we write i.i.d. for “independent and identically distributed.”
If X1, X2, . . .are independent (or, more generally, orthogonal) random variables each with mean µand
varianceσ2, then E X 1+· · ·+Xn n = 1 n[E(X1) +· · ·+E(Xn)] =µ, Var X 1+· · ·+Xn n = 1 n2[Var(X1) +· · ·+ Var(Xn)] = σ2 n .
The equality for the expectations does not require the random variables to be orthogonal, but the expression for the variance requires it. The mean of the weighted average is the same as the mean of one of the random variables, but the standard deviation of the weighted average isσ/√nwhich tends to zero.
Theorem 4.2 (Weak Law of Large Numbers). If X1, X2, . . . are independent random variables such that
E[Xn] =µandVar[Xn]≤σ2 for each n, then
X1+· · ·+Xn n −→µ in probability. Proof. We have E X1+· · ·+X n n =µ, Var X1+· · ·+X n n ≤ σ 2 n →0.
TheStrong Law of Large Numbers (SLLN)is a statement about almost sure convergence of the weighted sums. We will prove a version of the strong law here. We start with an easy, but very useful lemma. Recall that the limsup of events is defined by
lim supAn= ∞ \ n=1 ∞ [ m=n Am.
Probabilists often write{An i.o.} for lim supAn; here “i.o.” stands for “infinitely often”.
Lemma 4.3 (Borel-Cantelli Lemma). SupposeA1, A2, . . .is a sequence of events.
(i) IfP
P(An)<∞, then
P{Ani.o.}=P(lim supAn) = 0.
(ii) If P
P(An) =∞, and the A1, A2, . . .are independent, then P{Ani.o.}=P(lim supAn) = 1.
The conclusion in (ii) is not necessarily true for dependent events. For example, ifA is an event of probability 1/2, andA1 =A2 =· · ·=A, then lim supAn=Awhich has probability 1/2.
Proof. (i) IfP
P(An)<∞,
P(lim supAn) = lim n→∞P ∞ [ m=n An ! ≤ lim n→∞ ∞ X m=n P(Am) = 0. (ii) AssumeP
P(An) =∞andA1, A2, . . .are independent. We will show that
P[(lim supAn)c] =P "∞ [ n=1 ∞ \ m=n Acm # = 0.
To show this it suffices to show that for eachn
P ∞ \ m=n Acm ! = 0,
and for this we need to show for eachn,
lim M→∞P M \ m=n Acm ! = 0. But by independence, P M \ m=n Acm ! = M Y m=n [1−P(Am)]≤exp ( − M X n=m P(Am) ) →0.
Theorem 4.4 (Strong Law of Large Numbers). Let X1, X2, . . .be independent random variables each with mean µ. Suppose there exists an M <∞ such thatE[Xn4]≤M for each n (this holds, for example, when
the sequence is i.i.d. andE[X14]<∞). Then w.p.1,
X1+X2+· · ·+Xn
n −→µ.
Proof. Without loss of generality we may assume that µ = 0 for otherwise we can consider the random variablesX1−µ, X2−µ, . . .. Our first goal is to show that
E[(X1+X2+· · ·+Xn)4]≤3M n2. (6)
To see this, we first note that ifi6∈ {j, k, l},
E[XiXjXkXl] =E[Xi]E[XjXkXl] = 0. (7)
When we expand (X1+X2+· · ·+Xn)4and then use the sum rule for expectation we getn4terms. We get nterms of the formE[Xi4] and 3n(n−1) terms of the formE[X
2 iX
2
j] withi6=j. There are also terms of the
formE[XiXj3],E[XiXjXk2] andE[XiXjXkXl] for distincti, j, k, l; all of these expectations are zero by (7) so
we will not bother to count the exact number. We know thatE[Xi4]≤M. Also (see Exercise 2.2), ifi6=j, E[Xi2X 2 j]≤E[X 2 i]E[X 2 j]≤ E[X 4 i]E[X 4 j] 1/2 ≤M.
Since the sum consists of at most 3n2terms each bounded byM, we get (6).
LetAn be the event
An = X1+· · ·+Xn n ≥ 1 n1/8 ={|X1+X2+· · ·+Xn| ≥n7/8}.
By the generalized Chebyshev inequality (Exercise 2.5),
P(An)≤E [(X1+· · ·+Xn)4] (n7/8)4 ≤ 3M n3/2. In particular,P
P(An)<∞. Hence by the Borel-Cantelli Lemma, P(lim supAn) = 0.
Suppose for someω, (X1(ω) +· · ·+Xn(ω))/ndoes not converge to zero. Then there is an >0 such that
|X1(ω) +X2(ω) +· · ·+Xn(ω)| ≥n,
for infinitely many values of n. This implies thatω ∈ lim supAn. Hence the set of ω at which we do not
have convergence has probability zero.
The strong law of large numbers holds in greater generality than under the conditions given here. In fact, ifX1, X2, . . .is any i.i.d. sequence of random variables with E(X1) =µ, then the strong law holds
X1+· · ·+Xn
n →µ, w.p.1.
We will not prove this here. However, as the next exercise shows, the strong law does not hold for all independent sequences of random variables with the same mean.
Exercise 4.5. Let X1, X2, . . .be independent random variables. Suppose that
P{Xn= 2n}= 1−P{Xn = 0}= 2−n.
Show that thatE[Xn] = 1for eachnand that w.p.1,
X1+X2+· · ·+Xn
n −→0.
Verify that the hypotheses of Theorem 4.4 do not hold in this case. Exercise 4.6.
a. LetX be a random variable. Show that E[|X|]<∞if and only if
∞
X
n=1
P{|X| ≥n}<∞. b. Let X1, X2, . . . be i.i.d. random variables. Show that
P{|Xn| ≥n i.o.}= 0
if and only if E[|X1|]<∞.
We now establish a result known as the Kolmogorov Zero-One Law. AssumeX1, X2, . . .are independent.
LetFnbe theσ-algebra generated byX1, . . . , Xn, and letGnbe theσ-algebra generated byXn+1, Xn+2, . . ..
Note thatFn andGn are independentσ-algebras, and
F1⊂ F2⊂ · · · , G1⊃ G2⊃ · · ·.
LetF be theσ-algebra generated byX1, X2, . . ., i.e., the smallestσ-algebra containing the algebra
F0:= ∞
[
n=1
Fn.
Thetailσ-algebraT is defined by
T =
∞
\
n=1
Gn.
(The intersection ofσ-algebras is aσ-algebra, soT is a σ-algebra.) An example of an event inT is lim m→∞ X1+· · ·+Xm m = 0 .
It is easy to see that this event is inGn for eachnsince
lim m→∞ X1+· · ·+Xm m = 0 = lim m→∞ Xn+1+· · ·+Xm m = 0 .
We write 4 for symmetric difference,A4B = (A\B)∪(B\A).The next lemma says roughly that F0 is
“dense” inF.
Lemma 4.7. SupposeF0 is an algebra of events andF =σ(F0). If A∈ F, then for every >0 there is
anA0∈ F0 with
Proof. Let B be the collection of all events A such that for every > 0 there is an A0 ∈ F0 such that
P(A4A0) < . Trivially, F0 ⊂ B. We will show that B is a σ-algebra and this will imply that F ⊂ B. Obviously, Ω∈ B.
SupposeA∈ B and let >0. FindA0∈ F0such that
P(A4A0)< .ThenAc0∈ F0 and
P(Ac4Ac0) =P(A4A0)< .
HenceAc∈ B.
SupposeA1, A2, . . .∈ Band letA=∪Aj. Let >0, and findnsuch that
P n [ j=1 Aj ≥P[A]− 2.
Forj= 1, . . . , n, letAj,0∈ F0 such that
P[Aj4Aj,0]≤2−j−1.
LetA0=A1,0∪ · · · ∪An,0 and note that
A4A0⊂ n [ j=1 Aj4Aj,0 ∪ A\ n [ j=1 Aj ,
and henceP(A4A0)< andA∈ B.
Theorem 4.8 (Kolmogorov Zero-One Law). If A∈ T thenP(A) = 0 orP(A) = 1.
Proof. LetA∈ T and let >0. Find a setA0∈ F0 withP(A4A0)< . ThenA0∈ Fn for somen. Since
T ⊂ Gn andFn is independent ofGn,A andA0 are independent. ThereforeP(A∩A0) =P(A)P(A0). Note that P(A∩A0) ≥ P(A)−P(A4A0)≥ P(A)−. Letting → 0, we get P(A) = P(A)P(A). This implies P(A) = 0 orP(A) = 1.
As an example, supposeX1, X2, . . .are independent random variables. Then the probability of the event ω: lim n→∞ X1(ω) +· · ·+Xn(ω) n = 0 is zero or one.
5
Central Limit Theorem
In this section, we will use some knowledge of the Fourier transform that we review here. Ifg:R→R, the Fourier transformofg is defined by
ˆ
g(y) = Z ∞
−∞
e−ixyg(x)dx.
We will callgaSchwartzfunction it isC∞and all of its derivatives decay at±∞faster than every polynomial. If g is Schwartz, then ˆg is Schwartz. (Roughly speaking, the Fourier transform sends smooth functions to bounded functions and bounded functions to smooth functions. Very smooth and very bounded functions are sent to very smooth and very bounded functions.) The inversion formula
g(x) = 1 2π
Z ∞
−∞
holds for Schwartz functions. A straightforward computation shows that \ e−x2/2 =√2πe−y2/2. We also need Z ∞ −∞ f(x) ˆg(x)dx = Z ∞ −∞ f(x) Z ∞ −∞ g(y)e−ixydy = Z ∞ −∞ g(y) Z ∞ −∞ f(x)e−ixydx dy = Z ∞ −∞ ˆ f(y)g(y)dy.
This is valid provided that the interchange of integrals can be justified; in particular, it holds for Schwartz
f, g. The characteristic function is a version of the Fourier transform.
Definition Thecharacteristic functionof a a random variableX is the function φ=φX:R−→Cdefined by
φ(t) =E[eiXt]. Properties of Characteristic Functions
• φ(0) = 1 and for allt
|φ(t)|=E[eiXt] ≤E |eiXt| = 1.
• φis a continuous function of t. To see this note that the collection of random variablesYs=eisX are
dominated by the random variable 1 which has finite expectation. Hence by the dominated convergence theorem, lim s→tφ(s) = lims→tE[e isX] = E h lim s→te isXi=φ(t).
In fact, it is uniformly continuous (Exercise 5.1). • IfY =aX+bwhere a, bare constants, then
φY(t) =E[ei(aX+b)t] =eibtE[eiX(at)] =eibtφX(at).
• The functionMX(t) =etX is often called themoment generating functionofX. Unlike the
character-istic function, the moment generating function does not always exist for all values of t. When it does exist, however, we can use the formal relationφ(t) =M(it).
• If a random variable X has a densityf then
φX(t) =
Z ∞
−∞
eitxf(x)dx= ˆf(−t).
• If two random variables have the same characteristic function, then they have the same distribution. This fact is not obvious — this will follow from Proposition 5.6 which gives an inversion formula for the characteristic function similar to (8).
• IfX1, . . . , Xn are independent random variables, then φX1+···+Xn(t) = E[e
it(X1+···+Xn)]
= E[eitX1]E[eitX2]· · ·E[e−tXn] = φX1(t)φX2(t)· · ·φXn(t).
• If X is a normal random variable, mean zero, variance 1, then by completing the square in the expo-nential one can compute
φ(t) = Z ∞ −∞ eixt√1 2πe −x2/2 dx=e−t2/2.
IfY is normal meanµ, varianceσ2, then Y has the same distribution asσX+µ, hence φY(t) =φσX+µ(t) =eiµtφX(σt) =eiµte−σ
2t2/2
.
Exercise 5.1. Show that φis a uniformly continuous function of t.
Proposition 5.2. Suppose X is a random variable with characteristic function φ. Suppose E[|X|] <∞. Thenφis continuously differentiable and
φ0(0) =iE(X).
Proof. Note that for realθ
|eiθ−1| ≤ |θ|.
This can be checked geometrically or by noting that
|eiθ−1|= Z θ 0 ieisds ≤ Z θ 0 |ieis|ds=θ.
We write the difference quotient
φ(t+δ)−φ(t) δ = Z ∞ −∞ ei(t+δ)x−eitx δ dµ(x). Note that ei(t+δ)x−eitx δ ≤ |δx| δ ≤ |x|.
Saying E[|X|] < ∞ is equivalent to saying that |x| is integrable with respect to the measure µ. By the dominated convergence theorem,
lim δ→0 Z ∞ −∞ ei(t+δ)x−eitx δ dµ(x) = Z ∞ −∞ lim δ→0 ei(t+δ)x−eitx δ dµ(x). = Z ∞ −∞ ixeitxdµ(x). Hence, φ0(t) = Z ∞ −∞ ixeitxdµ(x), and, in particular, φ0(0) =iE[X].
The following proposition can be proved in the same way.
Proposition 5.3. SupposeX is a random variable with characteristic functionφ. SupposeE[|X|k]<∞for
some positive integerk. Thenφhas k continuous derivatives and
Formally, (9) can be derived by writing E[eiXt] =E ∞ X j=0 (iX)jtj j! = ∞ X j=0 ijE(Xj) j! t j,
differentiating both sides, and evaluating at t = 0. Proposition 5.3 tells us that the existence of E[|X|k] allows one to justify this rigorously up to thekth derivative.
Proposition 5.4. LetX1, X2, . . .be independent, identically distributed random variables with meanµand variance σ2 and characteristic function φ. Let
Zn =
X1+· · ·+Xn−nµ
√
σ2n ,
and letφn be the characteristic function of Zn. Then for eacht,
lim
n→∞φn(t) =e −t2/2.
Recall that e−t2/2is the characteristic function of a normal mean zero, variance 1 random variable. Proof. Without loss of generality we will assumeµ= 0, σ2= 1 for otherwise we can considerY
j= (Xj−µ)/σ. By Proposition 5.3, we have φ(t) = 1 +φ0(0)t+1 2φ 00(0)t2+ tt2= 1− t2 2 +tt 2,
wheret→0 ast→0. Note that
φn(t) = φ t √ n n ,
and hence for fixedt,
lim n→∞φn(t) = limn→∞ φ t √ n n = lim n→∞ 1− t 2 2n+ δn n n ,
where δn →0 as n → ∞. Oncen is sufficiently large that |δn−(t2/2)| < n we can take logarithms and
expand in a Taylor series,
log 1− t 2 2n+ δn n =−t 2 2n+ ρn n, n→ ∞
where ρn →0. (Sinceφcan take on complex values, we need to take a complex logarithm with log 1 = 0,
but there is no problem provided that|δn−(t2/2)|< n.) Therefore
lim
n→∞logφn(t) =− t2
2.
Theorem 5.5 (Central Limit Theorem). Let X1, X2, . . . be independent, identically distributed random
variables with meanµ and finite variance. If−∞< a < b <∞, then
lim n→∞P a≤ X1+X2+· · ·+Xn−nµ σ√n ≤b = Z b a 1 √ 2πe −x2/2 dx.
Proof. Letφn be the characteristic function ofn−1/2σ−1(X1+· · ·+Xn−nµ).We have already shown that
for each t, φn(t) → e−t 2/2
. It suffices to show that if µn is any sequence of distributions such that their
characteristic functionsφn converge pointwise to e−t 2/2
then for every a < b, lim n→∞µn[a, b] = Φ(b)−Φ(a), where Φ(x) = Z x −∞ 1 √ 2πe −y2/2dy.
Fixa, band let >0. We can find aC∞functiong=ga,b,such that
0≤g(x)≤1, −∞< x <∞, g(x) = 1, a≤x≤b, g(x) = 0, x /∈[a−, b+].
We will show that
lim n→∞ Z ∞ −∞ g(x)dµn(x) = Z ∞ −∞ g(x)√1 2πe −x2/2 dx. (10)
Since for any distributionµ,
µ[a, b]≤ Z ∞
−∞
g(x)dµ(x)≤µ[a−, b+],
we then get the theorem by letting→0.
Sinceg is a C∞ function with compact support, it is a Schwartz function. Let
ˆ
g(y) = Z ∞
−∞
e−ixyg(x)dx,
be the Fourier transform. Then, using the inversion formula (8), Z ∞ −∞ g(x)dµn(x) = Z ∞ −∞ 1 2π Z ∞ −∞ eixygˆ(y)dy dµn(x) = 1 2π Z ∞ −∞ Z ∞ −∞ eixydµn(x) ˆ g(y)dy = 1 2π Z ∞ −∞ φn(y)ˆg(y)dy.
Since|φngˆ| ≤ |ˆg|and ˆg isL1, we can use the dominated convergence theorem to conclude
lim n→∞ Z ∞ −∞ g(x)dµn(x) = 1 2π Z ∞ −∞ e−y2/2ˆg(y)dy. Recalling thate\−x2/2 =√2πe−y2/2 , and that Z ˆ f g= Z fg,ˆ we get (10).
The next proposition establishes the uniqueness of characteristic functions by giving an inversion formula. Note that in order to specify a distributionµ, it suffices to giveµ[a, b] at alla, bwithµ{a, b}= 0.
Proposition 5.6. LetXbe a random variable with distributionµ, distribution functionF, and characteristic function φ. Then for every a < bsuch that F is continuous ataandb,
µ[a, b] =F(b)−F(a) = lim T→∞ 1 2π Z T −T e−iya−e−iyb iy φ(y)dy. (11)
Note that it is tempting to write the right hand side of (11) as 1 2π Z ∞ −∞ e−iya−e−iyb iy φ(y)dy, (12)
but this integrand is not necessarily integrable on (−∞,∞). We can write (12) if we interpret this as the improper Riemann integral and not as the Lebesgue integral on (−∞,∞).
Proof. We first fixT <∞. Note that e−iya−e−iyb iy |φ(y)| ≤(b−a), (13)
and hence the integral on the right-hand side of (11) is well defined. The bound (13) allows us to use Fubini’s Theorem: 1 2π Z T −T e−iya−e−iyb iy φ(y)dy = 1 2π Z T −T e−iya−e−iyb iy Z ∞ −∞ eiyxdµ(x) dy = 1 2π Z ∞ −∞ " Z T −T eiy(x−a)−eiy(x−b) iy dy # dµ(x)
It is easy to check that for any realc, Z T −T eicx ix dx= 2 Z |c|T 0 sinx x dx.
We will need the fact that
lim T→∞ Z T 0 sinx x dx= π 2. There are two relevant cases.
• Ifx−aandx−bhave the same sign (i.e., x < aorx > b),
lim T→∞ Z T −T eiy(x−a)−eiy(x−b) iy dy= 0. • Ifx−a >0 andx−b <0, lim T→∞ Z T −T eiy(x−a)−eiy(x−b) iy dy= 4 limT→∞ Z T 0 sinx x dx= 2π.
We do not need to worry about what happens atx=aorx=b sinceµgives zero measure to these points. Since the function
gT(a, b, x) =
Z T
−T
eiy(x−a)−eiy(x−b)
iy dy
is uniformly bounded inT, a, b, x, we can use dominated convergence theorem to conclude that
lim T→∞ Z ∞ −∞ " Z T −T eiy(x−a)−eiy(x−b) iy dy # dµ(x) = Z 2π1(a,b)dµ(x) = 2πµ[a, b]. Therefore, lim T→∞ 1 2π Z T −T e−iya−e−iyb iy φ(y)dy=µ[a, b].
Exercise 5.7. A random variable X has a Poisson distribution with parameter λ >0 if X takes on only nonnegative integer values and
P{X=k}=e−λ
λk
k!, k= 0,1,2, . . . .
In this problem, suppose that X, Y are independent Poisson random variables on a probability space with parametersλX, λY, respectively.
(i) FindE[X],Var[X],E[Y],Var[Y].
(ii)Show thatX+Y has a Poisson distribution with parameter λX+λY by computing directly
P{X+Y =k}.
(iii)Find the characteristic function forX, Y and use this to find an alternative deriviation of(ii). (iv)Suppose that for each integern > λwe have independent random variablesZ1,n, Z2,n, . . . , Zn,n with
P{Zj,n= 1}= 1−P{Zj,n= 0}= λ n. Find lim n→∞P{Z1,n+Z2,n+· · ·+Zn,n=k}.
Exercise 5.8. A random variableXisinfinitely divisibleif for each positive integern, we can find i.i.d. ran-dom variablesX1, . . . , Xn such that X1+· · ·+Xn have the same distribution asX.
(i) Show that Poisson random variables and normal random variables are infinitely divisible. (ii)Suppose thatX has a uniform distribution on[0,1]. Show thatX is not infinitely divisible. (iii)Find all infinitely divisible distributions that take on only a finite number of values.
6
Conditional Expectation
6.1
Definition and properties
SupposeX is a random variable withE[|X|]<∞. We can think of E[X] as the best guess for the random variable given no information about the outcome of the random experiment which produces the numberX. Conditional expectation deals with the case where one has some, but not all, information about an event.
Consider the example of flipping coins, i.e., the probability space Ω ={(ω1, ω2, . . .) :ωj = 0 or 1}.
LetFnbe theσ-algebra of all events that depend on the firstnflips of the coins. Intutitively we think of the σ-algebra Fn as the “information” contained in viewing the first nflips. LetXn(ω) =ωn be the indicator
function of the event “thenth flip is heads”. Let
Sn=X1+· · ·+Xn
be the total number of heads in the firstn flips. ClearlyE[Sn] =n/2. Suppose we are interested in S3 but
we get to see the flip of the first coin. Then our “best guess” for S3 depends on the value of the first flip.
Using the notation of elementary probability courses we can write
E[S3|X1= 1] = 2, E[S3|X1= 0] = 1.
More formally we can write E[S3 | F1] for the random variable that is measurable with respect to the σ-algebraF1 that takes on the value 2 on the event{X1= 1}and 1 on the event {X1= 0}.
Let (Ω,F,P) be a probability space and G ⊂ F another σ-algebra. Let X be an integrable random variable.
Let us first consider the case when G is a finiteσ-algebra. In this case there exist disjoint, nonempty eventsA1, A2, . . . , Akthat generateGin the sense thatGconsists precisely of those sets obtained from unions
of the eventsA1, . . . , Ak. IfP(Aj)>0, we defineE[X| G] onAj to be the average ofX onAj,
E[X| G] = R AjX dP P(Aj) =E[X1Aj] P(Aj) , onAj. (14)
IfP(Aj) = 0, this definition does not make sense, and we just letE[X | G] take on an arbitrary value, say
cj, onAj. Note that E[X | G] is a G-measurable random variable and that its definition is unique except perhaps on a set of total probability zero.
Proposition 6.1. If G is a finite σ-algebra andE[X| G] is defined as in (14), then for every eventA∈ G, Z A E[X | G]dP= Z A X dP.
Proof. LetG be generated byA1, . . . , Ak and assume thatA=A1∪ · · · ∪Al. Then
Z A E[X | G]dP = l X j=1 Z Aj E[X | G]dP = l X j=1 P(Aj) R AjX dP P(Aj) = l X j=1 Z Aj X dP = Z A X dP.
This proposition gives us a clue as how to defineE[X | G] for infiniteG.
Proposition 6.2. SupposeX is an integrable random variable on(Ω,F,P), andG ⊂ F is aσ-algebra. Then there exists aG-measurable random variableY such that ifA∈ G,
Z A Y dP= Z A X dP. (15)
Proof. The uniqueness is immediate since ifY,Y˜ areG-measurable random variables satisfying (15), then Z
A
(Y −Y˜)dP= 0.
for allA∈ G and henceY −Y˜ = 0 almost surely.
To show existence consider the probability space (Ω,G,P). Consider the (signed) measure on (Ω,G),
ν(A) = Z
A X dP.
Note thatν << P. Hence by the Radon-Nikodym theorem there is aG-measurable random variableY with
ν(A) = Z
A Y dP.
Using the proposition, we can define the conditional expectationE[X | G] to be the random variable Y in the proposition. It is characterized by the properties
• E[X | G] isG-measurable. • For every A∈ G, Z A E[X | G]dP = Z A X dP. (16)
The random variableE[X| G] is only defined up to a set of probability zero.
Exercise 6.3. This exercise gives an alternative proof of the existence ofE[X | G]using Hilbert spaces (but not using Radon-Nikodym theorem). Let H denote the real Hilbert space of all square-integrable random variables on(Ω,F,P).
1. Show that the space ofG-measurable, square-integrable random varaibles is a closed subspace ofH. 2. For square-integrable X, define E(X | G) to be the Hilbert space projection onto the space of
G-measurable random variables. Show that E(X | G)satisfies (16).
3. Show thatE(X | G) exists for integrable X. (Hint: if X ≥0, we can approximate X from below by simple random variablesXn. Define
E(X| G) = lim
n→∞E(Xn| G).
For general X, writeX =X+−X−.)
Properties of Conditional Expectation • E[E[X| G]] =E[X].
Proof. This is a special case of (16) whereA= Ω.
• If a, b ∈ R, E[a X +b Y | G] = aE[X | G] +bE[Y | G]. (This equality and others below are really equalities up an event of probability zero.)
Proof. Note thataE[X| G] +bE[Y | G] isG-measurable. Also, ifA∈ G, Z A (aE[X | G] +bE[Y | G])dP = a Z A E[X | G]dP+b Z A E[Y | G]dPP = a Z A X dP+b Z A Y dP = Z A (aX+bY)dP.
The result follows by uniqueness of the conditional expectation. • IfX isG-measurable, thenE[X | G] =X.
• If G is independent of X, then E[X | G] =E[X]. (G is independent of X if G is independent of the
σ-algebra generated by X. Equivalently, they are independent if for everyA∈ G, 1A is independent
ofX.)
Proof. The constant random variableE[X] is certainlyG-measurable. Also, ifA∈ G, Z A E[X]dP=P(A)E[X] =E[1A]E[X] =E[1AX] = Z A X dP. • IfH ⊂ G ⊂ F areσ-algebras, then
E[X | H] =E[E[X| G]| H]. (17) Proof. Clearly,E[E[X| G]| H] isH-measurable. IfA∈ H, thenA∈ G, and hence
Z A E[E[X | G]| H]dP= Z A E[X | G]dP= Z A X dP. • IfY isG-measurable, then E[XY | G] =Y E[X | G]. (18)
Proof. ClearlyY E[X | G] isG-measurable. We need to show that for everyA∈ G, Z A YE[X | G]dP= Z A Y X dP. (19)
We will first consider the case where X, Y ≥ 0. Note that this implies that E[X | G] ≥ 0 a.s., since R
AE[X | G] dP =
R
AX dP ≥ 0 for every A ∈ G. Find simple G-measurable random variables
0 ≤Y1 ≤Y2 ≤ · · · such that Yn →Y. Then YnX converges monotonically toY X and YnE[X | G] converges monotonically toY E[X | G]. IfA∈ G and
Z=
n
X
j=1
is a G-measurable simple random variable, Z A ZE[X | G]dP = n X j=1 cj Z A 1BjE[X | G]dP = n X j=1 cj Z A∩Bj E[X | G]dP = n X j=1 cj Z A∩Bj X dP = Z A [ n X j=1 cj1Bj]X dP = Z A ZX dP.
Hence (19) holds for simple nonnegativeY and nonnegativeX, and hence by the monotone convergence theorem it holds for all nonnegative Y and nonnegative X. For general Y, X, we write Y = Y+−
Y−, X =X+−X− and use linearity of expectation and conditional expectation.
IfX, Y are random variables andX is integrable we writeE[X |Y] forE[X |σ(Y)], Hereσ(Y) denotes theσ-algebra generated byY. Intuitively, E[X |Y] is the best guess ofX given the value of Y. Note that E[X | Y] can be written as φ(Y) for some function φ, i.e., for each possible value of Y there is a value of E[X |Y]. Elementary texts often write this asE[X |Y =y] to indicate that for each value of y there is a value ofE[X|Y]. Similarly, we can defineE[X |Y1, . . . Yn].
Example Consider the coin-tossing random variables discussed earlier in the section. What isE[X1|Sn]?
Symmetry tells us that
E[X1|Sn] =E[X2|Sn] =· · ·=E[Xn|Sn].
Also, linearity gives
E[X1|Sn] +E[X2|Sn] +· · ·+E[Xn|Sn] = E[X1+· · ·+Xn |Sn] =E[Sn|Sn] =Sn. Hence, E[Xn|Sn] = Sn n .
Note that E[X1 | Sn] is not the same thing at E[X1 | X1, . . . , Xn] = E[X1 | Fn] which would be just X1.
The first conditioning is given only the number of heads on the firstnflips while the second conditioning is given all the outcomes of thenflips.
Example Suppose X, Y are two random variables with joint density function f(x, y). Let us take our probability space to beR2, letF be the Borel sets, and let
Pbe the measure P(A) =
Z
A
f(x, y)dx dy.
Then the random variables X(x, y) =xand Y(x, y) =y have the joint density f(x, y). Then E[Y |X] is a random variable with density
f(y|X) = R∞f(X, y) f(X, z)dz.
This formula is not valid when the denominator is zero. However, the set ofxsuch that Z ∞
−∞
f(x, z)dz= 0,
is a null set under the measurePand hence the densityf(y|X) is well defined almost surely. This density is often called the conditional density ofY givenX.
6.2
Martingales
A martingale is a model of a fair game. Suppose we have a probability space (Ω,F,P) and an increasing sequence ofσ-algebras
F0⊂ F1⊂ · · ·
Such a sequence is called afiltration, and we think ofFn as representing the amount of knowledge we have
at timen. The inclusion of theσ-algebras imply that we do not lose any knowledge that we have gained. Definition A martingale (with respect to the filtration {Fn})is a sequence of integrable random variables Mn such thatMn isFn-measurable for eachnand for alln < m,
E[Mm| Fn] =Mn. (20)
We note that to prove (20), it suffices to show that for alln,
E[Mn+1| Fn] =Mn.
This follows from successive applications of (17). If the filtration is not specified explicitly, then one assumes that one uses the filtration generated byMn, i.e.,Fn is theσ-algebra generated byM1, . . . , Mn.
Examples
We list some examples. We leave it as exercises to show that they are martingales.
• If X1, X2, . . . are independent, mean-zero random variables and Sn = X1+· · ·+Xn, then Sn is a
martingale with respect to the filtration generated by X1, X2, . . . .
• IfX1, X2, . . .are as above,E[Xj2] =σ 2
j <∞for eachj, then
Mn=Sn2− n
X
j=1 σj2
is a martingale with respect to the filtration generated byX1, X2, . . . .
• Suppose X1, X2, . . . are as in the first example andFn is the filtration generated byX1, X2, . . . .We
choose F0 to be the trivial σ-algebra. For each n, let Bn be a bounded random variable that is
measurable with respect toFn−1. We think ofBn as being the “bet” on the gameXn; we can see the
results of X1, . . . , Xn−1 before choosing a bet but one cannot see Xn. The total fortune by time nis
given byW0= 0 and Wn= n X j=1 BjXj. ThenWn is a martingale.
• SupposeX1, X2, . . .are independent coin-flipping random variables, i.e., with distribution
P{Xj = 1}=P{Xj =−1}=
1 2.
A particular case of the last example is whereB0= 1 andBn= 2n ifX1=X2=· · ·=X−1=−1 and Bn= 0 otherwise. In other words, one keeps doubling one’s bet until one wins. Note that
P{Wn = 1}= 1−2−n, P{Wn= 1−2n}= 2−n. (21)
This betting strategy is sometimes called the martingale betting strategy.
We want to prove a version of theoptional stopping theoremwhich is a mathematical way of saying that one cannot beat a fair game. The last example above shows that we need to be careful — if I play a fair game and keep doubling my bet, I will eventually win and come out ahead (provided that I always have enough money to put down the bet when I lose!)
Definition Astopping timewith respect to a filtration{Fn}is a random variable taking values in{0,1,2. . .}∪
{∞}such that for eachn, the event
{T≤n} ∈ Fn.
In other words, we only need the information at time n to know whether we have stopped at time n. Examples of stopping times are:
• Constant random variables are stopping times.
• IfMn is a martingale with respect to{Fn}and V ⊂Ris a Borel set, then
T = min{j:Mj ∈V}
is a stopping time.
• IfT1, T2 are stopping times, then so areT1∧T2andT1∨T2.
• In particular, ifT is a stopping time then so areTn=T∧n. Note thatT0≤T1≤. . .andT = limTn.
Proposition 6.4. If Mn is a martingale and T is a stopping time with respect to {Fn}, then Yn =MT∧n
is a martingale with respect to {Fn}.In particular, E[Yn] =E[Y0] =E[M0]. Proof. Note that
Yn=Mn1{T ≥n}+ n−1
X
j=0
Mj1{T =j}.
From this it is easy to see thatYnisFn-measurable andE[|Yn|]<∞. Also note that the event 1{T ≥n+ 1}
is the complement of the event 1{T ≤n}and hence isFn-measurable. Therefore, (18) implies E(Mn+11{T ≥n+ 1} | Fn) = 1{T ≥n+ 1}E(Mn+1| Fn) = 1{T ≥n+ 1}Mn.
Therefore, using linearity,
E(Yn+1| Fn) = 1{T ≥n+ 1}Mn+ n X j=0 Mj1{T =j} = 1{T ≥n}Mn+ n−1 X j=0 Mj1{T =j}=Yn. Examples
• LetX1, X2, . . .be independent, each withP{Xj = 1}=P{Xj =−1}= 1/2.LetMn= 1+X1+· · ·+Xn, N an integer greater than 1, and
T = min{j:Mn= 0 orMn =N}.
T is the first time that a “random walker” starting at 1 reaches 0 or N. Then Mn∧T is a martingale.
In particular,
1 =E[M0] =E[Mn∧T].
It is easy to check that w.p.1 T < ∞. Since Mn∧T is a sequence of bounded random variables
approaching MT, the dominated convergence theorem implies E[MT] = 1.
Note that
E[MT] =NP{MT =N},
and hence we have computed a probability,
P{MT =N}=
1
N.
• LetX1, X2, . . .be as above and let
Sn =X1+· · ·+Xn.
We have noted thatMn=Sn2−nis a martingale. LetN be a positive integer and T = min{n:|Sn|=N}.
ThenE(Mn∧T) is a martingale and hence
0 =E[M0] =E[Mn∧T] =E[Sn2∧T −(n∧T) 2].
With probability one,
lim
n→∞Mn∧T =MT.
We would like to say that
E[MT] = lim
n→∞E[Mn∧T] = 0. (22)
Indeed, we can justify this using the dominated convergence. It is not difficult to show (Exercise 6.5) thatE[T]<∞, and henceMn∧T is dominated by the integrable random variableN2+T. This justifies
(22). Note that this implies that
E[T] =N2. • Consider the martingale betting strategy as above and let
T = min{n:Xn = 1}= min{x:Wn= 1}.
ThenWT∧n is a martingale, and henceE[WT∧n] =E[W0] = 0. This could be checked easily using (21). Note thatP{T <∞}= 1 and
lim
T→∞WT∧n=WT = 1.
Hence,
1 =E[WT]6= lim
T→∞E[WT∧n] = 0.
Exercise 6.5. Consider the second example above.
• Show that there exists c <∞, ρ <1(depending onN) such that for alln, P{T > n} ≤c ρn.
Use this to conclude that E[T]<∞.
• Suppose we changed the problem by assuming that M0=k wherek < N is a positive integer. In this case, what is E[T]?.
7
Brownian motion
Brownian motion is a model for random continuous motion. We will consider only the one-dimensional version. Imagine thatBt=Bt(ω) denotes the position of a random particle at timet. For fixedt,Bt(ω) is a
random variable. For fixedω, we have a function t7→Bt(ω). Hence, we can consider the Brownian motion
either as a collection of random variblesBtparametrized by time or as a random function from [0,∞) into R.
Definition A collection of random variablesBt, t∈[0,∞) defined on a probability space (Ω,F,P) is called aBrownian motion if it satisfies the following.
(i)B0= 0 w.p.1.
(ii) For each 0≤s1≤t1≤s2≤t2≤ · · · ≤sn≤tn, the random variables Bt1−Bs1, . . . , Btn−Bsn
are independent.
(iii) For each 0≤s≤t, the random variableBt−Bshas the same distribution asBt−s.
(iv) W.p.1., t7→Bt(ω) is a continuous function oft.
It is a nontrivial theorem, which we will not prove here, that states that if Bt is a processes satisfying
(i)–(iv) above, then the distribution ofBt−Bsmust be normal. (This fact is closely related to but does not
immediately follow from the central limit theorem.) Note that
B1= n X j=1 h Bj n−B j−1 n i ,
so a normal distribution is reasonable. However, one cannot conclude normality only from assumptions (i)–(iii), see Exercise 5.8. Since we know Brownian motion must have normal increments (but we do not want to prove it here!), we will modify our definition to include this fact.
Definition A collection of random variablesBt, t∈[0,∞) defined on a probability space (Ω,F,P) is called aBrownian motion with driftµand variance parameterσ2, if it satisfies the following.
(i)B0= 0 w.p.1.
(ii) For each 0≤s1≤t1≤s2≤t2≤ · · · ≤sn≤tn, the random variables Bt1−Bs1, . . . , Btn−Bsn
are independent.
(iii) For each 0≤s≤t, the random variableBt−Bshas a normal distribution with mean µ(t−s) and
varianceσ2(t−s).
(iv) W.p.1., t7→Bt(ω) is a continuous function oft. Bt is called astandardBrownian motion ifµ= 0, σ2= 1.
Exercise 7.1. SupposeBtis a standard Brownian motion. Show thatBˆt:=σBt+µtis a Brownian motion
with driftµand variance parameterσ2.
In this section, we will show that Brownian motion exists. We will take any probability space (Ω,F,P) sufficiently large that there exist a countable sequence of standard normal random variables N1, N2, . . .
defined on the space. Exercise 3.8 shows that the unit interval with Lebesgue measure is large enough. For ease, we will show how to construct a standard Brownian motionBt,0≤t≤1. Let
Dn={j/2n :j= 1, . . . ,2n}, D= ∞
[
n=0 Dn,
and suppose that the random variables{Nt:t∈D}have been relabeled so they are indexed by the countable
setD. The basic strategy is as follows:
• Step 1. Use the random variablesNt, t∈D to define random variablesBt, t∈Dthat satisfy (i)–(iii),
restricted tos, t, sj, tj ∈D.
• Step 2. Show that w.p.1 the functiont7→Bt, t∈Dis uniformly continuous.
• Step 3. On the event thatt7→Bt, t∈D is uniformly continuous, define Bt, t∈[0,1] by continuity.
Show this satisfies the defintion.
7.1
Step 1.
We will recursively define Bt for t ∈ Dn. In our construction we will see that at each stage the random
variables
∆(j, n) :=Bj2−n−B(j−1)2−n, j = 1, . . . ,2n,
are independent, normal, variance 2−n.We start withn= 0, and setB0= 0, B1=N1. To do the next step, we use the following lemma.
Lemma 7.2. SupposeX, Y are independent normal random variables with mean zero, variance σ2, and let Z=X+Y, W =X−Y. Then Z, W are independent random variables with mean zero and variance2σ2.
Proof. For ease, assumeσ2= 1. IfV ∈
R2 is a Borel set, let ˜V ={(x, y) : (x+y, x−y)∈V}. Then P{(Z, W)∈V} = P{(X, Y)∈V˜} = Z ˜ V 1 2πe −(x2+y2)/2dx dy= Z V 1 4πe −(z2+w2)/4dz dw.
The second equality uses the change of variablesx= (z+w)/2, y= (z−w)/2. We define B1/2= 1 2B1+ 1 2N1/2= 1 2N1+ 1 2N1/2, and hence B1−B1/2= 1 2N1− 1 2N1/2.
SinceN1/2, N1 are independent, standard normal random variables, the lemma tells us that
∆(1,1),∆(2,1)
are independent, mean-zero normals with variance 1/2. Recursively, ifBtis defined for t∈Dn, we define
∆(2j+ 1, n+ 1) = 1 2∆(j, n) + 2 −(n+2)/2N (2j+1)2−(n+1), and hence ∆(2j+ 2, n+ 1) = 1 2∆(j, n)−2 −(n+2)/2N (2j+1)2−(n+1). Then continual use of the lemma shows that this works.
7.2
Step 2.
LetMn = sup{|Bs−Bt|:s, t∈D,|s−t| ≤2−n}.
Showing continuity is equivalent to showing that w.p.1,Mn →0 asn→ ∞, or, equivalently, for each >0, P{Mn> i.o.}= 0.
It is easier to work with a different quantity:
Z(j, n) = sup
Bt−B(j−1)2−n
:t∈D,(j−1)2−n ≤t≤j2−n ,
Zn = max{Z(j, n) :j= 1, . . . ,2n}.
Note thatMn≤3Zn. Hence, for any >0,
P{Mn >3} ≤P{Zn> } ≤ 2n
X
j=1
P{Z(j, n)> }= 2nP{Z(1, n)> }. SinceBthas the same distribution as
√
t B1, a little thought will show that the distribution ofZ(1, n) is the
same as the distribution of
2−n/2Z(1,0) = 2−n/2 sup{|Bt|:t∈D}= 2−n/2 lim
m→∞max{|Bt|:t∈Dm}.
For anyδ >0, let us consider
P{max{|Bt|:t∈Dm}> δ} ≤2P{max{Bt:t∈Dm}> δ}. We now write {max{Bt:t∈Dm}> δ}= 2m [ k=1 Ak, where Ak=Ak,m= Bk2−m> δ, Bj2−m ≤δ, j < k . Note that P(Ak∩ {B1> δ})≥P(Ak∩ {B1−Bk2−m ≥0}) =P(Ak)P{B1−Bk2−m≥0}= 1 2P(Ak). Since theAk are disjoint, we get
P{B1> δ}= 2m X k=1 P(Ak∩ {B1> δ})≥ 1 2 2m X k=1 P(Ak) = 1 2P{max{Bt:t∈Dm}> δ}, or, forδ≥p π/2, P{max{Bt:t∈Dm}> δ} ≤2P{B1> δ}= r 2 π Z ∞ δ e−x2/2dx≤ r 2 π Z ∞ δ e−xδ/2dx≤e−δ2/2.
Since this holds for allm, Putting this all together, we have
P{Mn >3} ≤2nP{Z(1, n)> 2−n/2} ≤2n exp−22n−1 . Therefore, ∞ X n=1 P{Mn>3}<∞,
and by the Borel-Cantelli lemma,
Exercise 7.3. Let α <1/2and Y = sup |B t−Bs| (t−s)α : 0≤s < t≤1 .
Show that w.p,.1.,Y <∞. Show that the result is not true forα= 1/2.
7.3
Step 3.
This step is straightforward and left as an exercise. One important fact that is used is that ifX = limXn
andXn is normal mean zero, varianceσn2 andσn→σ, thenX is normal mean zero, varianceσ2.
7.4
Wiener measure
Brownian motion Bt,0 ≤t ≤1 can be considered as a random variable taking values in the metric space C[0,1], the set of continuous functions with the usual sup norm. The distribution of this random variable is a probability measure onC[0,1] and is called Wiener measure. Iff ∈C[0,1] and >0, then the Wiener measure of the open ball of radiusaboutf is
P{|Bt−f(t)|< ,0≤t≤1}.
Note that C[0,1] with Wiener measure is a probability space. In fact, if we start with this probability space, then it is trivial to define a Brownian motion — we just letBt(f) =f(t).
8
Elementary probability
In this section we discuss some results that would usually be covered in an undergraduate calculus-based course in probability.
8.1
Discrete distributions
A discrete distribution onRcan be considered as a functionq:R→[0,1] such thatSq :={x:q(x)>0} is
finite or countably infinite, and
X
x∈Sq
q(x) = 1.
We callSq thesupportofq. IfX is a random variable with distributionq, then P{X =x}=q(x).
8.1.1 Bernoulli and Binomial
Letp∈(0,1). Independent repetitions of an experiment with probability pof success are calledBernoulli trials. The results of the trials can be represented by independentBernoullirandom variables
X1, X2, X3, . . .
withP{Xj= 1}= 1−P{Xj= 0}=p. Ifnis a positive integer, then the random variable Y =X1+· · ·+Xn
is said to have abinomial distribution with parametersn andp. Y represents the number of sucesses in n
Bernoulli trials each with probabilitypof success. Note that
P{Y =k}= n
k
The binomial coefficient nk
represents the number of orderedn-tuples of 0s and 1s that have exactlyk1s. ClearlyE[Xj] =pand
Var[Xj] =E[Xj2]−(E[Xj]) 2
=p−p2=p(1−p).
Hence, ifY has a binomial distribution with parameters n, p,
E[Y] = n X j=1 E[Xj] =np, Var[Y] = n X j=1 Var[Xj] =np(1−p). 8.1.2 Poisson distribution
A random variableX has aPoisson distributionwith parameterλ >0, if it has distribution
P{X =k}=e−λ λ k
k!, k= 0,1,2, . . . The Taylor series for the exponential shows that
∞
X
k=0
P{X =k}= 1.
The Poisson distribution arises as a limit of the binomial distribuiton asn→ ∞when the expected number of successes stays fixed at λ. In other words, it is the limit of binomial random variables Yn = Yn,λ with
parametersnandλ/n. This limit is easily derived; ifkis a nonnegative integer,
lim n→∞P{Yn=k} = nlim→∞ n k λ n k 1−λ n n−k = lim n→∞ n(n−1)· · ·(n−k+ 1) k! λ n k 1−λ n n 1−λ n −k = λ ke−λ k! . It is easy to check that
E[X] =λ, E[X(X−1)] =λ2, Var[X] =E[X2]−E[X]2=λ. The expectation and variances could be derived from the corresponding quanitities forYn,
E[Yn] =λ, Var[Yn] =λ 1−λ n . 8.1.3 Geometric distribution
Ifp∈(0,1), thenX has ageometric distributionwith parameterpif
P{X =k}= (1−p)k−1p, k= 1,2, . . . (23)
X represents the number of Bernoulli trials needed until the first success assuming the probability of success on each trial isp. (The event that thekth trial is the first success is the event that the firstk−1 trials are failures and the last one is a success. From this, (23) follows immediately.) Note that
E[X] = ∞ X k(1−p)k−1p=−p d dp ∞ X (1−p)k=−p d dp 1−p p =1 p.