INTRODUCTION
TO
STOCHASTIC
PROCESSES
EHRAN ÇINLAR
Norman J. Sollenberger Professor in Engineering
Princeton University
DOVER PUBLICATIONS, INC.
Mineola, New York
Copyright Copyright © 1975 by Erhan Çinlar
All rights reserved.
Bibliographical Note
This Dover edition, first published in 2013, is an unabridged republication of the work originally published in 1975 by Prentice-Hall, Inc., Englewood Cliffs, New Jersey.
Library of Congress Cataloging-in-Publication Data Çinlar, E. (Erhan), 1941–
Introduction to stochastic processes / Erhan Çinlar. — Dover edition. pages cm.
Summary: “This clear presentation of the most fundamental models of random phenomena employs methods that recognize computer-related aspects of theory. Topics include probability spaces and random variables, expectations and independence, Bernoulli processes and sums of independent random variables, Poisson processes, Markov chains and processes, and renewal theory. Includes an introduction to basic stochastic processes. 1975 edition”— Provided by publisher.
Includes bibliographical references and index. eISBN-13: 978-0-486-27632-8
1. Stochastic processes. I. Title. A274.C56 2013
519.2′3—dc23
2012028204 Manufactured in the United States by Courier Corporation
49797601
Contents
Preface
Chapter 1 Probability Spaces and Random Variables 1. Probability Spaces
2. Random Variables and Stochastic Processes 3. Conditional Probability
4. Exercises
Chapter 2 Expectations and Independence 1. Expected Value
2. Conditional Expectations 3. Exercises
Chapter 3 Bernoulli Processes and Sums of Independent Random Variables 1. Bernoulli Process
2. Numbers of Successes 3. Times of Successes
4. Sums of Independent Random Variables 5. Exercises
Chapter 4 Poisson Processes 1. Arrival Counting Process 2. Times of Arrivals
3. Forward Recurrence Times
4. Superposition of Poisson Processes 5. Decomposition of Poisson Processes 6. Compound Poisson Processes
7. Non-stationary Poisson Processes 8. Exercises
Chapter 5 Markov Chains 1. Introduction
2. Visits to a Fixed State 3. Classification of States 4. Exercises
Chapter 6 Limiting Behavior and Applications of Markov Chains 1. Computation of R and F
2. Recurrent States and the Limiting Probabilities 3. Periodic States
4. Transient States
5. Applications to Queueing Theory: M/G/1 Queue 6. Queueing System G/M/l
7. Branching Processes 8. Exercises
Chapter 7 Potentials, Excessive Functions, and Optimal Stopping of Markov Chains 1. Potentials
2. Excessive Functions 3. Optimal Stopping
4. Games with Discounting and Fees 5. Exercises
Chapter 8 Markov Processes 1. Markov Processes 2. Sample Path Behavior
3. Structure of a Markov Process 4. Potentials and Generators 5. Limit Theorems
6. Birth and Death Processes 7. Exercises
Chapter 9 Renewal Theory 1. Renewal Processes
2. Regenerative Processes and Renewal Theory 3. Delayed and Stationary Processes
4. Exercises
Chapter 10 Markov Renewal Theory 1. Markov Renewal Processes
2. Markov Renewal Functions and Classification of States 3. Markov Renewal Equations
4. Limit Theorems
5. Semi-Markov Processes 6. Semi-Regenerative Processes
7. Applications to Queueing Theory 8. Exercises
Afterword
Appendix. Non-Negative Matrices 1. Eigenvalues and Eigenvectors 2. Spectral Representations 3. Positive Matrices
4. Non-Negative Matrices
5. Limits and Rates of Convergence References
Answers to Selected Exercises Index of Notations
Preface
A man wanted to dock the tail of his horse. He consulted a wise man on how short to make it. “Make it as short as it pleases you,” said the wise man, “for, no matter what you do, some will say it is too long, some too short, and your opinion itself will change from time to time.”
This book is an introduction to stochastic processes. It covers most of the basic processes of interest except for two glaring omissions. Topics covered are developed in some depth, some fairly recent results are included, and references are given for further reading. These features should make the book useful to scientists and engineers looking for models and results to use in their work. However, this is primarily a textbook; a large number of numerical examples are worked out in detail, and the exercises at the end of each chapter are there to illustrate the theory rather than to extend it.
When theorems are proved, this is done in some detail; even the ill-prepared student should be able to follow them. On the other hand, not all theorems are proved. If a result is of sufficient intrinsic interest, and if it can be explained and understood, then it is listed as a theorem even though it could not be proved with the elementary tools available in this book. This freedom made it possible to include two characterization theorems on Poisson processes, several limit theorems on the ratios of additive functionals of Markov processes, several results on the sample path behavior of (continuous parameter) Markov processes, and a large number of results dealing with stopping times and the strong Markov property.
The book assumes a background in calculus but no measure theory; thus, the treatment is elementary. At the same time, it is from a modern viewpoint. In the modern approach to stochastic processes, the primary object is the behavior of the sample paths. This is especially so for the applied scientist and the engineer, since it is the sample path which he observes and tries to control. Our approach capitalizes on this happy harmony between the methods of the better mathematician and the intuition of the honest engineer.
We have also followed the modern trends in preferring matrix algebra and recursive methods to transform methods. The early probability theorist’s desire to cast his problem into one concerning distribution functions and Fourier transforms is understandable in view of his background in classical analysis and the notion of an acceptable solution prevailing then. The present generation, however, influenced especially by the availability of computers, prefers a characterization of the solution coupled with a recursive method of obtaining it to an explicit closed form expression in terms of the generating functions of the Laplace transforms of the quantities of actual interest.
The first part of the book is based on a set of lecture notes which were used by myself and several colleagues in a variety of “applied” courses on stochastic processes over the last five years. In that stage I was helped by P. A. Jacobs and C. G. Gilbert in collecting problems and abstracting papers from the applied literature. Most of the final version was written during my sabbatical stay at Stanford University; I should like to thank the department of Operations Research for their hospitality then. While there I have benefited from conversations with K. L. Chung and D. L. Iglehart; it is a pleasure to acknowledge my debt to them. I am especially indebted to A. F. Karr for his help throughout this project; he eliminated many inaccuracies and obscurities. I had the good fortune to
have G. Lemmond to type the manuscript, and finally, the National Science Foundation to support my work.
Introduction
To
Stochastic
Processes
CHAPTER 1
Probability Spaces and Random Variables
The theory of probability now constitutes a formidable body of knowledge of great mathematical interest and of great practical importance. It has applications in every area of natural science: in minimizing the unavoidable errors of observations, in detecting the presence of assignable causes in observed phenomena, and in discovering the basic laws obscured by chance variations. It is also an indispensable tool in engineering and business: in deducing the true lessons from statistics, in forecasting the future, and in deciding which course to pursue.
In this chapter the basic vocabulary of probability theory will be introduced. Most of these terms have cognates in ordinary language, and the reader should do well not to fall into a false sense of security because of his previous familiarity with them. Instead he should try to refine his own use of these terms and watch for the natural context within which each term appears.
1. Probability Spaces
The basic notion in probability theory is that of a random experiment: an experiment whose outcome cannot be determined in advance. The set of all possible outcomes of an experiment is called the sample space of that experiment.
An event is a subset of a sample space. An event A is said to occur if and only if the observed outcome ω of the experiment is an element of the set A.
(1.1) EXAMPLE. Consider an experiment that consists of counting the number of traffic accidents at a
given intersection during a specified time interval. The sample space is the set {0, 1, 2, 3, . . .}. The statement “the number of accidents is less than or equal to seven” describes the event {0, 1, . . ., 7}. The event A = {5, 6, 7, . . .} occurs if and only if the number of accidents is 5 or 6 or 7 or . . . .
Given a sample space Ω and an event A, the complement Ac of A is defined to be the event which occurs if and only if A does not occur, that is,
Given two events A and B, their union is the event which occurs if and only if either A or B (or both) occurs, that is,
The intersection of A and B is the event which occurs if and only if both A and B occur, that is,
The operations of taking unions, intersections, and complements may be combined to obtain new events. In particular, the following identities are of value:
The set Ω is also called the certain event. The set containing no elements is called the empty event and is denoted by Ø. Note that Ø = Ωc and Ω = Øc. Two events are said to be disjoint if they have no elements in common, that is, A and B are disjoint if
If two events are disjoint, the occurrence of one implies that the other has not occurred. A family of events is called disjoint if every pair of them are disjoint.
Event A is said to imply the event B, written A ⊂ B, if every ω in A belongs also to B. To show that two events A and B are the same, then, it is sufficient to show that A implies B and B implies A.
If A1, A2, . . . are events, then their union
is the event which occurs if and only if at least one of them occurs. Their intersection
is the event which occurs if and only if all of them occur.
Next, corresponding to our intuitive notion of the chances of an event occurring, we introduce a function defined on a collection of events.
(1.6) DEFINITION. Let Ω be a sample space and P a function which asso ciates a number with each
event. Then P is called a probability measure provided that (a) for any event A, 0 ≤ P(A) ≤ 1;
(b) P(Ω) = 1;
By axiom (b), the probability assigned to Ω is 1. Usually, there will be other events A ⊂ Ω such that P(A) = 1. If a statement holds for all ω in such a set A with P(A) = 1, then it is customary to say that the statement is true almost surely or that the statement holds for almost all ω ∈ Ω.
Axiom (c) above is a severe condition on the manner in which probabilities are assigned to events. Indeed, it is usually impossible to assign a probability P(A) to every subset A and still satisfy (c). Because of this, it is customary to define P(A) only for certain subsets A. Throughout this book we will avoid the issue by using the term “event” only for those subsets A of Ω for which P(A) is defined. If the sample space Ω is {0, 1, 2, . . .} as in Example (1.1), then there are as many subsets of Ω as there are points on the real line. Therefore, it might be difficult to assign a probability to each event in an explicit fashion. Furthermore, almost any meaningful real-life problem requires considering much more complex sample spaces. Usually, in such situtations, the probabilities of only a few key events are specified and the remaining probabilities are left to be computed from the axioms (1.6a, b, c) by considering the various relationships which might exist between events. In fact, most of probability theory concerns itself with finding methods of doing just this. The following are the first few steps in this direction.
(1.7) PROPOSITION. If A1, . . ., An are disjoint events, then
Proof. First, we establish that P(Ø) = 0. In (1.6c) we may take A1 = A2 = · · · = Ø. Then also, and (1.6c) implies that P(Ø) = P(Ø) + P(Ø) + · · ·. But this is possible only if P(Ø) = 0, since 0 ≤ P(Ø) ≤ 1 by (1.6a).
Next, let A1, . . ., An be disjoint events, and define An + 1 = An + 2 = ··· = Ø. Then A1, A2, . . . are disjoint, and . By (1.6c) we have , since P(An + 1) = P(An + 2) = ··· = P(Ø) = 0.
(1.8) PROPOSITION. If A ⊂ B, then P(A) ≤ P(B).
Proof. If A ⊂ B, we can write
Since A and Ac ∩ B are disjoint, by (1.7),
By (1.6a), P(Ac ∩ B) ≤ 0; thus, P(B) ≤ P(A).
Proof. The events A and Ac are disjoint, and by (1.7), P(A ∪ Ac) = P(Ac). On the other hand, A ∪ Ac = Ω, and P(Ω) = 1 by (1.6b).
Proposition (1.8) expresses our intuitive feeling that, if the occurrence of A implies that of B, then the probability of B should be at least as great as the probability of A. Proposition (1.9), restated in words, states that the probability that A does not occur is one minus the probability that A does occur.
Next is a theorem which is used quite often. It is useful in evaluating the probability of a “complex” event by breaking it into components whose probabilities may be simpler to evaluate.
(1.10) THEOREM. If B1, B2, . . . are disjoint events with Bi = Ω, then for any event A,
Proof. For any event A we can write
Since B1, B2, . . . are disjoint, A ∩ B1, A ∩ B2, . . . are disjoint. By (1.6c) then,
(1.11) PROPOSITION. Let A1, A2, . . . be a sequence of events such that A1 ⊂ A2 ⊂ A3 ⊂ ···, and put
. Then
Proof. Let B1 = A1, , , . . . . Then B1, B2, . . . are disjoint, and
Thus, by (1.7),
But by (1.6c),
(1.12) COROLLARY. Let A1, A2, . . . be a sequence of events such that A1 ⊃ A2 ⊃ A3 ⊃ . . ., and put
. Then
Proof. The sequence of events , , . . . satisfies . . ., and by (1.5),
Thus, by Proposition (1.11),
By Proposition (1.9), P(A) = 1 − P(Ac), and for any n. Thus,
Proposition (1.11) gives a continuity property of P: if the events A1, A2, . . . “increase to“ A, then their probabilities P(A1), P(A2), . . . increase to P(A). Corollary (1.12) is similar; if the events A1, A2, . . . “decrease to” A, then P(A1) P(A2), . . . decrease to P(A).
2. Random Variables and Stochastic Processes
Suppose we are given a sample space Ω and a probability measure P. Most often, especially in applied problems, we are interested in functions of the outcomes rather than the outcomes themselves. (2.1) DEFINITION. A random variable X with values in the set E is a function which assigns a value
X(ω) in E to each outcome ω in Ω.
The most usual examples of E are the set of non-negative integers the set of all integers {. . ., –1, 0, 1, . . .}, the set of all real numbers , and the set of
all non-negative real numbers . In the first two cases and, more generally, when E is finite or countably infinite, X is said to be a discrete random variable.
(2.2) EXAMPLE. Consider the experiment of flipping a coin once. The two possible outcomes are
“Heads” and “Tails,” that is, Ω = {H, T}. Suppose X is defined by putting X(H) = 1, X(T) = –1. Then X is a random variable taking values in the set E = {1, −1}. We may think of it as the earning of the player who receives or loses a dollar according as the outcome is heads or tails.
(2.3) EXAMPLE. Let an experiment consist of measuring the lifetimes of twelve electric bulbs. The
sample space Ω is the set of all 12-tuples ω = (ω1, . . ., ω12) where ωi ≥ 0 for all i. Then
defines a random variable on this sample space Ω. It represents the average lifetime of the 12 bulbs.
(2.4) EXAMPLE. Suppose an experiment consists of observing the acceleration of a vehicle during the
first 60 seconds of a race. Then each possible outcome is a real-valued right continuous function ω defined for 0 ≤ t ≤ 60, and the sample space Ω is the set of all such functions ω. For t ∈ [0, 60], let
for each ω ∈ Ω. Then Xt, Yt, and Zt are random variables on Ω. For the outcome ω, Xt(ω) is the acceleration at time t, Yt(ω) the velocity, and Zt(ω) the position.
Let X be a random variable taking values in a set E, and let f be a real-valued function defined on the set E. Then for each ω ∈ Ω, X(ω) is a point in E and f assigns the value f(X(ω)) to that point. By f(X) we mean the random variable whose value at ω ∈ Ω is f(X(ω)). A particular function of some use is the indicator function IB of a subset B of E; IB(x) is 1 or 0 according as x ∈ B or x ∉ B. Then IB(X) is a random variable which is equal to 1 if the event {X ∈ B} occurs and is equal to 0 otherwise. Quite often there will be a number of random variables X1, . . ., Xn and we will be concerned with functions of them. If X1, . . ., Xn take values in E, and if f is a real-valued function f defined on E × · · · × E = En, then f(X1, . . ., Xn) is a real-valued random variable whose value at ω ∈ Ω is f(X1(ω), . . ., Xn(ω)).
A stochastic process with state space E is a collection {Xt; t ∈ T} of random variables Xt defined on the same probability space and taking values in E. The set T is called its parameter set. If T is countable, especially if , the process is said to be a discrete parameter process. Otherwise, if T is not countable, the process is said to have a continuous parameter. In the latter case the usual examples are and . It is customary to think of the index t as representing time, and then one thinks of Xt as the “state” or the “position” of the process at time t.
(2.5) EXAMPLE. In Example (2.4) Yt is the velocity of the vehicle at time t, and the collection {Yt; 0 ≤
t ≤ 60} is a continuous time-parameter stochas tic process with state space . Similarly for {Zt; 0 ≤ t ≤ 60}.
(2.6) EXAMPLE. Consider the process of arrivals of customers at a store, and suppose the experiment
is set up to measure the interarrival times. Then the sample space Ω is the set of all sequences ω = (ω1, ω2, . . .) of non-negative real numbers ωi. For each ω ∈ Ω and , we put Nt(ω) = k if and only if the integer k is such that ω1 + · · · + ωk ≤ t < ω1 + · · · + ωk + 1 (Nt(ω) = 0 if t < ω1). Then for the outcome ω, Nt(ω) is the number of arrivals in the time interval (0, t]. For each , Nt is a random variable taking values in the set E = {0, 1, . . .}. Thus, is a continuous time-parameter stochastic process with state space E = {0, 1, . . . }. Considered as a function in t, for a fixed ω, the function Nt(ω) is non-decreasing, right continuous, and increases by jumps only; see Figure 1.2.1.
Figure 1.2.1 A possible realization of an arrival process. The picture is for the outcome ω = (1.2, 3.0, 1.7, 0.5, 2.6, . . .).
Let X be a random variable taking values in . If , the set of all outcomes ω for which X(ω) ≤ b is an event, namely, the event {ω: X(ω) ≤ b}. We will write {X ≤ b}
for short, instead of {ω: X(ω) ≤ b}, and we will write P{X ≤ b} for P{X ≤ b}. If a ≤ b, then
and Proposition (1.8) implies that
Noting that
and that the events {X ≤ a} and {a ≤ X ≤ b} are disjoint, we get, by (1.7),
Next, note that for any sequence b1, b2, . . . increasing to + ∞. Since {X ≤ b1} ⊂ {X ≤ b2} ⊂ · · · by (2.7), Proposition (1.11) applies, and we have
Let b1, b2, . . . be a decreasing sequence with limn bn = b. Then {X ≤ b1} ⊃ {X ≤ b2} ⊃ · · · by
(2.7), and . Corollary (1.12), therefore,
In particular, if the bn decrease to – ∞, the limit in (2.11) becomes zero. The function φ defined by
is called the distribution function of the random variable X. If φ is a distribution function, then (also see Figure 1.2.2)
(2.13) (a) φ is non-decreasing by (2.8), (b) φ is right continuous by (2.11), (c) limb→∞ φ(b) = 1 by(2.10),
Figure 1.2.2 A distribution function is non-decreasing and right continuous and lies between 0 and 1.
In the opposite direction, if φ is any function defined on the real line such that (2.13a)–(2.13d) hold, then by taking
and letting
we see that X is a random variable with the distribution function φ. Hence, for any such function φ, there exists a random variable X which has φ as its distribution function.
Next, let X be a discrete random variable taking values in the (countable) set E. Then for any a ∈ E,
is a non-negative number, and we must have
The collection {π(a): a ∈ E} is called the probability distribution of X.
In the case of non-discrete X, it is sometimes possible to differentiate the distribution function. Then the derivative of the distribution function of X is called the probability density function of X. (2.16) EXAMPLE. Consider Example (2.2). If the probability of “Heads” is 0.4, then
The random variable X defined there takes only two values: −1 and +1, and
Then
(2.17) EXAMPLE. In simulation studies using computers, the following setup is utilized in “generating
random variables from a given distribution φ.”
A table of “random numbers” is a collection of numbers ω lying in the interval [0, 1] such that a number picked “at random” is in the interval [a, b] with probability b − a. In our terminology, what this means is that we have a sample space Ω = [0, 1] and a probability measure P on Ω defined so that P([a, b]) = b − a for all 0 ≤ a ≤ b 1. Then the event “the picked number ω is less than or equal to b,” that is, the event [0, b], has probability b.
Suppose the given distribution function φ is continuous and strictly increasing. Then for any ω ∈ Ω = [0, 1], there is one and only one satisfying φ(a) = ω. Therefore, setting
defines a function X from Ω into (see Figure 1.2.3).
Figure 1.2.3 Defining a random variable X with a given distribution function φ.
In other words, the random variable X we have defined has the given function φ as its distribution. Hence, by picking a number ω at random from a table of “random numbers” and then computing X(ω) corresponding to ω from Figure 1.2.3, one obtains a possible value of a random variable X having the distribution function φ.
(2.18) EXAMPLE. A mid-course trajectory correction calls for a velocity increase of 135 ft/sec. The
spacecraft’s engine provides a thrust which causes a constant acceleration of 15 ft/sec2. Based on this, it is decided to fire the engine for 9 seconds. But the engine performance indicates that the actual length of burn time will be a random variable T with
What is the increase in velocity due to this burn?
Let Ω be the set of all possible burn times, i.e., Ω = [0, ∞), and define T(ω) = ω. Then, for the outcome ω, the velocity increase will be
Thus,
Suppose we have, defined on the sample space Ω, a number of random variables X1, . . ., Xn taking values in the countable set E. Then the probabilities of any events associated with X1, . . ., Xn can be computed (by using the results of Section 1) once their joint distribution is specified by giving
for all n-tuples (a1, . . ., an) with ai ∈ E. In the case of random variables X1, . . ., Xn taking values in , the joint distribution is specified by giving
for all numbers . Specification of these probabilities themselves can be difficult at times. The concept we introduce next simplifies such tasks (when used properly).
(2.19) DEFINITION. The discrete random variables X1, . . ., Xn are said to be independent if
for all a1 . . ., an ∈ E. If the Xi take values in , they are said to be independent if
for all . An infinite collection {Xt; t ∈ T} of random variables is called independent if any finite number of them are independent.
In particular, (2.21) implies, and is implied by, the condition that
for all intervals
We close this section by illustrating the concept through examples.
(2.22) EXAMPLE. Let X and Y be two discrete random variables taking values in {1, 2, 3, . . . }.
Suppose
For any m = 1, 2, . . ., using Theorem (1.10) with A = {X = m} and Bi = {Y = i}, we get†
Similarly, for any n ∈ {1, 2, . . . }, using (1.10) with A = {Y = n} and Bi = {X = i} now, we get
Since
(2.23) EXAMPLE. Let X and Y be two discrete random variables taking values in
and having the joint distribution
Then, using Theorem (1.10) with A = {X = m} and Bi = {Y = i}, for any , we have (see the footnote below)
Similarly, using Theorem (1.10) again, for any ,
Since P{X = m, Y = n) ≠ P{X = m}P{Y = n], X and Y are not independent.
(2.25) EXAMPLE. Let X and Y be as in the preceding example. Then X and Y – X are independent. To
show this, we note that
for all m, and that
for all m, , as claimed.
3. Conditional Probability
Let Ω be a sample space and P a probability measure on it.
(3.1) DEFINITION. Let A and B be two events. The conditional probability of A given B, written P(A |
B), is a number satisfying (a) 0 ≤ P(A | B) ≤ 1,
(b) P(A | B) = P(A | B)P(B).
If P(B) > 0, then P(A | B) is uniquely defined by (3.1b). Otherwise, if P(B) = 0, P(A | B) can be taken to be any number in [0, 1].
For fixed B with P(B) > 0, considered as a function of A, P(A | B) satisfies the conditions (1.6) for a probability measure. That is,
(3.2) (a) 0 ≤ P(A | B) ≤ 1, (b) P(Ω | B) = 1,
(c) provided the events A1, A2, . . . be disjoint; and thus Propositions (1.8)–(1.12) hold.
Heuristically, we think as follows. Suppose the outcome ω of the experiment is known to be in B, that is, B has occurred. Then event A can occur if and only if ω ∈ A ∩ B. And our estimate of the chances of A occurring given that B has occurred becomes the relative measure of A ∩ B with respect to B.
However, in practice, we usually have the various basic conditional probabilities specified, and our task then becomes the computation of other probabilities and conditional probabilities. The following proposition provides a simple tool for computing the probability of an event by conditioning on other events. It is sometimes referred to as the theorem of total probability.
(3.3) THEOREM. If B1, B2, . . . are disjoint events with , then for any event A,
and by (3.1b), P(A ∩ Bi) = P(A | Bi)P(Bi) for each i.
A simple consequence of this theorem is known as Bayes’ formula.
(3.4) COROLLARY. If B1, B2, . . . are disjoint events with union Ω, then for any event A with P(A) > 0,
and any j,
Proof. Using (3.1b),
Writing P(A) as the sum in Theorem (3.3) and using (3.1b) to write P(A ∩ Bj) = P(A | Bj)P(Bj) completes the proof.
(3.5) EXAMPLE. A coin is flipped until heads occur twice. Let X and Y denote, respectively, the trial
numbers at which the first and the second heads are observed. If p is the probability of heads occurring at any one trial, then
where q = 1 – p, 0 < p < 1. By Theorem (1.10), for any m = 1, 2, . . .,
Thus, using Definition (3.1b),
for any n = 2, 3, . . . and m = 1, 2, . . ., n – 1. That is, if it is known that the second heads occurred at the nth trial, the first heads must have occurred during the first n – 1 trials, and all these n – 1 possibilities are equally likely.
We have, for any m = 1, 2, . . ., n – 1 and n = 2, 3, . . .,
but a more instructive computation is the following. For m = 1, 2, . . . and k = 1, 2, . . .,
There are two conclusions to be drawn from (3.7): Y – X is independent of X, and Y – X has the same distribution as X.
(3.8) EXAMPLE. Suppose the length of a telephone conversation between two ladies is a random
variable X with distribution function
where the time is measured in minutes. Given that the conversation has been going on for 30 minutes, let us compute the probability that it continues for at least another 20 minutes. That is, we want to compute P{X > 50 | X > 30}. Since the event {X > 50, X > 30} is the same as the event {X > 50}, we have, by (3.1b),
Noting that e−0.6 = P{X > 20}, we have this interesting result: the probability that the conversation goes on another 20 minutes is independent of the fact that it has already lasted 30 minutes. Indeed, for any t, s ≤ 0,
That is, the probability that the conversation continues another s units of time is independent of how long it has been going on. Or, at every instant t, the ladies’ conversation starts afresh!
4. Exercises
(4.1) An experiment consists of drawing three flash bulbs from a lot and classifying each as defective (D) or non-defective (N). A drawing, then, can be described by a triplet; for example, (N, N, D) represents the outcome where the first and the second bulbs were found non-defective and the third defective. Let A denote the event “the first bulb drawn was defective,“ B the event “the second bulb drawn was defective,” and C the event “the third bulb drawn was defective.”
(a) Describe the sample space by listing all possible outcomes.
(b) List all outcomes in A, B, B ∪ C, A ∪ C, A ∪ B ∪ C, A ∩ B, Ac ∩ Bc ∩ C, A ∩ Bc ∩ C, (A ∪ Bc) ∩ C, (A ∪ C) ∩ (Bc ∪ C).
(4.2) Describe in detail the sample spaces for the following experiments: (a) Three tosses of a coin.
(b) An infinite number of coin tosses.
(c) Measurement of the speeds of cars passing a given point. (d) Scores of a class of 20 on an examination.
(e) Measurement of noontime temperatures at a certain locality. (f) Observation of arrivals at a store.
(4.3) An experiment consists of firing a projectile at a target and observing the position of the point of impact. (Suppose the origin of the coordinate system is placed at the target.) Then, an outcome is a pair ω = (ω1, ω2), where ω1 is the abscissa and ω2 the ordinate of the point of impact. The sample space Ω consists of all such pairs ω. For each ω ∈ Ω, let
(a) What do X, Y, Z stand for?
(b) Suppose the probability measure P is such that
Then show that X and Y are independent random variables with the same distribution function
(4.4) Let X, Y be as defined in (2.22), and put Z = X + Y. Compute
(a) P{X = m, Z = k} for m = 5, k = 7 first, and in general after that; (b) P{X = 5 | Z = 7};
(c) P{Z = 7 | X = 5}; (d) P{Y = 3 | Z = 14}; (e) P{X = 5, Y = 3 | Z = 8}; (f) P{Z = 8 | X = 6, Y = 2}.
(4.5) Let X be a discrete random variable with
where p > 0, p + q = 1. Show that for any m, n ∈ {1, 2, . . .},
(4.6) Let X and Y denote, respectively, the number of babies born on a certain day in a hospital and the number of them which are boys. Suppose their joint distribution is
Find P{X = n} and P{Y = m} for all m, n ∈ {0, 1, . . .}. Compute P{X = n | Y = m}, P{Y = m | X = n}, P{X – Y = k | X = n}, P{X – Y = k | Y = m}. Interpret the results.
What is the probability that
has real roots?
(4.8) Reliability is the probability of a device performing its purpose adequately for the period of time intended under the operating conditions encountered.
A piece of equipment consists of three components in series: for the equipment to function, all three components must be functioning. Let X1, X2, and X3 be the respective lifetimes of the components 1, 2, and 3 measured in hours. Suppose
If the lifetimes of the components are independent, what is the reliability of the equipment in a mission requiring 4,000 hours?
(4.9) Let X be the lifetime of a component measured in hours, and let its distribution function be
Given that the item has worked successfully for the first 3 hours, what is the probability that it does not fail within the next 1.5 hours?
In general, compute the conditional probability that X > t + s given that X > t.
(4.10) A printing machine capable of printing any of n characters a1, . . ., an is operated by electrical impulses, each character, in theory, being produced by a different impulse. Suppose the machine has probability p(0 < p < 1) of producing the character corresponding to the impulse received, independent of past behavior. If it prints the wrong character, the probabilities that any of the (n – 1) other characters will appear are equal.
(a) Suppose that one of the n impulses is chosen at random and fed into the machine twice, and that the character ai is printed both times. What is the probability that the impulse chosen is the one designed to produce ai?
(b) Suppose now that ai was printed on the first trial and aj(j ≠ i) on the second. What is the probability that the impulse was designed to produce ai?
(c) In (b), what is the probability that the impulse was designed to produce aj?
(d) In (b), suppose that ai is printed on the first trial and that it is known only that some other character (not ai) appeared on the second. Does this change the answer to (b)?
Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him.
THE BIBLE, THE BOOK OF JUDGES 12:6 Failure to distinguish between an outcome and an event is unlikely to have as severe a consequence as the fate which befell the warrior challenged by the men of ancient Gilead. Yet, probabilistic reasoning is a part of our culture today, and an appreciation of it cannot be acquired without developing a precise feeling for its essential ingredients.
It is important to remember always that an event is a collection of outcomes, while a random variable is a function. A random variable assigns a value to each outcome; a probability measure assigns a value to each event. One talks of the probability of an event, never of the probability of an outcome. A certain amount of confusion is caused by the historical mistakes made while the subject was developing, and the insistence of certain teachers on repeating them.
The present axiomatic foundations of the theory were laid by KOLMOGOROV [1] in 1933. Since then, the progress in probability theory has been very rapid. This progress was especially aided by the discovery of unsuspected applications to pure mathematics on the one hand, and by an ever increasing demand from other scientists and engineers on the other.
† As a reminder, 1 + x + x2 + · · · = 1/(1 – x) for x ∈ [0, 1). Also, in Example (2.23), 1 + x + x2/2! + x3/3! + · · · = ex for any .
CHAPTER 2
Expectations and Independence
Consider a large piece of land whose area we take to be one unit. The land is divided into n lots which, for purposes of buying and selling, are indivisible. Let Ω denote the whole land and P(A) the area of the region A. Let X(ω) denote the price per unit area of the lot containing the point ω. Then X takes only n values, say b1, b2, . . ., bn. The region on which X is equal to bk has area P{X = bk} and hence, the value of the total land is
Note that E[X] is, in a sense, the integral of the function X over the set Ω, and since the total area of Ω is P(Ω) = 1, E[X] is also the average unit price.
If we think of Ω as a sample space and P as a probability measure (as we may), then X becomes a random variable. The integral of X that we obtain is then called the expected value of X. The justification for the term “expected value” lies in our interpretation of E[X] as the average of X over Ω. This concept of integrating a random variable X over a sample space Ω with respect to a probability measure P extends also to arbitrary probability spaces. The present chapter is devoted to this and related concepts.
In Section 1 we give an account of expectation taking. Then in Section 2 we introduce conditional expectations and list many of their properties. The reader should study Section 1, and read Section 2
once or twice. He is urged not to dwell too long on Section 2 but to pass on to Chapters 3, 4, and 5
instead. In these later chapters there will be many opportunities to observe the workings of conditional expectations; by referring back to the cited theorems of this chapter, the reader will learn and appreciate them. Theorems on expectations and conditional expectations form the grammar of the language of probability, and are indispensable to anyone who wants to get acquainted with that language. But one does not start learning a language by memorizing the rules of grammar.
1. Expected Value
Let Ω be a sample space, P a probability measure, and X a discrete random variable defined on Ω. Let the values X takes be and put Bn = {ω: X(ω) = bn}. Then B0, B1, . . . are disjoint and their union is Ω. The function X is equal to bn on the set Bn whose measure is P(Bn). So the integral of the function X with respect to the measure P is
(we allow it to be + ∞). (See Figure 2.1.1.) Note that the right hand side divided by 1 = P(Ω) can also be looked upon as the weighted average of the function X with respect to the weight distribution given by P. Replacing P(Bn) by P{X = bn} in (1.1), we now make the following
Figure 2.1.1 Expected value of a discrete random variable is the sum of its values weighted by the corresponding probabilities.
(1.2) DEFINITION. The expected value of a discrete random variable X taking values in the set
is
The preceding defines the expected value of X when it is a discrete non-negative random variable. We extend this first to arbitrary non-negative random variables and then to arbitrary random variables.
Suppose X is a non-negative real-valued random variable. Then it is possible to find discrete random variables X1, X2, . . . such that
for all ω. Since each Xn is discrete, its expected value E[Xn] is well defined by (1.2). By our interpretation of E[Xn] as an integral it is easy to see that
and it seems reasonable to make the following
(1.5) DEFINITION. Let X be a non-negative random variable, and let X1, X2, . . . be discrete random
variables satisfying (1.3) and (1.4). Then we define the expected value of X to be
Finally, if X is an arbitrary real-valued random variable (not necessarily non-negative), and if we define
and
for all ω ∈ Ω, then both Y and Z are non-negative random variables, and
Definition (1.5) provides the meanings for the expected values of Y and Z, and we now make the following
(1.8) DEFINITION. Let X be an arbitrary random variable with values in , and let Y and Z be defined
by (1.6) and (1.7). Then
provided that at least one of the numbers E[Y] and E[Z] is finite. If E[Y] = E[Z] = + ∞, then X is said to have no expected value.
Definitions (1.2) and (1.8) are quite workable, but (1.5) is not. In fact, we have not even settled the matter of nonambiguity. If {Xn} is a sequence of discrete random variables increasing to X, and if
{Yn} is another sequence of discrete random variables also increasing to X, then Definition (1.5)
would put E[X] = limn E[Xn] and E[X] = limn E[Yn]. How do we know that these two numbers are the same ? Indeed, they are the same, as the proof of the next theorem shows. As a by-product we obtain a nice computational formula.
(1.9) THEOREM. For any non-negative random variable X,
Proof. First, suppose X is discrete with values in E. Then using Definition (1.2) and changing the order of summation and integration, we get (see Figure 2.1.2)
This establishes (1.10) for X discrete.
Let X be an arbitrary non-negative random variable, and let X1, X2, . . . be discrete random variables increasing to X. Then (1.11) applies to each
Figure 2.1.2 For non-negative discrete X, E[X] is the shaded area no matter how that is sliced.
On the other hand, since the Xn increase to X, for any t ≥ 0,
Thus, Proposition (1.1.11) applies to give
It follows from (1.12) and (1.13) and the monotonicity of the convergence that, by Definition (1.5), we have
This completes the proof.
We note that, in Definition (1.5), the sequence chosen to approximate X has nothing to do with the value E[X]. Formula (1.10) is in general easy to use if the distribution of X is known. (See Figure 2.1.3). In the case of discrete random variables taking integer values 0, 1, 2, . . ., it reduces further to a simpler sum:
Figure 2.1.3 Expected value of a non-negative random variable is the shaded area lying above its distribution function.
Figure 2.1.4. Expected value of the random variable X is the difference E[Y] – E[Z] of the two shaded areas.
In the case of arbitrary random variables, using Theorem (1.9) to compute E[Y] and E[Z] in
Definition (1.8), we obtain (see Figure 2.1.4)
(1.15) COROLLARY. For any real-valued random variable X,
provided that at least one term on the right is finite.
In the formula given by (1.15), if we integrate by parts we obtain
(1.16) COROLLARY. For any real-valued random variable X with distribution function φ,
provided the integral converges absolutely.
In computing a particular expectation, the choice of one formula over another is largely a matter of convenience. If there is a closed form expression for P{X > t}, in general, it is easier to use Theorem (1.9) and Corollary (1.15). Otherwise, it is easier to use Corollary (1.16) or its discrete equivalent,
Definition (1.2).
(1.17) EXAMPLE. The number of arrivals into a store during a specified time interval is a random
Then, using Definition (1.2),
(1.18) EXAMPLE. The lifetime X of an item has the distribution
This is a non-negative random variable; it is easier to compute E[X] by using Theorem (1.9). We obtain
(1.19) EXAMPLE. The intensity X of light falling on a certain surface has a distribution φ given by
This distribution is called “the normal distribution with mean α and variance β2.” Using Corollary (1.16),
(1.20) EXAMPLE. A discrete random variable X has the distribution
where p, q > 0, p + q = 1. If we use Definition (1.2),†
for all ; then
(1.21) EXAMPLE. A piece of equipment has two components whose life-times X and Y are
independent random variables with distributions
The equipment fails if either one of the two components does, namely, the lifetime of the equipment is Z = min(X, Y). To compute E[Z] we use Theorem (1.9). Now, Z > t if and only if both X > t and Y > t. So
where the second equality follows from the independence of X and Y (see Definition (1.2.21)). So
If X is a random variable taking values in E, and if f is a function from E into , then f(X) is a random variable with values in . Given the distribution of X, one can obtain the distribution of Y = f(X) and, using that, compute the expected value of Y by using the formulae of the preceding propositions. However, it is much easier to think of E[Y] as the integral of the function Y with respect to P and obtain it as in Definition (1.2):
(1.22) PROPOSITION. Let X be a discrete random variable taking values in E, and let f be a function
from E into . Then
provided the sum is absolutely convergent.
Proof. The random variable Y = f(X) takes the value f(a) on the set {X = a}, whose measure is P{X = a}. The integral of Y therefore is Σ f(a) P{X = a} where the summation is over all a ∈ E.
following
(1.23) PROPOSITION. Let X be a random variable with values in E, and let f be a function from E into
. Then
provided that the integral be absolutely convergent.
Proof is omitted. If instead we had a function of more than one random variable, the preceding two propositions become as follows. Again, we omit the proof.
(1.24) THEOREM. Let X1, . . ., Xn be random variables taking values in E, and let f be a function from
E × · · · × E = En into . Then
where φ is the joint distribution of X1, . . . Xn. In case X1, . . ., Xn are discrete, this becomes
where the summation is over all n-tuples a = (a1, . . ., an) with ai ∈ E (1.25) COROLLARY.
(a) E[cX] = cE[X];
(b) E[X + Y] = E[X] + E[Y];
(c) E[c1X1 + · · · + cn Xn] = c1E[X1] + · · · + cn E[Xn].
Proof of (a) is immediate from Proposition (1.23), where we take f(a) = ca and then use Corollary (1.16). Proof of (b) follows from (1.24) by taking f(a, b) = a + b, and (c) is immediate from (a) and (b).
We note that in the preceding corollary, we made no assumption of independence: whether or not the random variables are independent, the expected value of any linear combination of them is equal to the same linear combination of their expectations. The following is the analog for the case of multiplication.
(1.26) PROPOSITION. Let X and Y be two independent random variables taking values in E, and let g
Proof for discrete X, Y. Put f(a, b) = g(a)h(b) in Theorem (1.24). Then
But the independence of X and Y implies that
for any a and b. So
The assumption of independence in this proposition is crucial: if X and Y are not independent, then E[g(X)h(Y)] might differ from E[g(X)]E[h(Y)].
For certain special functions f, E[f(X)] is given certain special names. In particular, if f(b) = bn then E[f(X)] = E[Xn] is called the nth moment of X about the origin. If f(b) = (b − μ)2, where μ = E[X], then E[f(X)] = E [(X – μ)2] is called the variance of X and is denoted by Var(X). If X is a non-negative integer-valued random variable, and if f(b) = αb for some α ∈ [0, 1], then E[f(X)] = E[αx] is a number between 0 and 1. Considered as a function of α ∈ [0, 1], G(α) = E[αx] is called the generating function of X. If X is a non-negative random variable, and if f(b) = e–αb for some α ≥ 0, then E[f(X)] is again a number between 0 and 1. Considered as a function of α, F(α) = E[e–αx] is called the Laplace transform of X.
The expected value of X is a rough guide to the value X is likely to be near. The variance of X measures the deviation of X from this likely value E[X]. If the variance is small, then X is more likely to be near E[X]. The following is an estimate that can be used when the distribution of X is not known. It is called Chebyshev’s inequality.
(1.27) PROPOSITION. Let X be a random variable with expectation a and variance b2. Then for any ε >
0,
Proof. Consider the expectation of the positive random variable Y = (X − a)2; it is E[Y] = b2. Now, E[Y] is the integral of Y over all of Ω, and as such it is greater than the integral of Y on the set {Y > ε2}. The measure of that set is P{Y > ε2}, and Y > ε2 on that set. So the integral must be greater than ε2P{Y > ε2}. That is,
from which the proposition follows.
In computing the variance it is usually worth noting that
Following are some examples of these computations.
(1.29) EXAMPLE. Consider the random variable X of Example (1.17). We had already computed E[X]
= 8. Now, to obtain the variance we use the formula (1.28). To compute E[X2], note that it is easier to compute E[X(X − 1)] first and then use E[X2] = E[X(X − 1)] + E[X]. Now,
Hence,
and
Next we compute its generating function. We have
for any α ∈ [0, 1]. Note that the derivative of G at α = 1 is
and the third derivative is
(1.30) EXAMPLE. Consider the lifetime X of the item discussed in Example (1.18). Its expectation
was E[X] = 50. Now,
and hence
Computing the Laplace transform of X, we find
We note that the derivative of F at α = 0 is
and that the second derivative at α = 0 is
we have, for the generating function G of a non-negative integer-valued random variable X,
where G(k) is the kth derivative of G. Similarly, the results in Example (1.30) concerning the Laplace transform also generalize. For any non-negative random variable X with Laplace transform F,
It is also worth mentioning that a generating function determines the probability distribution associated with it; this is true because
which means that P{X = n} is the coefficient of αn in the power-series expansion of G(α). Similarly, the Laplace transform determines the associated distribution function.
We close this section with two theorems on the expected value of the limit of a sequence of random variables. The first is called the monotone convergence theorem and the second the bounded convergence theorem. The proof of the first is the same as that of (1.9), and we will not repeat it; we also omit the proof of the second.
(1.34) THEOREM. If X1, X2, . . . is a sequence of non-negative random variables increasing to the
random variable X, then the expectations E[X1], E[X2], . . . increase to E[X].
(1.35) THEOREM. Let X1, X2, . . be a sequence of random variables which are bounded in absolute
value by a random variable Y such that E[Y] < ∞. If
for almost all ω ∈ Ω, then
2. Conditional Expectations
Let Y be a discrete random variable taking values in , and let A be an event with P(A) > 0. Then the conditional probability that Y = b given the event A is (see (1.3.1))
As b varies, this is called the conditional distribution of Y given the event A. We define the conditional expectation of Y given the event A as
In particular, when A = {X = a} for a discrete random variable X taking values in a set E,
is called the conditional expectation of Y given that X = a. As a varies, (2.3) defines a function f on E by
By the conditional expectation of Y given X, written E[Y | X], we mean the random variable f(X); that is,
where f is as defined by (2.4). The following definition is the generalized version of this.
(2.6) DEFINITION. Let X1, . . ., Xn be discrete random variables taking values in E, and let Y be a
discrete random variable with values in . Then the conditional expectation of Y given X1, . . ., Xn is
where for any n-tuple (a1, . . ., an) with ai ∈ E,
If Y is not discrete, then a similar definition is given in terms of its conditional distribution P{Y ≤ t | X1 = a1, . . ., Xn = an}. For example, if Y is non-negative,
where
for all a1, . . ., an ∈ E.
not) is a random variable. Then we define the conditional probability of A given X1, . . ., Xn as
The following are some easy properties of conditional expectations. These are analogous to
Propositions (1.22), (1.23), (1.24), and (1.25). We omit the proofs.
(2.10) PROPOSITION. Let Y be a discrete random variable with values in E and g a function from E
into . Then
(2.11) PROPOSITION. Let Y1, . . ., Ym be discrete random variables with values in E, and let g be a
function from Em into . Then
(2.12) COROLLARY. If Y1, . . ., Ym take values in and c1, . . ., cm are constants, then
(2.13) EXAMPLE. Let X and Y be two random variables with
Let f(b) = E[Y | X = b], b = 1, 2. Then
Thus,
for K = 1, . . ., m – 1; m = 2, . . ., n – 1; n = 3, 4, . . ., where 0 < p < 1, p + q = 1. Then for k = 1, . . ., m – 1; m = 2, 3, . . .;
Thus, for k = 1, . . ., m – 1 and m = 2, . . ., n – 1, we have
Hence, for k = 1, . . ., m – 1 and m = 2, 3, . . .,
Thus,
We note that for any bounded function g,
so that
In particular, if g(b) = αb for some α ∈ [0, 1], then
The next proposition states that if the knowledge of X1 . . ., Xn determines Y completely, then the conditional expectation of Y given X1, . . ., Xn is equal to Y itself. The proof is very easy and we omit it.
(2.15) PROPOSITION. If Y can be written as
for some function f, then
Next is a very useful result used in situations where E[Y | X1, . . ., Xn] is easy to obtain or known somehow. Since E[Y | X1, . . ., Xn] is a random variable taking real values, we can talk about its expected value. That expected value is the same as the expectation of Y. In words, the expected value of any conditional expectation of Y is equal to the expected value of Y.
(2.16) PROPOSITION. E[E[Y | Xl, . . ., Xn] = E[Y].
Proof for discrete Y, X1, . . ., Xn. Let
then
On the other hand,
Putting (2.18) into (2.17) and changing the order of summation, noting Definition (1.3.1) of conditional probabilities, we obtain
The next result is very important in the theory of stochastic processes. It shows how to obtain the conditional expectation of Y given X1, . . ., Xn when it is easy to obtain the same given X1, . . ., Xn plus some extra information contained in Xn+1, . . ., Xn+m.
(2.19) THEOREM. For any n, m ≥ 1
Proof for n = 2, m = 1, X1, X2, X3, Y discrete. Let Z = f(X1, X2, X3) = E[Y| X1, X2, X3] We need to show that
We have
and
Putting the two computations together, we get
since
by (1.3.1) and (1.3.2). Noting that (2.21) is the same as (2.20) completes the proof. (2.22) COROLLARY. If
Proof. By Theorem (2.19) and Proposition (2.15),
The preceding corollary has the following interpretation. Suppose, given X1, . . ., Xn + m, that the conditional expectation of Y depends only on X1, . . ., Xn. This means that, as far as predicting the value of Y is concerned, knowledge of X1, . . ., Xn makes further knowledge concerning Xn +1, . . ., Xn
+ m irrelevant. Therefore, the conditional expectation of Y given X1, . . ., Xn is the same as that of Y
given X1, . . ., Xn + m.
A particular case of this happens to come up fairly often. Suppose we have E[Y | X1, . . ., Xn] computed; and suppose Y1, . . ., Ym are functions of X1, . . ., Xn, that is, Y1 = g1(X1, . . ., Xn), . . ., Ym = gm(X1, . . ., Xn). Then E[Y | X1, . . ., Xn, Y1, . . ., Ym] = E[Y | X1, . . ., Xn].
Another important concept is contained in the following theorem. Suppose {Y1, . . ., Ym} and {X1, . . ., Xn} are such that knowing the values of one set determines the values of the other. This is especially the case when Y1 = g1(X1, . . ., Xn), . . ., Ym = gm(X1, . . ., Xn) and conversely X1 = f1(Y1, . . ., Ym) . . ., Xn = fn(Y1, . . ., Ym). Then for any random variable Y, the conditional expectation of Y given X1, . . ., Xn is the same as the conditional expectation of Y given Y1, . . ., Ym. This is so since {X1, . . ., Xn} carries the same information as {Y1, . . ., Ym}. The proof is easy and will be omitted.
(2.23) THEOREM. Suppose the collections {X1, . . ., Xn} and {Y1, . . ., Ym} are such that the knowledge
of the random variables in one collection determines the values of the random variables in the other. Then for any Y,
We close this section by giving an extension of the concept of independence.
(2.24) DEFINITION. The set of random variables {Y1, . . ., Ym} is said to be independent of {X1, . . .,
Xn} if
for all non-negative functions g. Two stochastic processes {Yt; t ∈ T1} and {Xt; t ∈ T2} are said to be independent of each other if any finite collection {Yt
1, . . ., Ytm} from the first is independent of any
finite collection {Xs
1, . . ., Xsn} from the second.
equivalent to the independence, in the sense of (2.24), of any two subcollections. As such, we will not distinguish between the two.
Next is a new concept, that of conditional independence.
(2.25) DEFINITION. {Y1, . . ., Ym} is said to be conditionally independent of {Z1, . . ., Zk} given {X1, .
. ., Xn} provided that
for all non-negative functions g. The collection {Yt; t ∈ T1} is said to be conditionally independent of the collection {Zt; t ∈ T2} given the collection {Xt; t ∈ T3} provided that for any finite collection {Yt
1, . . ., Ytm} from the first and any finite collection {Zs1, . . ., Zsn} from the second,
for all non-negative functions g.
In words, {Y1, . . ., Ym} is conditionally independent of {Z1, . . ., Zk} given {X1, . . ., Xn} provided that, as far as predicting the value of any function of Y1, . . ., Ym is concerned, the extra knowledge provided by Z1, . . ., Zk loses all its significance once the values of X1, . . ., Xn are known.
(2.26) EXAMPLE. Consider the random variables X, Y, Z of Example (2.14). We had shown that
The right hand side being independent of X, we see that Z is conditionally independent of X given Y. We also note that Z is not independent of X.
(2.27) EXAMPLE. Let X1, X2, . . . be a sequence of random variables with E[Xi] = μ for all i. Let N be
a non-negative integer-valued random variable independent of X1, X2, . . . with E[N] = λ For each ω ∈ Ω, let
We would like to compute E[Y]. We may think of X1, X2, . . . as the amounts spent by customers 1, 2, . . . and of N as the number of arrivals within the first hour. Then Y is the total revenue within that hour.
On the other hand, since N is independent of X1, X2, . . ., for n ≥ 1,
Hence
and by (2.28)
3. Exercises
(3.1) Find the expected value of the random variable X taking the values –5, 1, 4, 8, 10 with probabilities 0.3, 0.2, 0.2, 0.1, 0.2 respectively.
(3.2) Consider the random variable X taking the values –2, 0, 2 with probabilities 0.4, 0.3, 0.3 respectively. Compute the expected values of X, X2, 3X2 + 5.
(3.3) Compute the variance and the generating function of the random variable in Example (1.20). (3.4) A random variable X is said to have the uniform distribution over [a, b] if
(a) Compute E[X], Var(X), E[(X – a)/(b – a)]. (b) Find the distribution of Y = (X – a)/(b – a).
(3.5) Compute the variance and the Laplace transform of the lifetime in Example (1.21). (3.6) Compute the variance of the intensity of light in Example (1.19).
(3.7) The headway X between two vehicles at a fixed instant is a random variable with
Find the expected value and the variance of the headway. (3.8) Show that for any constants a and b,
for any random variable X.
(3.9) Show that for any two independent random variables X and Y,
(3.10) The lifetime X of a device has the distribution
(a) Show that E[X] = 1/c.
(b) Show that (see also Example (1.3.8))
(3.11) Let X, Y be as defined in Example (1.2.23). (a) Compute E[X], E[Y – X], E[Y].
(b) Find E[Y – X | X], E[Y | X].
(c) Show by direct computation that E[E[Y | X]] = E[Y].
(3.12) Suppose X1 X2, . . . are independent and identically distributed non-negative random variables with
Let N be a non-negative integer-valued random variable which is independent of {X1, X2, . . .}, and let
Let S0 = 0, S1 = X1, S2 = X1 + X2, . . ., and let Y = SN. (a) Compute E[Y | N], E[Y2|N].
(b) Compute E[Y], E[Y2], Var(Y). (c) Show that for any ,
The chapter just finished completes our account of the preliminaries necessary for studying stochastic processes. Especially in view of the monstrous looks of the last section, it seems all too advisable to inquire how the reader’s patience is holding out and to assure him that he will in time come to appreciate the true friendliness of these concepts. For a deeper treatment and for the proofs which we have omitted, we refer the reader to CHUNG [2].
†Note that, for x ∈ (0, 1), . If we differentiate both sides we get
CHAPTER 3
Bernoulli Processes and Sums of
Independent Random Variables
Consider an experiment consisting of an infinite sequence of trials. Suppose that the trials are independent of each other and that each one has only two possible outcomes: “success” and “failure.” A possible outcome of such an experiment is (S, F, F, S, F, S, . . .), which stands for the outcome where the first trial resulted in success, the second and third in failure, fourth in success, fifth in failure, sixth in success, etc. Thus, the sample space Ω of the experiment consists of all sequences of two letters S and F, that is,
We next describe a probability measure P on all subsets of Ω. Let 0 ≤ p ≤ 1, and define q = 1 – p. We think of p and q as the probabilities of success and failure at any one trial. The probability of the event {ω: ω1 = S, ω2 = S, ω3 = F} should then be p · p · q = p2q, and the probability of the event {ω: ω1 = S, ω2 = F, ω3 = F, ω4 = F, ω5 = S} should be p · q · q · q · p = p2q3. For each n, we consider all events which are specified by the first n trials and define their probabilities in this manner. This, in addition to the conditions in Definition (1.1.6), completely specifies the probability P.
For each ω ∈ Ω and n ∈ {1, 2, . . .}, define Xn(ω) = 1 or 0 according as ωn = S or ωn = F. Then for each n, we have a random variable Xn whose only values are 1 and 0. It follows from the description of P that X1, X2, . . . are independent and identically distributed with P{Xn = 1} = p, P{Xn = 0} = q. In the first three sections of this chapter we will be interested in the properties of stochastic processes such as {Xn; n = 1, 2, . . .} and other processes defined in terms of {Xn}.
The process {Xn} is very simple in nature, and the answers to most problems raised here are easy to obtain. Our object here is to demonstrate some of the questions that are raised in studying a stochastic process, and to provide some experience in using the tools of Chapters 1 and 2.
The processes of Sections 2 and 3 provide two examples of sums of independent and identically distributed random variables. We collect results of a more general nature about such processes in
Section 4 along with certain classical limit theorems.
1. Bernoulli Processes
of random variables defined on Ω and taking only the two values 0 and 1.
(1.1) DEFINITION. The stochastic process {Xn; n = 1, 2, . . .} is called a Bernoulli process with
success probability p provided that
(a) X1, X2, . . . are independent, and
(b) P{Xn = 1} = p, P{Xn = 0} = q = 1 − p for all n.
(1.2) EXAMPLE. Finished products coming off an assembly line are given a routine inspection. If the
nth item is “defective” we put Xn = 1, otherwise Xn = 0. If the production process is “under control,” the successive items produced will show only random chance effects, and X1, X2, . . . will be independent. If, further, the defective rate p remains constant over time, then {Xn; n = 1, 2, . . .} will be a Bernoulli process.
(1.3) EXAMPLE. At a certain fork on a road, about 62 percent of the vehicles turn left. We define Xn to
be 1 or 0 according as the nth vehicle turns left or right. The random variables X1, X2, . . . are independent of each other if the drivers act independently of each other in choosing their directions to turn. Then {Xn; n = 1, 2, . . .} is a Bernoulli process with probability 0.62 for success.
(1.4) EXAMPLE. Diameters of bearings coming off a production line are measured, and those that do
not meet the specifications are rejected. Let Y1, Y2, . . . be the diameters in inches of the first, second, . . . bearings. Let a = 2.994 and b = 3.006 be the lower and the upper tolerance limits. Thus, the nth bearing is not rejected if and only if 2.994 ≤ Yn ≤ 3.006.
Suppose Y1, Y2, . . . are independent and identically distributed with the common distribution function φ given by
(this is the normal distribution with mean 3 and variance (0.002)2). For each n let
where B = [2.994, 3.006]. Then Xn is 1 if the nth bearing meets the specifications and is 0 otherwise. Since Y1, Y2, . . . are independent, X1, X2, . . . are also independent; and since the Yn have a common distribution, the Xn also have a common distribution. Hence, {Xn; n = 1, 2, . . .} is a Bernoulli process. The probability p = P{Xn = 1} is now computed as
where the table for the standard normal distribution is used to obtain the last number (see Figure 3.4.3
on page 65).
Let {Xn; n = 1, 2, . . .} be a Bernoulli process with success probability p. Then for any n,
for any α ≥ 0.
2. Numbers of Successes
Let {Xn; n = 1, 2, . . .} be a Bernoulli process with probability of success p. We think of Xn as the number of successes at the nth trial. Define
for each ω ∈ Ω. Then Nn is the number of successes in the first n trials, and Nn+m – Nn is the number of successes in the trials numbered n + 1, n + 2, . . ., n + m. In this section we are interested in the stochastic process whose time parameter is discrete and whose state space is discrete and again is {0, 1, . . .}.
We start with simple descriptive quantities. Using the definition of Nn, Corollary (2.1.25), and the result (1.5) above, we obtain