• No results found

Erhan Cinlar Introduction to Stochastic Processes

N/A
N/A
Protected

Academic year: 2021

Share "Erhan Cinlar Introduction to Stochastic Processes"

Copied!
416
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

INTRODUCTION

TO

STOCHASTIC

PROCESSES

EHRAN ÇINLAR

Norman J. Sollenberger Professor in Engineering

Princeton University

DOVER PUBLICATIONS, INC.

Mineola, New York

(3)

Copyright Copyright © 1975 by Erhan Çinlar

All rights reserved.

Bibliographical Note

This Dover edition, first published in 2013, is an unabridged republication of the work originally published in 1975 by Prentice-Hall, Inc., Englewood Cliffs, New Jersey.

Library of Congress Cataloging-in-Publication Data Çinlar, E. (Erhan), 1941–

Introduction to stochastic processes / Erhan Çinlar. — Dover edition. pages cm.

Summary: “This clear presentation of the most fundamental models of random phenomena employs methods that recognize computer-related aspects of theory. Topics include probability spaces and random variables, expectations and independence, Bernoulli processes and sums of independent random variables, Poisson processes, Markov chains and processes, and renewal theory. Includes an introduction to basic stochastic processes. 1975 edition”— Provided by publisher.

Includes bibliographical references and index. eISBN-13: 978-0-486-27632-8

1. Stochastic processes. I. Title. A274.C56 2013

519.2′3—dc23

2012028204 Manufactured in the United States by Courier Corporation

49797601

(4)

Contents

Preface

Chapter 1 Probability Spaces and Random Variables 1. Probability Spaces

2. Random Variables and Stochastic Processes 3. Conditional Probability

4. Exercises

Chapter 2 Expectations and Independence 1. Expected Value

2. Conditional Expectations 3. Exercises

Chapter 3 Bernoulli Processes and Sums of Independent Random Variables 1. Bernoulli Process

2. Numbers of Successes 3. Times of Successes

4. Sums of Independent Random Variables 5. Exercises

Chapter 4 Poisson Processes 1. Arrival Counting Process 2. Times of Arrivals

3. Forward Recurrence Times

4. Superposition of Poisson Processes 5. Decomposition of Poisson Processes 6. Compound Poisson Processes

7. Non-stationary Poisson Processes 8. Exercises

Chapter 5 Markov Chains 1. Introduction

2. Visits to a Fixed State 3. Classification of States 4. Exercises

(5)

Chapter 6 Limiting Behavior and Applications of Markov Chains 1. Computation of R and F

2. Recurrent States and the Limiting Probabilities 3. Periodic States

4. Transient States

5. Applications to Queueing Theory: M/G/1 Queue 6. Queueing System G/M/l

7. Branching Processes 8. Exercises

Chapter 7 Potentials, Excessive Functions, and Optimal Stopping of Markov Chains 1. Potentials

2. Excessive Functions 3. Optimal Stopping

4. Games with Discounting and Fees 5. Exercises

Chapter 8 Markov Processes 1. Markov Processes 2. Sample Path Behavior

3. Structure of a Markov Process 4. Potentials and Generators 5. Limit Theorems

6. Birth and Death Processes 7. Exercises

Chapter 9 Renewal Theory 1. Renewal Processes

2. Regenerative Processes and Renewal Theory 3. Delayed and Stationary Processes

4. Exercises

Chapter 10 Markov Renewal Theory 1. Markov Renewal Processes

2. Markov Renewal Functions and Classification of States 3. Markov Renewal Equations

4. Limit Theorems

5. Semi-Markov Processes 6. Semi-Regenerative Processes

(6)

7. Applications to Queueing Theory 8. Exercises

Afterword

Appendix. Non-Negative Matrices 1. Eigenvalues and Eigenvectors 2. Spectral Representations 3. Positive Matrices

4. Non-Negative Matrices

5. Limits and Rates of Convergence References

Answers to Selected Exercises Index of Notations

(7)

Preface

A man wanted to dock the tail of his horse. He consulted a wise man on how short to make it. “Make it as short as it pleases you,” said the wise man, “for, no matter what you do, some will say it is too long, some too short, and your opinion itself will change from time to time.”

This book is an introduction to stochastic processes. It covers most of the basic processes of interest except for two glaring omissions. Topics covered are developed in some depth, some fairly recent results are included, and references are given for further reading. These features should make the book useful to scientists and engineers looking for models and results to use in their work. However, this is primarily a textbook; a large number of numerical examples are worked out in detail, and the exercises at the end of each chapter are there to illustrate the theory rather than to extend it.

When theorems are proved, this is done in some detail; even the ill-prepared student should be able to follow them. On the other hand, not all theorems are proved. If a result is of sufficient intrinsic interest, and if it can be explained and understood, then it is listed as a theorem even though it could not be proved with the elementary tools available in this book. This freedom made it possible to include two characterization theorems on Poisson processes, several limit theorems on the ratios of additive functionals of Markov processes, several results on the sample path behavior of (continuous parameter) Markov processes, and a large number of results dealing with stopping times and the strong Markov property.

The book assumes a background in calculus but no measure theory; thus, the treatment is elementary. At the same time, it is from a modern viewpoint. In the modern approach to stochastic processes, the primary object is the behavior of the sample paths. This is especially so for the applied scientist and the engineer, since it is the sample path which he observes and tries to control. Our approach capitalizes on this happy harmony between the methods of the better mathematician and the intuition of the honest engineer.

We have also followed the modern trends in preferring matrix algebra and recursive methods to transform methods. The early probability theorist’s desire to cast his problem into one concerning distribution functions and Fourier transforms is understandable in view of his background in classical analysis and the notion of an acceptable solution prevailing then. The present generation, however, influenced especially by the availability of computers, prefers a characterization of the solution coupled with a recursive method of obtaining it to an explicit closed form expression in terms of the generating functions of the Laplace transforms of the quantities of actual interest.

The first part of the book is based on a set of lecture notes which were used by myself and several colleagues in a variety of “applied” courses on stochastic processes over the last five years. In that stage I was helped by P. A. Jacobs and C. G. Gilbert in collecting problems and abstracting papers from the applied literature. Most of the final version was written during my sabbatical stay at Stanford University; I should like to thank the department of Operations Research for their hospitality then. While there I have benefited from conversations with K. L. Chung and D. L. Iglehart; it is a pleasure to acknowledge my debt to them. I am especially indebted to A. F. Karr for his help throughout this project; he eliminated many inaccuracies and obscurities. I had the good fortune to

(8)

have G. Lemmond to type the manuscript, and finally, the National Science Foundation to support my work.

(9)

Introduction

To

Stochastic

Processes

(10)

CHAPTER 1

Probability Spaces and Random Variables

The theory of probability now constitutes a formidable body of knowledge of great mathematical interest and of great practical importance. It has applications in every area of natural science: in minimizing the unavoidable errors of observations, in detecting the presence of assignable causes in observed phenomena, and in discovering the basic laws obscured by chance variations. It is also an indispensable tool in engineering and business: in deducing the true lessons from statistics, in forecasting the future, and in deciding which course to pursue.

In this chapter the basic vocabulary of probability theory will be introduced. Most of these terms have cognates in ordinary language, and the reader should do well not to fall into a false sense of security because of his previous familiarity with them. Instead he should try to refine his own use of these terms and watch for the natural context within which each term appears.

1. Probability Spaces

The basic notion in probability theory is that of a random experiment: an experiment whose outcome cannot be determined in advance. The set of all possible outcomes of an experiment is called the sample space of that experiment.

An event is a subset of a sample space. An event A is said to occur if and only if the observed outcome ω of the experiment is an element of the set A.

(1.1) EXAMPLE. Consider an experiment that consists of counting the number of traffic accidents at a

given intersection during a specified time interval. The sample space is the set {0, 1, 2, 3, . . .}. The statement “the number of accidents is less than or equal to seven” describes the event {0, 1, . . ., 7}. The event A = {5, 6, 7, . . .} occurs if and only if the number of accidents is 5 or 6 or 7 or . . . .

Given a sample space Ω and an event A, the complement Ac of A is defined to be the event which occurs if and only if A does not occur, that is,

Given two events A and B, their union is the event which occurs if and only if either A or B (or both) occurs, that is,

(11)

The intersection of A and B is the event which occurs if and only if both A and B occur, that is,

The operations of taking unions, intersections, and complements may be combined to obtain new events. In particular, the following identities are of value:

The set Ω is also called the certain event. The set containing no elements is called the empty event and is denoted by Ø. Note that Ø = Ωc and Ω = Øc. Two events are said to be disjoint if they have no elements in common, that is, A and B are disjoint if

If two events are disjoint, the occurrence of one implies that the other has not occurred. A family of events is called disjoint if every pair of them are disjoint.

Event A is said to imply the event B, written A ⊂ B, if every ω in A belongs also to B. To show that two events A and B are the same, then, it is sufficient to show that A implies B and B implies A.

If A1, A2, . . . are events, then their union

is the event which occurs if and only if at least one of them occurs. Their intersection

is the event which occurs if and only if all of them occur.

Next, corresponding to our intuitive notion of the chances of an event occurring, we introduce a function defined on a collection of events.

(1.6) DEFINITION. Let Ω be a sample space and P a function which asso ciates a number with each

event. Then P is called a probability measure provided that (a) for any event A, 0 ≤ P(A) ≤ 1;

(b) P(Ω) = 1;

(12)

By axiom (b), the probability assigned to Ω is 1. Usually, there will be other events A ⊂ Ω such that P(A) = 1. If a statement holds for all ω in such a set A with P(A) = 1, then it is customary to say that the statement is true almost surely or that the statement holds for almost all ω ∈ Ω.

Axiom (c) above is a severe condition on the manner in which probabilities are assigned to events. Indeed, it is usually impossible to assign a probability P(A) to every subset A and still satisfy (c). Because of this, it is customary to define P(A) only for certain subsets A. Throughout this book we will avoid the issue by using the term “event” only for those subsets A of Ω for which P(A) is defined. If the sample space Ω is {0, 1, 2, . . .} as in Example (1.1), then there are as many subsets of Ω as there are points on the real line. Therefore, it might be difficult to assign a probability to each event in an explicit fashion. Furthermore, almost any meaningful real-life problem requires considering much more complex sample spaces. Usually, in such situtations, the probabilities of only a few key events are specified and the remaining probabilities are left to be computed from the axioms (1.6a, b, c) by considering the various relationships which might exist between events. In fact, most of probability theory concerns itself with finding methods of doing just this. The following are the first few steps in this direction.

(1.7) PROPOSITION. If A1, . . ., An are disjoint events, then

Proof. First, we establish that P(Ø) = 0. In (1.6c) we may take A1 = A2 = · · · = Ø. Then also, and (1.6c) implies that P(Ø) = P(Ø) + P(Ø) + · · ·. But this is possible only if P(Ø) = 0, since 0 ≤ P(Ø) ≤ 1 by (1.6a).

Next, let A1, . . ., An be disjoint events, and define An + 1 = An + 2 = ··· = Ø. Then A1, A2, . . . are disjoint, and . By (1.6c) we have , since P(An + 1) = P(An + 2) = ··· = P(Ø) = 0.

(1.8) PROPOSITION. If A ⊂ B, then P(A) ≤ P(B).

Proof. If A ⊂ B, we can write

Since A and Ac ∩ B are disjoint, by (1.7),

By (1.6a), P(Ac ∩ B) ≤ 0; thus, P(B) ≤ P(A).

(13)

Proof. The events A and Ac are disjoint, and by (1.7), P(A ∪ Ac) = P(Ac). On the other hand, A ∪ Ac = Ω, and P(Ω) = 1 by (1.6b).

Proposition (1.8) expresses our intuitive feeling that, if the occurrence of A implies that of B, then the probability of B should be at least as great as the probability of A. Proposition (1.9), restated in words, states that the probability that A does not occur is one minus the probability that A does occur.

Next is a theorem which is used quite often. It is useful in evaluating the probability of a “complex” event by breaking it into components whose probabilities may be simpler to evaluate.

(1.10) THEOREM. If B1, B2, . . . are disjoint events with Bi = Ω, then for any event A,

Proof. For any event A we can write

Since B1, B2, . . . are disjoint, A ∩ B1, A ∩ B2, . . . are disjoint. By (1.6c) then,

(1.11) PROPOSITION. Let A1, A2, . . . be a sequence of events such that A1 ⊂ A2 ⊂ A3 ⊂ ···, and put

. Then

Proof. Let B1 = A1, , , . . . . Then B1, B2, . . . are disjoint, and

Thus, by (1.7),

(14)

But by (1.6c),

(1.12) COROLLARY. Let A1, A2, . . . be a sequence of events such that A1 ⊃ A2 ⊃ A3 ⊃ . . ., and put

. Then

Proof. The sequence of events , , . . . satisfies . . ., and by (1.5),

Thus, by Proposition (1.11),

By Proposition (1.9), P(A) = 1 − P(Ac), and for any n. Thus,

Proposition (1.11) gives a continuity property of P: if the events A1, A2, . . . “increase to“ A, then their probabilities P(A1), P(A2), . . . increase to P(A). Corollary (1.12) is similar; if the events A1, A2, . . . “decrease to” A, then P(A1) P(A2), . . . decrease to P(A).

2. Random Variables and Stochastic Processes

Suppose we are given a sample space Ω and a probability measure P. Most often, especially in applied problems, we are interested in functions of the outcomes rather than the outcomes themselves. (2.1) DEFINITION. A random variable X with values in the set E is a function which assigns a value

X(ω) in E to each outcome ω in Ω.

The most usual examples of E are the set of non-negative integers the set of all integers {. . ., –1, 0, 1, . . .}, the set of all real numbers , and the set of

(15)

all non-negative real numbers . In the first two cases and, more generally, when E is finite or countably infinite, X is said to be a discrete random variable.

(2.2) EXAMPLE. Consider the experiment of flipping a coin once. The two possible outcomes are

“Heads” and “Tails,” that is, Ω = {H, T}. Suppose X is defined by putting X(H) = 1, X(T) = –1. Then X is a random variable taking values in the set E = {1, −1}. We may think of it as the earning of the player who receives or loses a dollar according as the outcome is heads or tails.

(2.3) EXAMPLE. Let an experiment consist of measuring the lifetimes of twelve electric bulbs. The

sample space Ω is the set of all 12-tuples ω = (ω1, . . ., ω12) where ωi ≥ 0 for all i. Then

defines a random variable on this sample space Ω. It represents the average lifetime of the 12 bulbs.

(2.4) EXAMPLE. Suppose an experiment consists of observing the acceleration of a vehicle during the

first 60 seconds of a race. Then each possible outcome is a real-valued right continuous function ω defined for 0 ≤ t ≤ 60, and the sample space Ω is the set of all such functions ω. For t ∈ [0, 60], let

for each ω ∈ Ω. Then Xt, Yt, and Zt are random variables on Ω. For the outcome ω, Xt(ω) is the acceleration at time t, Yt(ω) the velocity, and Zt(ω) the position.

Let X be a random variable taking values in a set E, and let f be a real-valued function defined on the set E. Then for each ω ∈ Ω, X(ω) is a point in E and f assigns the value f(X(ω)) to that point. By f(X) we mean the random variable whose value at ω ∈ Ω is f(X(ω)). A particular function of some use is the indicator function IB of a subset B of E; IB(x) is 1 or 0 according as x ∈ B or x ∉ B. Then IB(X) is a random variable which is equal to 1 if the event {X ∈ B} occurs and is equal to 0 otherwise. Quite often there will be a number of random variables X1, . . ., Xn and we will be concerned with functions of them. If X1, . . ., Xn take values in E, and if f is a real-valued function f defined on E × · · · × E = En, then f(X1, . . ., Xn) is a real-valued random variable whose value at ω ∈ Ω is f(X1(ω), . . ., Xn(ω)).

(16)

A stochastic process with state space E is a collection {Xt; t ∈ T} of random variables Xt defined on the same probability space and taking values in E. The set T is called its parameter set. If T is countable, especially if , the process is said to be a discrete parameter process. Otherwise, if T is not countable, the process is said to have a continuous parameter. In the latter case the usual examples are and . It is customary to think of the index t as representing time, and then one thinks of Xt as the “state” or the “position” of the process at time t.

(2.5) EXAMPLE. In Example (2.4) Yt is the velocity of the vehicle at time t, and the collection {Yt; 0 ≤

t ≤ 60} is a continuous time-parameter stochas tic process with state space . Similarly for {Zt; 0 ≤ t ≤ 60}.

(2.6) EXAMPLE. Consider the process of arrivals of customers at a store, and suppose the experiment

is set up to measure the interarrival times. Then the sample space Ω is the set of all sequences ω = (ω1, ω2, . . .) of non-negative real numbers ωi. For each ω ∈ Ω and , we put Nt(ω) = k if and only if the integer k is such that ω1 + · · · + ωk ≤ t < ω1 + · · · + ωk + 1 (Nt(ω) = 0 if t < ω1). Then for the outcome ω, Nt(ω) is the number of arrivals in the time interval (0, t]. For each , Nt is a random variable taking values in the set E = {0, 1, . . .}. Thus, is a continuous time-parameter stochastic process with state space E = {0, 1, . . . }. Considered as a function in t, for a fixed ω, the function Nt(ω) is non-decreasing, right continuous, and increases by jumps only; see Figure 1.2.1.

Figure 1.2.1 A possible realization of an arrival process. The picture is for the outcome ω = (1.2, 3.0, 1.7, 0.5, 2.6, . . .).

Let X be a random variable taking values in . If , the set of all outcomes ω for which X(ω) ≤ b is an event, namely, the event {ω: X(ω) ≤ b}. We will write {X ≤ b}

(17)

for short, instead of {ω: X(ω) ≤ b}, and we will write P{X ≤ b} for P{X ≤ b}. If a ≤ b, then

and Proposition (1.8) implies that

Noting that

and that the events {X ≤ a} and {a ≤ X ≤ b} are disjoint, we get, by (1.7),

Next, note that for any sequence b1, b2, . . . increasing to + ∞. Since {X ≤ b1} ⊂ {X ≤ b2} ⊂ · · · by (2.7), Proposition (1.11) applies, and we have

Let b1, b2, . . . be a decreasing sequence with limn bn = b. Then {X ≤ b1} ⊃ {X ≤ b2} ⊃ · · · by

(2.7), and . Corollary (1.12), therefore,

In particular, if the bn decrease to – ∞, the limit in (2.11) becomes zero. The function φ defined by

is called the distribution function of the random variable X. If φ is a distribution function, then (also see Figure 1.2.2)

(2.13) (a) φ is non-decreasing by (2.8), (b) φ is right continuous by (2.11), (c) limb→∞ φ(b) = 1 by(2.10),

(18)

Figure 1.2.2 A distribution function is non-decreasing and right continuous and lies between 0 and 1.

In the opposite direction, if φ is any function defined on the real line such that (2.13a)–(2.13d) hold, then by taking

and letting

we see that X is a random variable with the distribution function φ. Hence, for any such function φ, there exists a random variable X which has φ as its distribution function.

Next, let X be a discrete random variable taking values in the (countable) set E. Then for any a ∈ E,

is a non-negative number, and we must have

The collection {π(a): a ∈ E} is called the probability distribution of X.

In the case of non-discrete X, it is sometimes possible to differentiate the distribution function. Then the derivative of the distribution function of X is called the probability density function of X. (2.16) EXAMPLE. Consider Example (2.2). If the probability of “Heads” is 0.4, then

(19)

The random variable X defined there takes only two values: −1 and +1, and

Then

(2.17) EXAMPLE. In simulation studies using computers, the following setup is utilized in “generating

random variables from a given distribution φ.”

A table of “random numbers” is a collection of numbers ω lying in the interval [0, 1] such that a number picked “at random” is in the interval [a, b] with probability b − a. In our terminology, what this means is that we have a sample space Ω = [0, 1] and a probability measure P on Ω defined so that P([a, b]) = b − a for all 0 ≤ a ≤ b 1. Then the event “the picked number ω is less than or equal to b,” that is, the event [0, b], has probability b.

Suppose the given distribution function φ is continuous and strictly increasing. Then for any ω ∈ Ω = [0, 1], there is one and only one satisfying φ(a) = ω. Therefore, setting

defines a function X from Ω into (see Figure 1.2.3).

Figure 1.2.3 Defining a random variable X with a given distribution function φ.

(20)

In other words, the random variable X we have defined has the given function φ as its distribution. Hence, by picking a number ω at random from a table of “random numbers” and then computing X(ω) corresponding to ω from Figure 1.2.3, one obtains a possible value of a random variable X having the distribution function φ.

(2.18) EXAMPLE. A mid-course trajectory correction calls for a velocity increase of 135 ft/sec. The

spacecraft’s engine provides a thrust which causes a constant acceleration of 15 ft/sec2. Based on this, it is decided to fire the engine for 9 seconds. But the engine performance indicates that the actual length of burn time will be a random variable T with

What is the increase in velocity due to this burn?

Let Ω be the set of all possible burn times, i.e., Ω = [0, ∞), and define T(ω) = ω. Then, for the outcome ω, the velocity increase will be

Thus,

Suppose we have, defined on the sample space Ω, a number of random variables X1, . . ., Xn taking values in the countable set E. Then the probabilities of any events associated with X1, . . ., Xn can be computed (by using the results of Section 1) once their joint distribution is specified by giving

for all n-tuples (a1, . . ., an) with ai ∈ E. In the case of random variables X1, . . ., Xn taking values in , the joint distribution is specified by giving

for all numbers . Specification of these probabilities themselves can be difficult at times. The concept we introduce next simplifies such tasks (when used properly).

(21)

(2.19) DEFINITION. The discrete random variables X1, . . ., Xn are said to be independent if

for all a1 . . ., an ∈ E. If the Xi take values in , they are said to be independent if

for all . An infinite collection {Xt; t ∈ T} of random variables is called independent if any finite number of them are independent.

In particular, (2.21) implies, and is implied by, the condition that

for all intervals

We close this section by illustrating the concept through examples.

(2.22) EXAMPLE. Let X and Y be two discrete random variables taking values in {1, 2, 3, . . . }.

Suppose

For any m = 1, 2, . . ., using Theorem (1.10) with A = {X = m} and Bi = {Y = i}, we get†

Similarly, for any n ∈ {1, 2, . . . }, using (1.10) with A = {Y = n} and Bi = {X = i} now, we get

Since

(22)

(2.23) EXAMPLE. Let X and Y be two discrete random variables taking values in

and having the joint distribution

Then, using Theorem (1.10) with A = {X = m} and Bi = {Y = i}, for any , we have (see the footnote below)

Similarly, using Theorem (1.10) again, for any ,

Since P{X = m, Y = n) ≠ P{X = m}P{Y = n], X and Y are not independent.

(2.25) EXAMPLE. Let X and Y be as in the preceding example. Then X and Y – X are independent. To

show this, we note that

for all m, and that

(23)

for all m, , as claimed.

3. Conditional Probability

Let Ω be a sample space and P a probability measure on it.

(3.1) DEFINITION. Let A and B be two events. The conditional probability of A given B, written P(A |

B), is a number satisfying (a) 0 ≤ P(A | B) ≤ 1,

(b) P(A | B) = P(A | B)P(B).

If P(B) > 0, then P(A | B) is uniquely defined by (3.1b). Otherwise, if P(B) = 0, P(A | B) can be taken to be any number in [0, 1].

For fixed B with P(B) > 0, considered as a function of A, P(A | B) satisfies the conditions (1.6) for a probability measure. That is,

(3.2) (a) 0 ≤ P(A | B) ≤ 1, (b) P(Ω | B) = 1,

(c) provided the events A1, A2, . . . be disjoint; and thus Propositions (1.8)–(1.12) hold.

Heuristically, we think as follows. Suppose the outcome ω of the experiment is known to be in B, that is, B has occurred. Then event A can occur if and only if ω ∈ A ∩ B. And our estimate of the chances of A occurring given that B has occurred becomes the relative measure of A ∩ B with respect to B.

However, in practice, we usually have the various basic conditional probabilities specified, and our task then becomes the computation of other probabilities and conditional probabilities. The following proposition provides a simple tool for computing the probability of an event by conditioning on other events. It is sometimes referred to as the theorem of total probability.

(3.3) THEOREM. If B1, B2, . . . are disjoint events with , then for any event A,

(24)

and by (3.1b), P(A ∩ Bi) = P(A | Bi)P(Bi) for each i.

A simple consequence of this theorem is known as Bayes’ formula.

(3.4) COROLLARY. If B1, B2, . . . are disjoint events with union Ω, then for any event A with P(A) > 0,

and any j,

Proof. Using (3.1b),

Writing P(A) as the sum in Theorem (3.3) and using (3.1b) to write P(A ∩ Bj) = P(A | Bj)P(Bj) completes the proof.

(3.5) EXAMPLE. A coin is flipped until heads occur twice. Let X and Y denote, respectively, the trial

numbers at which the first and the second heads are observed. If p is the probability of heads occurring at any one trial, then

where q = 1 – p, 0 < p < 1. By Theorem (1.10), for any m = 1, 2, . . .,

(25)

Thus, using Definition (3.1b),

for any n = 2, 3, . . . and m = 1, 2, . . ., n – 1. That is, if it is known that the second heads occurred at the nth trial, the first heads must have occurred during the first n – 1 trials, and all these n – 1 possibilities are equally likely.

We have, for any m = 1, 2, . . ., n – 1 and n = 2, 3, . . .,

but a more instructive computation is the following. For m = 1, 2, . . . and k = 1, 2, . . .,

There are two conclusions to be drawn from (3.7): Y – X is independent of X, and Y – X has the same distribution as X.

(3.8) EXAMPLE. Suppose the length of a telephone conversation between two ladies is a random

variable X with distribution function

where the time is measured in minutes. Given that the conversation has been going on for 30 minutes, let us compute the probability that it continues for at least another 20 minutes. That is, we want to compute P{X > 50 | X > 30}. Since the event {X > 50, X > 30} is the same as the event {X > 50}, we have, by (3.1b),

(26)

Noting that e−0.6 = P{X > 20}, we have this interesting result: the probability that the conversation goes on another 20 minutes is independent of the fact that it has already lasted 30 minutes. Indeed, for any t, s ≤ 0,

That is, the probability that the conversation continues another s units of time is independent of how long it has been going on. Or, at every instant t, the ladies’ conversation starts afresh!

4. Exercises

(4.1) An experiment consists of drawing three flash bulbs from a lot and classifying each as defective (D) or non-defective (N). A drawing, then, can be described by a triplet; for example, (N, N, D) represents the outcome where the first and the second bulbs were found non-defective and the third defective. Let A denote the event “the first bulb drawn was defective,“ B the event “the second bulb drawn was defective,” and C the event “the third bulb drawn was defective.”

(a) Describe the sample space by listing all possible outcomes.

(b) List all outcomes in A, B, B ∪ C, A ∪ C, A ∪ B ∪ C, A ∩ B, Ac ∩ Bc ∩ C, A ∩ Bc ∩ C, (A ∪ Bc) ∩ C, (A ∪ C) ∩ (Bc ∪ C).

(4.2) Describe in detail the sample spaces for the following experiments: (a) Three tosses of a coin.

(b) An infinite number of coin tosses.

(c) Measurement of the speeds of cars passing a given point. (d) Scores of a class of 20 on an examination.

(e) Measurement of noontime temperatures at a certain locality. (f) Observation of arrivals at a store.

(4.3) An experiment consists of firing a projectile at a target and observing the position of the point of impact. (Suppose the origin of the coordinate system is placed at the target.) Then, an outcome is a pair ω = (ω1, ω2), where ω1 is the abscissa and ω2 the ordinate of the point of impact. The sample space Ω consists of all such pairs ω. For each ω ∈ Ω, let

(27)

(a) What do X, Y, Z stand for?

(b) Suppose the probability measure P is such that

Then show that X and Y are independent random variables with the same distribution function

(4.4) Let X, Y be as defined in (2.22), and put Z = X + Y. Compute

(a) P{X = m, Z = k} for m = 5, k = 7 first, and in general after that; (b) P{X = 5 | Z = 7};

(c) P{Z = 7 | X = 5}; (d) P{Y = 3 | Z = 14}; (e) P{X = 5, Y = 3 | Z = 8}; (f) P{Z = 8 | X = 6, Y = 2}.

(4.5) Let X be a discrete random variable with

where p > 0, p + q = 1. Show that for any m, n ∈ {1, 2, . . .},

(4.6) Let X and Y denote, respectively, the number of babies born on a certain day in a hospital and the number of them which are boys. Suppose their joint distribution is

Find P{X = n} and P{Y = m} for all m, n ∈ {0, 1, . . .}. Compute P{X = n | Y = m}, P{Y = m | X = n}, P{X – Y = k | X = n}, P{X – Y = k | Y = m}. Interpret the results.

(28)

What is the probability that

has real roots?

(4.8) Reliability is the probability of a device performing its purpose adequately for the period of time intended under the operating conditions encountered.

A piece of equipment consists of three components in series: for the equipment to function, all three components must be functioning. Let X1, X2, and X3 be the respective lifetimes of the components 1, 2, and 3 measured in hours. Suppose

If the lifetimes of the components are independent, what is the reliability of the equipment in a mission requiring 4,000 hours?

(4.9) Let X be the lifetime of a component measured in hours, and let its distribution function be

Given that the item has worked successfully for the first 3 hours, what is the probability that it does not fail within the next 1.5 hours?

In general, compute the conditional probability that X > t + s given that X > t.

(4.10) A printing machine capable of printing any of n characters a1, . . ., an is operated by electrical impulses, each character, in theory, being produced by a different impulse. Suppose the machine has probability p(0 < p < 1) of producing the character corresponding to the impulse received, independent of past behavior. If it prints the wrong character, the probabilities that any of the (n – 1) other characters will appear are equal.

(a) Suppose that one of the n impulses is chosen at random and fed into the machine twice, and that the character ai is printed both times. What is the probability that the impulse chosen is the one designed to produce ai?

(b) Suppose now that ai was printed on the first trial and aj(j ≠ i) on the second. What is the probability that the impulse was designed to produce ai?

(c) In (b), what is the probability that the impulse was designed to produce aj?

(d) In (b), suppose that ai is printed on the first trial and that it is known only that some other character (not ai) appeared on the second. Does this change the answer to (b)?

(29)

Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him.

THE BIBLE, THE BOOK OF JUDGES 12:6 Failure to distinguish between an outcome and an event is unlikely to have as severe a consequence as the fate which befell the warrior challenged by the men of ancient Gilead. Yet, probabilistic reasoning is a part of our culture today, and an appreciation of it cannot be acquired without developing a precise feeling for its essential ingredients.

It is important to remember always that an event is a collection of outcomes, while a random variable is a function. A random variable assigns a value to each outcome; a probability measure assigns a value to each event. One talks of the probability of an event, never of the probability of an outcome. A certain amount of confusion is caused by the historical mistakes made while the subject was developing, and the insistence of certain teachers on repeating them.

The present axiomatic foundations of the theory were laid by KOLMOGOROV [1] in 1933. Since then, the progress in probability theory has been very rapid. This progress was especially aided by the discovery of unsuspected applications to pure mathematics on the one hand, and by an ever increasing demand from other scientists and engineers on the other.

† As a reminder, 1 + x + x2 + · · · = 1/(1 – x) for x ∈ [0, 1). Also, in Example (2.23), 1 + x + x2/2! + x3/3! + · · · = ex for any .

(30)

CHAPTER 2

Expectations and Independence

Consider a large piece of land whose area we take to be one unit. The land is divided into n lots which, for purposes of buying and selling, are indivisible. Let Ω denote the whole land and P(A) the area of the region A. Let X(ω) denote the price per unit area of the lot containing the point ω. Then X takes only n values, say b1, b2, . . ., bn. The region on which X is equal to bk has area P{X = bk} and hence, the value of the total land is

Note that E[X] is, in a sense, the integral of the function X over the set Ω, and since the total area of Ω is P(Ω) = 1, E[X] is also the average unit price.

If we think of Ω as a sample space and P as a probability measure (as we may), then X becomes a random variable. The integral of X that we obtain is then called the expected value of X. The justification for the term “expected value” lies in our interpretation of E[X] as the average of X over Ω. This concept of integrating a random variable X over a sample space Ω with respect to a probability measure P extends also to arbitrary probability spaces. The present chapter is devoted to this and related concepts.

In Section 1 we give an account of expectation taking. Then in Section 2 we introduce conditional expectations and list many of their properties. The reader should study Section 1, and read Section 2

once or twice. He is urged not to dwell too long on Section 2 but to pass on to Chapters 3, 4, and 5

instead. In these later chapters there will be many opportunities to observe the workings of conditional expectations; by referring back to the cited theorems of this chapter, the reader will learn and appreciate them. Theorems on expectations and conditional expectations form the grammar of the language of probability, and are indispensable to anyone who wants to get acquainted with that language. But one does not start learning a language by memorizing the rules of grammar.

1. Expected Value

Let Ω be a sample space, P a probability measure, and X a discrete random variable defined on Ω. Let the values X takes be and put Bn = {ω: X(ω) = bn}. Then B0, B1, . . . are disjoint and their union is Ω. The function X is equal to bn on the set Bn whose measure is P(Bn). So the integral of the function X with respect to the measure P is

(31)

(we allow it to be + ∞). (See Figure 2.1.1.) Note that the right hand side divided by 1 = P(Ω) can also be looked upon as the weighted average of the function X with respect to the weight distribution given by P. Replacing P(Bn) by P{X = bn} in (1.1), we now make the following

Figure 2.1.1 Expected value of a discrete random variable is the sum of its values weighted by the corresponding probabilities.

(1.2) DEFINITION. The expected value of a discrete random variable X taking values in the set

is

The preceding defines the expected value of X when it is a discrete non-negative random variable. We extend this first to arbitrary non-negative random variables and then to arbitrary random variables.

Suppose X is a non-negative real-valued random variable. Then it is possible to find discrete random variables X1, X2, . . . such that

(32)

for all ω. Since each Xn is discrete, its expected value E[Xn] is well defined by (1.2). By our interpretation of E[Xn] as an integral it is easy to see that

and it seems reasonable to make the following

(1.5) DEFINITION. Let X be a non-negative random variable, and let X1, X2, . . . be discrete random

variables satisfying (1.3) and (1.4). Then we define the expected value of X to be

Finally, if X is an arbitrary real-valued random variable (not necessarily non-negative), and if we define

and

for all ω ∈ Ω, then both Y and Z are non-negative random variables, and

Definition (1.5) provides the meanings for the expected values of Y and Z, and we now make the following

(1.8) DEFINITION. Let X be an arbitrary random variable with values in , and let Y and Z be defined

by (1.6) and (1.7). Then

provided that at least one of the numbers E[Y] and E[Z] is finite. If E[Y] = E[Z] = + ∞, then X is said to have no expected value.

Definitions (1.2) and (1.8) are quite workable, but (1.5) is not. In fact, we have not even settled the matter of nonambiguity. If {Xn} is a sequence of discrete random variables increasing to X, and if

(33)

{Yn} is another sequence of discrete random variables also increasing to X, then Definition (1.5)

would put E[X] = limn E[Xn] and E[X] = limn E[Yn]. How do we know that these two numbers are the same ? Indeed, they are the same, as the proof of the next theorem shows. As a by-product we obtain a nice computational formula.

(1.9) THEOREM. For any non-negative random variable X,

Proof. First, suppose X is discrete with values in E. Then using Definition (1.2) and changing the order of summation and integration, we get (see Figure 2.1.2)

This establishes (1.10) for X discrete.

Let X be an arbitrary non-negative random variable, and let X1, X2, . . . be discrete random variables increasing to X. Then (1.11) applies to each

Figure 2.1.2 For non-negative discrete X, E[X] is the shaded area no matter how that is sliced.

(34)

On the other hand, since the Xn increase to X, for any t ≥ 0,

Thus, Proposition (1.1.11) applies to give

It follows from (1.12) and (1.13) and the monotonicity of the convergence that, by Definition (1.5), we have

This completes the proof.

We note that, in Definition (1.5), the sequence chosen to approximate X has nothing to do with the value E[X]. Formula (1.10) is in general easy to use if the distribution of X is known. (See Figure 2.1.3). In the case of discrete random variables taking integer values 0, 1, 2, . . ., it reduces further to a simpler sum:

(35)

Figure 2.1.3 Expected value of a non-negative random variable is the shaded area lying above its distribution function.

Figure 2.1.4. Expected value of the random variable X is the difference E[Y] – E[Z] of the two shaded areas.

In the case of arbitrary random variables, using Theorem (1.9) to compute E[Y] and E[Z] in

Definition (1.8), we obtain (see Figure 2.1.4)

(1.15) COROLLARY. For any real-valued random variable X,

provided that at least one term on the right is finite.

In the formula given by (1.15), if we integrate by parts we obtain

(1.16) COROLLARY. For any real-valued random variable X with distribution function φ,

provided the integral converges absolutely.

In computing a particular expectation, the choice of one formula over another is largely a matter of convenience. If there is a closed form expression for P{X > t}, in general, it is easier to use Theorem (1.9) and Corollary (1.15). Otherwise, it is easier to use Corollary (1.16) or its discrete equivalent,

Definition (1.2).

(1.17) EXAMPLE. The number of arrivals into a store during a specified time interval is a random

(36)

Then, using Definition (1.2),

(1.18) EXAMPLE. The lifetime X of an item has the distribution

This is a non-negative random variable; it is easier to compute E[X] by using Theorem (1.9). We obtain

(1.19) EXAMPLE. The intensity X of light falling on a certain surface has a distribution φ given by

This distribution is called “the normal distribution with mean α and variance β2.” Using Corollary (1.16),

(1.20) EXAMPLE. A discrete random variable X has the distribution

where p, q > 0, p + q = 1. If we use Definition (1.2),†

(37)

for all ; then

(1.21) EXAMPLE. A piece of equipment has two components whose life-times X and Y are

independent random variables with distributions

The equipment fails if either one of the two components does, namely, the lifetime of the equipment is Z = min(X, Y). To compute E[Z] we use Theorem (1.9). Now, Z > t if and only if both X > t and Y > t. So

where the second equality follows from the independence of X and Y (see Definition (1.2.21)). So

If X is a random variable taking values in E, and if f is a function from E into , then f(X) is a random variable with values in . Given the distribution of X, one can obtain the distribution of Y = f(X) and, using that, compute the expected value of Y by using the formulae of the preceding propositions. However, it is much easier to think of E[Y] as the integral of the function Y with respect to P and obtain it as in Definition (1.2):

(1.22) PROPOSITION. Let X be a discrete random variable taking values in E, and let f be a function

from E into . Then

provided the sum is absolutely convergent.

Proof. The random variable Y = f(X) takes the value f(a) on the set {X = a}, whose measure is P{X = a}. The integral of Y therefore is Σ f(a) P{X = a} where the summation is over all a ∈ E.

(38)

following

(1.23) PROPOSITION. Let X be a random variable with values in E, and let f be a function from E into

. Then

provided that the integral be absolutely convergent.

Proof is omitted. If instead we had a function of more than one random variable, the preceding two propositions become as follows. Again, we omit the proof.

(1.24) THEOREM. Let X1, . . ., Xn be random variables taking values in E, and let f be a function from

E × · · · × E = En into . Then

where φ is the joint distribution of X1, . . . Xn. In case X1, . . ., Xn are discrete, this becomes

where the summation is over all n-tuples a = (a1, . . ., an) with ai ∈ E (1.25) COROLLARY.

(a) E[cX] = cE[X];

(b) E[X + Y] = E[X] + E[Y];

(c) E[c1X1 + · · · + cn Xn] = c1E[X1] + · · · + cn E[Xn].

Proof of (a) is immediate from Proposition (1.23), where we take f(a) = ca and then use Corollary (1.16). Proof of (b) follows from (1.24) by taking f(a, b) = a + b, and (c) is immediate from (a) and (b).

We note that in the preceding corollary, we made no assumption of independence: whether or not the random variables are independent, the expected value of any linear combination of them is equal to the same linear combination of their expectations. The following is the analog for the case of multiplication.

(1.26) PROPOSITION. Let X and Y be two independent random variables taking values in E, and let g

(39)

Proof for discrete X, Y. Put f(a, b) = g(a)h(b) in Theorem (1.24). Then

But the independence of X and Y implies that

for any a and b. So

The assumption of independence in this proposition is crucial: if X and Y are not independent, then E[g(X)h(Y)] might differ from E[g(X)]E[h(Y)].

For certain special functions f, E[f(X)] is given certain special names. In particular, if f(b) = bn then E[f(X)] = E[Xn] is called the nth moment of X about the origin. If f(b) = (b − μ)2, where μ = E[X], then E[f(X)] = E [(X – μ)2] is called the variance of X and is denoted by Var(X). If X is a non-negative integer-valued random variable, and if f(b) = αb for some α ∈ [0, 1], then E[f(X)] = E[αx] is a number between 0 and 1. Considered as a function of α ∈ [0, 1], G(α) = E[αx] is called the generating function of X. If X is a non-negative random variable, and if f(b) = e–αb for some α ≥ 0, then E[f(X)] is again a number between 0 and 1. Considered as a function of α, F(α) = E[e–αx] is called the Laplace transform of X.

The expected value of X is a rough guide to the value X is likely to be near. The variance of X measures the deviation of X from this likely value E[X]. If the variance is small, then X is more likely to be near E[X]. The following is an estimate that can be used when the distribution of X is not known. It is called Chebyshev’s inequality.

(1.27) PROPOSITION. Let X be a random variable with expectation a and variance b2. Then for any ε >

0,

Proof. Consider the expectation of the positive random variable Y = (X − a)2; it is E[Y] = b2. Now, E[Y] is the integral of Y over all of Ω, and as such it is greater than the integral of Y on the set {Y > ε2}. The measure of that set is P{Y > ε2}, and Y > ε2 on that set. So the integral must be greater than ε2P{Y > ε2}. That is,

(40)

from which the proposition follows.

In computing the variance it is usually worth noting that

Following are some examples of these computations.

(1.29) EXAMPLE. Consider the random variable X of Example (1.17). We had already computed E[X]

= 8. Now, to obtain the variance we use the formula (1.28). To compute E[X2], note that it is easier to compute E[X(X − 1)] first and then use E[X2] = E[X(X − 1)] + E[X]. Now,

Hence,

and

Next we compute its generating function. We have

for any α ∈ [0, 1]. Note that the derivative of G at α = 1 is

(41)

and the third derivative is

(1.30) EXAMPLE. Consider the lifetime X of the item discussed in Example (1.18). Its expectation

was E[X] = 50. Now,

and hence

Computing the Laplace transform of X, we find

We note that the derivative of F at α = 0 is

and that the second derivative at α = 0 is

(42)

we have, for the generating function G of a non-negative integer-valued random variable X,

where G(k) is the kth derivative of G. Similarly, the results in Example (1.30) concerning the Laplace transform also generalize. For any non-negative random variable X with Laplace transform F,

It is also worth mentioning that a generating function determines the probability distribution associated with it; this is true because

which means that P{X = n} is the coefficient of αn in the power-series expansion of G(α). Similarly, the Laplace transform determines the associated distribution function.

We close this section with two theorems on the expected value of the limit of a sequence of random variables. The first is called the monotone convergence theorem and the second the bounded convergence theorem. The proof of the first is the same as that of (1.9), and we will not repeat it; we also omit the proof of the second.

(1.34) THEOREM. If X1, X2, . . . is a sequence of non-negative random variables increasing to the

random variable X, then the expectations E[X1], E[X2], . . . increase to E[X].

(1.35) THEOREM. Let X1, X2, . . be a sequence of random variables which are bounded in absolute

value by a random variable Y such that E[Y] < ∞. If

for almost all ω ∈ Ω, then

2. Conditional Expectations

Let Y be a discrete random variable taking values in , and let A be an event with P(A) > 0. Then the conditional probability that Y = b given the event A is (see (1.3.1))

(43)

As b varies, this is called the conditional distribution of Y given the event A. We define the conditional expectation of Y given the event A as

In particular, when A = {X = a} for a discrete random variable X taking values in a set E,

is called the conditional expectation of Y given that X = a. As a varies, (2.3) defines a function f on E by

By the conditional expectation of Y given X, written E[Y | X], we mean the random variable f(X); that is,

where f is as defined by (2.4). The following definition is the generalized version of this.

(2.6) DEFINITION. Let X1, . . ., Xn be discrete random variables taking values in E, and let Y be a

discrete random variable with values in . Then the conditional expectation of Y given X1, . . ., Xn is

where for any n-tuple (a1, . . ., an) with ai ∈ E,

If Y is not discrete, then a similar definition is given in terms of its conditional distribution P{Y ≤ t | X1 = a1, . . ., Xn = an}. For example, if Y is non-negative,

where

for all a1, . . ., an ∈ E.

(44)

not) is a random variable. Then we define the conditional probability of A given X1, . . ., Xn as

The following are some easy properties of conditional expectations. These are analogous to

Propositions (1.22), (1.23), (1.24), and (1.25). We omit the proofs.

(2.10) PROPOSITION. Let Y be a discrete random variable with values in E and g a function from E

into . Then

(2.11) PROPOSITION. Let Y1, . . ., Ym be discrete random variables with values in E, and let g be a

function from Em into . Then

(2.12) COROLLARY. If Y1, . . ., Ym take values in and c1, . . ., cm are constants, then

(2.13) EXAMPLE. Let X and Y be two random variables with

Let f(b) = E[Y | X = b], b = 1, 2. Then

Thus,

(45)

for K = 1, . . ., m – 1; m = 2, . . ., n – 1; n = 3, 4, . . ., where 0 < p < 1, p + q = 1. Then for k = 1, . . ., m – 1; m = 2, 3, . . .;

Thus, for k = 1, . . ., m – 1 and m = 2, . . ., n – 1, we have

Hence, for k = 1, . . ., m – 1 and m = 2, 3, . . .,

Thus,

We note that for any bounded function g,

so that

In particular, if g(b) = αb for some α ∈ [0, 1], then

The next proposition states that if the knowledge of X1 . . ., Xn determines Y completely, then the conditional expectation of Y given X1, . . ., Xn is equal to Y itself. The proof is very easy and we omit it.

(46)

(2.15) PROPOSITION. If Y can be written as

for some function f, then

Next is a very useful result used in situations where E[Y | X1, . . ., Xn] is easy to obtain or known somehow. Since E[Y | X1, . . ., Xn] is a random variable taking real values, we can talk about its expected value. That expected value is the same as the expectation of Y. In words, the expected value of any conditional expectation of Y is equal to the expected value of Y.

(2.16) PROPOSITION. E[E[Y | Xl, . . ., Xn] = E[Y].

Proof for discrete Y, X1, . . ., Xn. Let

then

On the other hand,

Putting (2.18) into (2.17) and changing the order of summation, noting Definition (1.3.1) of conditional probabilities, we obtain

The next result is very important in the theory of stochastic processes. It shows how to obtain the conditional expectation of Y given X1, . . ., Xn when it is easy to obtain the same given X1, . . ., Xn plus some extra information contained in Xn+1, . . ., Xn+m.

(47)

(2.19) THEOREM. For any n, m ≥ 1

Proof for n = 2, m = 1, X1, X2, X3, Y discrete. Let Z = f(X1, X2, X3) = E[Y| X1, X2, X3] We need to show that

We have

and

Putting the two computations together, we get

since

by (1.3.1) and (1.3.2). Noting that (2.21) is the same as (2.20) completes the proof. (2.22) COROLLARY. If

(48)

Proof. By Theorem (2.19) and Proposition (2.15),

The preceding corollary has the following interpretation. Suppose, given X1, . . ., Xn + m, that the conditional expectation of Y depends only on X1, . . ., Xn. This means that, as far as predicting the value of Y is concerned, knowledge of X1, . . ., Xn makes further knowledge concerning Xn +1, . . ., Xn

+ m irrelevant. Therefore, the conditional expectation of Y given X1, . . ., Xn is the same as that of Y

given X1, . . ., Xn + m.

A particular case of this happens to come up fairly often. Suppose we have E[Y | X1, . . ., Xn] computed; and suppose Y1, . . ., Ym are functions of X1, . . ., Xn, that is, Y1 = g1(X1, . . ., Xn), . . ., Ym = gm(X1, . . ., Xn). Then E[Y | X1, . . ., Xn, Y1, . . ., Ym] = E[Y | X1, . . ., Xn].

Another important concept is contained in the following theorem. Suppose {Y1, . . ., Ym} and {X1, . . ., Xn} are such that knowing the values of one set determines the values of the other. This is especially the case when Y1 = g1(X1, . . ., Xn), . . ., Ym = gm(X1, . . ., Xn) and conversely X1 = f1(Y1, . . ., Ym) . . ., Xn = fn(Y1, . . ., Ym). Then for any random variable Y, the conditional expectation of Y given X1, . . ., Xn is the same as the conditional expectation of Y given Y1, . . ., Ym. This is so since {X1, . . ., Xn} carries the same information as {Y1, . . ., Ym}. The proof is easy and will be omitted.

(2.23) THEOREM. Suppose the collections {X1, . . ., Xn} and {Y1, . . ., Ym} are such that the knowledge

of the random variables in one collection determines the values of the random variables in the other. Then for any Y,

We close this section by giving an extension of the concept of independence.

(2.24) DEFINITION. The set of random variables {Y1, . . ., Ym} is said to be independent of {X1, . . .,

Xn} if

for all non-negative functions g. Two stochastic processes {Yt; t ∈ T1} and {Xt; t ∈ T2} are said to be independent of each other if any finite collection {Yt

1, . . ., Ytm} from the first is independent of any

finite collection {Xs

1, . . ., Xsn} from the second.

(49)

equivalent to the independence, in the sense of (2.24), of any two subcollections. As such, we will not distinguish between the two.

Next is a new concept, that of conditional independence.

(2.25) DEFINITION. {Y1, . . ., Ym} is said to be conditionally independent of {Z1, . . ., Zk} given {X1, .

. ., Xn} provided that

for all non-negative functions g. The collection {Yt; t ∈ T1} is said to be conditionally independent of the collection {Zt; t ∈ T2} given the collection {Xt; t ∈ T3} provided that for any finite collection {Yt

1, . . ., Ytm} from the first and any finite collection {Zs1, . . ., Zsn} from the second,

for all non-negative functions g.

In words, {Y1, . . ., Ym} is conditionally independent of {Z1, . . ., Zk} given {X1, . . ., Xn} provided that, as far as predicting the value of any function of Y1, . . ., Ym is concerned, the extra knowledge provided by Z1, . . ., Zk loses all its significance once the values of X1, . . ., Xn are known.

(2.26) EXAMPLE. Consider the random variables X, Y, Z of Example (2.14). We had shown that

The right hand side being independent of X, we see that Z is conditionally independent of X given Y. We also note that Z is not independent of X.

(2.27) EXAMPLE. Let X1, X2, . . . be a sequence of random variables with E[Xi] = μ for all i. Let N be

a non-negative integer-valued random variable independent of X1, X2, . . . with E[N] = λ For each ω ∈ Ω, let

We would like to compute E[Y]. We may think of X1, X2, . . . as the amounts spent by customers 1, 2, . . . and of N as the number of arrivals within the first hour. Then Y is the total revenue within that hour.

(50)

On the other hand, since N is independent of X1, X2, . . ., for n ≥ 1,

Hence

and by (2.28)

3. Exercises

(3.1) Find the expected value of the random variable X taking the values –5, 1, 4, 8, 10 with probabilities 0.3, 0.2, 0.2, 0.1, 0.2 respectively.

(3.2) Consider the random variable X taking the values –2, 0, 2 with probabilities 0.4, 0.3, 0.3 respectively. Compute the expected values of X, X2, 3X2 + 5.

(3.3) Compute the variance and the generating function of the random variable in Example (1.20). (3.4) A random variable X is said to have the uniform distribution over [a, b] if

(a) Compute E[X], Var(X), E[(X – a)/(b – a)]. (b) Find the distribution of Y = (X – a)/(b – a).

(3.5) Compute the variance and the Laplace transform of the lifetime in Example (1.21). (3.6) Compute the variance of the intensity of light in Example (1.19).

(3.7) The headway X between two vehicles at a fixed instant is a random variable with

Find the expected value and the variance of the headway. (3.8) Show that for any constants a and b,

(51)

for any random variable X.

(3.9) Show that for any two independent random variables X and Y,

(3.10) The lifetime X of a device has the distribution

(a) Show that E[X] = 1/c.

(b) Show that (see also Example (1.3.8))

(3.11) Let X, Y be as defined in Example (1.2.23). (a) Compute E[X], E[Y – X], E[Y].

(b) Find E[Y – X | X], E[Y | X].

(c) Show by direct computation that E[E[Y | X]] = E[Y].

(3.12) Suppose X1 X2, . . . are independent and identically distributed non-negative random variables with

Let N be a non-negative integer-valued random variable which is independent of {X1, X2, . . .}, and let

Let S0 = 0, S1 = X1, S2 = X1 + X2, . . ., and let Y = SN. (a) Compute E[Y | N], E[Y2|N].

(b) Compute E[Y], E[Y2], Var(Y). (c) Show that for any ,

(52)

The chapter just finished completes our account of the preliminaries necessary for studying stochastic processes. Especially in view of the monstrous looks of the last section, it seems all too advisable to inquire how the reader’s patience is holding out and to assure him that he will in time come to appreciate the true friendliness of these concepts. For a deeper treatment and for the proofs which we have omitted, we refer the reader to CHUNG [2].

†Note that, for x ∈ (0, 1), . If we differentiate both sides we get

(53)

CHAPTER 3

Bernoulli Processes and Sums of

Independent Random Variables

Consider an experiment consisting of an infinite sequence of trials. Suppose that the trials are independent of each other and that each one has only two possible outcomes: “success” and “failure.” A possible outcome of such an experiment is (S, F, F, S, F, S, . . .), which stands for the outcome where the first trial resulted in success, the second and third in failure, fourth in success, fifth in failure, sixth in success, etc. Thus, the sample space Ω of the experiment consists of all sequences of two letters S and F, that is,

We next describe a probability measure P on all subsets of Ω. Let 0 ≤ p ≤ 1, and define q = 1 – p. We think of p and q as the probabilities of success and failure at any one trial. The probability of the event {ω: ω1 = S, ω2 = S, ω3 = F} should then be p · p · q = p2q, and the probability of the event {ω: ω1 = S, ω2 = F, ω3 = F, ω4 = F, ω5 = S} should be p · q · q · q · p = p2q3. For each n, we consider all events which are specified by the first n trials and define their probabilities in this manner. This, in addition to the conditions in Definition (1.1.6), completely specifies the probability P.

For each ω ∈ Ω and n ∈ {1, 2, . . .}, define Xn(ω) = 1 or 0 according as ωn = S or ωn = F. Then for each n, we have a random variable Xn whose only values are 1 and 0. It follows from the description of P that X1, X2, . . . are independent and identically distributed with P{Xn = 1} = p, P{Xn = 0} = q. In the first three sections of this chapter we will be interested in the properties of stochastic processes such as {Xn; n = 1, 2, . . .} and other processes defined in terms of {Xn}.

The process {Xn} is very simple in nature, and the answers to most problems raised here are easy to obtain. Our object here is to demonstrate some of the questions that are raised in studying a stochastic process, and to provide some experience in using the tools of Chapters 1 and 2.

The processes of Sections 2 and 3 provide two examples of sums of independent and identically distributed random variables. We collect results of a more general nature about such processes in

Section 4 along with certain classical limit theorems.

1. Bernoulli Processes

(54)

of random variables defined on Ω and taking only the two values 0 and 1.

(1.1) DEFINITION. The stochastic process {Xn; n = 1, 2, . . .} is called a Bernoulli process with

success probability p provided that

(a) X1, X2, . . . are independent, and

(b) P{Xn = 1} = p, P{Xn = 0} = q = 1 − p for all n.

(1.2) EXAMPLE. Finished products coming off an assembly line are given a routine inspection. If the

nth item is “defective” we put Xn = 1, otherwise Xn = 0. If the production process is “under control,” the successive items produced will show only random chance effects, and X1, X2, . . . will be independent. If, further, the defective rate p remains constant over time, then {Xn; n = 1, 2, . . .} will be a Bernoulli process.

(1.3) EXAMPLE. At a certain fork on a road, about 62 percent of the vehicles turn left. We define Xn to

be 1 or 0 according as the nth vehicle turns left or right. The random variables X1, X2, . . . are independent of each other if the drivers act independently of each other in choosing their directions to turn. Then {Xn; n = 1, 2, . . .} is a Bernoulli process with probability 0.62 for success.

(1.4) EXAMPLE. Diameters of bearings coming off a production line are measured, and those that do

not meet the specifications are rejected. Let Y1, Y2, . . . be the diameters in inches of the first, second, . . . bearings. Let a = 2.994 and b = 3.006 be the lower and the upper tolerance limits. Thus, the nth bearing is not rejected if and only if 2.994 ≤ Yn ≤ 3.006.

Suppose Y1, Y2, . . . are independent and identically distributed with the common distribution function φ given by

(this is the normal distribution with mean 3 and variance (0.002)2). For each n let

where B = [2.994, 3.006]. Then Xn is 1 if the nth bearing meets the specifications and is 0 otherwise. Since Y1, Y2, . . . are independent, X1, X2, . . . are also independent; and since the Yn have a common distribution, the Xn also have a common distribution. Hence, {Xn; n = 1, 2, . . .} is a Bernoulli process. The probability p = P{Xn = 1} is now computed as

(55)

where the table for the standard normal distribution is used to obtain the last number (see Figure 3.4.3

on page 65).

Let {Xn; n = 1, 2, . . .} be a Bernoulli process with success probability p. Then for any n,

for any α ≥ 0.

2. Numbers of Successes

Let {Xn; n = 1, 2, . . .} be a Bernoulli process with probability of success p. We think of Xn as the number of successes at the nth trial. Define

for each ω ∈ Ω. Then Nn is the number of successes in the first n trials, and Nn+m – Nn is the number of successes in the trials numbered n + 1, n + 2, . . ., n + m. In this section we are interested in the stochastic process whose time parameter is discrete and whose state space is discrete and again is {0, 1, . . .}.

We start with simple descriptive quantities. Using the definition of Nn, Corollary (2.1.25), and the result (1.5) above, we obtain

References

Related documents