Elements of probability theory

(1)

2

Elements of probability theory

Probability theory provides mathematical models for random phenomena, that is, phenomena which under repeated observations yield di®erent outcomes that cannot be predicted with certainty.

2.1 SAMPLE SPACES

A situation whose outcomes occur randomly is called an experiment. The set of all possible outcomes of an experiment is called the sample space corresponding to the experiment, and is denoted by -. A generic element of - is called a sample point, or simply a point, and is denoted by !2 -.

Example 2.1 A coin is tossed twice and the sequence of heads (H) and tails (T ) is recorded. The possible outcomes of this experiment are HH, HT , T H and T T . Hence, the sample space corresponding to this experiment consists of the four points

- =fHH; HT; T H; T T g:

2 A sample space is called ¯nite if it is empty or contains a ¯nite number of points, otherwise is called in¯nite. A sample space is called countable if its points can be indexed by the set of positive integers. A sample space that is ¯nite or countable is called discrete.

Example 2.2 A coin is tossed until H is recorded. The sample space corrsponding to this experiment is

- =fH; T H; T T H; T T T H; T T T T H; : : :g:

Thus - contains countably many points. 2

Not all sample spaces are discrete. For example, the sample space consisting of all positive real numbers is not discrete, neither is the sample space consisting of all real numbers in the interval [0; 1].

(2)

2.2 RELATIONS AMONG EVENTS

A subset of points in a sample space - is called an event in -. An event occurs if and only if one of its points occurs. Viewed as an event, - is called the sure event. In general, events will be de¯ned by certain conditions on the points that compose them.

Because events are just subsets of points in -, concepts and results from point set theory apply to events. In particular, if A and B are events in -, A implies B, written Aµ B, if and only if all points in A also belong to B. The events A and B are identical, written A = B, if and only if Aµ B and B µ A, that is, A and B contain exactly the same points. Other usual operations and relations between sets are listed below

(union) A[ B = f! 2 -: ! 2 A or ! 2 Bg, (intersection) A\ B = f! 2 -: ! 2 A and ! 2 Bg, (complement) A^c=f! 2 -: ! 62 Ag,

(impossible event) ; = -^c,

(di®erence) A¡ B = A \ B^c,

(symmetric di®erence) A¢B = (A¡ B) [ (B ¡ A).

Instead of A\ B we also write AB.

Operations and relations between sets are easy to visualize using Venn diagrams.

If A\ B = ;, then A andB are called disjoint or mutually exclusive events.

Intersections and unions of a countable collection A₁; A₂; A₃; : : : of events are denoted by T₁

i=1Ai and S₁

i=1Ai, respectively. If A is an arbitrary family of subsets in -, we write

[

A2A

A =f! 2 -: ! 2 A for some A 2 Ag

and \

A2A

A =f! 2 -: ! 2 A for all A 2 Ag

Some basic relationships between events are:

A[ ; = A, A[ - = -, A[ A = A, -[ ; = -, A\ ; = ;, A\ - = A, A\ A = A, -\ ; = ;.

and

(3)

(commutative law) A[ B = B [ A, A\ B = B \ A,

(distributive law) A[ (B \ C) = (A [ B) \ (A [ C), A\ (B [ C) = (A \ B) [ (A \ C), (associative law) (A[ B) [ C = A [ (B [ C),

(A\ B) \ C = A \ (B \ C), (De Morgan's laws) (A[ B)^c= A^c\ B^c,

(A\ B)^c= A^c[ B^c.

De Morgan's laws show that complementation, union and intersection are not independent operations.

The commutative, distributive and associative laws and De Morgan's laws can easily be extended to a countable collection of events A1; A2; A3; : : :.

The characteristic function (cf) of an event Aµ - is a function 1A(¢) de¯ned for all !2 - by the relation

1_A(!) =

½1; if !2 A, 0; otherwise.

We also write 1(!2 A) or, more compactly, 1(A).

There is a one-to-one correspondence between sets and their cf's, and all properties of sets and set operations can be expressed in terms of cf's. For example, if C = A^c then 1_C = 1¡ 1A, if C = A[ B then 1C = max(1_A; 1_B), and if C = A\ B then 1C= 1_A¢ 1B.

2.3 PROBABILITY

How can we attach probabilities to events? The ¯rst and easiest case is an experiment with a ¯nite sample space - consisting of N points. Suppose that, because of the nature of the experiment (e.g. tossing a fair coin), all points in - are equiprobable, that is, equally likely, and let A be some event in -. We de¯ne the probability of A, written P (A), as the ratio

P (A) = N (A)

N ; (2.1)

where N (A) denotes the number of points in A. For any Aµ -, we have 0· P (A) · 1;

P (-) = N (-) N = 1;

P (;) = N (;) N = 0:

Further, if A and B are disjoint events in -, then P (A[ B) = N (A[ B)

N = N (A)

N +N (B)

N = P (A) + P (B):

(4)

Despite its simplicity, formula (2.1) can lead to non trivial calculations. In order to use it in a given problem, we need to determine: (i) the number N of all equiprobable outcomes, and (ii) the number of all those outcomes leading to the occurrence of A.

A second case is when a basic experiment can be repeated in exactly the same conditions any number n of times. We call this situation the case of independent trials under identical conditions. In this case, we can give a precise meaning to the concept of probability.

In each trial a particular event A may or may not occur. Let n(A) be the number of trials in which A occurs. The relative frequency of the event A in the given series of n trials is de¯ned as

fn(A) = n(A) n :

It is an empirical fact that the f_n(A) observed for di®erent series of trials are virtually the same for large n, clustering about a constant value P (A), called the probability of A. Roughly speaking, the probability of A equals the fraction of trials leading to the occurrence of A in a large series of trials.

2.4 COMBINATORIAL RESULTS

Whenever equal probabilities are assigned to the elements of a ¯nite sample space, computation of probabilities of events reduces to counting the points comprising the events.

Theorem 2.1 Given n elements a₁; : : : ; a_nand m elements b₁; : : : ; b_mthere are exactly nm distinct ordered pairs (ai; bj) containing one element of each kind.

Thus, if one experiment has n possible outcomes and another experiment has m possible outcomes, there are nm possible outcomes for the two expriments.

More generally we have:

Theorem 2.2 Given n₁ elements a₁; : : : ; a_n₁, n₂ elements b₁; : : : ; b_n₂, etc., up to nr elements x1; : : : ; xnr, there are n1n2¢ ¢ ¢ nr distinct ordered pairs (ai1; bi2; : : : ; xir) containing one element of each kind.

Thus, if there are r experiments, where the ¯rst has n₁possible outcomes, the second n₂, . . . , and the rth n_rpossible outcomes, there are a total of n₁n₂¢ ¢ ¢ nr

possible outcomes for the r experiments.

A permutation is an ordered arrangement of objects. An ordered sample of size r is a permutation of r objects obtained from a set ofn elements.

Two possible ways for obtaining samples are: sampling with replacement and sampling without replacement. Notice that only samples of size r· n without replacement are possible.

(5)

Theorem 2.3 Given a set of n elements and sample size r, there are n^r di®erent ordered samples with replacement, and

n(n¡ 1)(n ¡ 2) ¢ ¢ ¢ (n ¡ r + 1) = n!

(n¡ r)!

di®erent ordered samples without replacement.

Theorem 2.3 implies that the number of permutations or orderings of n elements is equal to n!.

A combination is a set of elements without repetitions and without regard to ordering. For example,fa; bg and fb; ag are di®erent permutations but only one combination. Thus, a combination is the number of unordered samples of a given size drawn without replacement from a ¯nite set of objects.

Theorem 2.4 The number of possible combinations of n objects taken r at a time (r· n), is equal to

C_rⁿ= n!

r!(n¡ r)! = µn

r

¶ :

Proof. Since the number of ordered samples is equal to the number of unordered samples times the number of ways to order each sample we have that

n!

(n¡ r)! = C_rⁿr!;

from which

C_rⁿ= n!

r!(n¡ r)!:

2 The number C_rⁿ is called binomial coe±cient, since it occurs in the binomial expansion

(a + b)ⁿ= Xn r=0

µn r

¶ a^n¡rb^r: More generally, simple induction gives the following:

Theorem 2.5 Given a set of n elements, let n₁; : : : ; n_k be positive integers such that Pk

i=1n_i= n. Then there are

µ n

n1n2 : : : nk

¶

= n!

n1! n2! : : : nk! (2.2) ways of partitioning the set into k unorderd samples without replacement of size n₁; : : : ; n_k respectively.

The numbers (2.2) are called multinomial coe±cients.

(6)

2.5 FINITE PROBABILITY SPACES

The de¯nition of probability in terms of equiprobable events is circular. On the other hand, de¯ning probabilities as limits of relative frequencies in independent trials under identical conditions is far too restrictive. To avoid these problems we shall now present a purely axiomatic treatment of probabilities.

De¯nition 2.1 A sample space - is called a ¯nite probability space if - is

¯nite and for every event Aµ -, there is de¯ned a real number P (A), called the probability of the event A, such that:

A.1: P (A)¸ 0;

A.2: P (-) = 1;

A.3: if A₁ and A₂ are mutually exclusive events in -, then P (A[ B) = P (A) + P (B):

2 It follows from De¯nition 2.1 that, for any subset A and B of -,

0· P (A) · 1; (2.3)

P (A^c) = 1¡ P (A); (2.4)

P (;) = 0; (2.5)

Aµ B ) P (A) · P (B): (2.6)

Further

P (A[ B) = P (A) + P (B) ¡ P (AB) (Covering theorem): (2.7) This implies the following upper bound on P (A[ B)

P (A[ B) · P (A) + P (B);

with equality if and only if A and B are disjoint.

Notice that B = AB[A^cB, where AB and A^cB are mutually exclusive events.

Hence P (B) = P (AB) + P (A^cB) and therefore P (B)¡ P (AB) = P (A^cB).

Substituting in (2.7) gives

P (A[ B) = P (A) + P (A^cB) (Addition law): (2.8) Also notice that, by De Morgan's law and the Covering theorem,

1¡ P (AB) = P ((AB)^c) = P (A^c[ B^c)· P (A^c) + P (B^c):

This implies

P (AB)¸ 1 ¡ P (A^c)¡ P (B^c) (Bonferroni inequality); (2.9)

(7)

with equality if and only if A^c and B^c are disjoint.

More generally, if A₁; : : : ; A_n is a ¯nite collection of events in -, then A₁; A^c₁A₂; : : : ; A^c₁A^c₂; : : : ; A^c_n_¡1A_n form a partition ofSn

i=1A_i, and so P (

[n i=1

A_i) = P (A₁) + P (A^c₁A₂) +¢ ¢ ¢ + P (A^c1A^c₂¢ ¢ ¢ A^cn¡1A_n):

This result generalizes the Addition law (2.8). Since A^c₁A^c₂¢ ¢ ¢ A^cn¡1A_nµ An

for all n¸ 1, we also have P (

[n i=1

Ai)· Xn i=1

P (Ai):

This result generalizes the Covering theorem (2.7). Finally, the generalization of the Bonferroni inequality (2.9) is

P (

\n i=1

Ai)¸ 1 ¡ Xn i=1

P (A^c_i):

2.6 MEASURABLE SPACES AND MEASURES

For in¯nite sample spaces, some modi¯cations of the axioms A.1-A.3 and some additional concepts of set theory are required. The reason is that some subsets of an in¯nite sample space may be so irrregular that it is not possible to assign a probability to them.

A set whose elements are sets of - will be called a class of sets in -. When a set operation performed on sets in a classA gives as a result sets which also belong toA, we say that A is closed under the given operation.

De¯nition 2.2 A nonempty classA of subsets of - is called a ¯eld or an algebra on - if it contains - and is closed under complementation and ¯nite unions, that is,

-2 A;

A2 A ) A^c2 A; (2.10)

A_i2 A; i = 1; : : : ; n ) [n i=1

A_i2 A: (2.11)

2 By De Morgan's laws, (2.10) and (2.11) together imply

Ai2 A; i = 1; : : : ; n )

\n i=1

Ai2 A:

(8)

Thus, all standard set operations (union, intersection and complementation) can be performed any ¯nite number of times on the elements of a ¯eldA without obtaining a set not inA.

De¯nition 2.3 A ¯eldA on - is a ¾-¯eld or a ¾-algebra if it is closed under countable unions, that is, if

A_i2 A; i = 1; 2; : : : ) [1 i=1

A_i2 A: (2.12)

2 By De Morgan's law, (2.10) and (2.12) together imply

A_i2 A; i = 1; 2; : : : )

\1 i=1

A_i2 A:

Thus, all standard set operations can be performed any countable number of times on the elements of a ¾-¯eldA without obtaining a set not in A.

If A is a class of subsets of -, the smallest ¯eld (¾-¯eld) containing A is called the ¯eld (¾-¯eld) generated byA. It can be veri¯ed that the ¯eld (¾-¯eld) generated byA is equal to the intersection of all ¯eld (¾-¯elds) containing A.

IfA is a ¾-¯eld on -, the pair (-; A) is called a measurable space. A subset A of - is said to be measurable if A2 A. Given a space -, it is generally possible to de¯ne many ¾-¯elds on -. To distinguish between them, the members of a given ¾-¯eldA on - will be called A-measurable sets.

Example 2.3 An important ¾-¯eld on the real line< is that generated by the class of all bounded semi-closed intervals of the form (a; b],¡1 < a < b < 1.

This ¾-¯eld is called the Borel ¯eld on < and denoted by B. Its elements are called the Borel sets. Since B is a ¾-¯eld, repeated ¯nite and countable set theoretic operations on its elements will never lead outsideB. The measurable space (<; B) is called the Borel line.

Notice thatB would equivalently be generated by all the open half-lines of <, all the open intervals of<, or all the closed intervals of <. 2

A set function is a function de¯ned on a class of sets.

De¯nition 2.4 A measure ¹ on a measurable space (-;A) is a nonnegative set function ¹ de¯ned for all sets ofA and satisfying:

M.1: ¹(;) = 0;

M.2: (Countable additivity) if fAig is any countable sequence of disjoint A- measurable sets, then

¹(

[1 i=1

A_i) = Xn i=1

¹(A_i):

(9)

2 Clearly, countable additivity implies ¯nite additivity, that is, if A₁; : : : ; A_n is a ¯nite collection of disjoint measurable sets, then

¹(

[n i=1

A_i) = Xn i=1

¹(A_i):

Example 2.4 Let f be a nonnegative function of the points of a set -. Let the

¾-¯eldA consist of all countable subsets of -. A measure ¹ on (-; A) is then de¯ned as

¹(;) = 0; ¹(f!1; : : : ; !ng) = Xn i=1

f(!i):

If f = 1, then ¹ is called counting measure. 2

It is easy to verify that if ¹ is a measure on (-;A), then it is monotone, that is, ¹(A)· ¹(B) whenever A; B 2 A and A ½ B.

De¯nition 2.5 A measure ¹ on (-;A) is called ¯nite if ¹(-) < 1. It is called

¾-¯nite if there exists a sequencefAig of sets in A such thatS₁

i=1Ai = - and

¹(A_i) <1, n = 1; 2; : : :. 2

Example 2.5 An important ¾-¯nite measure is the one de¯ned on the Borel line (<; B) by ¹((a; b]) = b ¡ a, the length of the interval (a; b]. Such a measure is called Lebesgue measure. It is easy to verify that every countable set is a Borel

set of measure zero. 2

De¯nition 2.6 If ¹ is a measure on (-;A), the triple (-; A; ¹) is called a

measure space. 2

A measure space (-;A; ¹) is called complete if it contains all subsets of sets of measure zero, that is, if A2 A, B ½ A, and ¹(A) = 0, then B 2 A. It can be shown that each measure space can be completed by the addition of subsets of sets of measure zero.

If ¹ is a ¾-¯nite measure de¯ned on (-;A) and F(A) is the ¾-¯eld generated byA, then it can be shown that there exists a unique measure ¹^¤on (-;F(A)) such that ¹^¤(A) = ¹(A) for all A 2 A. Further, ¹^¤ is also ¾-¯nite. Such a measure is called the extension of ¹.

De¯nition 2.7 A measure space (-;A; P ) is a probability space if P is a ¾-¯nite

measure with P (-) = 1. 2

2.7 PROBABILITY SPACES

From De¯nition 2.7, a probability space is a triple (-;A; P ), where - is the sample space associated with an experiment, A is a ¾-¯eld on -, and the probability measure P is a real valued function de¯ned for all sets in A and satisfying:

(10)

P.1: P (A)¸ 0 for all A 2 A;

P.2: P (-) = 1;

P.3: (Countable additivity): IffAig is a countable sequence of disjoint subsets inA, then

P ( [1 i=1

A_i) = Xn i=1

P (A_i):

If (-;A; P ) is a probability space, then the sets in A are interpreted as possible events associated with an experiment. For any A2 A, the real number P (A) is called the probability of the event A. A support of P is any set A2 A for which P (A) = 1.

If - is a ¯nite sample space and A is the set of all the events in - (the collection of all subsets of -), then properties P.1{P.3 are equivalent to A.1{A.3 that de¯ne a ¯nite probability space.

As a consequence of properties P.1{P.3, relationships (2.3){(2.9) hold for any A; B 2 A. Further, their generalizations hold for any ¯nite collection of events in A. Notice that the Covering theorem (2.7) can be shown to hold for any countable collection of events inA. Further, if fAig is a countable collection of events inA such that A1µ A2µ ¢ ¢ ¢, then

P ( [1 i=1

Ai) = lim

n!1P (Ai):

2.8 CONDITIONAL PROBABILITY

Let (-;A; P ) be a probability space and let B 2 A be an event such that P (B) > 0. If we know that B occurred, then the relevant sample space becomes B rather than -. This justi¯es de¯ning the conditional probability of A given B as

P (Aj B) = P (AB)

P (B) (2.13)

if P (B) > 0, and P (Aj B) = 0 if P (B) = 0. It is easy to verify that the function P (¢ j B) de¯ned on A is a probability measure on (-; A), that is, it satis¯es P.1{P.3. We call P (¢ j B) the conditional probability measure given B.

Notice that (2.13) can equivalently be written as P (AB) = P (Aj B) P (B):

This result, called the Multiplication law, provides a convenient way of ¯nding P (AB) whenever P (Aj B) and P (B) are easy to ¯nd.

The Multiplication law can be generalized to a ¯nite collection of events A1; : : : ; An inA

P (A₁¢ ¢ ¢ An) = P (A_nj An¡1¢ ¢ ¢ A1) P (A_n¡1¢ ¢ ¢ A1)

= P (A_nj An¡1¢ ¢ ¢ A1) P (A_n_¡1j An¡2¢ ¢ ¢ A1) P (A_n_¡2¢ ¢ ¢ A1);

(11)

and so on. Thus

P (A₁¢ ¢ ¢ An) = P (A₁) P (A₂j A1) P (A₃j A2A₁)¢ ¢ ¢ P (Anj An¡1¢ ¢ ¢ A1):

Now consider a countable collection fBig of disjoint events in A such that P (Bi) > 0 for every i andS₁

i=1Bi= -. Clearly, P (

[1 i=1

Bi) = X1 i=1

P (Bi) = 1:

For any A2 A,

P (A) = P (A\ ( [1 i=1

B_i)) + P (A\ ( [1 i=1

B_i)^c)

= P (A\ ( [1 i=1

Bi));

where we used the fact that P ((

[1 i=1

Bi)^c) = 1¡ P ( [1 i=1

Bi) = 0:

Thus, by the Morgan's laws, P (A) = P (

[1 i=1

AB_i) = X1 i=1

P (AB_i);

since AB_i\ ABj =; for all i 6= j. Therefore

P (A) = X1 i=1

P (Aj Bi) P (B_i);

which is called the Law of total probabilities.

Now let A2 A be such that P (A) > 0, and consider computing the conditional probability P (B_jj A) given knowledge of fP (A j Bi)g and fP (Bi)g. By the de¯nition of conditional probability and the Multiplication law,

P (B_jj A) = P (B_jA)

P (A) = P (Aj Bj) P (B_j) P (A)

for any ¯xed j = 1; 2; : : :. Therefore, by the Law of total probabilities, P (B_jj A) = P (Aj Bj) P (B_j)

P₁

i=1P (Aj Bi) P (Bi); which is called Bayes rule.

(12)

2.9 INDEPENDENCE

Let A; B 2 A be two events with non-zero probability. If knowing that B occurred gives no information about whether or not A occurred, then the probability assigned to A should not be modi¯ed by the knowledge that B occurred. Hence

P (Aj B) = P (A);

and so

P (AB) = P (A) P (B): (2.14)

Two events A; B 2 A are said to be (pairwise) independent if (2.14) holds.

Notice that this de¯nition of independence is symmetric in A and B, and also covers the case when P (A) = 0 or P (B) = 0. It is easy to show that if A and B are independent, then A and B^c as well as A^c and B^c are independent.

Three events A; B; C 2 A are said to be (mutually) independent if they are pairwise independent and

P (ABC) = P (A) P (B) P (C):

This condition is necessary, for pairwise independence does not ensure that, for example, P ((AB)C) = P (AB) P (C).

It is easy to verify that if A; B and C are independent events, then A[ B and C are independent, and A\ B and C are independent.

More generally, a familyA of events are (mutually) independent if, for every

¯nite collection A1; : : : ; An of events inA, P (

\n i=1

Ai) = Yn i=1

P (Ai): (2.15)

REFERENCES

Billingsley P. (1979) Probability and Measure, Wiley, New York.

Feller, W. (1968) An Introduction to Probability Theory and Its Applications (3rd ed.), Vol. 1, Wiley, New York.

Halmos, P.R. (1974) Measure Theory, Springer, New York.

Kolmogorov, A.N. and Fomin S.V. (1970) Introductory Real Analysis, Dover, New York.

Loµeve, M. (1977) Probability Theory (4th ed.), Vol. 1, Springer, New York.

Royden H.L. (1968) Real Analysis (2nd ed.), MacMillan, New York.