Multiple random variables - Introduction to discrete random variables

Introduction to discrete random variables

2.3 Multiple random variables

If X and Y are random variables, we use the shorthand

{X ∈ B,Y ∈ C} := {ω∈ Ω : X(ω) ∈ B and Y(ω) ∈ C}, which is equal to

{ω∈ Ω : X(ω) ∈ B} ∩ {ω∈ Ω : Y(ω) ∈ C}.

Putting all of our shorthand together, we can write

{X ∈ B,Y ∈ C} = {X ∈ B} ∩ {Y ∈ C}.

We also have

P(X ∈ B,Y ∈ C) := P({X ∈ B,Y ∈ C})

= P({X ∈ B} ∩ {Y ∈ C}).

Independence

If the events{X ∈ B} and {Y ∈C} are independent for all sets B and C, we say that X and Y are independent random variables. In light of this deﬁnition and the above shorthand, we see that X and Y are independent random variables if and only if

P(X ∈ B,Y ∈ C) = P(X ∈ B)P(Y ∈ C) (2.5)

for all sets³B and C.

Example 2.8. On a certain aircraft, the main control circuit on an autopilot fails with probability p. A redundant backup circuit fails independently with probability q. The air-craft can ﬂy if at least one of the circuits is functioning. Find the probability that the airair-craft cannot ﬂy.

Solution. We introduce two random variables, X and Y . We set X = 1 if the main circuit fails, and X = 0 otherwise. We set Y = 1 if the backup circuit fails, and Y = 0 otherwise. ThenP(X = 1) = p and P(Y = 1) = q. We assume X and Y are independent random variables. Then the event that the aircraft cannot ﬂy is modeled by

{X = 1} ∩ {Y = 1}.

Using the independence of X and Y ,P(X = 1,Y = 1) = P(X = 1)P(Y = 1) = pq.

The random variables X and Y of the preceding example are said to be Bernoulli. To in-dicate the relevant parameters, we write X∼ Bernoulli(p) and Y ∼ Bernoulli(q). Bernoulli random variables are good for modeling the result of an experiment having two possible out-comes (numerically represented by 0 and 1), e.g., a coin toss, testing whether a certain block on a computer disk is bad, whether a new radar system detects a stealth aircraft, whether a certain Internet packet is dropped due to congestion at a router, etc. The Bernoulli(p) pmf is sketched in Figure 2.6.

0 1 2 3 4

−1 p i

X( )

i p

1−

Figure 2.6. Bernoulli(p) probability mass function with p > 1/2.

Given any finite number of random variables, say X1,...,Xn, we say they are X1,X2,... is an infinite sequence of random variables, we say that they are independent if (2.6) holds for every finite n= 1,2,... .

If for every B, P(Xj ∈ B) does not depend on j, then we say the Xj are identically distributed. If the Xjare both independent and identically distributed, we say they are i.i.d.

Example 2.9. Let X, Y , and Z be the number of hits at a website on three consecutive days. Assuming they are i.i.d. Poisson(λ) random variables, ﬁnd the probability that on each day the number of hits is at most n.

Solution. The probability that on each day the number of hits is at most n is P(X ≤ n,Y ≤ n,Z ≤ n).

By independence, this is equal to

P(X ≤ n)P(Y ≤ n)P(Z ≤ n).

Since the random variables are identically distributed, each factor has the same value. Since the random variables are Poisson(λ), each factor is equal to

P(X ≤ n) =

∑

ⁿ

Example 2.10. A webpage server can handle r requests per day. Find the probability that the server gets more than r requests at least once in n days. Assume that the number of requests on day i is Xi∼ Poisson(λ) and that X1,...,Xnare independent.

Solution. We need to compute P

Max and min problems

Calculations similar to those in the preceding example can be used to ﬁnd probabilities involving the maximum or minimum of several independent random variables.

Example 2.11. For i = 1,...,n, let Xi model the yield on the ith production run of an integrated circuit manufacturer. Assume yields on different runs are independent. Find the probability that the highest yield obtained is less than or equal to z, and ﬁnd the probability that the lowest yield obtained is less than or equal to z.

Solution. We must evaluate

P(max(X1,...,Xn) ≤ z) and P(min(X1,...,Xn) ≤ z).

Observe that max(X1,...,Xn) ≤ z if and only if all of the Xkare less than or equal to z; i.e.,

{max(X1,...,Xn) ≤ z} =

n k=1

{Xk≤ z}.

It then follows that

P(max(X1,...,Xn) ≤ z) = P

k=1

{Xk≤ z}

∏

ⁿ

k=1P(Xk≤ z), where the second equation follows by independence.

For the min problem, observe that min(X1,...,Xn) ≤ z if and only if at least one of the Xiis less than or equal to z; i.e.,

{min(X1,...,Xn) ≤ z} =

n k=1

{Xk≤ z}.

Hence,

P(min(X1,...,Xn) ≤ z) = P

n k=1

{Xk≤ z}

= 1 − P

n k=1

{Xk> z}

= 1 −

∏

ⁿ

k=1P(Xk> z).

Geometric random variables

For 0≤ p < 1, we deﬁne two kinds of geometric random variables.

We write X∼ geometric1(p) if

P(X = k) = (1 − p)p^k⁻¹, k = 1,2,....

As the example below shows, this kind of random variable arises when we ask how many times an experiment has to be performed until a certain outcome is observed.

We write X∼ geometric0(p) if

P(X = k) = (1 − p)p^k, k = 0,1,....

This kind of random variable arises in Chapter 12 as the number of packets queued up at an idealized router with an inﬁnite buffer. A plot of the geometric0(p) pmf is shown in Figure 2.7.

0 1 2 3 4 5 6 7 8 9 0

0.1 0.2 0.3

Figure 2.7. The geometric0(p) pmf pX(k) = (1 − p)p^kwith p= 0.7.

By the geometric series formula (Problem 27 in Chapter 1), it is easy to see that the probabilities of both kinds of random variable sum to one (Problem 16).

If we put q= 1 − p, then 0 < q ≤ 1, and we can write P(X = k) = q(1 − q)^k⁻¹ in the geometric1(p) case and P(X = k) = q(1 − q)^kin the geometric0(p) case.

Example 2.12. When a certain computer accesses memory, the desired data is in the cache with probability p. Find the probability that the ﬁrst cache miss occurs on the kth memory access. Assume presence in the cache of the requested data is independent for each access.

Solution. Let T = k if the ﬁrst time a cache miss occurs is on the kth memory access.

For i= 1,2,..., let Xi= 1 if the ith memory request is in the cache, and let Xi= 0 otherwise.

ThenP(Xi= 1) = p and P(Xi= 0) = 1 − p. The key observation is that the ﬁrst cache miss occurs on the kth access if and only if the ﬁrst k−1 accesses result in cache hits and the kth access results in a cache miss. In terms of events,

{T = k} = {X1= 1} ∩ ··· ∩ {Xk−1= 1} ∩ {Xk= 0}.

Since the Xiare independent, taking probabilities of both sides yields P(T = k) = P({X1= 1} ∩ ··· ∩ {Xk−1= 1} ∩ {Xk= 0})

= P(X1= 1)··· P(Xk−1= 1) · P(Xk= 0)

= p^k−1(1 − p).

Example 2.13. In the preceding example, what is the probability that the ﬁrst cache miss occurs after the third memory access?

Solution. We need to ﬁnd

P(T > 3) =

∑

^∞

k=4

P(T = k).

However, sinceP(T = k) = 0 for k ≤ 0, a ﬁnite series is obtained by writing P(T > 3) = 1 − P(T ≤ 3)

= 1 −

∑

k=1P(T = k)

= 1 − (1 − p)[1 + p + p²].

Joint probability mass functions

The joint probability mass function of X and Y is deﬁned by

pXY(xi,yj) := P(X = xi,Y = yj). (2.7) An example for integer-valued random variables is sketched in Figure 2.8.

0 1 2 3 4 5 6 7 8

1 0 3 2

5 4 6 0

0.02 0.04 0.06

j i

Figure 2.8. Sketch of bivariate probability mass function pXY(i, j).

It turns out that we can extract the marginal probability mass functions pX(xi) and pY(yj) from the joint pmf pXY(xi,yj) using the formulas

pX(xi) =

∑

pXY(xi,yj) (2.8)

and

pY(yj) =

∑

pXY(xi,yj), (2.9)

which we derive later in the section.

Another important fact that we derive below is that a pair of discrete random variables is independent if and only if their joint pmf factors into the product of their marginal pmfs:

pXY(xi,yj) = pX(xi) pY(yj).

When X and Y take ﬁnitely many values, say x1,...,xmand y1,...,yn, respectively, we can arrange the probabilities pXY(xi,yj) in the m × n matrix

⎡

⎢⎢

⎢⎣

pXY(x1,y1) pXY(x1,y2) ··· pXY(x1,yn) pXY(x2,y1) pXY(x2,y2) pXY(x2,yn)

... . .. ...

pXY(xm,y1) pXY(xm,y2) ··· pXY(xm,yn)

⎤

⎥⎥

⎥⎦.

Notice that the sum of the entries in the top row is

∑

n j=1

pXY(x1,yj) = pX(x1).

In general, the sum of the entries in the ith row is pX(xi), and the sum of the entries in the jth column is pY(yj). Since the sum of either marginal is one, it follows that the sum of all the entries in the matrix is one as well.

When X or Y takes inﬁnitely many values, a little more thought is required.

Example 2.14. Find the marginal probability mass function pX(i) if

pXY(i, j) :=

⎧⎨

⎩2[i/(i + 1)]^j

n(n + 1) , j ≥ 0, i = 0,...,n − 1, 0, otherwise.

Solution. For i in the range 0,...,n − 1, write

pX(i) =

∑

^∞

j=−∞

pXY(i, j)

∑

^∞

j=0

2[i/(i + 1)]^j n(n + 1)

= 2

n(n + 1)· 1 1− i/(i + 1),

by the geometric series. This further simpliﬁes to 2(i + 1)/[n(n + 1)]. Thus,

pX(i) =

⎧⎨

⎩

2 i+ 1

n(n + 1), i = 0,...,n − 1, 0, otherwise.

Remark. Since it is easily checked by induction that ∑ⁿi=1i= n(n + 1)/2, we can verify that∑ⁿ⁻¹_i=0 pX(i) = 1.

Derivation of marginal formulas (2.8) and (2.9) Since the shorthand in (2.7) can be expanded to

pXY(xi,yj) = P({X = xi} ∩ {Y = yj}), (2.10) two applications of the law of total probability as in (1.27) can be used to show that⁵

P(X ∈ B,Y ∈ C) =

∑

i:x_i∈B

∑

j:y_j∈C

pXY(xi,yj). (2.11)

Let us now specialize (2.11) to the case that B is the singleton set B= {xk} and C is the biggest set possible, C= IR. Then (2.11) becomes

P(X = xk,Y ∈ IR) =

∑

i:x_i=xk

∑

j:yj∈IR

pXY(xi,yj).

To simplify the left-hand side, we use the fact that

{Y ∈ IR} := {ω∈ Ω : Y(ω) ∈ IR} = Ω to write

P(X = xk,Y ∈ IR) = P({X = xk} ∩ Ω) = P(X = xk) = pX(xk).

To simplify the double sum on the right, note that the sum over i contains only one term, the term with i= k. Also, the sum over j is unrestricted. Putting this all together yields.

pX(xk) =

∑

pXY(xk,yj).

This is the same as (2.8) if we change k to i. Thus, the pmf of X can be recovered from the joint pmf of X and Y by summing over all values of Y . The derivation of (2.9) is similar.

Joint PMFs and independence Recall that X and Y are independent if

P(X ∈ B,Y ∈ C) = P(X ∈ B)P(Y ∈ C) (2.12)

for all sets B and C. In particular, taking B= {xi} and C = {yj} shows that P(X = xi,Y = yj) = P(X = xi)P(Y = yj) or, in terms of pmfs,

pXY(xi,yj) = pX(xi) pY(yj). (2.13) We now show that the converse is also true; i.e., if (2.13) holds for all i and j, then (2.12) holds for all sets B and C. To see this, write

P(X ∈ B,Y ∈ C) =

∑

i:xi∈B

∑

j:y_j∈C

pXY(xi,yj), by (2.11),

∑

i:xi∈B

∑

j:yj∈C

pX(xi) pY(yj), by (2.13),

i:x

∑

_i∈B

pX(xi)

j:y

∑

_j∈C

pY(yj)

= P(X ∈ B)P(Y ∈ C).

Computing probabilities with MATLAB

Example 2.15. If X ∼ geometric0(p) with p = 0.8, compute the probability that X takes the value of an odd integer between 5 and 13.

Solution. We must compute

(1 − p)[p⁵+ p⁷+ p⁹+ p¹¹+ p¹³].

The straightforward solution is p = 0.8;

s = 0;

for k = 5:2:13 % loop from 5 to 13 by steps of 2 s = s + pˆk;

end

fprintf(’The answer is %g\n’,(1-p)*s)

However, we can avoid using the for loop with the commands^b p = 0.8;

pvec = (1-p)*p.ˆ[5:2:13];

fprintf(’The answer is %g\n’,sum(pvec))

The answer is 0.162. In this script, the expression [5:2:13] generates the vector [5 7 9 11 13]. Next, the “dot notation” p.ˆ[5 7 9 11 13] means that MATLAB

should do exponentiation on each component of the vector. In this case, MATLAB com-putes[p⁵p⁷p⁹p¹¹p¹³]. Then each component of this vector is multiplied by the scalar 1− p.

This new vector is stored in pvec. Finally, the command sum(pvec) adds up the com-ponents of the vector.

bBecause MATLABprograms are usually not compiled but run through the interpreter, loops require a lot of execution time. By using vectorized commands instead of loops, programs run much faster.

Example 2.16. A light sensor uses a photodetector whose output is modeled as a Pois-son(λ) random variable X. The sensor triggers an alarm if X > 15. Ifλ = 10, compute P(X > 15).

Next, since k!= Γ(k + 1), where Γ is the gamma function, we can compute the required probability with the commands

lambda = 10;

k = [0:15]; % k = [ 0 1 2 ... 15 ]

pvec = exp(-lambda)*lambda.ˆk./gamma(k+1);

fprintf(’The answer is %g\n’,1-sum(pvec))

The answer is 0.0487. Note the operator ./ which computes the quotients of corresponding vector components; thus,

We can use MATLABfor more sophisticated calculations such asP(g(X) ≤ y) in many cases in which X is a discrete random variable and g(x) is a function that M^ATLABcan compute.

Example 2.17. Let X be a uniform random variable on 0,...,100. Assuming that g(x) := cos(2πx/10), compute P(g(X) ≤ 1/2).

Solution. This can be done with the simple script

p = ones(1,101)/101; % p(i) = P(X=i) = 1/101, i = 0,...,100 k=[0:100];

i = find(cos(2*pi*k/10) <= 1/2);

fprintf(’The answer is %g\n’,sum(p(i))) The answer is 0.693.

Remark. The MATLABStatistics Toolbox provides commands for computing several probability mass functions. In particular, we could have used geopdf(k,1-p) for the geometric0(p) pmf and poisspdf(k,lambda) for the Poisson(λ) pmf.

We next use MATLABfor calculations involving pairs of random variables.

Example 2.18. The input X and output Y of a system subject to random perturba-tions are described probabilistically by the joint pmf pXY(i, j), where i = 1,2,3 and j = 1,2,3,4,5. Let P denote the matrix whose i j entry is pXY(i, j), and suppose that

P = 1

Find the marginal pmfs pX(i) and pY( j).

Solution. The marginal pY( j) is obtained by summing the columns of the matrix. This is exactly what the MATLABcommand sum does with a matrix. Thus, if P is already deﬁned, the commands

format rat % print numbers as ratios of small integers pY = sum(P)

yield

pY =

13/71 8/71 21/71 15/71 14/71

Similarly, the marginal pX(i) is obtained by summing the rows of P. Since sum computes column sums, the easy way around this is to use the transpose of P instead of P. The apos-trophe ’ is used to compute transposes. Hence, the command pX = sum(P’)’ computes column sums on the transpose of P, which yields a row vector; the second transpose opera-tion converts the row into a column. We ﬁnd that

pX = 26/71 25/71 20/71

Example 2.19. Let X and Y be as in the previous example, and let g(x,y) be a given function. FindP(g(X,Y) < 6).

Solution. The first step is to create a 3×5 matrix with entries g(i, j). We then find those pairs(i, j) with g(i, j) < 6 and then sum the corresponding entries of P. Here is one way to do this, assuming P and the function g are already defined.

for i = 1:3 for j = 1:5

gmat(i,j) = g(i,j);

end end

prob = sum(P(find(gmat<6)))

If g(x,y) = xy, the answer is 34/71. A way of computing gmat without loops is given in the problems.

2.4 Expectation

The deﬁnition of expectation is motivated by the conventional idea of numerical average.

Recall that the numerical average of n numbers, say a1,...,an, is 1

∑

n k=1

a_k.

We use the average to summarize or characterize the entire collection of numbers a1,...,an

with a single “typical” value.

Example 2.20. The average of the 10 numbers 5,2,3,2,5,−2,3,2,5,2 is 5+ 2 + 3 + 2 + 5 + (−2) + 3 + 2 + 5 + 2

10 = 27

10 = 2.7.

Notice that in our collection of numbers,−2 occurs once, 2 occurs four times, 3 occurs two times, and 5 occurs three times. In other words, their relative frequencies are

−2 : 1/10 2 : 4/10 3 : 2/10 5 : 3/10.

We can rewrite the average of the ten numbers in terms of their relative frequencies as

−2 · 1 10+ 2 · 4

10+ 3 · 2 10+ 5 · 3

10 = 27

10 = 2.7.

Since probabilities model relative frequencies, if X is a discrete random variable taking distinct values xiwith probabilitiesP(X = xi), we deﬁne the expectation or mean of X by

E[X] :=

∑

xiP(X = xi),

or, using the pmf notation pX(xi) = P(X = xi), E[X] =

∑

xipX(xi).

Example 2.21. Find the mean of a Bernoulli(p) random variable X.

Solution. Since X takes only the values x0= 0 and x1= 1, we can write

E[X] =

∑

i=0

iP(X = i) = 0 · (1 − p) + 1 · p = p.

Note that, since X takes only the values 0 and 1, its “typical” value p is never seen (unless p= 0 or p = 1).

Example 2.22. When light of intensityλ is incident on a photodetector, the number of photoelectrons generated is Poisson with parameterλ. Find the mean number of photoelec-trons generated.

Solution. Let X denote the number of photoelectrons generated. We need to calculate E[X]. Since a Poisson random variable takes only nonnegative integer values with positive probability,

E[X] =

∑

^∞

n=0

nP(X = n).

Since the term with n= 0 is zero, it can be dropped. Hence,

Now change the index of summation from n to k= n − 1. This results in E[X] = λ^e^−λ

∑

^∞

k=0

λ^k

k! = λ^e^−λ^e^λ = λ.

Example 2.23. If X is a discrete random variable taking ﬁnitely many values, say x1, ...,xn with corresponding probabilities p1,..., pn, then it is easy to compute E[X] with MATLAB. If the value xkis stored in x(k) and its probability pk is stored in p(k), then E[X] is given by x’*p, assuming both x and p are column vectors. If they are both row vectors, then the appropriate expression is x*p’.

Example 2.24 (inﬁnite expectation). Zipf random variables arise in the analysis of web-site popularity and web caching. Here is an example of a Zipf random variable with inﬁnite expectation. Suppose thatP(X = k) = C⁻¹/k², k= 1,2,..., where^c as shown in Problem 48.

Some care is necessary when computing expectations of signed random variables that take more than ﬁnitely many values. It is the convention in probability theory thatE[X]

should be evaluated as

assuming that at least one of these sums is finite. If the first sum is+∞ and the second one is−∞, then no value is assigned to E[X], and we say that E[X] is undefined.

cNote that C is ﬁnite by Problem 48. This is important since if C= ∞, then C⁻¹= 0 and the probabilities would sum to zero instead of one.

Example 2.25 (undeﬁned expectation). With C as in the previous example, suppose that

Expectation of a function of a random variable, or the law of the unconscious statistician (LOTUS)

Given a random variable X , we will often have occasion to define a new random variable by Z := g(X), where g(x) is a real-valued function of the real variable x. More precisely, recall that a random variable X is actually a function taking points of the sample space, ω ∈ Ω, into real numbers X(ω). Hence, the notation Z = g(X) is actually shorthand for Z(ω) := g(X(ω)). If we want to compute E[Z], it might seem that we first have to find the pmf of Z. Typically, this requires a detailed analysis of g. However, as we show below, we can computeE[Z] = E[g(X)] without actually finding the pmf of Z. The precise formula is

E[g(X)] =

∑

g(xi) pX(xi). (2.14)

Because it is so much easier to use (2.14) than to ﬁrst ﬁnd the pmf of Z, (2.14) is sometimes called the law of the unconscious statistician (LOTUS) [23]. As a simple example of its use, we can write, for any constant a,

E[aX] =

∑

axipX(xi) = a

∑

xipX(xi) = aE[X].

In other words, constant factors can be pulled out of the expectation; technically, we say that expectation is a homogeneous operator. As we show later, expectation is also additive. An operator that is both homogeneous and additive is said to be linear. Thus, expectation is a linear operator.

Derivation of LOTUS

To derive (2.14), we proceed as follows. Let X take distinct values xi. Then Z takes values g(xi). However, the values g(xi) may not be distinct. For example, if g(x) = x², and X takes the four distinct values±1 and ±2, then Z takes only the two distinct values 1 and 4. In any case, let zkdenote the distinct values of Z and observe that

P(Z = zk) =

∑

i:g(xi)=zk

P(X = xi).

We can now write

since the last double sum is just a special way of summing over all values of i.

Linearity of expectation

The derivation of the law of the unconscious statistician can be generalized to show that if g(x,y) is a real-valued function of two variables x and y, then

E[g(X,Y)] =

∑

g(xi,yj) pXY(xi,yj).

In particular, taking g(x,y) = x + y, it is a simple exercise to show that E[X +Y] = E[X] + E[Y]. Thus, expectation is an additive operator. Since we showed earlier that expectation is also homogeneous, it follows that expectation is linear; i.e., for constants a and b,

E[aX + bY] = E[aX] + E[bY] = aE[X] + bE[Y]. (2.15) Example 2.26. A binary communication link has bit-error probability p. What is the expected number of bit errors in a transmission of n bits?

Solution. For i = 1,...,n, let Xi= 1 if the ith bit is received incorrectly, and let Xi= 0 otherwise. Then Xi∼ Bernoulli(p), and Y := X1+ ··· + Xnis the total number of errors in the transmission of n bits. We know from Example 2.21 thatE[Xi] = p. Hence,

E[Y] = E

The variance is the average squared deviation of X about its mean. The variance character-izes how likely it is to observe values of the random variable far from its mean. For example, consider the two pmfs shown in Figure 2.9. More probability mass is concentrated near zero in the graph at the left than in the graph at the right.

p i

X( )

0 1 2

−1 i

−2

1/6 1/3

Y i p ( )

0 1 2

−1 i

−2

1/6 1/3

Figure 2.9. Example 2.27 shows that the random variable X with pmf at the left has a smaller variance than the random variable Y with pmf at the right.

Example 2.27. Let X and Y be the random variables with respective pmfs shown in Figure 2.9. Computevar(X) and var(Y).

Solution. By symmetry, both X and Y have zero mean, and so var(X) = E[X²] and var(Y) = E[Y²]. Write

E[X²] = (−2)^{2 1}₆+ (−1)^{2 1}₃+ (1)^{2 1}₃+ (2)^{2 1}₆ = 2, and

E[Y²] = (−2)^{2 1}₃+ (−1)^{2 1}₆+ (1)^{2 1}₆+ (2)^{2 1}₃ = 3.

Thus, X and Y are both zero-mean random variables taking the values±1 and ±2. But Y is more likely to take values far from its mean. This is reﬂected by the fact thatvar(Y) >

var(X).

When a random variable does not have zero mean, it is often convenient to use the variance formula,

var(X) = E[X²] − (E[X])², (2.17)

which says that the variance is equal to the second moment minus the square of the ﬁrst moment. To derive the variance formula, write

var(X) := E[(X − m)²]

= E[X²− 2mX + m²]

= E[X²] − 2mE[X] + m², by linearity,

= E[X²] − m²

= E[X²] − (E[X])².

The standard deviation of X is deﬁned to be the positive square root of the variance. Since the variance of a random variable is often denoted by the symbolσ², the standard deviation is denoted byσ^.

Example 2.28. Find the second moment and the variance of X if X ∼ Bernoulli(p).

Solution. Since X takes only the values 0 and 1, it has the unusual property that X²= X.

Hence,E[X²] = E[X] = p. It now follows that

var(X) = E[X²] − (E[X])² = p − p² = p(1 − p).

Example 2.29. An optical communication system employs a photodetector whose out-put is modeled as a Poisson(λ) random variable X. Find the second moment and the vari-ance of X .

Solution. Observe that E[X(X −1)]+E[X] = E[X²]. Since we know that E[X] =λ^from Example 2.22, it sufﬁces to compute

E[X(X − 1)] =

∑

^∞

n=0

n(n − 1)λⁿ^e^−λ n!

∑

^∞

n=2

λⁿ^e^−λ (n − 2)!

= λ²^e^−λ

∑

^∞

n=2

λⁿ⁻² (n − 2)!. Making the change of summation k= n − 2, we have

E[X(X − 1)] = λ²^e^−λ

∑

^∞

k=0

λ^k k!

= λ². It follows thatE[X²] =λ²+λ^{, and}

var(X) = E[X²] − (E[X])² = (λ²+λ) −λ² = λ.

Thus, the Poisson(λ) random variable is unusual in that the values of its mean and variance are the same.

A generalization of the variance is the nth central moment of X , which is defined to beE[(X − m)ⁿ]. Hence, the second central moment is the variance. Ifσ²:= var(X), then the skewness of X is defined to beE[(X − m)³]/σ³, and the kurtosis of X is defined to be E[(X − m)⁴]/σ⁴^.

Example 2.30. If X has mean m and varianceσ², it is sometimes convenient to intro-duce the normalized random variable

Y := X− m σ ^.

It is easy to see that Y has zero mean. Hence, var(Y) = E[Y²] = E

X− m σ

₂

= E[(X − m)²] σ² ^{= 1.}

Thus, Y always has zero mean and unit variance. Furthermore, the third moment of Y is E[Y³] = E

X− m σ

= E[(X − m)³] σ³ ^,

which is the skewness of X , and similarly, the fourth moment of Y is the kurtosis of X .

Indicator functions

Given a set B⊂ IR, there is a very special function of x, denoted by IB(x), for which we will be interested in computingE[IB(X)] for various random variables X. The indicator function of B, denoted by IB(x), is deﬁned by

IB(x) :=

1, x ∈ B, 0, x /∈ B.

For example I_[a,b)(x) is shown in Figure 2.10(a), and I_(a,b](x) is shown in Figure 2.10(b).

a b (a)

a b (b)

Figure 2.10. (a) Indicator function I_[a,b)(x). (b) Indicator function I(a,b](x).

Readers familiar with the unit-step function, u(x) :=

1, x ≥ 0, 0, x < 0,

will note that u(x) = I_[0,∞)(x). However, the indicator notation is often more compact. For example, if a< b, it is easier to write I_[a,b)(x) than u(x − a) − u(x − b). How would you write I_(a,b](x) in terms of the unit step?

Example 2.31 (every probability is an expectation). If X is any random variable and B is any set, then IB(X) is a discrete random variable taking the values zero and one. Thus,

In document EE-0030 - Probability and Random Processes for Electrical and Computer Engineers (Page 84-112)