Projections - Conditional Probability - Theory of Statistics

1.5 Conditional Probability

1.5.3 Projections

Use of one distribution or one random variable as an approximation of another distribution or random variable is a very useful technique in probability and statistics. It is a basic technique for establishing asymptotic results, and it underlies statistical applications involving regression analysis and prediction. Given two scalar random variables X and Y, consider the question of what Borel function g is such that g(X) is closest to Y in some sense. A common way to define closeness of random variables is by use of the expected squared distance, E((Y₋g(X))2_{). This leads to the least squares criterion for} determining the optimalg(X).

First, we must consider whether or under what conditions, the problem has a solution under this criterion.

Theorem 1.62

LetX andY be scalar random variables over a common measurable space and assume E(Y2₎ _<

∞. Then there exists a Borel measurable function g0 with E((g0(X))2)<∞such that

E((Y −g0(X))2) = inf{E((Y −g(X))2)|g(X)∈ G0}, (1.246)

where_G0={g(X)|g: IR7→IRis Borel measurable andE((g0(X))2)<∞}. Proof.***fix

Although Theorem 1.62 is stated in terms of scalar random variables, a similar result holds for vector-valued random variables. The next theorem identifies ag0that minimizes the L2norm for vector-valued random variables. Theorem 1.63

LetY be ad-variate random variable such thatE(_kY_k2)<∞and letGbe the

set of all Borel measurable functions fromIRk intoIRd. LetX be ak-variate random variable such that E(_kE(Y_|X)_k2)<∞. Letg0(X) = E(Y|X). Then

g0(X) = arg min

Proof.Exercise. Compare this with Theorem1.13on page 27, in which the corresponding solution isg0(X) = E(Y|a) = E(Y).

By the general definition of projection (Definition0.0.9on page637), we see that conditional expectation can be viewed as a projection in a linear space defined by the square-integrable random variables over a given probability space and the inner product_hY, X_i= E(Y X) and its induced norm. (In fact, some people define conditional expectation this way instead of the way we have in Definitions1.44and1.45.)

In regression applications in statistics using least squares, as we discuss on page438, “Yb”, or the “predicted”Y givenX, that is, E(Y|X) is the projection of Y ontoX. For given fixed values of Y and X the predicted Y givenX is the vector projection, in the sense of Definition0.0.9.

We now formally define projection for random variables in a manner anal- ogous to Definition0.0.9. Note that the random variable space is the range of the functions inGin Theorem1.63.

Definition 1.46 (projection of a random variable onto a space of random variables) LetY be a random variable and let_X be a random variable space defined on

the same probability space. A random variableXp∈ X such that

E(_kY ₋Xpk2)≤E(kY −Xk2)∀X ∈ X (1.248) is called aprojection ofY onto_X.

The most interesting random variable spaces are linear spaces, and in the following we will assume thatX is a linear space, and hence the norm arises from an inner product so that the terms in inequality (1.248) involve variances and covariances.

*** existence, closure of space in second norm (see page35).

*** treat vector variables differently: E(kY −E(Y)k2) is not the vari- ance**** make this distinction earlier

When_X is a linear space, we have the following result for projections. Theorem 1.64

Let_X be a linear space of random variables with finite second moments. Then

Xp is a projection of Y ontoX iffXp∈ X and

E (Y ₋Xp)TX= 0∀X ∈ X. (1.249) Proof.

For anyX, Xp∈ X we have

E((Y −X)T(Y −X)) = E((Y −Xp)T(Y −Xp)) +2E((Y ₋Xp)T(Xp−X)) +E((Xp−X)T(Xp−X))

If equation (1.249) holds then the middle term is zero and so E((Y−X)T_(Y₋ X))≥E((Y −Xp)T(Y −Xp))∀X ∈ X; that is,Xp is a projection ofY onto X.

Now, for any real scalaraand anyX, Xp∈ X, we have E((Y −Xp−aX)T(Y −Xp−aX))−E((Y −Xp)T(Y −Xp)) =

−2aE((Y −Xp)TX) +a2E(XTX). IfXp is a projection ofY ontoX, the term on the left side of the equation is nonnegative for everya. But the term on the right side of the equation can be nonnegative for everyaonly if the orthogonality condition of equation (1.249) holds; hence, we conclude that that is the case.

Because a linear space contains the constants, we have the following corollary.

Corollary 1.64.1

Let _X be a linear space of random variables with finite second moments. and letXp be a projection of the random variableY onto X. Then,

E(Xp) = E(Y), (1.250)

Cov(Y −Xp, X) = 0∀X ∈ X, (1.251)

and

Cov(Y, X) = Cov(Xp, X)∀X ∈ X. (1.252) ***fix ** add uniqueness etc. E(Y) = E(Xp) and Cov(Y −Xp, X) = 0∀X ∈ X.

Definition 1.47 (projection of a function of random variables) Let Y1, . . . , Yn be a set of random variables. The projection of the statistic Tn(Y1, . . . , Yn)onto thekn random variablesX1, . . . , Xkn is

e Tn = E(Tn) + kn X i=1 (E(Tn|Xi)−E(Tn)). (1.253)

An interesting projection is one in which theY1, . . . , Yknin Definition1.47

are the same asX1, . . . , Xn. In that case, ifTn is a symmetric function of the X1, . . . , Xn (for example, the X1, . . . , Xn are iid), then the E(Tn|Xi) are iid with mean E(Tn). The residual,Tn−Ten, is often of interest. Writing

Tn−Ten=Tn−E(Tn)− n X i=1

(E(Tn|Xi)−E(Tn)),

E(Ten) = E(Tn), (1.254) and if V(Tn)<∞

V(Ten) =nV(E(Tn|Xi)) (1.255) (exercise).

If V(E(Tn|Xi))>0, by the central limit theorem, we have 1 p nV(E(Tn|Xi)) (Ten−E(Tn)) d →N(0,1).

We also have an interesting relationship between the variances ofTn and e

Tn, that is, V(Ten)≤V(Tn), as the next theorem shows. Theorem 1.65

If Tn is symmetric and V(Tn)<∞ for everyn, and Ten is the projection of Tn ontoX1, . . . , Xn, then

E((Tn−Ten)2) = V(Tn)−V(Ten). Proof.Because E(Tn) = E(Ten), we have

E((Tn−Ten)2) = V(Tn) + V(Ten)−2Cov(Tn,Ten). (1.256) But

Cov(Tn,Ten) = E(TnTen)−(E(Tn))2 = E(TnE(Tn)) + E Tn n X i=1 E(Tn|Xi) !

−nE(TnE(Tn)))−(E(Tn))2 =nE (TnE(Tn|Xi))−n(E(Tn))2

=nV(E(Tn|Xi)) = V(Ten),

and the desired result follows from equation (1.256) above.

The relevance of these facts, if we can show thatTen →Tn in some appro- priate way, then we can work out the asymptotic distribution ofTn. (The use of projections of U-statistics beginning on page413is an example.)

Partial Correlations

In document Theory of Statistics - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 133-137)