• No results found

83reduces potential overheating dependent on

157

Now, why should we care about linear dependence? Because the inverse of a square matrix exists only if the columns are linearly independent. Since the vector of regression estimates b depends on (X'X)-1, the parameter estimates b0, b1, and so on cannot be uniquely determined if some of the columns of X are linearly dependent! That is, if the columns of your X matrix — that is, two or more of your predictor variables — are linearly dependent (or nearly so), you will run into trouble when trying to estimate the regression equation.

For example, suppose for some strange reason we multiplied the predictor variable soap by 2 in the dataset soapsuds.txt. That is, we would have two predictor variables, say soap1 (which is the original soap) and soap2 (which is 2 × the original soap):

If we tried to regress y = suds on x1 = soap1 and x2 = soap2, we see that statistical software spits out trouble:

In short, the first moral of the story is "do not collect your data in such a way that the predictor variables are perfectly correlated." And, the second moral of the story is "if your software package reports an error message concerning high correlation among your predictor variables, then think about linear dependence and how to get rid of it."

3.4 More on the Regression Model in Matrix Form

158

This model includes the assumption that the εi ’s are a sample from a population with mean zero and standard deviation σ. In most cases we also assume that this population is normally distributed.

The multiple linear regression model is:

0 1 1 2 2 3 3 ...

i i i i k ik i

Y =+x +x +x + +x + for i = 1, 2, 3, …, n………(3.1.14) This model includes the assumption about the εi ’s.

This requires building up our symbols into vectors. Thus:

1 2 3 1 nx

n

Y Y

Y Y

Y

  

  

= 

  

 

………(3.1.15)

captures the entire dependent variable in a single symbol. The “n × 1” part of the notation is just a shape reminder. These get dropped once the context is clear.

For simple linear regression, we will capture the independent variable through this n

×2 matrix:

2

1 1 1 1

nx

X

=



1 2 3

n

x x x

x



……….(3.1.16)

The coefficient vector will be 0

2 1x 1

 

=   

  and the noise will be

1 2 3 1 nx

n

 

  

  

= 

  

  The simpler linear regression model is written then as:

1 2 2 1 n x1

nx nx x

Y = X  + The product part, meaning

2 2 1

nx x

X  , is found through the usual rule for matrix multiplication as:

159

2 2 1

1 1 1 1

nx x

X

=



0 1 1

1

0 1 2

2 0

3 0 1 3

1

0 1

n n

x x

x x

x x

x x

 

 

  

 

+

+

  

  = +

  

+

……….(3.1.17)

We usually write the model without the shape reminders as Y = X β + ε. This is a short hand notation for:

0 1 1 1

1

0 1 2 2

2

3 0 1 3 3

0 1

n n n

x Y

x Y

Y x

Y x

  

  

  

  

+ +

 

  + +

 

  = + +

 

 

  + +

  

………(3.1.18)

It is helpful that the multiple regression story with K ≥ 2 predictors leads to the same model expression Y = X β + ε (just with different shapes). As a notational convenience, let p = K + 1. In the multiple regression case, we have:

11 21 31 41 51 61

1

1 1 1, 1, 1, 1, , 1,

nxp

n

x x x X x

x x

x

= 



12 1

22 2

32 3

42 4

52 5

62 6

2 k

k k k k k

n nk

x x

x x

x x

x x

x x

x x

x x



and

0 1 2 1 3 px

k

 

  

  

=  

  

  

 

………..(3.1.19)

The detail shown here is to suggest that X is a tall, skinny matrix. We formally require n ≥ p.

In most applications, n is much, much larger than p. The ratio n/p is often in the hundreds. If it happens that n/p is as small as 5, we will worry that we don’t have enough data (reflected in n) to estimate the number of parameters in β (reflected in p).

The multiple regression model is now:

160

0 1 11 2 12 3 13 1 1

1

0 1 21 2 22 3 23 2 2

2

3 0 1 31 2 32 3 33 3 3

0 1 1 2 2 3 3

...

...

...

...

k k

k k

k k

n n n n k nk n

x x x x

Y

x x x x

Y

Y x x x x

Y x x x x

      

      

      

      

+ + + + + +

 

  + + + + + +

 

  = + + + + + +

 

 

  + + + + + +

  

………(3.1.20)

The model form Y = X β + ε is thus completely general.

The assumptions on the noise terms can be written as E(ε) = 0 and Var (ε) = σ2 I. The I here is the n × n identity matrix. That is,

1, 0, 0 0,1, 0 0, 0,1 0, 0, 0 I

=

0 0 0 1



……….(3.1.21)

The variance assumption can be written as var (ε) =

2 2

2

, 0, 0 0, , 0 0, 0, 0, 0, 0 I

= 

2

0 0 0

. You may see

this expressed as cov (εij) = σ2 δij, where, 1

ij 0

= 

if i j i j

= 

 

We will call b as the estimate for unknown parameter vector β. You will also find the notation  as the estimate. Once we get b, we can compute the fitted vector Yˆ = X b . This fitted value represents annex-post guess at the expected value of Y.

The estimate b is found so that the fitted vector Yˆ is close to the actual data vector Y.

Closeness is defined in the least squares sense, meaning that we want to minimize the criterion Q, where,

1

( ( )th

n

i i

i

Q Y Xb

=

=

entry)2………(3.1.22)

161

This can be done by differentiating this quantity p = K + 1 times, once with respect to b0, once with respect to b1, ….., and once with respect to bK . This is routine in simple regression (K = 1), and it’s possible with a lot of messy work in general.

It happens that Q is the squared length of the vector difference Y − Xb. This means that we can write:

1 1

( ) ( )

xn nx

Q= YXb YXb ……….(3.1.23)

This represents Q as a 1 × 1 matrix, and so we can think of Q as an ordinary number.

There are several ways to find the b that minimizes Q. The simple solution we will show here (alas) requires knowing the answer and working backward.

Define the matrix,

nxn nxp pxn

H X X

= 

1

nxp pxn

X X

. We will call H as the “hat matrix”, and it has some important uses. There are several technical comment about H:

(1) Finding H requires the ability to get

pxn

X

1

nxp

X

.This matrix inversion is possible if and only if X has full rank p. Things get very interesting when X almost has full rank p;

that’s a longer story for another time.

(2) The matrix H is idempotent. The defining condition for idempotence is this:

The matrix C is idempotent ⇔ C C = C . Only square matrices can be idempotent.

Since H is square (It is n × n.), it can be checked for idempotence. You will indeed find that H H = H .

(3) The ith diagonal entry, that in position (i, i), will be identified for later use as the ith leverage value. The notation is usually hi , but you’ll also see hii .

Now write Y in the form H Y + (I – H) Y .

Now let’s develop Q. This will require using the fact that H is symmetric, meaning H ′

= H .

This will also require using the transpose of a matrix product. Specifically, the property will be:

The second and third summands above are zero, as a consequence of:

162

( )

( 1 .

{

)

{ } } (( ) ) ( )

H X X H X X X X X X X X X

H Y Xb H Y Xb H Y H Y

= =  = =

= +  −

I 0

I I ………..(3.1.24)

If this is to be minimized over choices of b, then the minimization can only be done with regard to the first summand,{H YXb}{ H YXb}}. It is possible to make the vector H Y − Xb equal to 0 by selectingb=(XX)1XY . This is very easy to see, as:

( ) 1

H =X XX X. This b=(XX)1XY is known as the least squares estimate of β.

For the simple linear regression case K = 1, the estimate 0

1

b b b

=   

  and be found with relative ease. The slope estimate is 1 xy ,

xx

b s

= s where

1 1

( )( )

n n

xy i i i i

i i

s x x Y Y x Y nxY

= =

=

=

and where 2 2 2

1 1

( ) ( )

n n

xx i i

i i

s x x x n x

= =

=

=

For the multiple regression case K ≥ 2, the calculation involves the inversion of the p×p matrix X′ X. This task is best left to computer software.

There is a computational trick, called “mean-centering,” that converts the problem to a simpler one of inverting a K × K matrix.

The matrix notation will allow the proof of two very helpful facts:

* E (b) = β. This means that b is an unbiased estimate of β. This is a good thing, but there are circumstances in which biased estimates will work a little bit better.

* Var (b) = σ2 (X′ X)-1. This identifies the variances and covariances of the estimated coefficients. It’s critical to note that the separate entries of b are not statistically independent.

SELF ASSESSMENT EXERCISE

What is linear dependence in matrix algebra? And happens if some of the columns of X are linearly dependent?

163

4.0 CONCLUSION

Instead of writing regression models in normal function (which is pretty inefficient), we can instead formulate the linear regression function in a matrix form. In a simple and short statement of a matrix regression formulation, it shows that X is an n x 2 matrix, β is a 2 x 1 column vector; Y is n x 1 column vector; and ε is an n x 1 column vector. The matrix X and vector β are multiplied together using the technique of matrix multiplication and added to the vector ε using the technique of matrix addition. The least squares estimates in matrix notation can be as: the inverse of X-transpose times X multiplied by X-transpose times Y [(X X )1X Y ].

5.0 SUMMARY

This unit explained the basic matrix formulation of a regression model. Emphasis is laid on the use of matrix notation to formulate regression model due to its efficiency. The unit further stressed that, in a simple and short statement of a matrix regression formulation, it shows that X is an n x 2 matrix, β is a 2 x 1 column vector; Y is n x 1 column vector; and ε is an n x 1 column vector. The matrix X and vector β are multiplied together using the technique of matrix multiplication and added to the vector ε using the technique of matrix addition. The least squares estimates in matrix notation can be as: the inverse of X-transpose times X multiplied by X-transpose times Y [(X X )1X Y ].

6.0 TUTOR-MARKED ASSIGNMENT

(1) The following table (Table 3.1) is students (Y), soap 1 (X1) and soap 2 (X2) Table 3.1

Soap 1(x1) 4 4.5 5 5.5 6 6.5 7

Soap 2(x2) 8 9 10 11 12 13 14

Suds (y) 33 42 45 51 53 61 62

164

Note: all variables are deviation from their means

use matrix notation to show that there is linear dependency between the columns of the X’s.

(2) Table 3.2 shows students and soap used

Soap 1(x1) 4 4.5 5 5.5 6 6.5 7

Suds (yi) 33 42 45 51 53 61 62

Use both normal regression estimation and matrix notation; prove that both methods produce the same estimates.

Note: 0 1

1

( )

b b X X X Y

b

 

= =

  ; 1

2

1 1

n i i

n n

i i i i

n x

X X x x

=

= =

 = 



  

and

1

1 n i i n

i i i

y X Y

x y

=

=

 =

7.0 REFERENCES/FURTHER READINGS

Draper, N.R. & Smith, H. (1998). Applied Regression Analysis, Third Edition, John Wiley online; doi: 10.1002/9781118625590.

165

UNIT 2: VECTOR AUTOREGRESSIVE (VAR) MODELS AND CAUSALITY