Subset Selection in Regression
Analysis
M.Sc. Project Report (Final Stage) submitted in partial fulfilment of the requirements
for degree of Master of Science
by
Ariful Islam Mondal
(02528019) under the guidance of Prof. Alladi Subramanyam Department Of Mathematics
IIT Bombay
a
Department of Mathematics
INDIAN INSTITUTE OF TECHNOLOGY, BOMBAY
April, 2004
Acknowledgment
I am extremely grateful to my guide, Prof. Alladi Subramanyam, for giving me such a nice project on applied statistics. Without his support and suggestions I could not have finished my project. Indeed, he devoted a lot of time for me. Moreover, this project helped me a lot for having a good job. Lastly, I would like to thank my guide for everything he did for me and also would like to thank my friends and batchmates who spent time with me at the time of typing the report.
Contents
1 Introduction 1
1.0.1 Why Subset Selection? . . . 1
1.0.2 Uses of regression . . . 2
1.0.3 Organization of report and Future Plan . . . 2
2 Multiple Linear Regression 3 2.1 Multiple Regression Models . . . 3
2.2 Estimation of the Model Parameters by Least Squares . . . 4
2.2.1 Least Squares Estimation Method . . . 4
2.2.2 Geometrical Interpretation of Least Squares . . . 5
2.2.3 Properties of the Least Squares Estimators . . . 7
2.2.4 Estimation of σ2 . . . . 7
2.3 Confidence Interval in Multiple Regression . . . 8
2.3.1 Confidence Interval on the Regression Coefficients . . . 8
2.3.2 Confidence Interval Estimation of the Mean Response . . . 9
2.4 Gauss−Markov Conditions . . . 9
2.4.1 Mean and Variance of Estimates Under G-M Conditions . . . 10
2.4.2 The Gauss-Markov Theorem . . . 10
2.5 Maximum Likelihood Estimator(MLE) . . . 12
2.6 Explanatory Power−Goodness of Fit . . . 13
2.7 Testing of Hypothesis In Multiple Linear Regression . . . 14
2.7.1 Test for Significance of Regression . . . 15
2.7.2 Tests on Individual Regression Coefficients . . . 15
2.7.3 Special Cases of Orthogonal Columns in X . . . 18
2.7.4 Likelihood Ratio Test . . . 19
3 Regression Diagnosis and Measures of Model Adequacy 22 3.1 Residual Analysis . . . 23
3.1.1 Definition of Residuals . . . 23
3.1.2 Estimates . . . 24
3.1.3 Estimates of σ2 . . . 24
3.1.4 Coefficient of Multiple Determination . . . 24
3.1.5 Methods of Scaling Residuals . . . 25
4 Subset Selection and Model Building 33
4.1 Model Building Or Formulation Of The Problem . . . 34
4.2 Consequences Of Variable Deletion . . . 34
4.2.1 Properties of ˆβ ¯p . . . 35
4.3 Criteria for Evaluating Subset Regression Models . . . 37
4.3.1 Coefficient of Multiple Determination . . . 37
4.3.2 Adjusted R2 . . . 38
4.3.3 Residual Mean Square . . . 39
4.3.4 Mallows’ Cp-Statistics . . . 40
4.4 Computational Techniques For Variable Selection . . . 42
4.4.1 All Possible Regression . . . 42
4.4.2 Directed Search on t . . . 42
4.4.3 Stepwise Variable Selection . . . 43
5 Dealing with Multicollinearity 45 5.1 Sources of Multicollinearity . . . 45
5.1.1 Effects Of Multicollinearity . . . 46
5.2 Multicollinearity Diagnosis . . . 48
5.2.1 Estimation of the Correlation Matrix . . . 48
5.2.2 Variance Inflation Factors . . . 49
5.2.3 Eigensystem Analysis of X0X . . . 49
5.2.4 Other Diagnostics . . . 50
5.3 Methods for Dealing with Multicollinearity . . . 50
5.3.1 Collecting Additional Data . . . 50
5.3.2 Model Respecification . . . 51
5.3.3 Ridge Regression . . . 51
5.3.4 Principal Components Regression . . . 51
6 Ridge Regression 54 6.1 Ridge Estimation . . . 54
6.2 Methods for Choosing k . . . 58
6.3 Ridge regression and Variable Selection . . . 60
6.4 Ridge Regression: Some Remarks . . . 60
7 Better Subset Regression Using the Nonnegative Garrote 62 7.1 Nonnegative Garrote . . . 62
7.2 Model Selection . . . 63
7.2.1 Prediction and Model Error . . . 63
7.3 Estimation of Error . . . 64
7.3.1 X-Controlled Estimates . . . 64
7.3.2 X-Random Estimates . . . 66
7.4 X-Orthonormal . . . 67
8 Subset Auto Regression 69
8.1 Method of Estimation . . . 70
8.2 Pagano’s Algorithm . . . 71
8.3 Computation of The Increase in The Residual Variance . . . 72
8.4 Conclusion . . . 73
A Data Analysis 74 A.1 Correlation Coefficients . . . 74
A.2 Forward Selection for Fitness Data . . . 76
A.3 Backward Elimination Procedure for Fitness Data . . . 79
A.4 Stepwise Selection For Fitness Data . . . 82
A.5 Nonnegative Garrote Estimation of Fitness data . . . 85
A.6 Ridge Estimation For Fitness Data . . . 85
Chapter 1
Introduction
Study of the analysis of data aimed at discovering how one or more variables (called independent variables, predictor variables or regressors) affect ther variables (called de-pendent variables or response variables) is known as “regression analysis”. Consider the model:
y = β0+ β1x + ε (1.1)
This is a simple linear regression model. Where x is a independent variable and y is a dependent variable or response variable. In general, the response may be related to p regressors x1, x2, · · · , xp, so that
y = β0+ β1x1+ β2x2+ · · · + βpxp (1.2)
This is called a multiple linear regression model.The adjective linear is employed to indicate that the model is linear in the parameters β0, β1, · · · , βp and not because y is a
linear function of the x’s. There are so many models in which y is related to the x’s in a non-linear fashion can still be treated as linear regression model as long as the equation is linear in the β’s.
An important objective of regression analysis is to estimate the unknown parameters in the regression model. This process is also called fitting the model to the data. The next phase of regression analysis is called model adequacy checking in which the appro-priateness of the model is studied and the quality of the fit ascertained. Through such analysis the usefulness of the regression model may be determined. The outcome of the adequacy checking may indicate either that the model is reasonable or that the original fit must be modified.
1.0.1
Why Subset Selection?
In most practical problems the analyst has a pool of candidate regressors that should include all the influential factors but the actual subset of regressors that should be used in the model needs to be determined. Finding an appropriate subset of regressors for the model is called variables selection problem. Building a regression model that includes only a subset of the available regressors involves two conflicting objectives. At first, we would allow the model to include as many regressors as possible so that the “information
content” in these factors can influence the predicted value of y. On the other hand, we want the model to include as few regressors as possible because the variance of the prediction increases as the number of regressors increases. Again, the more regressors there in a model, the greater is the cost of data collection and model maintenance. This process of finding a model that compromises between these two objectives is called selecting the “best” regression equation.
1.0.2
Uses of regression
Regression models are used for several purposes, such as 1. data description,
2. parameter estimation and prediction of events, 3. control.
Engineers and scientists frequently use equations to summarize or describe a set of data. Sometime parameter estimation problems can be solved by regression technique. For example, suppose that an Regression analysis is helpful in developing such equations. Many applications of regression involve prediction of the response variable. Regression models may be used for control purposes like chemical engineering works.
1.0.3
Organization of report and Future Plan
This report is divided into 8 chapters.In third chapter, Regression Diagnosis and mea-sures of Model adequacy from the first stage report are included. Chapter 4 related to Subset Selection using ordinary subset regression (Forward, Backward and Stepwise methods). In Chapter 5 Multicollinearity has been discussed, in Chapter 6 Ridge Re-gression, in Chapter 7 idea of a intermediate selection procedure (Nonnegative Garrote in subset selection by Leo Breiman) is given and in Chapter 8 we have tried to use subset selection criterion in order selection of an Autoregressive process scheme. And lastly, in the Appendix the data analysis is provided.
Chapter 2
Multiple Linear Regression
A regression model that involves more than one variable is called a multiple regression model. Fitting and analyzing these models is discussed in this chapter. We shall also discuss measure of model adequacy that are useful in multiple regression.
2.1
Multiple Regression Models
Suppose that n > p observations are available and let yidenote the ithobserved response
and xij denote the ith observation or level of regressors xj, i = 1, 2, · · · , n and j = 1, 2,
· · · , p. Then the general Multiple Linear Regression Model can be written as
y = Xβ + ε (2.1) Where y = y1 y2 .. . yn , X = 1 x1,1 x1,2 . . . x1,p 1 x2,1 x2,2 . . . x2,p .. . ... . .. ... 1 xn,1 xn,2 . . . xn,p , β = β0 β1 .. . βn and ε = ε1 ε2 .. . εn
In general, y is an n × 1 vector of the observations, X is an n × (p + 1) matrix of the levels of the regressor variables, β is a (p + 1) × 1 vector of the regression coefficients, and ε is an n × 1 vector of random errors.
2.2
Estimation of the Model Parameters by Least
Squares
2.2.1
Least Squares Estimation Method
Model:
y = Xβ + ε
We assume that the error term ε in the model has E(ε) = 0, V(ε) = σ2I, and that of the errors are uncorrelated.
We wish to find the vector of Least squares(LS) estimators, ˆβ, that minimizes
S(β) = n X i=1 ε2i = ε0ε = (y − Xβ)0(y − Xβ). (2.2)
The LS estimators ˆβ must satisfy ∂S ∂β = −2X 0 y + 2X0Xβ = 0 (2.3) which simplifies to X0Xβ = X0y (2.4)
These equations are called the Least Squares Normal Equations.
To find the solution of the normal equations shall consider the following cases: Case I: (X’X) is non-singular.
If X0X is non-singular, i.e. if no column of the X matrix is a linear combination of the other columns, i.e. if the regressors are linearly independent, then (X0X)−1 exists. Hence the solution of the equations ( 2.4) is unique. Thus the least squares estimators of β
¯ is given by: ˆ
β = (X0X)−1X0y (2.5)
Case II:(X0X) is singular.
If (X0X) is singular then we can solve the normal equations X0Xβ
¯ = X
0
y
¯ by using Generalized inverses(g-inverse)1 as follows:
ˆ β ¯ = (X 0 X)−X0y ¯ (2.6)
while the estimates are not unique.(X0X)− is called the g-inverse of (X0X), which is not unique.
The predicted value ˆy
¯, corresponding to the observed value y¯is ˆ y ¯ = X ˆβ¯ (2.7) = X(X0X)−1X0y ¯ (2.8) = Hy ¯ (2.9)
The n × n matrix H = X(X0X)−1X0 is usually called ”hat” matrix, because it maps the vector of observed values and its properties play a central role in the regression analysis. Let M = I − H. Then M X = (I − H)X = X − X(X0X)−1X0X = X − X = 0 and residuals ˆ e ¯= y¯− ˆy¯ (2.10) where ˆ e ¯= ˆ e1 ˆ e2 .. . ˆ en , yˆ ¯= ˆ y1 ˆ y2 .. . ˆ yn = X ˆβ ¯ = X(X 0X)−1X0y ¯ (2.11)
Therefore, we can express the residual ˆe
¯ in terms of y¯ or β¯ as follows: ˆ
e
¯= y¯− Hy¯ = My¯ = MXβ¯+ Mε¯= Mε¯ (2.12)
2.2.2
Geometrical Interpretation of Least Squares
An intuitive geometrical interpretation of least squares is sometimes helpful. We may think of the vector of observations y
¯
0
= [y1, y2, · · · , yn] as defining a vector from the
origin to the point A in Figure 2.1. Note that [y1, y2, · · · , yn] form the coordinates of
an n-dimensional sample space. The sample space in Figure 2.1 is three dimensional.
The X consisits of p+1 (n × 1) vectors, for example, 1
¯ (a column vector of 1
0s), x
¯1, x
¯2, · · · , x¯p. Each of these columns define a vector from the origin in the sample space. These p + 1 vectors form a p + 1-dimensional subspace called the estimation space. The figure shows a 2-dimensional estimation space. We may represent any point in this subspace by a linear combination of of the vectors 1
¯, x¯1, x¯2, · · · , x¯p. Thus any point in the estimation space is of the form Xβ
¯. Let the vector Xβ¯ determine the point B in Figure 2.1. The squared distance from B to A is just
S(β
¯) = (y¯− Xβ¯)
0
(y
¯− Xβ¯).
Therefore minimizing the squared distance of point A defined by the observation vector y
¯ to the estimation requires finding the point in the estimation space that is closest to A. The squared distance will be a minimum when the point in the estimation space is the foot of the line from A normal (or perpendicular) to the estimation space. This is point C in Figure 2.1.
This point is defined by the vector (y
¯ − X ˆβ¯). Therefore since y¯ -ˆy¯ = y¯ - X ˆβ¯ is
Figure 2.1: A geometrical representation of least squares perpendicular to the estimation space, we may write
X0(y ¯− X ˆβ¯) = 0¯ or X0X ˆβ ¯ = X 0 y ¯
which we recognize as the least squares normal equations.
2.2.3
Properties of the Least Squares Estimators
The statistical properties of the least squares estimators ˆβ
¯may be easily demonstrated.Consider first bias: E( ˆβ ¯) = E(X 0 X)−1X0y ¯ = E(X0X)−1X0(Xβ ¯ + ε¯) = E(X0 X)−1X0Xβ ¯+ (X 0 X)−1X0ε ¯ = β ¯ since E (ε
¯) = 0¯. Thus ˆβ¯ is an unbiased estimator of β¯. The covariance matrix of ˆβ
¯ is given by Cov( ˆβ
¯) = σ
2(X0
X)−1
Therefore if we let C = (X0X)−1, the variance of ˆβj is σ2Cjj and the covariance between
ˆ
βi and ˆβj is σ2Cij.
The least squares estimator ˆβ
¯ is the best linear unbiased eatimator of β¯ (the Gauss-Markov Theorem). If we further assume that the errors εi are normally distributeb,
then ˆβ
¯ is also the maximum likelihood estimator (MLE) of β¯. The maximum likelihood estimator is the minimum variance unbiased estimator of β
¯.
2.2.4
Estimation of σ
2As in simple linear regression, we may develop an estimator of σ2 from the residual sum
of squares SSE = n X i=1 (yi− ˆyi)2 = n X i=1 e2i = ˆe ¯ 0 ˆ e ¯ Substituting e ¯= (y¯− X ˆβ¯),we have SSE = (y ¯− X ˆβ¯) 0 (y ¯− X ˆβ¯) = y ¯ 0 y ¯ − 2 ˆβ ¯X 0 y ¯ + ˆβ ¯ 0 X0X ˆβ ¯ Since X0X ˆβ ¯ = X 0 y
¯,this last equation becomes SSE = y ¯ 0 y ¯− ˆβ¯ 0 X0y ¯ (2.13)
The residual sum of squares has n-p-1 degrees of freedom associated with it since p+1 parameters are estimated in the regression model. The residual mean square is
M SE = SSE/(n-p+1) (2.14)
We can show that the expected value of M SE is σ2, so an unbiased estimator of σ2 is
given by
ˆ
σ2 = M SE (2.15)
This estimator of σ2 is model dependent.
2.3
Confidence Interval in Multiple Regression
Confidence intervals on individual regression coefficients and confidence intervals on the mean response given specifif levels of the regressors paly the same important role in multiple regression that they do in simple linear regression.The section develops the one-at-a-time confidence intervals for these cases.
2.3.1
Confidence Interval on the Regression Coefficients
To construct confidence interval estimates for the regression coefficients βj,we must
as-sume that the errors εi are normally and independently distributed with mean zero and
variance σ2. Therefore the observations yi are normally and independently distributed
with mean β0 +Ppj=1βjxij and variance σ2. Since the least squares estimator ˆβ
¯ is a lin-ear combination of the observations, it follows that ˆβ
¯ is normally distributed with mean vector β
¯ and covariance matrix σ
2(X0
X)−1. This implies that the marginal distribution of any regression coefficient ˆβj is normal with mean βj and variance σ2Cjj, where Cij is
the j th diagonal element of the (X0X)−1 matrix. Consequently each of the statistics ˆ
βj − βj
p ˆσ2C ij
, j = 0, 1, ..., p (2.16)
is distributed as t with n - p - 1 degrees of freedom, where ˆσ2 is the estimate of the error variance obtained from 2.12 .Therefore a 100(1 − α)% confidence interval for the regression coefficient βj, j = 0, 1, ..., k, is
ˆ
βj− tα/2,n−p−1p ˆσ2Cij ≤ βj ≤ ˆβj + tα/2,n−p−1p ˆσ2Cij (2.17)
We usually call the quantity
se( ˆβj) = p ˆσ2Cij (2.18)
2.3.2
Confidence Interval Estimation of the Mean Response
We may construct a confidence interval on the mean response at a particular point,such as x01, x02, · · · , x0p. Define vector x ¯0 as x ¯0 = 1 x01 x02 .. . x0p
The fitted value at this point is ˆ y ¯0 = x¯ 0 0βˆ ¯ (2.19)
This is an unbiased estimator of y
¯0, since E(ˆy¯0) = x¯
0 0β
¯ = y¯0, and the variance of ˆy¯0 is V (ˆy ¯0) = σ 2 x ¯ 0 0(X 0 X)−1x ¯0 (2.20)
Therefore a 100(1−α)% confidence interval on the mean response at the point x01, x02, · · · , x0p
is ˆ y ¯0− tα/2,n−p−1 q σ2x ¯ 0 0(X 0 X)−1x ¯0 ≤ y¯0 ≤ ˆy ¯0+ tα/2,n−p−1 q σ2x ¯ 0 0(X 0 X)−1x ¯0 (2.21)
2.4
Gauss−Markov Conditions
In order for estimates of β
¯ to have some desirable statistical properties,we need the following assumptions, called Gauss-Markov conditions,which have been already intro-duced: E(ε ¯) = 0¯ (2.22) E(ε ¯ε¯) = σ 2I (2.23)
We shall use these conditions repeatedly in the sequel. Note that G-M conditions imply that
Ey ¯ = Xβ¯ (2.24) and Cov(y ¯) = E[(y¯− Xβ¯)(y¯− Xβ¯) 0] = E(ε ¯ε¯ 0) = σ2I (2.25)
It also follows that (see 2.12) E(ˆe ¯eˆ¯ 0 ) = ME[ε ¯ε¯ 0 ]M = σ2M (2.26)
since M = (I − H) is idempotent. Therefore,
V ar(ˆei) = σ2mii= σ2[1 − hii] (2.27)
where mij and hij are the ij-th elements of M and H respectively. Because a variance
is non-negative and covariance matrix is at least positive semi-definite, it follows that hii≤ 1 and M is at least positive semi-definite.
2.4.1
Mean and Variance of Estimates Under G-M Conditions
From equation ( 2.22),
E( ˆβ
¯) = β¯ (2.28)
Now we know that, if for any parameter θ, its estimate T has the propetry that E(T) = θ, then T is an unbiased estimator of θ.Thus under G-M conditions, ˆβ
¯ is an unbiased estimator of β
¯. Note that we only use the first G-M condition to prove this. Therefore violation of condition ( 6.10) will not lead to bias. Further, under the G-M condition ( 6.10)
Cov( ˆβ ¯) = σ
2(X0
X)−1 (2.29)
2.4.2
The Gauss-Markov Theorem
In most applications of regression we are interested in estimates of some linear function Lβ
¯ or l¯
0
β
¯ of β¯, where l¯is a vector and L is a matrix. Estimates of these type include the predicteds ˆyi, the estimate ˆy0 of future observation, ˆy
¯and even ˆβ¯ itself. We consider here l
¯
0
β ¯.
Although there may be several possible estimators, we shall confine ourselves to linear estimators i.e., an estimator which is a linear function of y1, y2, · · · , yn, say c
¯
0y
¯. We also require that these linear functions be unbiased estimators of l
¯
0
β
¯ and assume that such linear unbiased estimators for l
¯ 0 β ¯ exist; l¯ 0 β
¯ is then called estimable.
In the following theorem we show that among all linear unbiased estimators, the least squares estimator l ¯ 0ˆ β ¯ = l¯ 0 (X0X)−1X0y
¯, which is also a linear function of y1, y2, · · · , yn and which is unbiased for l
¯
0
β
¯, has the smallest variance. That is, Var(l¯
0ˆ β ¯) ≤ V ar(c¯ 0y ¯) for all c
¯ such that E(c¯
0y
¯) = l¯
0
β
¯. Such an estimator is called a best linear unbiased estimator (BLUE).
Theorem 2.4.1 (Gauss-Markov) Let ˆβ ¯ = l¯
0
(X0X)−1X0y
¯ and y¯ = Xβ¯ + ε¯. Then under G-M conditions, the estimator l
¯
0ˆ
β
¯ of the estimable function l¯
0
β
¯ is BLUE.
Before proving G-M theorem, let X1, X2, · · · , Xn be a random sample of size n from the
distribution f (x ¯, θ¯), where θ¯R k. Now let T ¯ = T1 T2 .. . Tn = T1(X1, X2, · · · , Xn) T2(X1, X2, · · · , Xn) .. . Tn(X1, X2, · · · , Xn) If E(T) = θ
¯,then we say that T is unbiased estimator for θ¯. Again, let T and S be two unbiased estimators for θ
¯. Then we say that T is better than S if PS θ ¯ −PT θ ¯ is non-negative definite, ∀θ
¯ Ω and ∀S unbised, where PS
θ ¯
is the variance-covariance matrix of the estimator T.
Proof of G-M Theorem: Model: y ¯= Xβ¯ + ε¯ (2.30) with E(ε ¯) = 0¯ and V (ε¯) = σ
2I. We shall concentrate only on linear combination of y
¯. Then (i)a ¯ 0y ¯ is an unbiased estimator of a¯ 0Xβ ¯ and (ii)X0y ¯ is an unbiased estimator of X 0 Xβ ¯. Again, let M0y
¯ is also an unbiased estimator of X
0 Xβ ¯. Therefore M0Xβ ¯ = X 0 Xβ ¯ ∀ β¯ R p ⇒ M0X = X0X, ∀ β ¯ R p Write M0y ¯ can be written as M0y ¯ = M 0 y ¯− X 0 y ¯+ X 0 y ¯ Let u ¯ = M 0 y ¯− X 0 y
¯. Then the variance-covariance matrix can be written as:
M X θ ¯ = u ¯ X θ ¯ + X X θ ¯ ⇒ M X θ ¯ − X X θ ¯ = u ¯ X θ ¯ which is non-negative definite. Hence X0y
¯ is the optimal unbiased estimator of X
0 Xβ ¯. If X0X is non-singular, then ˆ β ¯ = (X 0 X)−1Xy ¯ Suppose u ¯ 0X0y ¯ and v¯ 0X0y ¯be two BLUEs of λ¯ 0 β
¯, we shall show that BLUE is unique. Therefore, E(u ¯ 0 X0y ¯) = λ¯ 0 β ¯= E(v¯ 0 X0y ¯) ⇒ u ¯ 0 X0Xβ ¯ = λ¯ 0 β ¯= v¯ 0 X0Xβ ¯ ⇒ u ¯ 0 X0X = λ ¯ 0 = v ¯ 0 X0X i.e., E((u ¯ 0 − v ¯ 0)X0y ¯) = 0. Now since u¯ 0X0y ¯ is BLUE so, V (u¯ 0X0y ¯) ≤ v¯ 0X0y ¯ for all v ¯ 0 6= u ¯ 0. Again, since v ¯ 0X0y ¯ is a BLUE so, V (v¯ 0X0y ¯) ≤ u¯ 0X0y ¯ for all u¯ 0 6= v ¯ 0. Therefore V (v ¯ 0X0y ¯) = u¯ 0X0y
¯ and hence BLUE is unique. Alternative Proof of G-M theorem: let c
¯
0y
¯ be another linear unbiased estimator of (estimable) l¯
0 β ¯. Since c¯ 0y ¯ is an unbiased estimator of l ¯ 0 β ¯, l¯ 0 β ¯ = E(c¯ 0y ¯) = c¯ 0Xβ
¯ for all β¯ and hence we have c ¯ 0 X = l ¯ 0 (2.31)
Now, V ar(c ¯ 0y ¯) = c¯ 0Cov(y ¯c¯) = c¯ 0(σ2I)c ¯= σ 2c ¯ 0c ¯, and V ar(l ¯ 0 β ¯) = l¯ 0 Cov( ˆβ ¯)l¯= σ 2l ¯ 0 (X0X)−1l ¯= σ 2c ¯ 0 X(X0X)−1X0c ¯ from ( 2.28) and ( 2.29). Therefore
V ar(c ¯ 0 y ¯) − V ar(l¯ 0 β ¯) = σ 2 [c ¯ 0 c ¯− c¯ 0 X(X0X)−1X0c ¯] = σ2c ¯ 0 [I − X(X0X)−1X0]c ¯≥ 0
since I − X(X0X)−1X0 = M is positive semi-definite (see Section 2.4). This proves the theorem.
A slight generalization of the Gausss-Markov theorem is the following: Theorem 2.4.2 Under G-M conditions, the estimator L ˆβ
¯ of the estimable function Lβ¯ is BLUE in the sense that
Cov(Cy
¯) − Cov(L ˆ¯β)
is positive semi-definite, where L is an arbtrary matrix and Cy
¯ is another unbiased linear estimator of Lβ
¯.
This theorem implies that if we wish to estimate several(possibly) related linear func-tion of the βj’s, we can not do better(in a BLUE sense) than use least squares estimates.
Proof: As in the proof of Gausss-Markov theorem, the unbiasedness of Cy
¯ yields Lβ¯ = CE(y
¯) = CXβ¯ for all β¯, whence L = CX, and since Cov(Cy¯) = σ
2CC0 and Cov(L ˆβ ¯) = σ 2L(X0 X)−1L0 = σ2CX(X0X)−1X0C0, it follows that Cov(Cy ¯) − Cov(L ˆβ¯) = σ 2 C[I − X(X0X)−1X0]C0,
which is positive semi-definite, since, as shown before that the matrix [I−X(X0X)−1X0] = M is positive semi-definite.
2.5
Maximum Likelihood Estimator(MLE)
Model:
y
¯= Xβ¯ + ε¯
We assume that the Gausss-Markov conditions hold and the yi’s are normally
normal with mean zero and variance σ2 i.e. ε
i iid N(0,σ2), i.e. ε
¯ ; N(0¯, σ
2I). Then
the probability density function of y1,y2, · · · , yn is given by
(2πσ2)−n/2exp[− 1 2σ2(y ¯− Xβ¯) 0 (y ¯− Xβ¯)]. (2.32)
The same probability density function, when considered as a function of β ¯ and σ
2, given
the observations y1,y2, · · · , yn, i.e. f (β
¯, σ
2|y
¯), is called the Likelihood function and is denoted by L(β
¯, σ
2|y
¯). The maximum likelihood estimates of β¯ and σ
2 are obtained by
maximizing L(β ¯, σ
2|y
¯) with respect to β¯ and σ
2. Since log[z] is an incresing function of
z, the same maximum likelihood estimates can be found by maximizing the logarithm of L.
Since maximizing ( 2.32) with respect to β
¯ is equivalent to minimizing (y¯− Xβ¯)
0(y
¯− Xβ
¯), the maximum likelihood estimate of β¯ is the same as the least squares estimate; i.e., it is ˆβ
¯ = (X
0X)−1X0y
¯. The maximum likelihood estimate of σ
2, obtained by
equating to zero the derivative of the log of the likelihood function with respect to σ2
after substituting β ¯ by ˆβ¯, is 1 n(y¯− X ˆβ¯) 0 (y ¯− X ˆβ¯) = 1 ne¯ 0 e ¯ (2.33)
To obtain the maximum likelihood estimate of β
¯ under the constraints Cβ¯ = γ¯ we need to minimize ( 2.33 ) subject to Cβ
¯ = γ¯. This is equivalent to minimizing (y
¯− X ˆβ¯)
0
(y
¯− X ˆβ¯) subject to Cβ¯− γ¯ = 0¯.
2.6
Explanatory Power−Goodness of Fit
In this section we shall discuss about a measure of how well our model explains the data-some measure of goodness of fit. One way to do this would be to see what a good would look like. A good model should make no mistake. Hence
y
¯ ≈ X ˆβ¯
Therefore, our estimatederrors or residuals can provide a useful measure of how well our model approximates the data. It would also be useful if we could scale this thing so the value associated with the goodness of fit would have some meaning.
To see this, let’s examine our estimated equation and break it up into two parts-the explained portion X ˆβ
¯ and the unexplained portion ˆe¯. Ideally, we would like the unexplained portion to be negligible. Hence our estimated model is
y ¯ = |{z}X ˆβ¯ explained portion + ˆe ¯ |{z} unexplained portion (2.34)
To gauge how well our models fits, we first minimize (using least squares) the square of our dependent variable y
us to getting the variation of the explained and unexplained portions of our regression. So, y ¯ 0y ¯ = (X ˆβ¯+ ˆe¯) 0(X ˆβ ¯ + ˆe¯) This gives, y ¯ 0 y ¯ = ˆβ¯ 0 X0X ˆβ ¯ + ˆe¯ 0 X ˆβ ¯+ ˆβ¯ 0 X0ˆe ¯+ ˆe¯ 0 ˆ e ¯ (2.35)
But from our assumptions, we know X’ˆe
¯ = 0¯ so its transpose ˆe¯
0
X must also be zero. Therefore y ¯ 0 y ¯ = ˆβ¯ 0 X0X ˆβ ¯ + ˆe¯ 0 ˆ e ¯ (2.36)
which has good intuitive meaning. y ¯ 0 y ¯ |{z} SSR = ˆβ ¯ 0 X0X ˆβ ¯ | {z } SSR + ˆe ¯ 0 ˆ e ¯ |{z} SSE (2.37)
wher SST stands for the total sum of squares which gives us an idea of the total variation;
SSR stands for the regression sum of squares which gives us idea of how much of the
variation comes from explained portion; and SSE stands for the error sum of squares
which gives us an idea of how much of the variation comes from the unexplained portion. If we divide it all by SST, we get
1 = ˆ β ¯ 0 X0X ˆβ ¯ y ¯ 0y ¯ + eˆ¯ 0 ˆ e ¯ y ¯ 0y ¯ (2.38) so that SSR/SST (the percent of variation of y
¯ explained by X) and the SSE/SST (the percent of variation of y
¯ that is unexplained) must equal one. Hence, if we define a statistics R2to compute how well our model fits, i.e., the percent variation of y
¯explained by X, we get R2 = 1 − ˆe¯ 0 ˆ e ¯ y ¯ 0y ¯ = ˆ β ¯ 0 X0X ˆβ ¯ y ¯ 0y ¯ (2.39) So a high R2, say .95, says that just about all variation in y
¯ is being explained by X, whereas a low R2, say .05, says that we can’t explain much of the variation of y
¯.
2.7
Testing of Hypothesis In Multiple Linear
Re-gression
In multiple regression problems certain tests of hypotheses about the model parame-ters are useful in measuring model adequacy. In this section we shall describe several important hypothesis-testing procedures. We shall continue to require the normality assumption on the errors introduced in previous sections.
2.7.1
Test for Significance of Regression
The test for significance of regression is a test to determine if there is a linear relationship between response y and any of the regressor variables x1, x2, · · · , xp. The appropriate
hypotheses are
H0 : β0 = β1 = · · · = βp = 0 (2.40)
against
H1 : βj 6= 0 f or at least one j
Rejection of H0 : βj = 0 implies that atleast one of the regressors x1, x2, · · · , xp
con-tributes significantly to the model. Test Procedure:
The total sum of squares SST is partioned into a sum of squares due to regression and
a residual sum of squares:
SST = SSR+ SSE (2.41)
and if H0 : βj = 0 is true, then SSR/σ2 ; χ2p+1 where the number of degrees of freedom
(d.f.) for χ2 are equal to the number of regressor variables including constant or in
other words d.f. equal to number of parameters to be estimated. Also we can show that SSE/σ2 ; χ2n−p−1 and that SSE and SSR are independent. The test procedures for
H0 : βj = 0 is to compute F0 = SSR/(p + 1) SSE/(n − p − 1) = M SR M SE (2.42) and reject H0if F0 > Fα,(p+1),(n−p−1). The procedure is usually summarized in an
nalysis-of-variance table such as
Table 2.1-Analysis of Variance for Significance Regression in Multiple Regression
Source of Variation d.f. Sum of squares Mean squares F0
Regression p + 1 SSR M SR M SR/M SE
Residuals n - p - 1 SSE M SE .
Total n SST -
-2.7.2
Tests on Individual Regression Coefficients
The testing of hypotheses on the individual regression coefficients are helpful in deter-mining the value of each of the regressors in the model. For example, the model might be more effective with the inclusion of additional regressors or perhaps with the deletion of one or more regressors presently in the model.
Adding a variable to a regression model always causes the sum of squares for regres-sion to increase and the residual sum of squares to decrease. We must decide whether the increase in the regression sum of squares is sufficient to warrant using the additional
regressor in the model. The addition of a regressor also increase the variance of the fitted value ˆy
¯, so we must be careful to include only regressor that are of real value in explaining the response. Furthermore adding an unimportant regressor may increase the residual mean square, which may decrease the usefulness of the model.
The hypotheses for testing the significance of any individual regression coefficient, such as βj, are
H0 : βj = 0
against H1 : βj 6= 0 (2.43)
If H0 : βj = 0 is not rejected, then this indicates that the regressor xj can be deleted
from the model.The test statistic for this hypothesis is t0 = ˆ βj p ˆσ2C jj = ˆ βj se( ˆβj) (2.44)
where Cij is the diagonal element of (X0X)−1 corresponding to ˆβj. The null
hypoth-esis H0 : βj = 0 is rejected if |t0| > tα/2,n−p−1. Note that this is really a partial or
marginal test because regression coefficient ˆβj depends on all the other regressor
vari-ables xi(i 6= j) that are in the model. Thus this is a test of the contribution of xj given
the other regressors in the model.
We may also directly determine the contribution to the regression sum of squares of a regressor, for example xj, given that the other regressors xi(i 6= j) are include in the
model by using the ”extra-sum-of-squares” method. This procedure can also be used to investigate the contribution of a subset of the regressor variables to the model. Consider the regression model with p regressors
y
¯= Xβ¯ + ε¯ (2.45)
where y
¯ is an n × 1 vector of the observations, X is an n × (p + 1) matrix of the levels of the regressor variables, β
¯ is a (p + 1) × 1 vector of the regression coefficients, and ε
¯ is an n × 1 vector of random errors. We would like to determine if some subset of r < p regressors contributes significantly to the regression model. Let the regression coefficients be partioned as follows:
β ¯ = " β ¯1 β ¯2 # where β
¯1 is (p − r + 1) × 1 and β¯2 is r × 1. We want to test the hypotheses
H0 : β
¯2 = 0¯ against H1 : β
¯2 6= 0¯ (2.46)
The model can be written as y
where n × (p − r + 1) matrix X1 represents the columns of X associated with β
¯1 and the n × r matrix X2 represents the columns of X associated with β
¯2. This is called full model.
For the full model, we know that ˆβ ¯ = (X
0X)−1X0y
¯. The regression sum of squares for this model is
SSR(β ¯) = ˆβ¯ 0 X0y ¯ (p + 1 d.f.) and M SE = y ¯ 0y ¯− ˆβ¯ 0 X0y ¯ n − p − 1 To find the contribution of the terms in β
¯2 to the regression, fit the model assuming that the null hypothesis H0 : β
¯2 = 0¯ is true. This reduced model is y
¯= X1β¯1+ ε¯ (2.48)
The lest squares estimator of β
¯1 in the reduced model is ˆβ¯1 = (X
0
1X1)−1X10y
¯. The regression sum of squares is
SSR(β ¯1) = ˆβ¯ 0 1X 0 1y ¯ (d.f.p − r + 1) (2.49)
The regression sum of squares due to β
¯2 given that β¯1 is already in the model is SSR(β
¯2|β¯1) = SSR(β¯) − SSR(β¯1) (2.50) with p + 1 − (p − r + 1) = r degrees of freedom. This sum of squares is called the extra sum of squares due to β
¯2 because it measures the increase in the regression sum of squares that results from adding the regressors xp−r+1, xp−r+2, · · · , xp to a model that
already contains x1, x2, · · · , xp−r. Now SSR(β
¯2|β¯1) is independent of M SE, and the null hypothesis H0 : β
¯2 = 0¯may be tested by the statistic F0 =
SSR(β
¯2|β¯1)/r M SE
(2.51) If F0 > Fα,r,n−p−1, we reject H0 concluding that atleast one of the parameters in β
¯2 is not zero, and consequently at least one of the regressors xp−r+1, xp−r+2, · · · , xp in X2
contributes significantly to the regression model.
Some authors call the above test a partial F-test because it measures the contribu-tion of the regressors in X2 given that the other regressors in X1 are in the model. To
illustrate the usefulness of this procedure, consider the model
y = β0+ β1x1+ β2x2+ β3x3+ ε (2.52)
The sum of squares
SSR(β1|β0, β2, β3)
and
SSR(β3|β0, β1, β2)
are single-degree-of -freedom sums of squares that measure the contribution of each regressor xj, j = 1, 2, 3 to the model given that all other regressors were already in the
model. That is, we are assessing the value of adding xj to a mode that did not include
this regressor.In general, we could find
SSR(βj|β0, β1, · · · , βj−1, βj+1, · · · , βp), 1 ≤ j ≤ p
which is the increase in the regression sum of squares due to adding xj to a model
that already contains x1, · · · , xj−1, xj+1, · · · , xp. Some find it helpful to think of this as
measuring the contribution of xj as if it were the last variable added to the model.
2.7.3
Special Cases of Orthogonal Columns in X
Consider the model( 2.47) y
¯ = Xβ¯ + ε¯ = X1β
¯1+ X2β¯2+ ε¯.
The extra sum of squares method allows us to measure the effect of the regressors in X2
conditional on those in X1 by computing SSR(β
¯2|β¯1). In general, we can not talk about finding the sum of squares due to β
¯2, SSR(β¯2), without accounting for the dependency of this quantity on the regressors in X1. However if the columns orthogonal to the columns
in X2 , we can determine a sum of squares due to β
¯2 that is free of any dependency on the regressors in X2.
To demonstrate this, form the normal equation X0X ˆβ
¯ = X
0
y
¯ for the above model. The normal equations are
X10X1 | X10X2 − − −− | − − −− X20X1 | X20X2 ˆ β ¯1 −− ˆ β ¯2 = X01y ¯ −− X02y ¯
Now if the columns of X1 are orthogonal to the columns in X01X2 and X02X1 = 0
¯. Then the normal equations become
X01X1βˆ ¯1 = X 0 1y ¯ X02X2βˆ ¯2 = X 0 2y ¯ with solutions ˆ β ¯1 = (X 0 1X1)−1X01y ¯ ˆ β ¯2 = (X 0 2X2)−1X02y ¯
Note that the least squares estimator of β
¯1 is ˆβ¯1 regardless of whether or not X2 is in the model, and the least squares estimator of β
¯2 is ˆβ¯2 regardless of whether or not X1 in the model.
The regression sum of squares for the full model is SSR(β ¯) = ˆβ¯ 0 X0y ¯ = h ˆ β ¯1, ˆ β ¯2 i X0 1y ¯ X02y ¯ = ˆβ ¯ 0 1X 0 1y ¯+ ˆβ¯ 0 2X 0 2y ¯ = y ¯ 0 X01(X01X1)−1X01y ¯+ y¯ 0 X02(X02X2)−1X02y ¯ (2.53)
However, the normal equations form two sets, and for each set we note that SSR(β ¯1) = ˆβ¯ 0 1X 0 1y ¯ SSR(β ¯2) = ˆβ¯ 0 2X 0 2y ¯ (2.54) which implies, SSR(β ¯) = SSR(β¯1) + SSR(β¯2) (2.55) Therefore SSR(β ¯1|β¯2) = SSR(β¯) − SSR(β¯2) = SSR(β¯1) and SSR(β ¯2|β¯1) = SSR(β¯) − SSR(β¯2) = SSR(β¯1) Consequently, SSR(β
¯1) measures the contribution of the regressors in X1 to the model unconditionally, and SSR(β
¯2) measures the contribution of the regressors in X2 to the model unconditionally. Because we can unambiguously determine the effect of each regressor when regressors are orthogonal, data collection experiments are often designed to have orthogonal variables.
2.7.4
Likelihood Ratio Test
Model: y ¯ = Xβ¯ + ε¯ (2.56) Assumptions: ε ¯;iid N (0¯, σ 2I). The MLE of β ¯ and σ 2 are,
ˆ β ¯ = (X 0 X)−1X0y ¯ and ˆ σ2 = 1 n(y¯− X ˆβ¯) 0 (y ¯− X ˆβ¯) = 1 nˆe¯ 0 ˆ e ¯ Let’s partition β ¯ as follows: β ¯ = " β ¯(1) β ¯(2) # where β
¯(1) is a vector of order q + 1 and β¯(2) is a vector of order p - q. Our aim is to test H0 : β ¯(2) = 0¯ against H1 : β¯(2) 6= 0¯ Now write X =X(1) : X(2)
Then the model becomes y
¯ = X(1)β¯(1)+ X(2)β¯(2)+ ε¯ (2.57)
and under H0, the model becomes
y ¯= X(1)β¯(1)+ ε¯ (2.58) Let R02(X) = (y ¯− X ˆβ¯) 0 (y ¯− X ˆβ¯) and R20(X(1)) = (y ¯− X(1) ˆ β ¯(1)) 0 (y ¯− X(1) ˆ β ¯(1)) where ˆ β ¯(1) = (X 0 (1)X(1))−1X0(1)y ¯. Also, D = R2
0(X(1)) − R02(X), be the extra sum of squares.
Result 2.7.1 Let y ¯ = Xβ¯+ε¯, ε¯ ; Nn(0¯, σ 2I n) and β ¯ 0 =β ¯(1) : β¯(2) . Then Likelihood Ratio (LR) test for testing H0 : β
¯(2) = 0¯ is to reject H0 if
F = D
(p − q)S2 =
R02(X(1)) − R20(X)
(p − q)S2 > Fp−q,n−p−1(α)
Proof:
The likelihood function L(β ¯, σ 2) = (2πσ2)−n/2 exp[− 1 2σ2(y ¯− Xβ¯) 0 (y ¯− Xβ¯)] (2.59) Therefore, max β ¯,σ 2 L(β ¯, σ 2) = L( ˆβ ¯, ˆσ 2) = (2πˆσ2)−n/2e−n2 (2.60) Now under H0, ˆ β ¯(1) = (X 0 (1)X(1))−1X0(1)y ¯ and ˆ σ2(1) = 1 n(y¯− X(1) ˆ β ¯(1)) 0 (y ¯− X(1) ˆ β ¯(1)) Therefore, max H0 L(β ¯, σ 2) = L( ˆβ ¯(1), ˆσ 2 (1)) = (2πˆσ 2 (1)) −n/2 e−n2 (2.61)
Hence the LR test statistics is given by,
λ = maxH0L(β¯, σ 2) maxβ ¯ ,σ2L(β ¯, σ 2) = ˆσ 2 (1) ˆ σ2 !−n/2 = 1 + σˆ 2 (1)− ˆσ 2 ˆ σ2 !−n/2
Therefore, λ is small when
ˆσ2 (1)− ˆσ2
ˆ σ2
!
is very large or equivalently when
n ˆσ 2 (1)− ˆσ 2 nˆσ2 ! ≥ c1, say i.e., R2 0(X(1)) − R20(X) R2 0(X) ≥ c1 or, D = (R 2 0(X(1)) − R20(X))/(p − q) R2 0(X)/(n − p − 1) > c2
where c1, c2 are constants. Therefore, reject H0 if T = (p−q)SD 2 > c2.
Chapter 3
Regression Diagnosis and Measures
of Model Adequacy
Evaluating model adequacy is an important part of a multiple regression problem. This section will present several methods for measuring model adequacy.Many of these tech-niques are extensions of those used in simple linear regression.
The major assumptions that we have made so far in our study of regression analysis are as follows:
1. The relationship between y and X is linear, or at least it is well approximated by a straight line.
2. The error term ε has zero mean.
3. The error term εi has constant variance σ2 for all i.
4. The errors are uncorrelated.
5. The errors are normally distributed.
We should always consider the validity of these assumptions to be doubtful and conduct analysis to examine the adequacy of the model we have tentatively entertained. Gross violation of the assumptions may yield an unstable model in the sense that a different sample could lead to a totally different model with opposite conclusions. We usually cannot detect departures from the underlying assumptions by examination of the standard summary statistics, such as the t- or F-statistics or R2. These are ”global” model properties, and as such they do not ensure model adequacy.
In this chapter we present several methods useful for diagnosing and treating viola-tions of the basic regression assumpviola-tions.
3.1
Residual Analysis
3.1.1
Definition of Residuals
We have defined the residuals as ˆ e
¯= y¯− ˆy¯ = (y¯− X ˆβ¯) (3.1)
viewed as the deviation between the data and the fit. This is a measure of the variability not explained by the regression model and departure from the underlying assumptions on the errors should show up by residuals.Analysis of residuals effective method for discovering several types of model deficiencies.
Properties
1. E(ˆei) = 0, ∀i.
2. Approximate variance is given by ˆ e ¯ 0 ˆ e ¯ n − p − 1 = Pn i eˆ 2 i n − p − 1 = SSE/(n − p − 1) = M SE. where ˆei’s are not independent(we shall see later part of this section).
3. Sometimes it is useful to work with the standardized residuals. di = ˆ ei √ M SE , ∀i. (3.2)
where E(di) = 0 and var(di) ≈ 1
The above equation scales the residuals by dividing them by their average standard deviation. In some regression data sets residuals may have standard deviation that differ greatly.In simple linear regression
V (ˆei) = V (yi− ˆyi) = V (yi) + V (ˆyi) − 2Cov(yi, ˆyi) = σ2+ σ2 1 n + (xi− ¯x) Sxx − 2Cov(yi, ˆyi)
Now we can show that
Cov(yi, ˆyi) = Cov yi, ¯y + Sxy Sxx (xi− ¯x) = σ2 1 n + (xi− ¯x)2 Sxx 6= 0
Therefore ˆei’s are not independent.
The studentized residuals are defined as ri = ei r M SE h 1 −1n+ (xi−¯x) Sxx i , i = 1, 2, · · · , n (3.3)
Notice that in this equation(3.3) the ordinary least squares residuals ˆei has been divided
by its exact standard error, rather than the average value as in standardized residuals(see 3.2). Studentized residuals are extremly useful in diagnostics.In small data sets the studentized residuals are often more appropriate than the standardized residuals because the differences in residual variances will be more dramatic.
3.1.2
Estimates
ˆ y ¯ = X ˆβ¯ = X(X 0 X)−1X0y ¯ = Hy¯ where ˆ β ¯ = (X 0 X)−1X0y ¯ and H = X(X0X)−1X0is a symmetric and idempotent matrix.We have seen that E( ˆβ
¯) = β¯ and Cov( ˆβ¯) = σ2(X0X)−1.
3.1.3
Estimates of σ
2Residual sum of squares
SSE = ˆe ¯ 0 ˆ e ¯ = (y ¯− X ˆβ¯) 0 (y ¯− X ˆβ¯) = y ¯ 0 y ¯, d.f. n − p − 1 and residual mean squares
M SE =
1
n − p − 1SSE. (3.4)
Then E(M SE) = σ2, i.e.,ˆσ2 = M SE.Therefore M SE is an unbiased estimator of σ2.
3.1.4
Coefficient of Multiple Determination
The coefficient of multiple determination R2 is defined as
R2 = SSR SST
= 1 − SSE SST
(3.5) It is customary to think of R2 as a measure of the reduction in the variability of y ¯ obtained by using x1, x2, · · · , xp. As in the simple linear regression case, we must have
0 ≤ R2 ≤ 1. However, a large value of R2 does not necessarily imply that the regression
model is a good one. Adding a regressors to the model will always increase R2 regardless of whether or not the additional regressor contributes to the model. Thus it is possible for models that have a large values of R2 to perform poorly in prediction or estimation.
The positive square root of R2 is the multiple correlation coefficient between y ¯ and the set of regressor variables x1, x2, · · · , xp. That is, R is a measure of the linear
asso-ciation between y
¯ and x1, x2, · · · , xp. We may also show that R
2 is the square of the
correlation between y
¯ and the vector of fitted values ˆy¯.
Adjusted R2
Some analysts prefer to use an adjusted R2-statistics because the ordinary R2 defined
above will always increase (at least not decrease) when a new term is added to the regression model.We shall see that in variable selection and model building procedures, it will be helpful to have a procedure that can guard against overfitting the model,that is, adding terms that are unnecessary. The adjusted R2 penalizes the analyst who includes
unnecessary variables in the model. We define the adjusted R2, R2
a, by replacing SSE and SST in equation(3.5) by the
corresponding mean squares; that is,
R2a= 1 − SSE/(n − p − 1) SST/n
= 1 − n
n − p − 1(1 − R
2) (3.6)
3.1.5
Methods of Scaling Residuals
I. Standardized and Studentized Residuals:
We have already introduced two types of scaled residuals, the standardized residuals di = ˆ ei √ M SE , i = 1, 2, · · · , n
and the studentized residuals. We now give a general development of the studentized residual scaling.Recall,
ˆ e
¯= (I − H)y¯ (3.7)
As H is symmetric (H0 = H) and idempotent (HH = H). Similarly the matrix (I − H) is symmetric and idempotent. Substituting y
¯ = Xβ¯ + ε¯ into above equation yields ˆ e ¯ = (I − H)(Xβ¯+ ε¯) = Xβ ¯ − HXβ¯ + (I − H)ε¯ = (I − H)ε ¯ (3.8)
Thus the residuals are the same linear transformation of the observations y
¯ and the errors ε
¯.
The covariance matrix of the residuals is V (ˆe ¯) = V [(I − H)ε¯] = (I − H)V (ε ¯)(I − H) 0 = σ2(I − H) (3.9) since V (ε ¯) = σ
2I and (I − H) is symmetric and idempotent. The matrix (I − H) is
generally not diagolnal, so the residuals have different variances and they are correlated. The variance of the i-th residual is
V (ˆei) = σ2(1 − hii) (3.10)
where hii is the i-th diagonal element of H. Since 0 ≤ hii ≤ 1, using the residual
mean square M SE to estimate the variance of the residuals actually overestimates V (ei).
Further more since hii is a measure of the location of the i-th point in x-space, the
variance of ei depends upon where the point x
¯i lies. Generally points near the center of the x-space have larger variance(poorer least squares fit) than residuals at more remote locations.Violation of the model assumptions are more likely at remote points, and these violations may be hard to detect from inspection of ei (or di) because their residuals will
usually be smaller.
Several authors(Behnken and Draper[1972]),Davies and Hutton[1975],and Huber[1975] suggest talking this inequality of variance into account when scaling the residuals. They recommend plotting the ”studentized” residuals
ri =
ˆ ei
pMSE(1 − hii)
, i = 1, 2, ..., n (3.11)
instead of ˆei(or di). The studentized residuals have constant variance V (ri) = 1
re-gardless of the location of x
¯i when the form of the model is correct. In many situations the variance of the residuals stabilizes, particularly for large data sets. In these cases there may be little difference between the standardized and studentized residuals. Thus standardized and studentized residuals often convey equivalent information. However, since any point with a large residual and a large hii potentially highly influential on the
least squares fit, examination of the studentized residuals is generally recommended. The covariance between ˆei and ˆej is
Cov(ˆei, ˆej) = −σ2hij (3.12)
so another approach to scaling the residuals is to transform the n dependent residuals into n − p orthogonal functions of the errors ε
¯.These transformed residuals are normally and independently distributed with constant variance σ2. Several procedures have been
proposed to investigate departures from the underlying assumptions using transformed residuals. These procedures are not widely used in practice because it is difficult to make specific inferences about the transformed residuals, such as the interpretation of outliers. Further more dependence between the residuals does not affect interpretation of the usual residual plots unless p is large relative to n.
II. Prediction Error Sum of Squares Residuals:
The prediction error sum of squares(PRESS) proposed by Allen[1971b,1974] provides a useful residual scaling. To calculate PRESS, select an observation, for example i. Fit the regression model to the remaining n - 1 observations and use this equation to predict the withheld observation yi. Denoting this predicted value ˆy(i), we may find the prediction
error for point i as ˆe(i) = yi− ˆy(i). The prediction error is often called the i-th PRESS
residual. This procedure is repeated for each observation i = 1,2,...,n, producing a set of n PRESS residuals ˆe(1), ˆe(2), · · · , ˆe(n). Then the PRESS statistic is defined as the sum
of squares of the n PRESS residuals as in P RESS = n X i=1 ˆ e2(i) = n X i=1 yi− ˆy(i) 2 (3.13)
Thus PRESS uses each possible subset of n − 1 observations as the estimation as the estimation data set, and every observation in turn is used to form the prediction data set, and every observation in turn is used to form the prediction data set.
It would initially seem that calculating PRESS requires fitting n different regres-sions.However, it is possible to calculate PRESS from the results of a single least squares fit to all n observations. It turns out that the i-th PRESS residual is
ˆ ei = ˆ ei 1 − hii (3.14) Thus since PRESS is just the sum of the squares of PRESS residuals, a simple computing formula is P RESS = n X i=1 ˆ ei 1 − hii 2 (3.15) From ( 3.14) it is easy to see that PRESS residual is just the ordinary residual weighted according to the diagonal elements of the hat matrix hii . Residuals associated with
points for which hii is large will have PRESS residuals. These points will generally be
higher influence points. Generally, a large difference between the ordinary residual will indicate a point where the model fits the data well, but a model built without that point predicts poorly.
Finally note that the variance of the i-th PRESS residual is V ˆe(i) = V ˆ ei 1 − hii = 1 (1 − hii)2 σ2(1 − h ii) = σ 2 1 − hii
so that a standardized PRESS residual is ˆ e(i) q V ˆe(i) = eˆ(i)/(1 − hii) p[σ2(1 − h iii)] = ˆei pσ2(1 − h ii)
which if we use M SE to estimate σ2 is just the studentized residual discussed previously.
III. R-Student:
The studentized residual ri discussed above is often considered an outlier diagnostic.
It is customary to use M SE as an estimate of σ2 in computing ri. This is referred to
as internal scaling of the residual because M SE is an internally generated estimate of
σ2 obtained from fitting the model to all n observations. Another approach would be to use an estimate of σ2 based on a data set with the i-th observation removed. Denote the estimate of σ2 so obtained by S2
(i). We can show that
S(i)2 = (n − p)M SE − ˆe
2
i/(1 − hii)
n − p − 1 (3.16)
The estimate of σ2 in ( 3.16) is used instead of M SE to produce an externally studentized
residual, usually called R-student, given by ti = ˆ ei q S2 (i)(1 − hii) , i = 1, 2, · · · , n (3.17)
In many situation ti will differ little from the studentized residual ri. However, if the
i-th observation is influential, then S2
(i) can differ significantly from M SE, and thus the
R-student statistic will be more sensitive to this point. Furthermore under the standard assumptions ti does follows the tn−p−1-distribution. Thus R-student offers a more formal
procedure for outlier detection via hypothesis testing. Furthermore detection of outliers needs to be considered simultaneously with the detection of influential observations.
IV.Estimation of Pure Error:
The procedure involved partitioning the error (or residual) sum of squares into sum squares due to ”pure error” and sum of squares due to ”lack of fit”,
SSE = SSP E + SSLOF
where SSP E is computed using responses at repeated observations at the same level of x
¯. This is a model independent estimate of σ2.The calculation of SS
P E requires repeated
observations on y
¯ at the same set of levels on the regressor variables x1, x2, · · · , xp, i.e., some of the rows of X matrix be same. However repeated observations do not often occur in multiple regression.
3.1.6
Residual Plots
The residuals ei from the multiple linear regression model plays an important role in
judging model adequacy just as they do in simple linear regression. Specially we often find it instructive to plot the following:
1. Residuals on Normal Probability paper.
2. Residuals versus each regressor xj, j = 1,2,...,p
3. Residuals versus fitted ˆyi, i = 1, 2, · · · , n.
4. Residuals in time sequence (if known).
The plots are used to detect departures from normality, outliers, inequality of variance and the wrong functional specification for a regressor.There are several other residual plots useful in multiple regression analysis, some of them are as follows:
1. Plot of residuals against regressors omitted from the model.
2. Partial residual plots(i-th partial residuals for regressor xj is ˆe∗ij = yi − ˆβ
¯1xi1− · · · − ˆβ
¯j−1xi,j−1− ˆβ¯j+1xi,j+1− · · · − ˆβ¯pxi,p).
3. Partial regression plots:plots of residuals from which the linear dependency of y ¯on all regressors other than xj have been removed against regressor xj with its linear
dependency on other regressors removed.
4. Plot of regressor xj against xi(checking multicollinearity):If two or more regressors
are highly correlated, we say that multicollinearity is present in the data. Multi-collinearity can seriously disturb the least squares fit and in some situations render the regression model almost useless. We shall discuss some of them here.
Normal Probability Plot
A very simple method for checking the normality assumption is to plot the residuals on normal probability paper. This graph paper is so designed so that the cumulative normal distribution will plot as a straight line.
Let e[1] < e[2] < · · · < e[n] be the residuals ranked in increasing order. Plot e[i]
against the cumulative probabilities, Pi = (i − 12)/n, i = 1, 2, · · · , n. resulting should
be approximately on a straight line. The straight line is usually determined visually, with emphasis on the central values(e.g., the .33 and .67 cumulative probability points) rather than the extremes. Substantially departures from a straight line indicate that the distribution is not normal.
Figure 3.1(a) displays an ”idealized” normal probability plot. Notice that the points lie approximately along a straight line. Figure 3.1(b)-(e) represent other typical prob-lems. Figure 3.1(b) shows a sharp upward and downward curve at both extremes,
indicating that the tails of the distribution are too heavy for it to be considered normal. Conversely Figure 3.1c shows flattening at the extremes, which is a pattern typical of samples from a distribution with thinner tails than the normal. Figure 3.1(d)-(e) ex-hibit patterns associated with positive and negative skew, respectively. Andrews [1970] and Gnanadesikan [1977] note that normal probability plots often exhibit no unusual behavior even if errors εi are not normally distributed. This problems occur because the
residuals are not a simple random sample; they are the remnants of a parameter estima-tion process. The residuals can be shown to be linear combinaestima-tions of the model errors (the εi). Thus fitting the parameters tends to destroy the evidence of non-normality in
the residuals, and consequently we cannot always rely on the normal probability plot to detect departures from normality.
A common defect that shows up on the normal probability plot is the occurrence of one or two large residuals. Sometimes this is an indication that the corresponding observations are outliers.
Residual Plot against ˆyi
Figure 3.2: Patterns for residual plots:(a) satisfactory;(b) funnel;(c) double bow;(d) nonlinear.
A plot of the residuals ˆei (or the scaled residuals di or ri) versus the corresponding
fitted values ˆyi is useful for detecting several common types of model inadequacies.1
vspace0.5mm If this plot resembles Figure 3.1.6(a), which indicates that the residuals can be contained in a horizontal band, then there are no obvious model defects. Plots of ei against ˆyi that resemble any of the patterns in Figures 3.1.6(b)-(d) are symptomatic
of model deficiencies.
∗ ∗ ∗ ∗ ∗
1The residuals should be plotted against the fitted values ˆy
i and not the observed values yi because the ei and the ˆyi are uncorrelated while the ei and the yi are usually correlated .
Chapter 4
Subset Selection and Model
Building
So far we have assumed that the variables that go into the regression equation were chosen in advance. Our analysis involved examining the equation to see whether the functional specification was correct and whether the underlying assumptions about the error term were valid. The analysis presupposed that the regressor variables included in the model are known to be influential. However, in most practical applications the analyst has a pool of candidate regressors that should include all the influential factors, but the actual subset of regressors that should be used in the model needs to be deter-mined. Finding an appropriate subset of regressors for the model is called the variable selection problem.
Building a regression model that includes only a subset of the variable regressors involves two conflicting objectives.
1. We would like the model to include as many as regressors as possible so that the ”information content” in these factors can influence the predicted value of y. 2. We want the model to include as few regressors as possible because the variance
of the prediction ˆy increases as the number of regressors increases. Also the more regressors there are in a model, greater the costs of data collection and model maintenance.
The process of finding a model that is a compromise between these two objectives is called ”best” regression equation. Unfortunately there is no unique definition of best. Furthermore there are several algorithms that can be used for variable selection, and these procedures frequently specify different different subsets of the candidate regressors as best.
The variable selection problem is often discussed in an idealized setting. It is usually assumed that the correct functional specification of the regressors is known, and that no outliers or influential observations are present. In practice, these assumptions are rarely met. Investigation of model adequacy is linked to the variable selection problem. Although ideally these problems should be solved simultaneously, an iterative approach is often employed, in which
1. a particular variable selection strategy is employed and then
2. the resulting subset model is checked for correct functional specification, outliers, influential observations.
Several iterations may be required to produce an adequate model.
4.1
Model Building Or Formulation Of The Problem
Suppose we have a response variable Y and q predictor variables X1, X2, · · · , Xq. A
linear model that represents Y in terms of q variables is
yi = q
X
j=1
βjxij + i, (4.1)
where βj are parameters an i represents random disturbances. Instead of dealing with
full set of variables (particularly when q is large), we might delete a number of variables and construct an equation with a subset of variables. This chapter is concerned with determining which variables are to be retained in the equation. Let us denote the set of variables retained by X1, X2, · · · , Xp and those deleted by Xp+1, Xp+2, · · · , Xq. Let us
examine the effect of variable deletion under two general conditions:
1. The model that connects Y to the X’s has all β’s (β0, β1, · · · , βq) nonzero.
2. The model has β0, β1, · · · , βp nonzero, but βp+1, βp+2, · · · , βq zero.
Suppose that instead of fitting (4.1) we fit the subset model
yi = β0 + p
X
j=1
βjxij+ i (4.2)
We will examine the effect of deletion of variables on the estimates of parameters and the predicted values of Y . The solution to the problem of variable selection becomes a little clearer once the effects of retaining unessential variables or the deletion of essential variables in an equation are known.
4.2
Consequences Of Variable Deletion
To provide motivation for variable selection, we will briefly review the consequences of in-correct model specification. assume that there are K candidate regressors x1, x2, · · · , xK
and n ≥ K + 1 observations on these regressors and the response y. The full model, containing all K regressors, is
yi = β0+ K
X
j=1
or equivalently
y
¯ = Xβ¯ + ¯ (4.4)
We assume that the list of candidate regressors contains all the influential variables and all equations include an intercept term. Let r be the number of regressors that are deleted from (4.4). Then the number of variables that are retained is p = K + 1 − r. Since the intercept is included, the subset model contains p − 1 = K − r of the original regressors.
The model (4.4) can be written as y
¯ = Xpβ¯p+ Xrβ¯r+ ¯ (4.5)
where the X matrix has been partitioned into Xp an n × p matrix whose columns
represent the intercept and the p − 1 regressors to be retained in the subset model, and Xr, an n × r matrix whose columns represent the regressors to be deleted from the full
model. Let β
¯ be partitioned conformably into β¯p and β¯r. For the full model the least squares estimate of β ¯ is ˆ β ¯ ∗ = (X0X)−1X0y ¯ (4.6)
and an estimate of the residual variance σ2 is
ˆ σ2∗ = y ¯ 0y ¯− ˆβ¯ ∗0 pX 0y ¯ n − K − 1 = y ¯ 0[I − X(X0X)−1X0] y ¯ n − K − 1 (4.7) The components of ˆβ ¯ ∗ are denoted by ˆβ ¯ ∗ p and ˆβ ¯ ∗ r, and ˆy ∗
i are the fitted values. For the
subset model
y
¯ = Xpβ¯p+ ¯ (4.8)
the least squares estimate of β ¯p is ˆ β ¯p = (X 0 pXp)−1X0py ¯ (4.9)
the estimate of the residual variance is
ˆ σ2 = y ¯ 0y ¯− ˆβ¯ 0 pX 0 py ¯ n − p = y ¯ 0I − X p(X0pXp)−1X0p y ¯ n − p (4.10)
and the fitted values are ˆyi.
4.2.1
Properties of ˆ
β
¯
pThe properties of the estimates ˆβ
¯p and ˆσ
2 from the subset model have been investigated
1. Bias in ˆβ ¯p E ˆβ ¯p = β ¯p+ (X 0 pXp)−1X0pXrβ ¯r = β ¯p+ Aβ¯r where A = (X0pXp)−1X0pXr. Thus ˆβ
¯p is a biased estimate of β¯p unless the regression coefficients corresponding to the deleted variables (β
¯r) are zero or the retained variables are orthogonal to the deleted variables (X0pXr = 0
¯). 2.Variance of ˆβ ¯p The variance of ˆβ ¯p and ˆβ¯ ∗ are V ( ˆβ ¯p) = σ 2(X0 pXp)−1 and V ( ˆβ ¯ ∗ ) = σ2(X0X)−1,
respec-tively. Also the matrix V ( ˆβ ¯
∗
p) − V ( ˆβ¯p) is positive semidefinite; that is, the variances of
the least squares estimates of the parameters in the full model are greater than or equal to the variances of the corresponding parameters in the subset model. Consequently deleting variables never increases the variances of the estimates of the remaining param-eters.
3. Precision of the Parameter Estimates Since ˆβ
¯p is a biased estimate of β¯p and ˆβ¯
∗
p is not, it is more reasonable to compare the
precision of the parameter estimates from the full and subset models in terms of mean square error. The mean square error of ˆβ
¯p is M SE( ˆβ ¯p) = σ 2 (X0pXp)−1+ A ˆβ ¯r ˆ β ¯ 0 rA 0 If the matrix V ( ˆβ ¯ ∗ r) − ˆβ¯r ˆ β ¯ 0
r is positive semidefinite, the matrix V ( ˆβ¯ ∗
r) − M SE( ˆβ¯p) is
positive semidefinite. This means that the least squares estimates of the parameters in the subset model have smaller mean squares error than the corresponding parameter estimates from the full model when the deleted variables have regression coefficients that are smaller than the standard errors of their estimates in the full model.
4. Precision in Prediction
Suppose we wish to predict the response at the point x0 = [x0p, x0r]. If we use the full model, the predicted value is ˆy
¯ ∗ = x0βˆ ¯ ∗ , with mean x0β ¯ ∗
and prediction variance V (ˆy∗) = σ21 + x ¯ 0 (X0X)−1x ¯
However, if the subset model is used, ˆy ¯= x¯ 0 pβˆ ¯p, with mean E(ˆy) = x ¯ 0 pβ ¯ + x¯ 0 pA ˆβ ¯r