2.7 Testing of Hypothesis In Multiple Linear Regression
3.1.5 Methods of Scaling Residuals
Adjusted R2
Some analysts prefer to use an adjusted R2-statistics because the ordinary R2 defined above will always increase (at least not decrease) when a new term is added to the regression model.We shall see that in variable selection and model building procedures, it will be helpful to have a procedure that can guard against overfitting the model,that is, adding terms that are unnecessary. The adjusted R2 penalizes the analyst who includes unnecessary variables in the model.
We define the adjusted R2, R2a, by replacing SSE and SST in equation(3.5) by the corresponding mean squares; that is,
R2a= 1 − SSE/(n − p − 1)
SST/n = 1 − n
n − p − 1(1 − R2) (3.6)
3.1.5 Methods of Scaling Residuals
I. Standardized and Studentized Residuals:
We have already introduced two types of scaled residuals, the standardized residuals di = ˆei
√M SE, i = 1, 2, · · · , n
and the studentized residuals. We now give a general development of the studentized residual scaling.Recall,
ˆ
e¯= (I − H)y
¯ (3.7)
As H is symmetric (H0 = H) and idempotent (HH = H). Similarly the matrix (I − H) is symmetric and idempotent. Substituting y
¯ = Xβ
¯ + ε
¯ into above equation yields eˆ
¯ = (I − H)(Xβ
¯+ ε
¯)
= Xβ¯ − HXβ
¯ + (I − H)ε
= (I − H)ε ¯
¯ (3.8)
Thus the residuals are the same linear transformation of the observations y
¯ and the errors ε
¯.
The covariance matrix of the residuals is V (ˆe
¯) = V [(I − H)ε
¯]
= (I − H)V (ε
¯)(I − H)0
= σ2(I − H) (3.9)
since V (ε
¯) = σ2I and (I − H) is symmetric and idempotent. The matrix (I − H) is generally not diagolnal, so the residuals have different variances and they are correlated.
The variance of the i-th residual is
V (ˆei) = σ2(1 − hii) (3.10) where hii is the i-th diagonal element of H. Since 0 ≤ hii ≤ 1, using the residual mean square M SE to estimate the variance of the residuals actually overestimates V (ei).
Further more since hii is a measure of the location of the i-th point in x-space, the variance of ei depends upon where the point x
¯i lies. Generally points near the center of the x-space have larger variance(poorer least squares fit) than residuals at more remote locations.Violation of the model assumptions are more likely at remote points, and these violations may be hard to detect from inspection of ei (or di) because their residuals will usually be smaller.
Several authors(Behnken and Draper[1972]),Davies and Hutton[1975],and Huber[1975]
suggest talking this inequality of variance into account when scaling the residuals. They recommend plotting the ”studentized” residuals
ri = eˆi
pMSE(1 − hii), i = 1, 2, ..., n (3.11) instead of ˆei(or di). The studentized residuals have constant variance V (ri) = 1 re-gardless of the location of x
¯i when the form of the model is correct. In many situations the variance of the residuals stabilizes, particularly for large data sets. In these cases there may be little difference between the standardized and studentized residuals. Thus standardized and studentized residuals often convey equivalent information. However, since any point with a large residual and a large hii potentially highly influential on the least squares fit, examination of the studentized residuals is generally recommended.
The covariance between ˆei and ˆej is
Cov(ˆei, ˆej) = −σ2hij (3.12) so another approach to scaling the residuals is to transform the n dependent residuals into n − p orthogonal functions of the errors ε
¯.These transformed residuals are normally and independently distributed with constant variance σ2. Several procedures have been proposed to investigate departures from the underlying assumptions using transformed residuals. These procedures are not widely used in practice because it is difficult to make specific inferences about the transformed residuals, such as the interpretation of outliers. Further more dependence between the residuals does not affect interpretation of the usual residual plots unless p is large relative to n.
II. Prediction Error Sum of Squares Residuals:
The prediction error sum of squares(PRESS) proposed by Allen[1971b,1974] provides a useful residual scaling. To calculate PRESS, select an observation, for example i. Fit the regression model to the remaining n - 1 observations and use this equation to predict the withheld observation yi. Denoting this predicted value ˆy(i), we may find the prediction error for point i as ˆe(i) = yi− ˆy(i). The prediction error is often called the i-th PRESS residual. This procedure is repeated for each observation i = 1,2,...,n, producing a set of n PRESS residuals ˆe(1), ˆe(2), · · · , ˆe(n). Then the PRESS statistic is defined as the sum of squares of the n PRESS residuals as in
P RESS =
n
X
i=1
ˆ e2(i) =
n
X
i=1
yi− ˆy(i)2
(3.13)
Thus PRESS uses each possible subset of n − 1 observations as the estimation as the estimation data set, and every observation in turn is used to form the prediction data set, and every observation in turn is used to form the prediction data set.
It would initially seem that calculating PRESS requires fitting n different regres-sions.However, it is possible to calculate PRESS from the results of a single least squares fit to all n observations. It turns out that the i-th PRESS residual is
ˆ
ei = eˆi
1 − hii (3.14)
Thus since PRESS is just the sum of the squares of PRESS residuals, a simple computing formula is
P RESS =
n
X
i=1
ˆei 1 − hii
2
(3.15) From ( 3.14) it is easy to see that PRESS residual is just the ordinary residual weighted according to the diagonal elements of the hat matrix hii . Residuals associated with points for which hii is large will have PRESS residuals. These points will generally be higher influence points. Generally, a large difference between the ordinary residual will indicate a point where the model fits the data well, but a model built without that point predicts poorly.
Finally note that the variance of the i-th PRESS residual is V ˆe(i)
= V
eˆi 1 − hii
= 1
(1 − hii)2 σ2(1 − hii)
= σ2
1 − hii
so that a standardized PRESS residual is ˆ
e(i) q
V ˆe(i)
= eˆ(i)/(1 − hii) p[σ2(1 − hiii)]
= ˆei pσ2(1 − hii)
which if we use M SE to estimate σ2 is just the studentized residual discussed previously.
III. R-Student:
The studentized residual ri discussed above is often considered an outlier diagnostic.
It is customary to use M SE as an estimate of σ2 in computing ri. This is referred to as internal scaling of the residual because M SE is an internally generated estimate of σ2 obtained from fitting the model to all n observations. Another approach would be to use an estimate of σ2 based on a data set with the i-th observation removed. Denote the estimate of σ2 so obtained by S(i)2 . We can show that
S(i)2 = (n − p)M SE − ˆe2i/(1 − hii)
n − p − 1 (3.16)
The estimate of σ2 in ( 3.16) is used instead of M SE to produce an externally studentized residual, usually called R-student, given by
ti = eˆi
qS(i)2 (1 − hii)
, i = 1, 2, · · · , n (3.17)
In many situation ti will differ little from the studentized residual ri. However, if the i-th observation is influential, then S(i)2 can differ significantly from M SE, and thus the R-student statistic will be more sensitive to this point. Furthermore under the standard assumptions ti does follows the tn−p−1-distribution. Thus R-student offers a more formal procedure for outlier detection via hypothesis testing. Furthermore detection of outliers needs to be considered simultaneously with the detection of influential observations.
IV.Estimation of Pure Error:
The procedure involved partitioning the error (or residual) sum of squares into sum squares due to ”pure error” and sum of squares due to ”lack of fit”,
SSE = SSP E + SSLOF
where SSP E is computed using responses at repeated observations at the same level of x
¯. This is a model independent estimate of σ2.The calculation of SSP E requires repeated observations on y
¯ at the same set of levels on the regressor variables x1, x2, · · · , xp, i.e., some of the rows of X matrix be same. However repeated observations do not often occur in multiple regression.