Methods of Scaling Residuals - Testing of Hypothesis In Multiple Linear Regression

2.7 Testing of Hypothesis In Multiple Linear Regression

3.1.5 Methods of Scaling Residuals

Adjusted R²

Some analysts prefer to use an adjusted R²-statistics because the ordinary R² defined above will always increase (at least not decrease) when a new term is added to the regression model.We shall see that in variable selection and model building procedures, it will be helpful to have a procedure that can guard against overfitting the model,that is, adding terms that are unnecessary. The adjusted R² penalizes the analyst who includes unnecessary variables in the model.

We define the adjusted R², R²_a, by replacing SS_E and SS_T in equation(3.5) by the corresponding mean squares; that is,

R²_a= 1 − SS_E/(n − p − 1)

SS_T/n = 1 − n

n − p − 1(1 − R²) (3.6)

3.1.5 Methods of Scaling Residuals

I. Standardized and Studentized Residuals:

We have already introduced two types of scaled residuals, the standardized residuals d_i = ˆe_i

√M S_E, i = 1, 2, · · · , n

and the studentized residuals. We now give a general development of the studentized residual scaling.Recall,

e¯= (I − H)y

¯ (3.7)

As H is symmetric (H⁰ = H) and idempotent (HH = H). Similarly the matrix (I − H) is symmetric and idempotent. Substituting y

¯ = Xβ

¯ + ε

¯ into above equation yields eˆ

¯ = (I − H)(Xβ

¯+ ε

¯)

= Xβ¯ − HXβ

¯ + (I − H)ε

= (I − H)ε ¯

¯ (3.8)

Thus the residuals are the same linear transformation of the observations y

¯ and the errors ε

¯.

The covariance matrix of the residuals is V (ˆe

¯) = V [(I − H)ε

¯]

= (I − H)V (ε

¯)(I − H)⁰

= σ²(I − H) (3.9)

since V (ε

¯) = σ²I and (I − H) is symmetric and idempotent. The matrix (I − H) is generally not diagolnal, so the residuals have different variances and they are correlated.

The variance of the i-th residual is

V (ˆe_i) = σ²(1 − h_ii) (3.10) where hii is the i-th diagonal element of H. Since 0 ≤ hii ≤ 1, using the residual mean square M S_E to estimate the variance of the residuals actually overestimates V (e_i).

Further more since h_ii is a measure of the location of the i-th point in x-space, the variance of ei depends upon where the point x

¯ⁱ lies. Generally points near the center of the x-space have larger variance(poorer least squares fit) than residuals at more remote locations.Violation of the model assumptions are more likely at remote points, and these violations may be hard to detect from inspection of ei (or di) because their residuals will usually be smaller.

Several authors(Behnken and Draper[1972]),Davies and Hutton[1975],and Huber[1975]

suggest talking this inequality of variance into account when scaling the residuals. They recommend plotting the ”studentized” residuals

r_i = eˆ_i

pMS_E(1 − h_ii), i = 1, 2, ..., n (3.11) instead of ˆe_i(or d_i). The studentized residuals have constant variance V (r_i) = 1 re-gardless of the location of x

¯ⁱ when the form of the model is correct. In many situations the variance of the residuals stabilizes, particularly for large data sets. In these cases there may be little difference between the standardized and studentized residuals. Thus standardized and studentized residuals often convey equivalent information. However, since any point with a large residual and a large h_ii potentially highly influential on the least squares fit, examination of the studentized residuals is generally recommended.

The covariance between ˆe_i and ˆe_j is

Cov(ˆe_i, ˆe_j) = −σ²h_ij (3.12) so another approach to scaling the residuals is to transform the n dependent residuals into n − p orthogonal functions of the errors ε

¯.These transformed residuals are normally and independently distributed with constant variance σ². Several procedures have been proposed to investigate departures from the underlying assumptions using transformed residuals. These procedures are not widely used in practice because it is difficult to make specific inferences about the transformed residuals, such as the interpretation of outliers. Further more dependence between the residuals does not affect interpretation of the usual residual plots unless p is large relative to n.

II. Prediction Error Sum of Squares Residuals:

The prediction error sum of squares(PRESS) proposed by Allen[1971b,1974] provides a useful residual scaling. To calculate PRESS, select an observation, for example i. Fit the regression model to the remaining n - 1 observations and use this equation to predict the withheld observation y_i. Denoting this predicted value ˆy_(i), we may find the prediction error for point i as ê_(i) = y_i− ˆy_(i). The prediction error is often called the i-th PRESS residual. This procedure is repeated for each observation i = 1,2,...,n, producing a set of n PRESS residuals ê₍₁₎, ê₍₂₎, · · · , ê_(n). Then the PRESS statistic is defined as the sum of squares of the n PRESS residuals as in

P RESS =

i=1

ˆ e²_(i) =

i=1

yi− ˆy_(i)2

(3.13)

Thus PRESS uses each possible subset of n − 1 observations as the estimation as the estimation data set, and every observation in turn is used to form the prediction data set, and every observation in turn is used to form the prediction data set.

It would initially seem that calculating PRESS requires fitting n different regres-sions.However, it is possible to calculate PRESS from the results of a single least squares fit to all n observations. It turns out that the i-th PRESS residual is

ei = eˆ_i

1 − h_ii (3.14)

Thus since PRESS is just the sum of the squares of PRESS residuals, a simple computing formula is

P RESS =

i=1

ˆe_i 1 − h_ii

(3.15) From ( 3.14) it is easy to see that PRESS residual is just the ordinary residual weighted according to the diagonal elements of the hat matrix h_ii . Residuals associated with points for which hii is large will have PRESS residuals. These points will generally be higher influence points. Generally, a large difference between the ordinary residual will indicate a point where the model fits the data well, but a model built without that point predicts poorly.

Finally note that the variance of the i-th PRESS residual is V ˆe_(i)

= V

eˆ_i 1 − h_ii

= 1

(1 − h_ii)² σ²(1 − h_ii)

= σ²

1 − h_ii

so that a standardized PRESS residual is ˆ

e_(i) q

V ˆe_(i)

= eˆ_(i)/(1 − h_ii) p[σ²(1 − hiii)]

= ˆe_i pσ²(1 − h_ii)

which if we use M S_E to estimate σ² is just the studentized residual discussed previously.

III. R-Student:

The studentized residual r_i discussed above is often considered an outlier diagnostic.

It is customary to use M SE as an estimate of σ² in computing ri. This is referred to as internal scaling of the residual because M S_E is an internally generated estimate of σ² obtained from fitting the model to all n observations. Another approach would be to use an estimate of σ² based on a data set with the i-th observation removed. Denote the estimate of σ² so obtained by S_(i)² . We can show that

S_(i)² = (n − p)M S_E − ˆe²_i/(1 − h_ii)

n − p − 1 (3.16)

The estimate of σ² in ( 3.16) is used instead of M SE to produce an externally studentized residual, usually called R-student, given by

ti = eˆi

qS_(i)² (1 − h_ii)

, i = 1, 2, · · · , n (3.17)

In many situation t_i will differ little from the studentized residual r_i. However, if the i-th observation is influential, then S_(i)² can differ significantly from M S_E, and thus the R-student statistic will be more sensitive to this point. Furthermore under the standard assumptions t_i does follows the t_n−p−1-distribution. Thus R-student offers a more formal procedure for outlier detection via hypothesis testing. Furthermore detection of outliers needs to be considered simultaneously with the detection of influential observations.

IV.Estimation of Pure Error:

The procedure involved partitioning the error (or residual) sum of squares into sum squares due to ”pure error” and sum of squares due to ”lack of fit”,

SS_E = SS_{P E} + SS_LOF

where SSP E is computed using responses at repeated observations at the same level of x

¯. This is a model independent estimate of σ².The calculation of SS_{P E} requires repeated observations on y

¯ at the same set of levels on the regressor variables x₁, x₂, · · · , x_p, i.e., some of the rows of X matrix be same. However repeated observations do not often occur in multiple regression.

In document Subset Selection in Regression Analysis (Page 30-34)