Residual Plots - Testing of Hypothesis In Multiple Linear Regression

2.7 Testing of Hypothesis In Multiple Linear Regression

3.1.6 Residual Plots

The residuals e_i from the multiple linear regression model plays an important role in judging model adequacy just as they do in simple linear regression. Specially we often find it instructive to plot the following:

1. Residuals on Normal Probability paper.

2. Residuals versus each regressor xj, j = 1,2,...,p 3. Residuals versus fitted ˆy_i, i = 1, 2, · · · , n.

4. Residuals in time sequence (if known).

The plots are used to detect departures from normality, outliers, inequality of variance and the wrong functional specification for a regressor.There are several other residual plots useful in multiple regression analysis, some of them are as follows:

1. Plot of residuals against regressors omitted from the model.

2. Partial residual plots(i-th partial residuals for regressor x_j is ˆe^∗_ij = y_i − ˆβ

¯¹x_i1−

· · · − ˆβ

¯^j−1xi,j−1− ˆβ

¯^j+1xi,j+1− · · · − ˆβ

¯^pxi,p).

3. Partial regression plots:plots of residuals from which the linear dependency of y

¯on all regressors other than x_j have been removed against regressor x_j with its linear dependency on other regressors removed.

4. Plot of regressor x_j against x_i(checking multicollinearity):If two or more regressors are highly correlated, we say that multicollinearity is present in the data. Multi-collinearity can seriously disturb the least squares fit and in some situations render the regression model almost useless. We shall discuss some of them here.

Normal Probability Plot

A very simple method for checking the normality assumption is to plot the residuals on normal probability paper. This graph paper is so designed so that the cumulative normal distribution will plot as a straight line.

Let e_[1] < e_[2] < · · · < e_[n] be the residuals ranked in increasing order. Plot e_[i]

against the cumulative probabilities, Pi = (i − ¹₂)/n, i = 1, 2, · · · , n. resulting should be approximately on a straight line. The straight line is usually determined visually, with emphasis on the central values(e.g., the .33 and .67 cumulative probability points) rather than the extremes. Substantially departures from a straight line indicate that the distribution is not normal.

Figure 3.1(a) displays an ”idealized” normal probability plot. Notice that the points lie approximately along a straight line. Figure 3.1(b)-(e) represent other typical prob-lems. Figure 3.1(b) shows a sharp upward and downward curve at both extremes,

Figure 3.1: Normal Probability plots

indicating that the tails of the distribution are too heavy for it to be considered normal.

Conversely Figure 3.1c shows flattening at the extremes, which is a pattern typical of samples from a distribution with thinner tails than the normal. Figure 3.1(d)-(e) ex-hibit patterns associated with positive and negative skew, respectively. Andrews [1970]

and Gnanadesikan [1977] note that normal probability plots often exhibit no unusual behavior even if errors εi are not normally distributed. This problems occur because the residuals are not a simple random sample; they are the remnants of a parameter estima-tion process. The residuals can be shown to be linear combinaestima-tions of the model errors (the εi). Thus fitting the parameters tends to destroy the evidence of non-normality in the residuals, and consequently we cannot always rely on the normal probability plot to detect departures from normality.

A common defect that shows up on the normal probability plot is the occurrence of one or two large residuals. Sometimes this is an indication that the corresponding observations are outliers.

Residual Plot against ˆy_i

Figure 3.2: Patterns for residual plots:(a) satisfactory;(b) funnel;(c) double bow;(d) nonlinear.

A plot of the residuals ˆe_i (or the scaled residuals d_i or r_i) versus the corresponding fitted values ˆy_i is useful for detecting several common types of model inadequacies.¹ vspace0.5mm If this plot resembles Figure 3.1.6(a), which indicates that the residuals can be contained in a horizontal band, then there are no obvious model defects. Plots of e_i against ˆy_i that resemble any of the patterns in Figures 3.1.6(b)-(d) are symptomatic of model deficiencies.

∗ ∗ ∗ ∗ ∗

1The residuals should be plotted against the fitted values ˆy_i and not the observed values y_i because the e_i and the ˆy_i are uncorrelated while the e_i and the y_i are usually correlated .

Chapter 4 Subset Selection and Model Building

So far we have assumed that the variables that go into the regression equation were chosen in advance. Our analysis involved examining the equation to see whether the functional specification was correct and whether the underlying assumptions about the error term were valid. The analysis presupposed that the regressor variables included in the model are known to be influential. However, in most practical applications the analyst has a pool of candidate regressors that should include all the influential factors, but the actual subset of regressors that should be used in the model needs to be deter-mined. Finding an appropriate subset of regressors for the model is called the variable selection problem.

Building a regression model that includes only a subset of the variable regressors involves two conflicting objectives.

1. We would like the model to include as many as regressors as possible so that the

”information content” in these factors can influence the predicted value of y.

2. We want the model to include as few regressors as possible because the variance of the prediction ˆy increases as the number of regressors increases. Also the more regressors there are in a model, greater the costs of data collection and model maintenance.

The process of finding a model that is a compromise between these two objectives is called ”best” regression equation. Unfortunately there is no unique definition of best.

Furthermore there are several algorithms that can be used for variable selection, and these procedures frequently specify different different subsets of the candidate regressors as best.

The variable selection problem is often discussed in an idealized setting. It is usually assumed that the correct functional specification of the regressors is known, and that no outliers or influential observations are present. In practice, these assumptions are rarely met. Investigation of model adequacy is linked to the variable selection problem.

Although ideally these problems should be solved simultaneously, an iterative approach is often employed, in which

1. a particular variable selection strategy is employed and then

2. the resulting subset model is checked for correct functional specification, outliers, influential observations.

Several iterations may be required to produce an adequate model.

In document Subset Selection in Regression Analysis (Page 34-39)