• No results found

Residual Plots

2.7 Testing of Hypothesis In Multiple Linear Regression

3.1.6 Residual Plots

The residuals ei from the multiple linear regression model plays an important role in judging model adequacy just as they do in simple linear regression. Specially we often find it instructive to plot the following:

1. Residuals on Normal Probability paper.

2. Residuals versus each regressor xj, j = 1,2,...,p 3. Residuals versus fitted ˆyi, i = 1, 2, · · · , n.

4. Residuals in time sequence (if known).

The plots are used to detect departures from normality, outliers, inequality of variance and the wrong functional specification for a regressor.There are several other residual plots useful in multiple regression analysis, some of them are as follows:

1. Plot of residuals against regressors omitted from the model.

2. Partial residual plots(i-th partial residuals for regressor xj is ˆeij = yi − ˆβ

¯1xi1

· · · − ˆβ

¯j−1xi,j−1− ˆβ

¯j+1xi,j+1− · · · − ˆβ

¯pxi,p).

3. Partial regression plots:plots of residuals from which the linear dependency of y

¯on all regressors other than xj have been removed against regressor xj with its linear dependency on other regressors removed.

4. Plot of regressor xj against xi(checking multicollinearity):If two or more regressors are highly correlated, we say that multicollinearity is present in the data. Multi-collinearity can seriously disturb the least squares fit and in some situations render the regression model almost useless. We shall discuss some of them here.

Normal Probability Plot

A very simple method for checking the normality assumption is to plot the residuals on normal probability paper. This graph paper is so designed so that the cumulative normal distribution will plot as a straight line.

Let e[1] < e[2] < · · · < e[n] be the residuals ranked in increasing order. Plot e[i]

against the cumulative probabilities, Pi = (i − 12)/n, i = 1, 2, · · · , n. resulting should be approximately on a straight line. The straight line is usually determined visually, with emphasis on the central values(e.g., the .33 and .67 cumulative probability points) rather than the extremes. Substantially departures from a straight line indicate that the distribution is not normal.

Figure 3.1(a) displays an ”idealized” normal probability plot. Notice that the points lie approximately along a straight line. Figure 3.1(b)-(e) represent other typical prob-lems. Figure 3.1(b) shows a sharp upward and downward curve at both extremes,

Figure 3.1: Normal Probability plots

indicating that the tails of the distribution are too heavy for it to be considered normal.

Conversely Figure 3.1c shows flattening at the extremes, which is a pattern typical of samples from a distribution with thinner tails than the normal. Figure 3.1(d)-(e) ex-hibit patterns associated with positive and negative skew, respectively. Andrews [1970]

and Gnanadesikan [1977] note that normal probability plots often exhibit no unusual behavior even if errors εi are not normally distributed. This problems occur because the residuals are not a simple random sample; they are the remnants of a parameter estima-tion process. The residuals can be shown to be linear combinaestima-tions of the model errors (the εi). Thus fitting the parameters tends to destroy the evidence of non-normality in the residuals, and consequently we cannot always rely on the normal probability plot to detect departures from normality.

A common defect that shows up on the normal probability plot is the occurrence of one or two large residuals. Sometimes this is an indication that the corresponding observations are outliers.

Residual Plot against ˆyi

Figure 3.2: Patterns for residual plots:(a) satisfactory;(b) funnel;(c) double bow;(d) nonlinear.

A plot of the residuals ˆei (or the scaled residuals di or ri) versus the corresponding fitted values ˆyi is useful for detecting several common types of model inadequacies.1 vspace0.5mm If this plot resembles Figure 3.1.6(a), which indicates that the residuals can be contained in a horizontal band, then there are no obvious model defects. Plots of ei against ˆyi that resemble any of the patterns in Figures 3.1.6(b)-(d) are symptomatic of model deficiencies.

∗ ∗ ∗ ∗ ∗

1The residuals should be plotted against the fitted values ˆyi and not the observed values yi because the ei and the ˆyi are uncorrelated while the ei and the yi are usually correlated .

Chapter 4

Subset Selection and Model Building

So far we have assumed that the variables that go into the regression equation were chosen in advance. Our analysis involved examining the equation to see whether the functional specification was correct and whether the underlying assumptions about the error term were valid. The analysis presupposed that the regressor variables included in the model are known to be influential. However, in most practical applications the analyst has a pool of candidate regressors that should include all the influential factors, but the actual subset of regressors that should be used in the model needs to be deter-mined. Finding an appropriate subset of regressors for the model is called the variable selection problem.

Building a regression model that includes only a subset of the variable regressors involves two conflicting objectives.

1. We would like the model to include as many as regressors as possible so that the

”information content” in these factors can influence the predicted value of y.

2. We want the model to include as few regressors as possible because the variance of the prediction ˆy increases as the number of regressors increases. Also the more regressors there are in a model, greater the costs of data collection and model maintenance.

The process of finding a model that is a compromise between these two objectives is called ”best” regression equation. Unfortunately there is no unique definition of best.

Furthermore there are several algorithms that can be used for variable selection, and these procedures frequently specify different different subsets of the candidate regressors as best.

The variable selection problem is often discussed in an idealized setting. It is usually assumed that the correct functional specification of the regressors is known, and that no outliers or influential observations are present. In practice, these assumptions are rarely met. Investigation of model adequacy is linked to the variable selection problem.

Although ideally these problems should be solved simultaneously, an iterative approach is often employed, in which

1. a particular variable selection strategy is employed and then

2. the resulting subset model is checked for correct functional specification, outliers, influential observations.

Several iterations may be required to produce an adequate model.

Related documents