• No results found

squared measures how well the behavior of the independent variable explains the

Exercise 1: Soft Drink Consumption Revisited

R- squared measures how well the behavior of the independent variable explains the

behavior of the dependent variable. R-squared is the ratio of the Regression Sum of Squares to the Total Sum of Squares. As such, it tells us what proportion of the total variation in the dependent variable is explained by its linear relationship with the independent variable. Residual Analysis

Although the regression line is the line that best fits the observed data, the data points typically do not fall precisely on the line. Collectively, the vertical distances from the data to the line — the errors — measure how well the line fits the data. These errors are also known

as residuals.

A careful study of the residuals can tell us a lot about a regression analysis and the validity of the assumptions we base it on.

For example, when we run a regression, we assume that a straight line best describes the relationship between our two variables. In fact, sometimes the relationship may be better described by a curve.

In this graph, we can clearly see a negative trend. If we run a regression on these data, we find a relatively high R-squared. How do we use the residuals to check our assumption that the relationship is linear?

First, we measure the residuals: the distance from the data points to the regression line. Then we plot the residuals against the values of the independent variable. This graph — called a residual plot — helps us identify patterns in the residuals.

We can recognize a pattern in the residual plot: a curve. This pattern strongly indicates that a straight line is not the best way to express the relationship between the variables: a curve would be a much better fit.

A residual plot often is better than the original scatter plot for recognizing patterns because it isolates the errors from the general trend in the data. Residual plots are critical for

studying error patterns in more advanced regressions with multiple independent variables. If the only pattern in the dependent variable is accounted for by a linear relationship with the independent variable, then we should see no systematic pattern in the residual plot. The residuals should be spread randomly around the horizontal axis.

In fact, the distribution of the residuals should be a normal distribution, with mean zero, and a fixed variance. Residuals are called homoskedastic if their distributions have the same variance.

If we see a pattern in the distribution of the residuals, then we can infer that there is more to the behavior of the dependent variable than what is explained by our linear regression. Other factors may be influencing the dependent variable, or the assumption that the relationship is linear may be unwarranted.

We've already seen the pattern of a curved relationship. What other patterns might we see? Let's look at this scatter diagram and its corresponding residual plot. The residuals appear to be getting larger for higher values of the independent variable. This phenomenon is known as heteroskedasticity.

Residual analysis reveals that the distribution of the residuals changes with the independent variable: the variance increases as the independent variable increases. Since the variance of the residuals — which contributes to the variation of the dependent variable — is affected by the behavior of the independent variable, we can conclude that there must be more to the story than just the linear relationship.

There are a number of other assumptions about regression whose validity can be tested by performing a residual analysis. Although interesting, these uses of residual analysis are beyond the scope of this course.

Summary

A complete regression analysis should include a careful inspection of the residuals. Plot the residuals against the independent variable to reveal patterns in the distribution of the residuals.

Graphing Residual Lines and Residuals in Excel 2007

to perform residual analysis using the regression tool. However, we suggest you read through the instructions to learn how Excel's regression tool works, so you can perform residual analysis in the future, when you do have access to the Data Analysis Toolpak. To study patterns in residuals and other regression data, visual representations can be very helpful. To plot a regression line we first generate a scatter diagram of the data we are studying.

Once we have the scatter diagram we add the regression line by right-clicking on any one of the data points. A menu will pop up. From that menu, we select "Add Trendline."

In the Trendline Options Menu, we select "Linear" under Trend/Regression Type.

If we wish to display the regression equation and R-squared value on the same scatter plot, we can select the check boxes next to those items at the bottom and press "Close."

The scatter diagram will now have been augmented by the regression line, the regression equation, and the R-squared value. This is a quick way to perform a simple regression analysis from a scatter plot, though it doesn't provide all of the output we'll want to thoroughly review the results.

Residual analysis is an option in Excel's Regression tool. To calculate the residuals and generate the residual plot, we select "Residual Plots" in the Residuals section of the