Using Lines for Prediction - Statistics in Action

There are two main reasons why you would want to fit a line to a set of data: • to find a summary, or model, that describes the relationship between the two

variables

• to use the line to predict the value of y when you know the value of x. In cases where it makes sense to do this, the variable on the x-axis is called the

predictororexplanatory variable,and the variable on the y-axis is called the

predictedorresponse variable.

In the previous example, the equation

y = −195.20 + 0.10x

models the rise in the minimum wage for the years 1960 through 2005. Knowing this equation enables you to make a general statement about the minimum wage throughout these years: The minimum wage went up roughly $0.10 per year.”

You might instead want to use the line to predict the minimum wage in one of the years for which no amount is given or for years before 1960 or after 2005.

Example: Predicting the Minimum Wage

Use the equation y = −195.20 + 0.10x to predict the minimum wage in the years 2003 and 1950.

Solution

The predicted minimum wage for 2003 is

y = −195.20 + 0.10x = −195.20 + 0.10(2003) = 5.10

Assuming the linear trend continues back to earlier years, the predicted minimum wage for 1950 is

y = −195.20 + 0.10x = −195.20 + 0.10(1950) = −0.20

The predicted minimum wage for 2003 is very close to the actual minimum wage of $5.15 per hour. But the actual minimum wage in 1950 was $0.75 per hour, not a negative number! As you can see, making the assumption that the linear trend continues can be risky. This type of prediction, making a prediction when the value of x falls outside the range of the actual data, is called extrapolation. Interpolation—making a prediction when the value of x falls inside the range of the data, as does 2003—is safer.

Suppose you know the value of x and use a line to predict the corresponding value of y. You know that your prediction for y won’t be exact, but you hope that the error will be small. The prediction error is the difference between the

observed value of y and the predicted value of y, or . You usually don’t know what that error is. If you did, you wouldn’t need to use the line to predict the value of y. You do, however, know the errors for the points used to construct the line. These differences are called residuals:

residual = observed value of y − predicted value of y = y −

Lines are summaries and can be used to predict.

is read “y-hat” and may be called the “predicted” value or the “fitted” value.

Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm

The geometric interpretation of the residual is shown in Display 3.16. A residual is the signed vertical distance from an observed data point to the regression line. The residual is positive if the point is above the line and negative if the point is below the line.

Display 3.16 Residual y −

Example: Finding Residuals

Display 3.17 shows the mean net income (after expenses and before taxes, in thousands of dollars) for doctors who were board-certified in family practice and working during the years 1990–1998 and 2001.

Display 3.17 Mean net income for family practitioners,

1990–2001. [Source: U.S. Census Bureau, Statistical Abstract of

the United States, 2004–2005.]

The equation of the fitted line is = −8300.6 + 4.2248x, where x is the year and is the income in thousands of dollars.

Graph the fitted line with the data points. What is the residual for the year 1996?

Solution

You can use a graphing calculator to graph a scatterplot with a summary line.

[See Calculator Note 3B.] The residual is the

signed vertical distance of the data point from the line.

Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm

122 Chapter 3 Relationships Between Two Quantitative Variables

The actual net income value for 1996 was $139,000. Using the equation of the fitted line, the prediction for 1996 is

= −8300.6 + 4.2248x = −8300.6 + 4.2248(1996) = 132.1008, or $132,101 You also can use your calculator to calculate a predicted value quickly. [See

Calculator Note 3C.]

To find the residual, subtract the predicted value from the observed value:

y − = 139 − 132.1008 = 6.8992

or about $6899. The residual is positive because the observed value is higher than the predicted value. That is, the point lies above the line.

You can use a calculator to calculate residuals for all points in a data set simultaneously. [See Calculator Note 3D.]

Using Lines for Prediction

D5. Test how well you understand residuals.

a. If a residual is large and negative, where is the point located with respect to the line? Draw a diagram to illustrate. What does it mean if the

residual is 0?

b. If someone said that they had fit a line to a set of data points and all their residuals were positive, what would you say to them?

c. Interpret the y-intercept of the regression line in the previous example. Does this make sense?

Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm

3.2 Getting a Line on the Pattern 123

D6. What do you think of the arithmetic and the reasoning in this passage from Mark Twain’s Life on the Mississippi?

In the space of one hundred and seventy-six years the Lower Mississippi has shortened itself two hundred and forty-two miles. That is an average of a trifle over one mile and a third per year. Therefore, any calm person, who is not blind or idiotic, can see that in the Old Oölitic Silurian Period, just a million years ago next November, the Lower Mississippi River was upwards of one million three hundred thousand miles long, and stuck out over the Gulf of Mexico like a fishing rod.

And by the same token any person can see that seven hundred and forty-two years from now the Lower Mississippi will be only a mile and three quarters long, and Cairo and New Orleans will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen. There is something fascinating about science. One gets such wholesale returns of conjecture out of such a trifling investment of fact. [Source:James R. Osgood and Company, 1883, p. 208.]

Given that the Mississippi/Missouri river system was about 3710 mi long in the year 2000, write an equation that Twain would say gives the length of the river in terms of the year.

In document Statistics in Action (Page 129-132)