When you are working with real data, the best way to get the least squares line is by computer or calculator. Display 3.18 shows typical computer output for the minimum wage data in Display 3.14 on page 118.
Display 3.18 Data Desk output giving the equation of the least
squares line for the minimum wage data.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
128 Chapter 3 Relationships Between Two Quantitative Variables
You can ignore most of the output for now. You will learn how to interpret it in Chapter 11. For the time being, focus on the first two columns in the last three rows, which are reproduced in Display 3.19.
Display 3.19 The lower-left corner of the computer output gives
the y-intercept and slope.
The y-intercept is the coefficient in the row labeled “Constant” and is −196.977. The slope is the coefficient of the predictor variable “Year” and is 0.100909. The SSE for the regression line is found in the “Residual” row and is 0.354545.
Reading Computer Output
D8. Doctors’ incomes. Display 3.20 shows the mean net income y of family practitioners versus year x (from page 121), with Data Desk computer output for the least squares line.
Display 3.20 Scatterplot of mean net income (in thousands
of dollars) of doctors board-certifi ed in family practice, 1990–2001, and Data Desk output for the regression.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.2 Getting a Line on the Pattern 129
a. What is the equation of the least squares line? Estimate the SSE from the scatterplot in Display 3.20, and then find it in the computer output. b. The Minitab software output for this regression is shown in Display 3.21.
How is it different from the Data Desk output?
Display 3.21 Minitab output for the regression of family
practitioners’ income versus year.
Summary 3.2: Getting a Line on the Pattern
For many quantitative relationships, it makes sense to use one variable, x, called the predictor or explanatory variable, to predict values of the other variable, y, called the predicted or response variable. When the data are roughly linear, you can use a fitted line, called the least squares regression line, as a summary or model that describes the relationship between the two variables. You might also use it to predict the value of an unknown value y when you know the value of x.
Interpolation—using a fitted relationship to predict a response value when the predictor value falls within the range of the data—generally is much more trustworthy than extrapolation—predicting response values based on the
assumption that a fitted relationship applies outside the range of the observed data. Each residual from a fitted line measures the vertical distance from a data point to the line:
residual = observed value − predicted value = y − y
The least squares regression line for a set of pairs (x, y) is the line for which the sum of squared errors, or SSE, is as small as possible. For this line, these properties hold:
• The sum (and mean) of the residuals is 0. • The line contains the point of averages,
• The variation in the residuals is as small as possible. • The line has slope b1, where
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
130 Chapter 3 Relationships Between Two Quantitative Variables
To find the equation of the regression line, • compute and .
• find the slope using the formula for b1
• compute the y-intercept: b0 = − b1 .
The equation is = b0 + b1x. Remember to use a hat, , to indicate a predicted
value of y.
Lines as Summaries
P3. Display 3.22 shows the weight of a student’s pink eraser, in grams, plotted against
the number of days into the school year. Estimate the slope of the line drawn on the graph. Interpret the slope in the context of the situation. [Source:Zach’s Eraser, CMC
ComMuniCator, 28 (June 2004): 28.]
Display 3.22 Weight of pink eraser.
P4. Display 3.23 shows the hand width of 383 students plotted against hand length. The line drawn on the plot is the least squares line.
Display 3.23 Hand width and hand length, in
inches, for 383 students.
a. Estimate the slope of the line. b. What does the slope tell you? c. Estimate the equation of the line. d. Students were instructed to measure their
“hand width” with their fingers spread apart as far as possible. The scatterplot shows a smaller cloud of points below the main one. Why do you think that is the case? What would happen to the regression line if those points were removed?
Using Lines for Prediction
P5. If you attend a university where class sizes tend to be small, are you more likely to give to your alumni fund after you graduate than if you graduate from a university with large classes? Display 3.24 shows a scatterplot of a sample of 40 universities. Each university appears as a point. The vertical coordinate,
y, tells the percentage of alumni who gave
money. Each x-coordinate tells the student/ faculty ratio (number of students per faculty member). The equation of the fitted line is approximately = 55 − 2x.
a. Which is the explanatory variable and which is the response variable? b. Explain how you can see from the
graph that an increase of five students per faculty member corresponds to a decrease of about 10 percentage points in the giving rate. Explain how you can see this from the equation of the fitted line. c. Does the y-intercept have a useful
interpretation in this situation?
Practice
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.2 Getting a Line on the Pattern 131
d. Use the regression line to predict the giving rate for a university with a student/ faculty ratio of 16. When you use the regression line to predict the giving rate, would you expect a rather large error or a relatively small error in your prediction? e. Use the plot to estimate the residual for
the university with the highest student/ faculty ratio and for the university with the highest giving rate.
f. The university with the lowest student/ faculty ratio, 6 to 1, had a giving rate of 32%. Use the equation of the fitted line to find the residual for that university. g. Suppose the Alumni Association at
Piranha State University boasts a giving rate of 80%. Without knowing the student/faculty ratio at PSU, can you tell whether the prediction error will be positive or negative?
Display 3.24 Percentage of alumni giving to the
alumni fund versus the student/ faculty ratio for 40 highly rated U.S. universities.
Least Squares Regression Line
P6. The fat and calorie contents of 5 oz of three kinds of pizza are represented by the data points (9, 305), (11, 309), and (13, 316). a. Plot the points.
b. Compute the equation of the least squares regression line by hand, and draw the line on your plot.
c. Interpret the slope and y-intercept in the context of this situation.
d. Verify that the least squares regression line goes through the point of averages. e. Verify that the sum of the residuals is 0. P7. Use the statistical functions of your
calculator to make a scatterplot, find the regression equation for predicting percentage
on-time arrivals from mishandled baggage,
and compute residuals for the airline data from P2 on page 110. [See Calculator Notes 3A, 3G, and 3D.] The data values are given in Display 3.25.
Display 3.25 Comparison, by airline, of mishandled
baggage and on-time arrival rate.
[Source: U.S. Department of Transportation, Air
Travel Consumer Report, October 2005.]
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
132 Chapter 3 Relationships Between Two Quantitative Variables
Reading Computer Output
P8. The JMP-IN computer output in Display 3.26 is for the pizza data in P6. Does it give the same results that you computed by hand? Where in the output is the SSE found?
Display 3.26 JMP-IN computer output for pizza data.
Exercises
E9. Display 3.27 shows cost in dollars per hour versus number of seats for three aircraft models. Five lines, labeled A–E, are shown on the plot. Their equations, listed below, are labeled I–V.
a. Match each line (A–E) with its equation (I–V).
I. cost = −290 + 15.8 seats II. cost = 400 + 15.8 seats III. cost = 1000 + 15.8 seats
IV. cost = 370 + 25 seats V. cost = 900 + 10 seats b. Match each line (A–E) with the
appropriate verbal description (I–V): I. This line overestimates cost. II. This line underestimates cost. III. This line overestimates cost for the
smallest plane and underestimates cost for the largest plane.
IV. This line underestimates cost for the smallest plane and overestimates cost for the largest plane.
V. On balance, this line gives a better fit than the other lines.
Display 3.27 Cost in dollars per hour versusnumber of seats for three aircraft models.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.2 Getting a Line on the Pattern 133
E10. Examine the scatterplot in Display 3.28.
Display 3.28 Calories versus fat, per 5-oz serving,
for seven kinds of pizza. [Source: Consumer Reports, July 2003.]
a. Which two kinds of pizza in Display 3.28 have the fewest calories? Which two have the least fat? Which region of the graph has the pizzas with the most fat?
b. Display 3.29 shows the data again, with five possible summary lines. Match each equation (I–V) with the appropriate line (A–E).
I. calories 70 15 fat II. calories 10 25 fat III. calories 150 15 fat IV. calories 110 15 fat V. calories 170 10 fat
Display 3.29 Five possible fitted lines for the
pizza data.
c. Consider the possible summary lines in Display 3.29.
i. Which line gives predicted values for calorie content that are too high? How can you tell this from the plot? ii. Which line tends to give predicted
calorie values that are too low? iii. Which line tends to overestimate
calorie content for lower-fat pizzas and underestimate calorie content for higher-fat pizzas?
iv. Which line has the opposite problem, underestimating calorie content when fat content is lower and overestimating calorie content when fat content is higher?
v. Which line fits the data best overall? E11. Heights of boys. The scatterplot in Display 3.30 shows the median height, in inches, for boys ages 2 through 14 years.
Display 3.30 Median height versus age for boys.
[Source: National Health and Nutrition Examination Survey (NHANES), 2002, www.cdc.gov.]
a. Estimate the slope of the line that summarizes the relationship between age and median height.
b. Explain the meaning of the slope with respect to boys and their median height. c. Write the equation of the line using the
slope from part a and a point on the line. d. Interpret the y-intercept. Does the
interpretation make sense in this context?
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
134 Chapter 3 Relationships Between Two Quantitative Variables
E12. Pizza again. Display 3.31 shows the calorie and fat content of 5 oz of various kinds of pizza.
Display 3.31 Calories and fat content per 5-oz
serving, for seven kinds of pizza. [Source:
Consumer Reports, January 2002.]
a. Use the line on the scatterplot to predict the calorie content of a pizza with 10.5 g of fat. Often use the line to predict the calorie content of a pizza with 15 g of fat. b. Use the two predictions in part a to
estimate the slope of the line. Write the equation of the line using this slope and a point on the line.
c. There are 9 calories in a gram of fat. How is your estimated slope related to this number?
E13. Stopping on a dime? In an emergency, the typical driver requires about 0.75 second to get his or her foot onto the brake pedal. The distance the car travels during this reaction time is called the reaction distance. Display 3.32 shows the reaction distances for cars traveling at various speeds.
Display 3.32 Reaction distance at various speeds.
a. Plot reaction distance versus speed, with
speed on the horizontal axis. Describe the shape of the plot.
b. What should the y-intercept be? c. Find the slope of the line of best fit
by calculating the change in y per unit change in x. What does the slope represent in this situation?
d. Write the equation of the line that fits these data.
e. Use the equation of the line in part d to predict the reaction distance for a car traveling at a speed of 55 mi/h and at 75 mi/h.
f. How would the equation change if it actually took 1 second, instead of 0.75 second, for drivers to react?
E14. The scatterplot in Display 3.33 shows
operating cost (in dollars per hour) versus
fuel consumption (in gallons per hour) for a sample of commercial aircraft.
Display 3.33 Operating cost versus fuel
consumption for commercial aircraft.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.2 Getting a Line on the Pattern 135
a. Which is the explanatory variable and which is the response variable? b. Estimate the slope of the regression line
from the graph, and interpret it in the context of this situation.
c. The y-intercept is 470. Does this value have a reasonable interpretation in this situation?
d. Use the line to predict the cost per hour for a plane that consumes 1500 gal/h of fuel. E15. Arsenic is a potent poison sometimes found in
groundwater. Long-term exposure to arsenic in drinking water can cause cancer. How much arsenic a person has absorbed can be measured from a toenail clipping. The plot in Display 3.34 shows the arsenic concentrations in the toenails of 21 people who used water from their private wells plotted against the arsenic concentration in their well water. Both measurements are in parts per million.
Display 3.34 Arsenic concentrations. [Source: M. R.
Karagas et al. Toenail Samples as an Indicator of Drinking Water Arsenic Exposure, Cancer
Epidemiology, Biomarkers and Prevention 5 (1996):
849–52.]
a. What is the predictor variable, and what is the response variable?
b. Describe the relationship.
c. Estimate the residual for the person with the highest concentration of arsenic in the well water.
d. Find the person on the plot with the largest residual. What was the concentration of arsenic in that person’s toenails?
e. The World Health Organization has set a standard that the concentration of arsenic
in drinking water should be less than 0.01 mg/L. (1 mg/L = 1 ppm.) Is this standard exceeded in any of these wells?
[Source: www.who.int.]
E16. More pizza. Refer to the pizza data in E12. a. The least squares residuals for the pizza
data are, in order from smallest to largest, −40.58, −17.66, −15.95, −1.03, 14.28, 26.44, and 34.50. Match each residual with its pizza.
b. What does the residual for Pizza Hut’s Pan pizza tell you about the pizza’s number of calories versus fat content? c. For Pizza Hut’s Hand Tossed and
Domino’s Deep Dish, are the residuals positive or negative? How can you tell this from the scatterplot in Display 3.31? E17. The level of air pollution is indicated by a
measure called the air quality index (AQI). An AQI greater than 100 means the air quality is unhealthy for sensitive groups such as children. The table and plot in Display 3.35 show the number of days in Detroit that the AQI was greater than 100 for the years 2001, 2002, and 2003.
Display 3.35 Air quality index for 2001–2003.
[Source: U.S. Environmental Protection Agency, www.epa.gov.]
a. By hand, compute the equation of the least squares line.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
136 Chapter 3 Relationships Between Two Quantitative Variables
b. Interpret the slope in the context of this situation.
c. Which year has the largest residual? What is this residual?
d. Compute the SSE for this line.
e. Verify that the sum of the residuals is 0. f. Find the SSE for the line that has the
same slope as the least squares line but passes through the point for 2002. Is this SSE larger or smaller than the SSE for the least squares line? According to the least squares approach, which line fits better?
g. Find the slope of the line that passes through the points for 2001 and 2003. Then find the fitted value for 2002. Finally, find all three residuals and the value of the SSE for this line.
h. The least squares line doesn’t pass through any of the points, and yet judging by the SSE that line fits better than the one in part g. Do you agree that the least squares line fits better than the lines in parts f and g? Explain why or why not.
E18. Even more pizza. Refer again to the table and scatterplot in Display 3.31 on page 134. a. By hand, compute the equation of the
least squares regression line for using fat to predict calories. How close was your estimate of the equation in E12?
b. Which of these values must be the SSE for this regression? Explain your answer.
0 29.3 861.4 4307 E19. Heights of girls. Display 3.36 gives the
median height in inches for girls ages 2–14. a. Practice using your calculator by making
a scatterplot, finding the equation of the least squares line for median height versus age, and graphing the equation on the plot.
Display 3.36 Median height for girls ages 2–14.
b Judging from the plot, is the residual for 11-year-olds positive or negative? Compute this residual to check your answer.
c. Verify that the line contains the point of averages,
d. How does the regression line for girls compare to the line for boys in E11? E20. Sum of residuals. In this exercise, you will
show that the sum of the residuals is equal to 0 if and only if the regression line passes through the point of averages,
a. Show that for a horizontal line the sum of the residuals will be 0 if and only if the line passes through the point of averages.
b. Show that no matter what the slope of the line is, the sum of the residuals will be 0 if and only if the line passes through the point of averages.
c. Why isn’t it good enough to define the regression line as the line that makes the sum of the residuals equal 0?
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.2 Getting a Line on the Pattern 137
E21. Height versus age. Display 3.37 shows a standard computer printout for the median
height versus age data of E11.
Display 3.37 Computer output of median height
versus age data.
a. Write the equation of the regression line. How does it compare to your estimate of the equation in E11?
b. What is the SSE for this least squares line? Does its value seem reasonable given the scatterplot in Display 3.30 on page 133? E22. Part of a printout for the percentage of
alumni who give to their colleges versus the student/faculty ratio is shown in Display 3.38. (These are the data in the scatterplot shown in Display 3.24 on page 131.)
Display 3.38 Computer output: regression analysis
of percentage giving to alumni fund versus student/faculty ratio.
a. What equation is given in the printout for the least squares regression line? b. Examine the table of unusual
observations. What is the student/faculty ratio at the college with the largest residual (in absolute value)? Find this college in Display 3.24 on page 131. c. Verify that the fit and the value of the
largest residual were computed correctly. d. Locate the SSE on the printout. Why is
this value so large?
E23. For the least squares regression line you found in E19, calculate the residuals for girls ages 2, 8, and 14. What does this suggest about the pattern of growth beyond what is summarized in the equation of the regression line?
E24. More about slope.
a. You and three friends, one right after the other, each buy the same kind of gas at the same pump. Then you make a scatterplot of your data, with one point per person, plotting the number of