In most cases, no line will pass exactly through all the points in a scatterplot. Dif- ferent people will draw different lines by eye. We need a way to draw a regression line that doesn’t depend on our guess as to where the line should go. Because we use the line to predict y from x, the prediction errors we make are errors in
y, the vertical direction in the scatterplot. A good regression line makes the vertical distances of the points from the line as small as possible.
Figure 5.2 illustrates the idea. This plot shows three of the points from Figure 5.1, along with the line, on an expanded scale. The line passes above one of the points and below two of them. The three prediction errors appear as vertical line seg- ments. For example, one subject had x 57, a decrease of 57 calories in NEA. c05Regression.indd Page 128 8/17/11 7:27:29 PM user-s163
The line predicts a fat gain of 3.7 kilograms, but the actual fat gain for this subject was 3.0 kilograms. The prediction error is
error observed response predicted response 3.0 3.7 0.7 kilogram
There are many ways to make the collection of vertical distances “as small as possible.” The most common is the least-squares method.
•
The Least-Squares Regression Line 1 2 92.5 −150 −100 −50 0 50 3.0 3. 5 4 .0 4 .5 F at gain (kil ograms)
Nonexercise activity change (calories)
Predicted response 3.7
This subject had NEA = –57. Observed
response 3.0
F I G U R E 5 . 2
The least-squares idea. For each observation, find the vertical distance of each point on the scatter- plot from a regression line. The least-squares regression line makes the sum of the squares of these distances as small as possible.
LEAST-SQUARES REGRESSION LINE
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
One reason for the popularity of the least-squares regression line is that the problem of finding the line has a simple answer. We can give the equation for the least-squares line in terms of the means and standard deviations of the two variables and the correlation between them.
c05Regression.indd Page 129 8/17/11 7:27:30 PM user-s163
1 3 0 C H A P T E R 5
•
RegressionWe write yˆ (read “y hat”) in the equation of the regression line to emphasize that the line gives a predicted response yˆ for any x. Because of the scatter of points about the line, the predicted response will usually not be exactly the same as the actually observed response y. In practice, you don’t need to calculate the means, standard deviations, and correlation first. Software or your calculator will give the slope b and intercept a of the least-squares line from the values of the variables x and y. You can then concentrate on understanding and using the regression line.
USING TECHNOLOGY
Least-squares regression is one of the most common statistical procedures. Any technology you use for statistical calculations will give you the least-squares line and related information. Figure 5.3 displays the regression output for the data of Examples 5.1 and 5.2 from a graphing calculator, two statistical programs, and a spreadsheet program. Each output records the slope and intercept of the least- squares line. The software also provides information that we do not yet need, although we will use much of it later. (In fact, we left out part of the Minitab and Excel outputs.) Be sure that you can locate the slope and intercept on all four outputs. Once you understand the statistical ideas, you can read and work with almost
any software output.
EQUATION OF THE LEAST-SQUARES REGRESSION LINE
We have data on an explanatory variable x and a response variable y for n individu-
als. From the data, calculate the meansx and y and the standard deviations sx and sy
of the two variables, and their correlation r. The least-squares regression line is the line yˆ a bx with slope b rsy sx and intercept a y bx A P P LY Y O U R K N O W L E D G E
5.3 Coral reefs. Exercises 4.2 and 4.10 discuss a study in which scientists examined data on mean sea surface temperatures (in degrees Celsius) and mean coral growth (in millimeters per year) over a several-year period at locations in the Red Sea. Here
are the data:2
CORAL
Sea surface temperature 29.68 29.87 30.16 30.22 30.48 30.65 30.90
Growth 2.63 2.58 2.60 2.48 2.26 2.38 2.26
c05Regression.indd Page 130 8/17/11 7:27:30 PM user-s163
F I G U R E 5 . 3
Least-squares regression for the non- exercise activity data: output from a graphing calculator, two statistical programs, and a spreadsheet program.
1 3 1
Minitab
Texas Instruments Graphing Calculator
Regression Analysis: fat versus nea The regression equation is
fat = 3.51 - 0.00344 nea
Predictor Coef SE Coef T P
Constant 3.5051 0.3036 11.54 0.000 nea 0.0007414 -4.64 0.000 S = 0.739853 R-Sq = 60.6% R-Sq (adj) = 57.8% -0.0034415 Microsoft Excel CrunchIt! SUMMARY OUTPUT 1 2 3 4 5 6 7 8 9 10 11 12 13 E D C B A F Regression statistics Multiple R 0.778555846 R Square Adjusted R Square Standard Error 0.739852874 0.578017005 0.606149205 Observations
Coefficients Standard Error t Stat P-value
16 3.505122916 0.303616403 -0.003441487 0.00074141 11.54458 -4.64182 1.53E-08 0.000381 Intercept nea
Output nea data
Fitted Equation:
Fat = 3.505 - 0.003441 * NEA
Estimate Std. Error t value Pr(>t)
(Intercept) 3.505 0.3036 11.54 <0.0001
NEA 0.0007414 -4.642 0.0003810
r-Squared: 0.6061 Adjusted r-Squared: 0.5780 sigma: 0.7399 -0.003441
Export
Results - Simple Linear Regression
c05Regression.indd Page 131 11/15/11 5:09:52 PM user-s163
1 3 2 C H A P T E R 5
•
Regression(a) Use your calculator to find the mean and standard deviation of both sea surface temperature x and growth y and the correlation r between x and y. Use these basic measures to find the equation of the least-squares line for predicting y from x.
(b) Enter the data into your software or calculator and use the regression function to find the least-squares line. The result should agree with your work in (a) up to roundoff error.
5.4 Do heavier people burn more energy? We have data on the lean body mass and resting metabolic rate for 12 women who are subjects in a study of dieting. Lean body mass, given in kilograms, is a person’s weight leaving out all fat. Metabolic rate, in calories
burned per 24 hours, is the rate at which the body consumes energy. METABOLIC
Mass 36.1 54.6 48.5 42.0 50.6 42.0 40.3 33.1 42.4 34.5 51.1 41.2 Rate 995 1425 1396 1418 1502 1256 1189 913 1124 1052 1347 1204
(a) Make a scatterplot that shows how metabolic rate depends on body mass. There
is a quite strong linear relationship, with correlation r 0.876.
(b) Find the least-squares regression line for predicting metabolic rate from body mass. Add this line to your scatterplot.
(c) Explain in words what the slope of the regression line tells us.
(d) Another woman has a lean body mass of 45 kilograms. What is her predicted metabolic rate?