One reason for the popularity of least-squares regression lines is that they have many convenient properties. Here are some facts about least-squares regression lines.
Fact 1. The distinction between explanatory and response variables is essen- tial in regression. Least-squares regression makes the distances of the data points from the line small only in the y direction. If we reverse the roles of the two variables, we get a different least-squares regression line.
EXAMPLE 5.3
Predicting fat gain, predicting change in NEA Figure 5.4 repeats the scatterplot of the NEA data in Figure 5.1, but with two least- squares regression lines. The solid line is the regression line for predicting fat gain from change in NEA. This is the line that appeared in Figure 5.1.We might also use the data on these 16 subjects to predict the change in NEA for another subject from that subject’s fat gain when overfed for 8 weeks. Now the roles of the variables are reversed: fat gain is the explanatory variable and change in NEA is the response variable. The dashed line in Figure 5.4 is the least-squares line for pre- dicting NEA change from fat gain. The two regression lines are not the same.
In the regression setting, you must know clearly which variable is explanatory. ■ c05Regression.indd Page 132 8/17/11 7:27:31 PM user-s163
Fact 2. There is a close connection between correlation and the slope of the least-squares line. The slope is
b rsy sx
You see that the slope and the correlation always have the same sign. For exam- ple, if a scatterplot shows a positive association, then both b and r are positive. The formula for the slope b says more: along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y. When the variables are perfectly correlated (r 1 or r 1), the change in the predicted response yˆ is the same (in standard deviation units) as the change in x. Otherwise, because –1 r 1, the change in yˆ (in standard deviation units) is less than the change in x. As the correlation grows less strong, the prediction yˆ moves less in response to changes in x.
Fact 3. The least-squares regression line always passes through the point 1x, y2 on the graph of y against x. This is a consequence of the equation of the least-squares regression line (box on page 130). In Exercise 5.48 we ask you to confirm this.
Fact 4. The correlation r describes the strength of a straight-line relation- ship. In the regression setting, this description takes a specific form: the square of the correlation, r2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.
−200 02 4 6 0 200 400 600 800 1000 F at gain (kil ograms)
Nonexercise activity change (calories)
This line predicts change in NEA from fat gain.
This line predicts fat gain from change in NEA.
F I G U R E 5 . 4
Two least-squares regression lines for the nonexercise activity data, for Example 5.3. The solid line predicts fat gain from change in nonexercise activity. The dashed line predicts change in nonexercise activity from fat gain.
•
Facts About Least-Squares Regression 1 3 3 c05Regression.indd Page 133 8/17/11 7:27:31 PM user-s1631 3 4 C H A P T E R 5
•
RegressionThe idea is that when there is a linear relationship, some of the variation in
y is accounted for by the fact that as x changes, y changes along with it. Look
again at Figure 5.1 (page 126), the scatterplot of the NEA data. The variation in
y appears as the spread of fat gains from 0.4 to 4.2 kg. Some of this variation is
explained by the fact that x (change in NEA) varies from a loss of 94 calories to a gain of 690 calories. As x changes from 94 to 690, y changes along the line. You would predict a smaller fat gain for a subject whose NEA increased by 600 calories than for someone with 0 change in NEA. But the straight-line tie of y to
x doesn’t explain all of the variation in y. The remaining variation appears as the
scatter of points above and below the line.
Although we won’t do the algebra, it is possible to break the variation in the observed values of y into two parts. One part measures the variation in yˆ along the least-squares regression line as x varies. The other measures the vertical scatter of the data points above and below the line. The squared correlation r2 is the first of these as a fraction of the whole:
r2variation in yˆ along the regression line as x varies total variation in observed values of y
E X A M P L E 5 . 4
Using r2For the NEA data, r 0.7786 and r2 (0.7786)2 0.6062. About 61% of the
variation in fat gained is accounted for by the linear relationship with change in NEA. The other 39% is individual variation among subjects that is not explained by the linear relationship.
Figure 4.2 (page 103) shows a stronger linear relationship between boat regis-
trations in Florida and manatees killed by boats. The correlation is r 0.951 and
r2 (0.951)2 0.904. Slightly more than 90% of the year-to-year variation in
number of manatees killed by boats is explained by regression on number of boats registered. Only about 10% is variation among years with similar numbers of boats
registered. ■
You can find a regression line for any relationship between two quantitative variables, but the usefulness of the line for prediction depends on the strength of the linear relationship. So r2 is almost as important as the equation of the line in reporting a regression. All the outputs in Figure 5.3 (page 131) include r2, either in decimal form or as a percent. When you see a correlation, square it to get a better feel for the strength of the association. Perfect correlation (r 1 or
r 1) means the points lie exactly on a line. Then r2 1 and all the variation
in one variable is accounted for by the linear relationship with the other variable. If r 0.7 or r 0.7, r2 0.49 and about half the variation is accounted for by the linear relationship. In the r2 scale, correlation 0.7 is about halfway between 0 and 1.
Facts 2, 3, and 4 are special properties of least-squares regression. They are not true for other methods of fitting a line to data.
c05Regression.indd Page 134 8/17/11 7:27:32 PM user-s163
RESIDUALS
One of the first principles of data analysis is to look for an overall pattern and also for striking deviations from the pattern. A regression line describes the overall pattern of a linear relationship between an explanatory variable and a response variable. We see deviations from this pattern by looking at the scatter of the data points about the regression line. The vertical distances from the points to the least-squares regression line are as small as possible, in the sense that they have the smallest possible sum of squares. Because they represent “leftover” variation in the response after fitting the regression line, these distances are called residuals.
A P P LY Y O U R K N O W L E D G E
5.5 How useful is regression? Figure 4.8 (page 115) displays the relationship between golfers’ scores on the first and second rounds of the 2010 Masters Tournament. The
correlation is r 0.347. Exercise 4.30 gives data on solar radiation (SRD) and concen-
tration of dimethyl sulfide (DMS) over a region of the Mediterranean. The correlation
is r 0.969. Explain in simple language why knowing only these correlations enables
you to say that prediction of DMS from SRD by a regression line will be much more accurate than prediction of a golfer’s second-round score from his first-round score. 5.6 Feed the birds. Exercise 4.32 (page 118) gives data from a study in which canary
parents cared for both their own babies and those of other parents. Investigators looked at how the growth rate of the foster babies relative to the growth rate of the natural babies changed as the begging intensity for food by the foster babies increased over the begging intensity of the natural babies. If begging intensity is the main factor determining food received, with higher intensity leading to more food, one would expect the relative growth rate to increase as the difference in begging intensity increases. However, if both begging intensity and a preference for their own babies determine the amount of food received (and hence the relative growth rate), we might expect growth rate to increase initially as begging intensity increases but then to level off (or even decrease) as the parents begin to ignore further increases
in begging by the foster babies. CANARIES
(a) Make a scatterplot of the data. Find the least-squares regression line for predicting
relative growth rate of the foster brood from the difference in begging intensity between the foster brood and the actual babies of the parents and add this line to your plot. Should we not use the regression line for prediction in this setting?
(b) What is r2
? What does this value say about the success of the regression line in predicting relative growth rate?
RESIDUALS
A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is, a residual is the prediction error that remains after we have chosen the regression line:
residual observed y predicted y
y yˆ
•
Residuals 1 3 5Arco Images GmbH/Alamy
c05Regression.indd Page 135 8/17/11 7:27:33 PM user-s163
1 3 6 C H A P T E R 5
•
RegressionFigure 5.5 is a scatterplot, with empathy score as the explanatory variable x and brain activity as the response variable y. The plot shows a positive association. That is,
Subject 1 2 3 4 5 6 7 8 Empathy score 38 53 41 55 56 61 62 48 Brain activity 0.120 0.392 0.005 0.369 0.016 0.415 0.107 0.506 Subject 9 10 11 12 13 14 15 16 Empathy score 43 47 56 65 19 61 32 105 Brain activity 0.153 0.745 0.255 0.574 0.210 0.722 0.358 0.779