Figures 5.5 and 5.6 show one unusual observation. Subject 16 is an outlier in the x direction, with empathy score 40 points higher than any other subject. Because of its extreme position on the empathy scale, this point has a strong influence on the cor-relation. Dropping Subject 16 reduces the correlation from r 0.515 to r 0.331.
You can see that this point extends the linear pattern in Figure 5.5 and so increases the correlation. We say that Subject 16 is influential for calculating the correlation.
Speed 10 20 30 40 50 60 70 80
Fuel 21.00 13.00 10.00 8.00 7.00 5.90 6.30 6.95
Residual 10.09 2.24 0.62 2.47 3.33 4.28 3.73 2.94 Speed 90 100 110 120 130 140 150
Fuel 7.57 8.27 9.03 9.87 10.79 11.77 12.83 Residual 2.17 1.32 0.42 0.57 1.64 2.76 3.97
INFLUENTIAL OBSERVATIONS
An observation is influential for a statistical calculation if removing it would mark-edly change the result of the calculation.
The result of a statistical calculation may be of little practical use if it depends strongly on a few influential observations.
Points that are outliers in either the x or the y direction of a scatterplot are often influential for the correlation. Points that are outliers in the x direction are often influential for the least-squares regression line.
E X A M P L E 5 . 6
An influential observation?Subject 16 in Example 5.5 is influential for the correlation between empathy score and brain activity because removing it reduces r from 0.515 to 0.331. Calculating that r 0.515 is not a very useful description of the data, because the value depends so strongly on just one of the 16 subjects.
•
Influential Observations 1 3 9 c05Regression.indd Page 139 8/17/11 7:27:35 PM user-s163c05Regression.indd Page 139 8/17/11 7:27:35 PM user-s163 user-F452user-F452
1 4 0 C H A P T E R 5
•
RegressionIs this observation also influential for the least-squares line? Figure 5.7 shows that it is not. The regression line calculated without Subject 16 (dashed) differs little from the line that uses all the observations (solid). The reason that the outlier has little influ-ence on the regression line is that it lies close to the dashed regression line calculated from the other observations. ■
0
−0.4−0.20.00.20.40.60.81.01.2
20 40 60 80 100
Empathy score
Brain activity
Removing Subject 16 moves the regression line only a little.
Subject 16
F I G U R E 5 . 7
Subject 16 is an outlier in the x direction. The outlier is not influential for least-squares regression, because removing it moves the regression line only a little.
To see why points that are outliers in the x direction are often influential for regression, let’s try an experiment. Suppose that Subject 16’s point in the scatter-plot moves straight down. What happens to the regression line? Figure 5.8 gives the answer. The dashed line is the regression line with the outlier in its new, lower position. Because there are no other points with similar x-values, the line chases the outlier. The Correlation and Regression applet allows you to try this experiment yourself—see Exercise 5.9. An outlier in x pulls the least-squares line toward itself. If the outlier does not lie close to the line calculated from the other observations, it will be influential.
We did not need the distinction between outliers and influential observations in Chapter 2. A single high salary that pulls up the mean salary x for a group of workers is an outlier because it lies far above the other salaries. It is also influ-ential, because the mean changes when it is removed. In the regression setting, however, not all outliers are influential.
c05Regression.indd Page 140 8/17/11 7:27:35 PM user-s163
c05Regression.indd Page 140 8/17/11 7:27:35 PM user-s163 user-F452user-F452
0 20 40 60 80 100
−0.4–0.20.00.20.40.60.81.01.2
Empathy score
Brain activity
... and the least-squares line chases it down.
Move the outlier down ...
F I G U R E 5 . 8
An outlier in the x direction pulls the least-squares line to itself because there are no other observations with similar values of x to hold the line in place. When the outlier moves down, the regression line chases it down. The original regression line is solid, and the final position of the regression line is dashed.
A P P LY Y O U R K N O W L E D G E
5.9 Influence in regression. The Correlation and Regression applet allows you to animate Figure 5.8. Click to create a group of 10 points in the lower-left corner of the scatterplot with a strong straight-line pattern (correlation about 0.9). Click the
“Show least-squares line” box to display the regression line.
(a) Add 1 point at the upper right that is far from the other 10 points but exactly on the regression line. Why does this outlier have no effect on the line even though it changes the correlation?
(b) Now use the mouse to drag this last point straight down. You see that one end of the least-squares line chases this single point, while the other end remains near the middle of the original group of 10. What makes the last point so influential?
5.10 Do heavier people burn more energy? Return to the data of Exercise 5.4 (page 132) on body mass and metabolic rate. We will use these data to illustrate influence.
(a) Make a scatterplot of the data that is suitable for predicting metabolic rate from body mass, with two new points added. Point A: mass 42 kilograms, metabolic rate 1500 calories. Point B: mass 70 kilograms, metabolic rate 1400 calories. In which direction is each of these points an outlier?
(b) Add three least-squares regression lines to your plot: for the original 12 women, for the original women plus Point A, and for the original women plus Point B. Which new point is more influential for the regression line? Explain in simple language why each new point moves the line in the way your graph shows. METABOLIC2
•
Influential Observations 1 4 1 c05Regression.indd Page 141 8/17/11 7:27:37 PM user-s163c05Regression.indd Page 141 8/17/11 7:27:37 PM user-s163 user-F452user-F452
1 4 2 C H A P T E R 5
•
Regression5.11 Outsourcing by airlines. Exercise 4.5 (page 101) gives data for 12 airlines on the percent of major maintenance outsourced and the percent of flight delays blamed on the airline.
(a) Make a scatterplot with outsourcing percent as x and delay percent as y. Would you consider Hawaiian Airlines to be influential?
(b) Find the correlation r with and without Hawaiian Airlines. How influential is the outlier for correlation?
(c) Find the least-squares line for predicting y from x with and without Hawaiian Airlines. Draw both lines on your scatterplot. Use both lines to predict the per-cent of delays blamed on an airline that has outsourced 74.1% of its major main-tenance. How influential is the outlier for the least-squares line? AIRLINES