• No results found

Which Points Have the Influence?

In document Statistics in Action (Page 171-176)

Not all data points are created equal. You saw in the calculation of the correlation in Display 3.42 on page 142 that some points make large contributions and some small. Some make positive contributions and some negative. Your goal is to learn to recognize the points in a data set that might have an unusually large influence on where the regression line goes or on the size and sign of the correlation.

Just as among people, some data points have more influence than others.

© 2008 Key Curriculum Press

Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm

3.4Diagnostics: Looking for Features That the Summaries Miss 163

Near and Far

What you’ll need: an open area in which to step off distances

In this activity, you compare the actual distance to an object with what the distance appears to be.

1. Go to an open area, such as the hall or lawn of your school, and pick a spot as your origin. Choose six objects at various distances from the origin. Five of the objects should be within 10 to 20 paces, and the other should be a long way away (at least 100 paces). 2. For each of the six objects, estimate the number of paces

from the origin to the object. Record your estimates. 3. From your origin, walk to each of your objects and

count the actual number of regular paces it takes you to get there. Record this number beside your estimate. 4. Plot your data on a scatterplot, with your estimated

value on the x-axis and the actual value on the y-axis. Does the plot show a linear trend?

5. Determine the equation of the regression line, and calculate the correlation. 6. Delete the point for the object that is farthest away from the origin.

Determine the equation of the regression line and calculate the correlation for the reduced data set.

7. Did the extreme point have any influence on the regression line? On the correlation? Explain.

In Chapter 2, you learned about outliers for distributions—values that are separated from the bulk of the data. Outliers are atypical cases, and they can exert more than their share of influence on the mean and standard deviation. For scatterplots, as you will soon see, working with two variables together means that there can be outliers of various kinds. Different kinds of outliers can have different types of influence on the least squares line and the correlation. Unfortunately, there is no rule you can use to identify outliers in bivariate data. Just look for points surrounded by white space.

Judging a Point’s Influence

Points separated from the bulk of the data by white space are outliers and are potentially influential. To judge a point’s influence, compare the regression equation and correlation computed first with and then without the point in question.

To see these ideas in action, turn to the data on mammal longevity in Display 2.24 on page 43 and think about how to summarize the relationship between maximum and average longevity.

© 2008 Key Curriculum Press

Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm

164 Chapter 3 Relationships Between Two Quantitative Variables

Example: Influential Mammals

The average elephant lives 35 years. The oldest elephant on record lived 70 years. The average hippo lives 41 years—longer than the average elephant—but the record-holding hippo lived only 54 years. The oldest-known beaver lived 50 years, almost as long as the champion hippo, but the average beaver cashes in his wood chips after only 5 short years of making them. Other mammals, however, are more predictable. If you look at the entire sample, shown in Display 3.62, it turns out that the elephant (E), hippo (H), and beaver (B) are the oddballs of the bunch. For the rest, there’s an almost linear relationship between average longevity and maximum longevity. The least squares line for the entire sample has the equation

M = 10.53 + 1.58A

where M, or “M-hat,” stands for predicted maximum longevity and A stands for observed average longevity. For every increase of 1 year in average longevity, the model predicts a 1.58-year increase in maximum longevity. The correlation for the relationship between these two variables is 0.77. How much influence do the oddballs have on these summaries?

Display 3.62 Maximum longevityversus average longevity.

Solution

The hippo has the effect of pulling the right end of the regression line downward ( like putting a heavy weight on one end of a seesaw ), as you can see in Display 3.63. When the hippo is removed, that end of the regression line will “spring upward” and the slope will increase. Because one large residual has been removed and many of the remaining residuals have been reduced in size, the correlation will increase. The new slope is 1.96, and the new correlation is 0.80. The hippo has considerable influence on the slope and some influence on the correlation.

Now envision the scatterplot with just the elephant, E, missing. Because E is close to the straight line fi t to the data, it produces a small residual. Thus, you would expect that removing E should not change the slope of the regression line much (not nearly as much as removing H did ) and should reduce the correlation just a bit. In fact, the correlation does decrease some, to 0.72 from 0.77. However, the new slope is 1.53. It turns out that removing the elephant gives the hippo even more influence, and the slope decreases.

Points surrounded by white space might have strong influence.

© 2008 Key Curriculum Press

Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm

3.4 Diagnostics: Looking for Features That the Summaries Miss 165

© 2008 Key Curriculum Press

Display 3.63 Regression lines for maximum longevityversus average longevity, with and without the hippo.

Finally, envision the scatterplot with just the beaver, B, removed. B produces a large, positive residual close to the left end of the regression line. Thus, removing B should allow the left end of the line to drop, increasing the slope, and removing a large residual should increase the correlation. The new slope is 1.69 (an increase from 1.58), and the new correlation is 0.83 (an increase from 0.77). The beaver also has considerable infl uence on both slope and correlation.

Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm

166 Chapter 3 Relationships Between Two Quantitative Variables

With a little practice, you often can anticipate the influence of certain points in a scatterplot, as in the previous example, but it is difficult to state general rules. The best rule is the one given in the box on page 163: Fit the line with and without the questionable point and see what happens. Then report all the results, with appropriate explanations.

Why the Anscombe Data Sets Are Important

Display 3.64 shows four scatterplots. These plots, known as “the Anscombe data” after their inventor, are arguably the most famous set of scatterplots in all of statistics. The questions that follow invite you to figure out why statistics books refer to them so often. In the process, you’ll learn more about what a summary doesn’t tell you about a data set.

Display 3.64 Four regression data sets invented by Francis J.

Anscombe. [Source:Francis J. Anscombe, “Graphs in Statistical Analysis,” American Statistician 27 (1973): 17–21.]

D26. For each plot in Display 3.64, first give a short verbal description of the pattern in the plot. Then

a. either fit a line by eye and estimate its slope or tell why you think a line is not a good summary

b. either estimate the correlation by eye or tell why you think a correlation is not an appropriate summary

D27. Display 3.65 shows a computer output for one of the four Anscombe data set plots. Can you tell which one? If so, tell how you know. If not, explain why you can’t tell.

© 2008 Key Curriculum Press

Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm

3.4 Diagnostics: Looking for Features That the Summaries Miss 167

D28. Display 3.66 lists values for the Anscombe plots.

Display 3.65 Regression analysis for one of the Anscombe data sets.

Display 3.66 Anscombe plot data values.

a. Which plot has a point that is highly influential both with respect to the slope of the regression line and with respect to the correlation?

b. Compared to the other points in the plot, does the influential point lie far from the least squares line or close to it?

c. How would the slope and correlation change if you were to remove this point? Discuss this first without actually performing the calculations. Then carry out the calculations to verify your conjectures.

In document Statistics in Action (Page 171-176)