As you can see from the Anscombe plots, there are many features of the shape of a scatterplot that you can’t learn from the standard set of summary numbers. Only when the cloud of points is elliptical, as in Display 3.41 on page 140, does the least squares line, together with the correlation, give a good summary of the relationship described by the plot. If the cloud of points isn’t elliptical, thes summaries aren’t appropriate. How can you decide?
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
168 Chapter 3 Relationships Between Two Quantitative Variables
A special kind of scatterplot, called a residual plot, often can help you see more clearly what’s going on. For some data sets, a residual plot can even show you patterns you might otherwise have overlooked completely. Statisticians use residual plots the way a doctor uses a microscope or an X ray—to get a better look at less obvious aspects of a situation. ( Plots you use in this way are called “diagnostic plots” because of the parallel with medical diagnosis. ) Push the analogy just a little. You’re the doctor, and data sets are your patients. Sets with elliptical clouds of points are the “healthy” ones; they don’t need special attention.
A residual plot is a scatterplot of residuals, y—y , versus predictor values, x
(or, sometimes, versus predicted values, y ).
Example: Constructing a Residual Plot
Return to the data on percentage of on-time arrivals versus mishandled baggage for airlines, introduced in P2 on page 110. Calculate the residuals and make a plot.
Solution
Visualize each residual—the difference between the observed value of y and the predicted value, y —as a vertical segment on the scatterplot in Display 3.67.
Display 3.67 Scatterplot of airline data.
The calculated residuals are shown in Display 3.68, with the list of carriers ordered from smallest to largest on the x-scale. This allows the size of the residuals in the far right column to appear in the same order as in Display 3.67. Alaska produces a negative residual of modest size, whereas US Airways produces a large positive residual.
The residual plot, Display 3.69, is simply a scatterplot of the residuals versus the original x-variable, mishandled baggage. Note that 0 is at the middle of the residuals on the vertical scale.
Residual plots may uncover more detailed patterns.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.4 Diagnostics: Looking for Features That the Summaries Miss 169
© 2008 Key Curriculum Press
Display 3.68 Table showing residuals for the airline data.
Display 3.69 Residual plot for the airline data.
The residual plot shows nearly random scatter, with no obvious trends. This is the ideal shape for a residual plot, because it indicates that a straight line is reasonable model for the trend in the original data. [You can use your calculator to create residual plots. See Calculator Note 3I.]
Residual Plots
D29. In Display 3.69, identify which residual belongs to Delta and which to Northwest.
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
170 Chapter 3 Relationships Between Two Quantitative Variables
D30. To see how residual plots magnify departures from the regression line, compare the Anscombe plots in Display 3.64 with Display 3.70, which shows the four corresponding residual plots in scrambled order.
Display 3.70 Residual plots for the four Anscombe data sets.
a. Match each of the original scatterplots in Display 3.64 with its corresponding residual plot in Display 3.70.
b. Describe the overall difference between the original scatterplots and the residual plots. What do the scatterplots show that the residual plots don’t? What do the residual plots show that the scatterplots don’t? What to Look For in a Residual Plot
A careful data analyst always looks at a residual plot.
If the original cloud of points is elliptical, so that a line is an appropriate summary, the residual plot will look like a random scatter of points.
Use residual plots to check for systematic departures from constant slope (linear trend) and constant strength (same vertical spread). Look in particular for plots that are curved or fan-shaped. It’s true that for data sets with only one predictor value (like those in this chapter), you often can get a good idea of what the residual plot will look like by carefully inspecting the original scatterplot. Once in a while, however, you get a surprise.
Example: Interpreting a Residual Plot
E19 on page 136 introduced data on median height versus age for young girls. Display 3.71 shows the scatterplot of these data, with the regression line. The overall average growth rate for the 12-year period is the slope of the regression line.
Residual plots sometimes yield surprises.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.4 Diagnostics: Looking for Features That the Summaries Miss 171
© 2008 Key Curriculum Press
The plot looks nearly linear, but is a line a suitable model?
Display 3.71 Median height versus age for young girls.
Solution
The residual plot, shown in Display 3.72, quite dramatically reveals that the trend is not as linear as first imagined. The curvature in the residual plot mimics the curvature in the original scatterplot, which is harder to see. A line is not a good model for these data.
Display 3.72 Residual plot of median height versus age for young girls.
Statistical software often plots residuals against the predicted values, y, rather than against the predictor values, x. For simple linear regression, both plots have exactly the same shape as long as the slope of the regression line is positive.
Types of Residual Plots
D31. Display 3.73 shows a scatterplot and two residual plots for the “data set” consisting of these three ordered pairs (x, y): (0, 1), (1, 0), and (2, 2). One residual plot plots residuals versus predictor values, x, the sort of plot you Residuals sometimes
are plotted against the predicted values, y.
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
172 Chapter 3 Relationships Between Two Quantitative Variables
get from graphing calculators. The other plots residuals versus predicted (fitted) y-values, or y, the sort of plot you get from computer software packages. Explain how the residual plots were produced and how you can tell which residual plot is which. The equation of the least squares line is y = 0.5 + 0.5x.
Display 3.73 A scatterplot and two residual plots.
Summary 3.4: Diagnostics: Looking for Features
That the Summaries Miss
For the simplest clouds of data points—elliptical in shape, with linear trend and no outliers—you can summarize all the main features of a scatterplot with just a few numbers, mainly the slope of the fitted line, y-intercept, and correlation. Not all plots are this simple, however, and a good statistician always does diagnostic checks for outliers and influential points and for departures from constant slope or constant strength.
• Points separated from the bulk of the data by white space are outliers and potentially influential.
• To judge a point’s influence, fit a line to the data and compute a regression equation and a correlation first with and then without the point in question. If the change in the regression equation and correlation is meaningful in your situation, report both sets of summary statistics.
For some data sets, a residual plot can show patterns you might otherwise overlook. A residual plot is a scatterplot of residuals, y − y , versus predictor values, x. A residual plot also can be constructed as a scatterplot of residuals, y − y , versus fitted values, . Use residual plots to check for systematic departures from linearity and for constant variability in y across the values of x. If the data aren’t linear, the residual plot doesn’t look random. If the data have nonconstant variability, the residual plot is fan-shaped.
Which Points Have the Influence? P22. The data in Display 3.74 show some
interesting patterns in the relationship between domestic and international gross income from the ten movies with the highest domestic gross ticket sales.
a. Construct a scatterplot suitable for predicting international sales from domestic sales. Describe the pattern in the data.
Practice
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.4 Diagnostics: Looking for Features That the Summaries Miss 173
© 2008 Key Curriculum Press
b. Find the least squares line and the correlation for these data.
c. Remove the most influential data point and recalculate the least squares line and correlation. Describe the influence of the removed point.
Display 3.74 Ticket sales for the ten highest-grossing domestic (United States and Canada) movies of all time. [Source:Internet Movie Database, us.imdb.com, September 12, 2006.]
P23. A data table and scatterplot of one student’s results from Activity 3.4a are shown in Display 3.75.
a. How well did the student do in estimating the number of paces?
b. Which point appears to be most influential?
c. Calculate the slope of the regression line and the correlation with and without this point. Describe the influence of this point.
Display 3.75 Sample data from Activity 3.4a.
Residual Plots
P24. For the set of (x, y) pairs (0, 0), (0, 1), (1, 1), and (3, 2), the equation of the least squares line is y = 0.5 + 0.5x.
a. Plot the data and graph the least squares line.
b. Next complete a table for the predicted values and residuals, like the table in Display 3.68 on page 169.
c. Using the values in your table, plot residuals versus predictor, x.
d. How does the residual plot differ from the scatterplot?
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
174 Chapter 3 Relationships Between Two Quantitative Variables
P25. Display 3.76 shows four scatterplots ( A–D ) for the data from a sample of commercial aircraft. Display 3.77 shows four corresponding residual plots (I–IV).
a. Match the residual plots to the scatterplots.
b. Using scatterplots A–D as examples, describe how you can identify each of these in a scatterplot from the residual plot.
i. a curve with increasing slope
ii. unequal variation in the responses iii. a curve with decreasing slope
iv. two linear patterns with di erent slopes
c. For one of the plots, two line segments joined together seem to give a better fit than either a single line or a curve. Which plot is this? Is this pattern easier to see in the original scatterplot or in the residual plot?
Display 3.76 Four scatterplots for the sample of commercial aircraft.
Display 3.77 Four residual plots corresponding to the scatterplots in Display 3.71.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.4Diagnostics: Looking for Features That the Summaries Miss 175
Exercises
E43. Extreme temperatures. The data in Display 3.78 provide the maximum and minimum temperatures ever recorded on each continent.
Display 3.78 Maximum and minimum recorded temperatures for the continents.
[Source: National Climatic Data Center, 2005, www.ncdc.noaa.gov .]
a. Construct a scatterplot of the data suitable for predicting the minimum temperature from a given maximum temperature. Is a straight line a good model for these points? Explain. b. Fit a least squares line to the points and
calculate the correlation, even if you thought in part a that a straight line was not a good model.
c. Explain, in words and numbers, what influence Antarctica has on the slope of the regression line and on the correlation. How could an account of these data be misleading if it were not accompanied by a plot?
Two climbers stand on Mount Erebus, Antarctica, 12,500 ft above sea level.
E44. The data and plot in Display 3.79 are from E15 on page 135. They show the arsenic concentrations in the toenails of 21 people who used water from their private wells. Both measurements are in parts per million.
Display 3.79 Arsenic concentrations.
a. Which point do you think has the most influence on the slope and correlation? What would be the effect of removing
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
176 Chapter 3 Relationships Between Two Quantitative Variables
this point? Perform the calculations to see if your intuition is correct.
b. Find a point that you think has almost no influence on the slope and correlation. Perform the calculations to see if your intuition is correct.
c. Find a point whose removal you think would make the correlation increase. Perform the calculations to see if your intuition is correct.
E45. How effective is a disinfectant? The data in Display 3.80 show (coded) bacteria colony counts on skin samples before and after a disinfectant is applied.
Display 3.80 Coded bacteria colony counts before (x) and after (y) treatment. [Source:
Snedecor and Cochran, Statistical Methods (Iowa State University Press, 1967), p. 422.]
a. Plot the data, fit a regression line to them, and complete a copy of the table, filling in the predicted values and residuals. b. Plot the residuals versus x, the count
before the treatment. Comment on the pattern.
c. Use the residual plot to determine for which skin sample the disinfectant was unusually effective and for which skin sample it was not very effective. E46. Textbook prices. Display 3.81 compares
recent prices at a college bookstore to those of a large online bookstore.
a. The equation of the regression line is online = −3.57 + 1.03 college. Interpret this equation in terms of textbook prices.
Display 3.81 Prices for a sample of textbooks at a college bookstore and an online bookstore.
b. Construct a residual plot. Interpret it and point out any interesting features. c. In comparing the prices of the textbooks,
you might be more interested in a different line: y = x. Draw this line on a copy of the scatterplot in Display 3.81. What does it mean if a point lies above this line? Below it? On it?
d. A boxplot of the differences
college price − online price is shown in Display 3.82. Interpret this boxplot.
Display 3.82 A boxplot of the differences between the college price and the online price for various textbooks.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.4 Diagnostics: Looking for Features That the Summaries Miss 177
© 2008 Key Curriculum Press
E47. Pizzas, again. Display 3.83 shows the pizza data from E12 on page 134, with its regression line.
Display 3.83 Calories versus fat, per 5-oz serving, for seven kinds of pizza.
a. Estimate the residuals from the graph, and use your estimates to sketch a rough version of a residual plot for this data set. b. Which pizza has the largest positive
residual? The largest negative residual? Are any of the residuals so extreme as to suggest that those pizzas should be regarded as exceptions?
c. Is any one of the pizzas a highly influential data point? If so, specify which one(s), and describe the effect on the slope of the fitted line and the correlation of removing the influential point or points from the analysis.
E48. Aircraft. Look again at Display 3.76 on page 174, which shows a scatterplot of flight length versus number of seats.
a. Does the slope of the pattern increase, decrease, or stay roughly constant as you move from left to right across the plot? b. Focusing on the variation (spread) in
flight length, y, for planes with roughly the same seating capacity, compare the spreads for planes with few seats, a moderate number of seats, and a large number of seats. As you move from left to right across the plot, how does the spread change, if at all?
c. Suppose a friend chose a plane from the sample at random and told you the approximate number of seats. Could you guess its flight length to within 500 miles if the number of seats was between 50 and 150? If it was between 200 and 300? Explain.
d. What is the relationship between your answer in part b and residual plot I in Display 3.77?
e. Give an explanation for why the variation in flight length shows the pattern it does. E49. Match each scatterplot ( A–D ) in Display 3.84 with its residual plot ( I–IV ) in Display 3.85. For which plots is a linear regression appropriate?
Display 3.85 Four residual plots. Display 3.84 Four scatterplots.
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
178 Chapter 3 Relationships Between Two Quantitative Variables
E50. Can either of the plots in Display 3.86 be a residual plot? Explain your reasoning.
Display 3.86 Residual Plots?
E51. Display 3.87 gives the data set for the three passenger jets from the example on page 123, along with a scatterplot showing the least squares line. (Values have been rounded.) a. Use the equation of the line to find
predicted values and residuals to complete the table in Display 3.87. b. Use your numbers from part a to construct
two residual plots, one with the predictor, x, on the horizontal axis and the other with the predicted value, y, on the horizontal axis. How do the two plots differ?
Display 3.87 Cost per hourversus number of seatsfor three models of the passenger aircraft.
E52. Explain why a residual plot of ( x, residual ) and a plot of ( predicted value, residual ) have exactly the same shape if the slope of the regression line is positive. What changes if the slope is negative?
E53. Can you recapture the scatterplot from the residual plot? The residual plot in Display 3.88 was calculated from data showing the recommended weight (in pounds) for men at various heights over 64 in. The fitted weights ranged from 145 lb to 187 lb. Make a rough sketch of the scatterplot of these data.
Display 3.88 Residuals of recommended weight versus height for men.
E54. The plot in Display 3.89 shows the residuals resulting from fitting a line to the data for female life expectancy (life exp) versus gross national product (GNP, in thousands of dollars per capita) for a sample of countries from around the world. The regression equation for the sample data was
life exp = 67.00 + 0.63 GNP
Sketch the scatterplot of life exp versus GNP.
Display 3.89 Residuals of female life expectancy versus gross national product.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.5 Shape-Changing Transformations 179