The error term ε is a normally distributed random variable In other words, the values of the error term ε i are independent normal random variables with mean

Box–Cox Transformations

4. The error term ε is a normally distributed random variable In other words, the values of the error term ε i are independent normal random variables with mean

zero and varianceσ2_.

In the regression model, whenβ1=0, the regression equation becomesy=

β0+ε, so there no longer exists a linear relationship betweenxandy. On the other hand, ifβ1 takes on any conceivable value other than zero, a linear relationship of some kind exists between the response and the predictor. We may use this key idea to apply regression-based inference. For example, thet-test tests directly whether β1=0, with the null hypothesis representing the claim that no linear relationship exists. We may also construct a conﬁdence interval for the true slope of the regression line. If the conﬁdence interval includes zero, this is evidence that no linear relationship exists.

Point estimates for values of the response variable for a given value of the predictor value may be obtained by an application of the estimated regression equation ˆy=b0+b1x.Unfortunately, these kinds of point estimates do not provide a

SPH

JWDD006-02 JWDD006-Larose November 23, 2005 14:50 Char Count= 0

86 CHAPTER 2 REGRESSION MODELING

probability statement regarding their accuracy. We may therefore construct two types of intervals: (1) a conﬁdence interval for the mean value ofygivenx, and (2) a prediction interval for the value of a randomly chosenygivenx. The prediction interval is always wider, since its task is more difﬁcult.

All of the inferential methods just described depend on the adherence of the data to the regression assumptions outlined earlier. The two main graphical methods used to verify regression assumptions are (1) a normal probability plot of the residuals, and (2) a plot of the standardized residuals against the fitted (predicted) values. Anormal probability plotis a quantile–quantile plot of the quantiles of a particular distribution against the quantiles of the standard normal distribution, for the purposes of determining whether the specified distribution deviates from normality. In a normality plot, the values observed for the distribution of interest are compared against the same number of values which would be expected from the normal distribution. If the distribution is normal, the bulk of the points in the plot should fall on a straight line; systematic deviations from linearity in this plot indicate nonnormality. We evaluate the validity of the regression assumptions by observing whether certain patterns exist in the plot of the residuals versus fits, in which case one of the assumptions has been violated, or whether no such discernible patterns exists, in which case the assumptions remain intact.

If these graphs indicate violations of the assumptions, we may apply a transformation to the response variabley, such as theln(natural log, log to the basee) transformation. Transformations may also be called for if the relationship between the predictor and the response variables is not linear. We may use either Mosteller and Tukey’s ladder of re-expressions or a Box–Cox transformation.

REFERENCES

1. Daniel Larose,Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, Hoboken, N. J. 2005.

2. Cereals data set, in Data and Story Library,http://lib.stat.cmu.edu/DASL/. Also available at the book series Web site.

3. Norman Draper and Harry Smith,Applied Regression Analysis, Wiley, New York, 1998. 4. California data set, U.S. Census Bureau,http://www.census.gov/. Also available at

the book series Web site.

5. Frederick Mosteller and John Tukey,Data Analysis and Regression, Addison-Wesley, Reading, MA, 1977.

6. G. E. P. Box and D. R. Cox, An analysis of transformations,Journal of the Royal Statistical Society, Series B, Vol. 26, pp. 211–243, 1964.

EXERCISES

Clarifying the Concepts

2.1. Determine whether the following statements are true or false. If a statement is false, explain why and suggest how one might alter the statement to make it true.

EXERCISES 87

(a) The least-squares line is that line which minimizes the sum of the residuals.

(b) If all the residuals equal zero, SST=SSR.

(c) If the value of the correlation coefﬁcient is negative, this indicates that the variables are negatively correlated.

(d) The value of the correlation coefﬁcient can be calculated given the value ofr2 alone.

(e) Outliers are inﬂuential observations.

(f) If the residual for an outlier is positive, we may say that the observedy-value is higher than the regression estimated, given thex-value.

(g) An observation may be inﬂuential even though it is neither an outlier nor a high leverage point.

(h) The best way of determining whether an observation is inﬂuential is to see whether its Cook’s distance exceeds 1.0.

(i) If one is interested in using regression analysis in a strictly descriptive manner, with no inference and no model building, one need not worry quite so much about assumption validation.

(j) In a normality plot, if the distribution is normal, the bulk of the points should fall on a straight line.

(k) The chi-square distribution is left-skewed.

(l) Small p-values for the Anderson–Darling test statistic indicate that the data are right-skewed.

(m) A funnel pattern in the plot of residuals versus ﬁts indicates a violation of the independence assumption.

2.2. Describe the difference between the estimated regression line and the true regression line.

2.3. Calculate the estimated regression equation for the orienteering example using the data in Table 2.3. Use either the formulas or software of your choice.

2.4. Where would a data point be situated which has the smallest possible leverage?

2.5. Calculate the values for leverage, standardized residual, and Cook’s distance for the hard-core hiker example in the text.

2.6. Calculate the values for leverage, standardized residual, and Cook’s distance for the eleventh hiker who had hiked for 10 hours and traveled 23 kilometers. Show that although it is neither an outlier nor of high leverage, it is nevertheless inﬂuential.

2.7. Match each of the following regression terms with its deﬁnition.

Term Deﬁnition

(a) Inﬂuential observation Measures the typical difference between the predicted and actual response values.

(b) SSE Represents the total variability in the values of the response variable alone, without reference to the predictor.

SPH

JWDD006-02 JWDD006-Larose November 23, 2005 14:50 Char Count= 0

88 CHAPTER 2 REGRESSION MODELING

Term Deﬁnition

(d) Residual Measures the strength of the linear relationship between two quantitative variables, with values ranging from

−1 to 1.

(e) s An observation that alters the regression parameters signiﬁcantly based on its presence or absence in the data set.

(f) High leverage point Measures the level of inﬂuence of an observation by taking into account both the size of the residual and the amount of leverage for that observation. (g) r Represents an overall measure of the error in prediction

resulting from the use of the estimated regression equation.

(h) SST An observation that is extreme in the predictor space, without reference to the response variable. (i) Outlier Measures the overall improvement in prediction

accuracy when using the regression as opposed to ignoring the predictor information.

(j) SSR The vertical distance between the response predicted and the actual response.

(k) Cook’s distance The proportion of the variability in the response that is explained by the linear relationship between the predictor and response variables.

2.8. Explain in your own words the implications of the regression assumptions for the behavior of the response variabley.

2.9. Explain what statistics from Table 2.11 indicate to us that there may indeed be a linear relationship betweenxandyin this example, even though the value forr2_{is less than} 1%.

2.10. Which values of the slope parameter indicate that no linear relationship exists between the predictor and response variables? Explain how this works.

2.11. Explain what information is conveyed by the value of the standard error of the slope estimate.

2.12. Describe the criterion for rejecting the null hypothesis when using thep-value method for hypothesis testing. Who chooses the value of the level of signiﬁcance,α? Make up a situation (onep-value and two different values ofα) where the very same data could lead to two different conclusions of the hypothesis test. Comment.

2.13. (a) Explain why an analyst may prefer a conﬁdence interval to a hypothesis test.

(b) Describe how a conﬁdence interval may be used to assess signiﬁcance.

2.14. Explain the difference between a conﬁdence interval and a prediction interval. Which interval is always wider? Why? Which interval is probably, depending on the situation, more useful to a data miner? Why?

EXERCISES 89

2.15. Clearly explain the correspondence between an original scatter plot of the data and a plot of the residuals versus ﬁtted values.

2.16. What recourse do we have if the residual analysis indicates that the regression assumptions have been violated? Describe three different rules, heuristics, or family of functions that will help us.

2.17. A colleague would like to use linear regression to predict whether or not customers will make a purchase based on some predictor variable. What would you explain to your colleague?

Working with the Data

2.18. Based on the scatter plot of attendance at football games versus winning percentage of the home team shown in Figure E2.18, answer the following questions.

(a) Describe any correlation between the variables, and estimate the value of the correlation coefﬁcientr.

(b) Estimate as best you can the values of the regression coefﬁcientsb0andb1.

(c) Will thep-value for the hypothesis test for the existence of a linear relationship between the variables be small or large? Explain.

(d) Will the conﬁdence interval for the slope parameter include zero? Explain.

(e) Will the value ofsbe closer to 10, 100, 1000, or 10,000? Why?

(f) Is there an observation that may look as though it is an outlier?

10 20000 19000 18000 17000 16000 Attendance 15000 14000 13000 12000 11000 20 30 40 50 60 Winning Percent 70 80 90 100 Figure E2.18

2.19. Use the regression output (shown in Table E2.19) to verify your responses from Exercise 2.18.

SPH

JWDD006-02 JWDD006-Larose November 23, 2005 14:50 Char Count= 0

90 CHAPTER 2 REGRESSION MODELING

TABLE E2.19

The regeression equation is

Attendance = 11067 + 77.2 Winning Percent

Predictor Coef SE Coef T P

Constant 11066.8 793.3 13.95 0.000 Winning Percent 77.22 12.00 6.44 0.000 S = 1127.51 R-Sq = 74.7% R-Sq(adj) = 72.9% Analysis of Variance Source DF SS MS F P Regression 1 52675342 52675342 41.43 0.000 Residual Error 14 17797913 1271280 Total 15 70473255 Unusual Observations Winning

Obs Percent Attendance Fit SE Fit Residual St Resid

10 76 19593 16936 329 2657 2.46R

R denotes an observation with a large standardized residual.

2.20. Based on the scatter plot shown in Figure E2.20, answer the following questions.

(a) Is it appropriate to perform linear regression? Why or why not?

(b) What type of transformation or transformations are called for? Use the bulging rule. 1.2 0.9 0.6 0.3 0.0 0.5 0.6 0.7 0.8 0.9 1.0 Figure E2.20

EXERCISES 91

2.21. Based on the regression output shown in Table E2.21 (from thechurndata set), answer the following questions.

(a) Is there evidence of a linear relationship betweenz vmail messages(z-scores of the number of voice mail messages) andz day calls(z-scores of the number of day calls made)? Explain.

(b) Since it has been standardized, the responsez vmail messageshas a standard deviation of 1.0. What would be the typical error in predictingz vmail messages if we simply used the sample mean response and no information about day calls? Now, from the printout, what is the typical error in predictingz vmail messages givenz day calls? Comment.

TABLE E2.21

The regression equation is

z vmail messages = 0.0000 - 0.0095 z day calls

Predictor Coef SE Coef T P

Constant 0.00000 0.01732 0.00 1.000 z day calls -0.00955 0.01733 -0.55 0.582 S = 1.00010 R-Sq = 0.0% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 0.304 0.304 0.30 0.582 Residual Error 3331 3331.693 1.000 Total 3332 3331.997

Hands-on Analysis

2.22. Open thebaseballdata set, which is available at the book series Web site. Subset the data so that we are working with batters who have at least 100 at bats.

(a) We are interested in investigating whether there is a linear relationship between the number of times a player has been caught stealing and the number of stolen bases the player has. Construct a scatter plot withcaughtas the response. Is there evidence of a linear relationship?

(b) Based on the scatter plot, is a transformation to linearity called for? Why or why not?

(c) Perform the regression of the number of times a player has been caught stealing versus the number of stolen bases the player has.

(d) Find and interpret the statistic which tells you how well the data ﬁt the model.

(e) What is the typical error in predicting the number of times a player is caught stealing given his number of stolen bases?

(f) Interpret they-intercept. Does this make sense? Why or why not?

(g) Inferentially, is there a signiﬁcant relationship between the two variables? What tells you this?

SPH

JWDD006-02 JWDD006-Larose November 23, 2005 14:50 Char Count= 0

92 CHAPTER 2 REGRESSION MODELING

(h) Calculate and interpret the correlation coefﬁcient.

(i) Clearly interpret the meaning of the slope coefﬁcient.

(j) Suppose someone said that knowing the number of stolen bases a player has explains most of the variability in the number of times the player gets caught stealing. What would you say?

2.23. Open thecerealsdata set, which is available at the book series Web site.

(a) We are interested in predicting nutrition rating based on sodium content. Construct the appropriate scatter plot.

(b) Based on the scatter plot, is there strong evidence of a linear relationship between the variables? Discuss. Characterize their relationship, if any.

(c) Perform the appropriate regression.

(d) Which cereal is an outlier? Explain why this cereal is an outlier.

(e) What is the typical error in predicting rating based on sodium content?

(f) Interpret they-intercept. Does this make any sense? Why or why not?

(g) Inferentially, is there a signiﬁcant relationship between the two variables? What tells you this?

(h) Calculate and interpret the correlation coefﬁcient.

(i) Clearly interpret the meaning of the slope coefﬁcient.

(j) Construct and interpret a 95% conﬁdence interval for the true nutrition rating for all cereals with a sodium content of 100.

(k) Construct and interpret a 95% conﬁdence interval for the nutrition rating for a randomly chosen cereal with sodium content of 100.

2.24. Open theCaliforniadata set, which is available at the book series Web site.

(a) Recapitulate the analysis performed within the chapter.

(b) Set aside the military outliers and proceed with the analysis with the remaining 848 records. Apply whatever data transformations are necessary to construct your best regression model.

CHAPTER

3

MULTIPLE REGRESSION AND

In document Data Mining Methods And Models Larose DT (2006) pdf (Page 103-111)