• No results found

Recommendation for Checking the Utility of a Multiple Regression Model

1. First, conduct a test of overall model adequacy using the F -test, that is, test

H0: β1 = β2= · · · = βk= 0

If the model is deemed adequate (i.e., if you reject H0), then proceed to step 2. Otherwise, you should hypothesize and fit another model. The new model may include more independent variables or higher-order terms. 2. Conduct t-tests on those β parameters in which you are particularly interested

(i.e., the ‘‘most important’’ β’s). These usually involve only the β’s associated with higher-order terms (x2, x

1x2, etc.). However, it is a safe practice to limit

the number of β’s that are tested. Conducting a series of t-tests leads to a high overall Type I error rate α.

We conclude this section with a final caution about conducting t-tests on individual β parameters in a model.

Caution

Extreme care should be exercised when conducting t-tests on the individual β parameters in a first-order linear model for the purpose of determining which independent variables are useful for predicting y and which are not. If you fail to reject H0: βi = 0, several conclusions are possible:

1. There is no relationship between y and xi.

2. A straight-line relationship between y and x exists (holding the other x’s in the model fixed), but a Type II error occurred.

3. A relationship between y and xi (holding the other x’s in the model fixed) exists, but is more complex than a straight-line relationship (e.g., a curvilinear relationship may be appropriate). The most you can say about a β parameter test is that there is either sufficient (if you reject

H0: βi = 0) or insufficient (if you do not reject H0: βi = 0) evidence of a

linear (straight-line) relationship between y and xi.

4.8 Multiple Coefficients of Determination:

R2

and R2a

Recall from Chapter 3 that the coefficient of determination, r2, is a measure of how well a straight-line model fits a data set. To measure how well a multiple regression model fits a set of data, we compute the multiple regression equivalent of r2, called the multiple coefficient of determination and denoted by the symbol R2.

Definition 4.1 The multiple coefficient of determination, R2, is defined as

R2= 1 − SSE

SSyy

0≤ R2≤ 1 where SSE= (yi− ˆyi)2, SSyy =

(yi− ¯y)2, and ˆyi is the predicted value of

yifor the multiple regression model.

Just as for the simple linear model, R2 represents the fraction of the sample variation of the y-values (measured by SSyy) that is explained by the least squares regression model. Thus, R2= 0 implies a complete lack of fit of the model to the data, and R2 = 1 implies a perfect fit, with the model passing through every data point. In general, the closer the value of R2is to 1, the better the model fits the data. To illustrate, consider the first-order model for the grandfather clock auction price presented in Examples 4.1–4.4. A portion of the SPSS printout of the analysis is shown in Figure 4.7. The value R2= .892 is highlighted on the printout. This

relatively high value of R2 implies that using the independent variables age and

number of bidders in a first-order model explains 89.2% of the total sample variation (measured by SSyy)in auction price y. Thus, R2 is a sample statistic that tells how well the model fits the data and thereby represents a measure of the usefulness of the entire model.

Figure 4.7 A portion of the SPSS regression output for the auction price model

A large value of R2 computed from the sample data does not necessarily mean that the model provides a good fit to all of the data points in the population. For example, a first-order linear model that contains three parameters will provide a perfect fit to a sample of three data points and R2 will equal 1. Likewise, you will always obtain a perfect fit (R2 = 1) to a set of n data points if the model contains exactly n parameters. Consequently, if you want to use the value of R2as a measure

of how useful the model will be for predicting y, it should be based on a sample that contains substantially more data points than the number of parameters in the model.

Caution

In a multiple regression analysis, use the value of R2as a measure of how useful

a linear model will be for predicting y only if the sample contains substantially more data points than the number of β parameters in the model.

As an alternative to using R2 as a measure of model adequacy, the adjusted multiple coefficient of determination, denoted Ra2, is often reported. The formula for Ra2is shown in the box.

Definition 4.2 The adjusted multiple coefficient of determination is given by

Ra2= 1 −  (n− 1) n− (k + 1)   SSE SSyy  = 1 −  (n− 1) n− (k + 1)  (1− R2) Note: Ra2≤ R2and, for poor-fitting models Ra2may be negative.

R2 and R2a have similar interpretations. However, unlike R2, R2a takes into account (‘‘adjusts’’ for) both the sample size n and the number of β parameters in the model.

R2awill always be smaller than R2, and more importantly, cannot be ‘‘forced’’ to 1 by simply adding more and more independent variables to the model. Consequently, analysts prefer the more conservative R2a when choosing a measure of model

adequacy. The value of R2

a is also highlighted in Figure 4.7. Note that Ra2= .885, a

value only slightly smaller than R2.

Despite their utility, R2 and R2

a are only sample statistics. Consequently, it is

dangerous to judge the usefulness of the model based solely on these values. A prudent analyst will use the analysis-of-variance F -test for testing the global utility of the multiple regression model. Once the model has been deemed ‘‘statistically’’ useful with the F -test, the more conservative value of R2

a is used to describe the

4.8 Exercises

4.1 Degrees of freedom. How is the number of degrees of freedom available for estimating σ2, the

variance of ε, related to the number of independent variables in a regression model?

4.2 Accounting and Machiavellianism. Refer to the

Behavioral Research in Accounting (January 2008)

study of Machiavellian traits (e.g., manipulation, cunning, duplicity, deception, and bad faith) in accountants, Exercise 1.47 (p. 41). Recall that a Machiavellian (‘‘Mach’’) rating score was deter- mined for each in a sample of accounting alumni of a large southwestern university. For one portion of the study, the researcher modeled an accoun- tant’s Mach score (y) as a function of age (x1),

gender (x2), education (x3), and income (x4). Data

on n= 198 accountants yielded the results shown in the table.

INDEPENDENT

VARIABLE t-VALUE FOR H0: βi = 0 p-VALUE

Age (x1) 0.10 > .10

Gender (x2) −0.55 > .10

Education (x3) 1.95 < .01

Income (x4) 0.52 > .10

Overall model: R2= .13, F = 4.74 (p-value < .01)

(a) Write the equation of the hypothesized model relating y to x1, x2, x3, and x4.

(b) Conduct a test of overall model utility. Use

α= .05.

(c) Interpret the coefficient of determination, R2.

(d) Is there sufficient evidence (at α= .05) to say that income is a statistically useful predictor of Mach score?

4.3 Study of adolescents with ADHD. Children with attention-deficit/hyperactivity disorder (ADHD) were monitored to evaluate their risk for substance (e.g., alcohol, tobacco, illegal drug) use (Journal

of Abnormal Psychology, August 2003). The fol-

lowing data were collected on 142 adolescents diagnosed with ADHD:

y= frequency of marijuana use the past 6 months x1= severity of inattention (5-point scale)

x2= severity of impulsivity–hyperactivity

(5-point scale)

x3= level of oppositional–defiant and conduct

disorder (5-point scale)

(a) Write the equation of a first-order model for

E(y).

(b) The coefficient of determination for the model is R2= .08. Interpret this value.

(c) The global F -test for the model yielded a

p-value less than .01. Interpret this result. (d) The t-test for H0: β1= 0 resulted in a p-value

less than .01. Interpret this result.

(e) The t-test for H0: β2= 0 resulted in a p-value

greater than .05. Interpret this result.

(f) The t-test for H0: β3= 0 resulted in a p-value

greater than .05. Interpret this result.

4.4 Characteristics of lead users. During new prod- uct development, companies often involve ‘‘lead users’’ (i.e., creative individuals who are on the leading edge of an important market trend).

Creativity and Innovation Management (February

2008) published an article on identifying the social network characteristics of lead users of children’s computer games. Data were collected for n= 326 children and the following variables measured: lead-user rating (y, measured on a 5-point scale), gender (x1= 1 if female, 0 if male), age (x2, years),

degree of centrality (x3, measured as the num-

ber of direct ties to other peers in the network), and betweenness centrality (x4, measured as the

number of shortest paths between peers). A first- order model for y was fit to the data, yielding the following least squares prediction equation:

ˆy= 3.58 + .01x1− .06x2− .01x3+ .42x4

(a) Give two properties of the errors of predic- tion that result from using the method of least squares to obtain the parameter estimates. (b) Give a practical interpretation the estimate of

β4in the model.

(c) A test of H0: β4= 0 resulted in a two-tailed

p-value of .002. Make the appropriate conclu- sion at α= .05.

4.5 Runs scored in baseball. In Chance (Fall 2000), statistician Scott Berry built a multiple regression model for predicting total number of runs scored by a Major League Baseball team during a season. Using data on all teams over a 9-year period (a sample of n= 234), the results in the next table (p. 184) were obtained.

(a) Write the least squares prediction equation for

y= total number of runs scored by a team in

a season.

(b) Conduct a test of H0: β7= 0 against Ha: β7<0

at α= .05. Interpret the results.

(c) Form a 95% confidence interval for β5. Inter-

pret the results.

(d) Predict the number of runs scored by your favorite Major League Baseball team last

year. How close is the predicted value to the actual number of runs scored by your team? (Note: You can find data on your favorite team on the Internet at www.mlb.com.)

INDEPENDENT

VARIABLE βESTIMATE STANDARD ERROR

Intercept 3.70 15.00 Walks (x1) .34 .02 Singles (x2) .49 .03 Doubles (x3) .72 .05 Triples (x4) 1.14 .19 Home Runs (x5) 1.51 .05 Stolen Bases (x6) .26 .05 Caught Stealing (x7) −.14 .14 Strikeouts (x8) −.10 .01 Outs (x9) −.10 .01

Source: Berry, S. M. ‘‘A statistician reads the sports pages:

Modeling offensive ability in baseball,’’ Chance, Vol. 13, No. 4, Fall 2000 (Table 2).

4.6 Earnings of Mexican street vendors. Detailed interviews were conducted with over 1,000 street vendors in the city of Puebla, Mexico, in order to study the factors influencing vendors’ incomes (World Development, February 1998). Vendors were defined as individuals working in the street, and included vendors with carts and stands on wheels and excluded beggars, drug dealers, and prostitutes. The researchers collected data on gen- der, age, hours worked per day, annual earnings, and education level. A subset of these data appears in the accompanying table.

(a) Write a first-order model for mean annual earnings, E(y), as a function of age (x1) and

hours worked (x2).

SAS output for Exercise 4.6

STREETVEN

VENDOR ANNUAL HOURS WORKED NUMBER EARNINGS y AGE x1 PER DAY x2

21 $2841 29 12 53 1876 21 8 60 2934 62 10 184 1552 18 10 263 3065 40 11 281 3670 50 11 354 2005 65 5 401 3215 44 8 515 1930 17 8 633 2010 70 6 677 3111 20 9 710 2882 29 9 800 1683 15 5 914 1817 14 7 997 4066 33 12

Source: Adapted from Smith, P. A., and Metzger, M. R.

‘‘The return to education: Street vendors in Mexico,’’

World Development, Vol. 26, No. 2, Feb. 1998, pp.

289–296.

(b) The model was fit to the data using SAS. Find the least squares prediction equation on the printout shown below.

(c) Interpret the estimated β coefficients in your model.

(d) Conduct a test of the global utility of the model (at α= .01). Interpret the result.

(e) Find and interpret the value of R2a.

(f) Find and interpret s, the estimated standard deviation of the error term.

(g) Is age (x1) a statistically useful predictor of

annual earnings? Test using α= .01.

(h) Find a 95% confidence interval for β2. Inter-

4.7 Urban population estimation using satellite images. Can the population of an urban area be estimated without taking a census? In Geographi-

cal Analysis (January 2007) geography professors

at the University of Wisconsin–Milwaukee and Ohio State University demonstrated the use of satellite image maps for estimating urban popula- tion. A portion of Columbus, Ohio, was partitioned into n= 125 census block groups and satellite imagery was obtained. For each census block, the following variables were measured: population density (y), proportion of block with low-density residential areas (x1), and proportion of block with

high-density residential areas (x2). A first-order

model for y was fit to the data with the following results:

ˆy= −.0304 + 2.006x1+ 5.006x2, R2= .686

(a) Give a practical interpretation of each

β-estimate in the model.

(b) Give a practical interpretation of the coeffi- cient of determination, R2.

(c) State H0 and Ha for a test of overall model adequacy.

(d) Refer to part c. Compute the value of the test statistic.

(e) Refer to parts c and d. Make the appropriate conclusion at α= .01.

4.8 Novelty of a vacation destination. Many tourists choose a vacation destination based on the new- ness or uniqueness (i.e., the novelty) of the itinerary. Texas A&M University professor J. Pet- rick investigated the relationship between novelty and vacationing golfers’ demographics (Annals

of Tourism Research, Vol. 29, 2002). Data were

obtained from a mail survey of 393 golf vaca- tioners to a large coastal resort in southeast- ern United States. Several measures of novelty level (on a numerical scale) were obtained for each vacationer, including ‘‘change from routine,’’ ‘‘thrill,’’ ‘‘boredom-alleviation,’’ and ‘‘surprise.’’ The researcher employed four independent vari- ables in a regression model to predict each of the novelty measures. The independent variables were

x1= number of rounds of golf per year, x2= total

number of golf vacations taken, x3= number of

years played golf, and x4= average golf score.

(a) Give the hypothesized equation of a first-order model for y= change from routine.

(b) A test of H0: β3= 0 versus Ha: β3<0 yielded a

p-value of .005. Interpret this result if α= .01. (c) The estimate of β3 was found to be negative.

Based on this result (and the result of part b), the researcher concluded that ‘‘those who have played golf for more years are less apt to seek change from their normal routine in

their golf vacations.’’ Do you agree with this statement? Explain.

(d) The regression results for the three other dependent novelty measures are summarized in the table below. Give the null hypothesis for testing the overall adequacy of each first-order regression model.

DEPENDENT VARIABLE F-VALUE p-VALUE R2

Thrill 5.56 < .001 .055

Boredom-alleviation 3.02 .018 .030

Surprise 3.33 .011 .023

Source: Reprinted from Annals of Tourism Research,

Vol. 29, Issue 2, J. F. Petrick, ‘‘An examination of golf vacationers’ novelty,” Copyright© 2002, with permission from Elsevier.

(e) Give the rejection region for the test, part d, using α= .01.

(f) Use the test statistics reported in the table and the rejection region from part e to conduct the test for each of the dependent measures of novelty.

(g) Verify that the p-values in the table support your conclusions in part f.

(h) Interpret the values of R2reported in the table.

4.9 Highway crash data analysis. Researchers at Montana State University have written a tutorial on an empirical method for analyzing before and after highway crash data (Montana Department of Transportation, Research Report, May 2004). The initial step in the methodology is to develop a Safety Performance Function (SPF)—a mathe- matical model that estimates crash occurrence for a given roadway segment. Using data collected for over 100 roadway segments, the researchers fit the model, E(y)= β0+ β1x1+ β2x2, where

y= number of crashes per 3 years, x1= roadway

length (miles), and x2= AADT (average annual

daily traffic) (number of vehicles). The results are shown in the following tables.

Interstate Highways

PARAMETER STANDARD

VARIABLE ESTIMATE ERROR t-VALUE

Intercept 1.81231 .50568 3.58 Length (x1) .10875 .03166 3.44

AADT (x2) .00017 .00003 5.19

Non-Interstate Highways

PARAMETER STANDARD

VARIABLE ESTIMATE ERROR t-VALUE

Intercept 1.20785 .28075 4.30 Length (x1) .06343 .01809 3.51