• No results found

Property 3. Consider a regression model and let Assumptions F1–F5hold.

3.4 Statistical Inference for a Single Coefficient 1 The t-Test

3.4.3 Added Variable Plots

To represent multivariate data graphically, we have seen that a scatterplot matrix is a useful device. However, the major shortcoming of the scatterplot matrix is that it captures relationships only between pairs of variables. When the data can be summarized using a regression model, a graphical device that does not have this shortcoming is an added variable plot. The added variable plot is also called a partial regression plot because, as we will see, it is constructed in terms of residuals from certain regression fits. We will also see that the added variable plot can be summarized in terms of a partial correlation coefficient, thus providing a link between correlation and regression. To introduce these ideas, we work in the context of the following example:

3.4 Statistical Inference for a Single Coefficient 89

Table 3.7 Summary Statistics for Each Variable for 37 Refrigerators

Standard

Variable Mean Median Deviation Minimum Maximum

ECOST 70.51 68.00 9.14 60.00 94.00 RSIZE 13.400 13.200 0.600 12.600 14.700 FSIZE 5.184 5.100 0.938 4.100 7.400 SHELVES 2.514 2.000 1.121 1.000 5.000 FEATURES 3.459 3.000 2.512 1.000 12.000 PRICE 626.4 590.0 139.8 460.0 1200.0

Source: Consumer Reports, July 1992. “Refrigerators:A Comprehensive Guide to the Big White Box.”

Table 3.8 Matrix of Correlation Coefficients

ECOST RSIZE FSIZE SHELVES FEATURES

RSIZE 0.333 FSIZE 0.855 0.235 SHELVES 0.188 0.363 0.251 FEATURES 0.334 0.096 0.439 0.160 PRICE 0.522 0.024 0.720 0.400 0.697 R  Empirical Filename is “Refrigerator”

Example: Refrigerator Prices. What characteristics of a refrigerator are impor-

tant in determining its price (PRICE)? We consider here several characteristics of a refrigerator, including the size of the refrigerator in cubic feet (RSIZE), the size of the freezer compartment in cubic feet (FSIZE), the average amount of money spent per year to operate the refrigerator (ECOST, for “energy cost”), the number of shelves in the refrigerator and freezer doors (SHELVES), and the number of features (FEATURES). The features variable includes shelves for cans, see-through crispers, ice makers, egg racks, and so on.

Both consumers and manufacturers are interested in models of refrigerator prices. Other things equal, consumers generally prefer larger refrigerators with lower energy costs that have more features. Because of forces of supply and demand, we would expect consumers to pay more for such refrigerators. A larger refrigerator with lower energy costs that has more features at the similar price is considered a bargain to the consumer. How much extra would the consumer be willing to pay for the additional space? A model of prices for refrigerators on the market provides some insight to this question.

To this end, we analyze data from n= 37 refrigerators. Table3.7provides the basic summary statistics for the response variable PRICE and the five explanatory variables. From this table, we see that the average refrigerator price is y=

$626.40, with standard deviation sy = $139.80. Similarly, the average annual

amount to operate a refrigerator, or average ECOST, is $70.51.

To analyze relationships among pairs of variables, Table3.8provides a matrix of correlation coefficients. From the table, we see that there are strong linear relationships between PRICE and each of freezer space (FSIZE) and the number

Table 3.9 Fitted Refrigerator Price

Model CoefficientEstimate StandardError t-Ratio

Intercept 798 271.4 −2.9 ECOST −6.96 2.275 −3.1 RSIZE 76.5 19.44 3.9 FSIZE 137 23.76 5.8 SHELVES 37.9 9.886 3.8 FEATURES 23.8 4.512 5.3

of FEATURES. Surprisingly, there is also a strong positive correlation between PRICE and ECOST. Recall that ECOST is the energy cost; one might expect that higher-priced refrigerators should enjoy lower energy costs.

A regression model was fit to the data. The fitted regression equation appears in Table3.9, with s= 60.65 and R2= 83.8%.

From Table 3.9, the explanatory variables seem to be useful predictors of refrigerator prices. Together, the variables account for 83.8% of the variability. To understand prices, the typical error has dropped from sy = $139.80 to s = $60.65.

The t-ratios for each of the explanatory variables exceeds 2 in absolute value, indicating that each variable is important on an individual basis.

What is surprising about the regression fit is the negative coefficient associated with energy cost. Remember, we can interpret bECOST= −6.96 to mean that, for

each dollar increase in ECOST, we expect the PRICE to decrease by $6.96. This negative relationship conforms to our economic intuition. However, it is surprising that the same dataset has shown us that there is a positive relationship between PRICE and ECOST. This seeming anomaly is because correlation measures relationships only between pairs of variables, though the regression fit can account for several variables simultaneously. To provide more insight into this seeming anomaly, we now introduce the added variable plot.

Producing an Added Variable Plot

The added variable plot provides additional links between the regression method- ology and more fundamental tools such as scatter plots and correlations. We work in the context of the refrigerator price example to demonstrate the construction of this plot.

Procedure for producing an added variable plot.

(i) Run a regression of PRICE on RSIZE, FSIZE, SHELVES, and FEA- TURES, omitting ECOST. Compute the residuals from this regression, which we label e1.

(ii) Run a regression of ECOST on RSIZE, FSIZE, SHELVES, and FEATURES. Compute the residuals from this regression, which we label e2.

(iii) Plot e1 versus e2. This is the added variable plot of PRICE versus

ECOST, controlling for the effects of the RSIZE, FSIZE, SHELVES, and FEATURES. This plot appears in Figure3.4.

3.4 Statistical Inference for a Single Coefficient 91 5 0 5 10 100 50 0 50 100 150 e1 e2 Figure 3.4 An added

variable plot. The residuals from the regression of PRICE on the explanatory variables, omitting ECOST, are on the vertical axis. On the horizontal axis are the residuals from the regression fit of ECOST on the other explanatory variables. The correlation coefficient is−0.48.

The error ε can be interpreted as the natural variation in a sample. In many situations, this natural variation is small compared to the patterns evident in the nonrandom regression component. Thus, it is useful to think of the error,

εi = yi− (β0+ β1xi1+ · · · + βkxik), as the response after controlling for the

effects of the explanatory variables. In Section 3.3, we saw that a random error can be approximated by a residual, ei = yi− (b0+ b1xi1+ · · · + bkxik). Thus,

in the same way, we may think of a residual as the response after “controlling for”the effects of the explanatory variables.

With this in mind, we can interpret the vertical axis of Figure3.4as the refriger- ator PRICE controlled for effects of RSIZE, FSIZE, SHELVES, and FEATURES. Similarly, we can interpret the horizontal axis as the ECOST controlled for effects of RSIZE, FSIZE, SHELVES, and FEATURES. The plot then provides a graph- ical representation of the relation between PRICE and ECOST, after controlling for the other explanatory variables. For comparison, a scatter plot of PRICE and ECOST (not shown here) does not control for other explanatory variables. Thus, it is possible that the positive relationship between PRICE and ECOST is due not to a causal relationship but rather to one or more additional variables that cause both variables to be large.

For example, from Table3.7, we see that the freezer size (FSIZE) is positively correlated with both ECOST and PRICE. It certainly seems reasonable that increasing the size of a freezer would cause both the energy cost and the price to increase. Rather, the positive correlation may be because large values of FSIZE mean large values of both ECOST and PRICE.

Variables left out of a regression are called omitted variables. This omission could cause a serious problem in a regression model fit; regression coefficients could be not only strongly significant when they should not be but also of the incorrect sign. Selecting the proper set of variables to be included in the regression model is an important task; it is the subject of Chapters 5 and 6.