Linear Regression with Multiple Regressors 6.1 Multiple Choice

1) In the multiple regression model, the adjusted R2, R2 A) cannot be negative.

B) will never be greater than the regression R2. C) equals the square of the correlation coefficient r.

D) cannot decrease when an additional explanatory variable is added. Answer: B

2) Under imperfect multicollinearity

A) the OLS estimator cannot be computed.

B) two or more of the regressors are highly correlated. C) the OLS estimator is biased even in samples of n> 100. D) the error terms are highly, but not perfectly, correlated. Answer: B

3) When there are omitted variables in the regression, which are determinants of the dependent variable, then A) you cannot measure the effect of the omitted variable, but the estimator of your included variable(s) is

(are) unaffected.

B) this has no effect on the estimator of your included variable because the other variable is not included. C) this will always bias the OLS estimator of the included variable.

D) the OLS estimator is biased if the omitted variable is correlated with the included variable. Answer: D

4) Imagine you regressed earnings of individuals on a constant, a binary variable (“ Male”) which takes on the value 1 for males and is 0 otherwise, and another binary variable (“Female”) which takes on the value 1 for females and is 0 otherwise. Because females typically earn less than males, you would expect

A) the coefficient for Male to have a positive sign, and for Female a negative sign.

B) both coefficients to be the same distance from the constant, one above and the other below. C) none of the OLS estimators to exist because there is perfect multicollinearity.

D) this to yield a difference in means statistic. Answer: C

5) When you have an omitted variable problem, the assumption that E(ui Xi) = 0 is violated. This implies that A) the sum of the residuals is no longer zero.

B) there is another estimator called weighted least squares, which is BLUE.

C) the sum of the residuals times any of the explanatory variables is no longer zero. D) the OLS estimator is no longer consistent.

Answer: D

6) If you had a two regressor regression model, then omitting one variable which is relevant

A) will have no effect on the coefficient of the included variable if the correlation between the excluded and the included variable is negative.

B) will always bias the coefficient of the included variable upwards.

C) can result in a negative value for the coefficient of the included variable, even though the coefficient will have a significant positive effect on Y if the omitted variable were included.

D) makes the sum of the product between the included variable and the residuals different from 0. Answer: C

7) (Requires Calculus) In the multiple regression model you estimate the effect on Yi of a unit change in one of the Xi while holding all other regressors constant. This

A) makes little sense, because in the real world all other variables change. B) corresponds to the economic principle of mutatis mutandis.

C) leaves the formula for the coefficient in the single explanatory variable case unaffected. D) corresponds to taking a partial derivative in mathematics.

Answer: D

8) You have to worry about perfect multicollinearity in the multiple regression model because A) many economic variables are perfectly correlated.

B) the OLS estimator is no longer BLUE.

C) the OLS estimator cannot be computed in this situation. D) in real life, economic variables change together all the time. Answer: C

9) In a two regressor regression model, if you exclude one of the relevant variables then A) it is no longer reasonable to assume that the errors are homoskedastic.

B) OLS is no longer unbiased, but still consistent.

C) you are no longer controlling for the influence of the other variable. D) the OLS estimator no longer exists.

Answer: C

10) The intercept in the multiple regression model

A) should be excluded if one explanatory variable has negative values. B) determines the height of the regression line.

C) should be excluded because the population regression function does not go through the origin. D) is statistically significant if it is larger than 1.96.

Answer: B

11) In the multiple regression model, the least squares estimator is derived by A) minimizing the sum of squared prediction mistakes.

B) setting the sum of squared errors equal to zero. C) minimizing the absolute difference of the residuals.

D) forcing the smallest distance between the actual and fitted values. Answer: A

12) The sample regression line estimated by OLS A) has an intercept that is equal to zero.

B) is the same as the population regression line. C) cannot have negative and positive slopes.

D) is the line that minimizes the sum of squared prediction mistakes. Answer: D

14) Under the least squares assumptions for the multiple regression problem (zero conditional mean for the error term, all Xi and Yi being i.i.d., all Xi and ui having finite fourth moments, no perfect multicollinearity), the OLS estimators for the slopes and intercept

A) have an exact normal distribution for n > 25. B) are BLUE.

C) have a normal distribution in small samples as long as the errors are homoskedastic. D) are unbiased and consistent.

Answer: D

15) The main advantage of using multiple regression analysis over differences in means testing is that the regression technique

A) allows you to calculate p-values for the significance of your results. B) provides you with a measure of your goodness of fit.

C) gives you quantitative estimates of a unit change in X.

D) assumes that the error terms are generated from a normal distribution. Answer: C

16) In a multiple regression framework, the slope coefficient on the regressor X2i A) takes into account the scale of the error term.

B) is measured in the units of Yi divided by units of X2i. C) is usually positive.

D) is larger than the coefficient on X1i. Answer: B

17) One of the least squares assumptions in the multiple regression model is that you have random variables which are “i.i.d.” This stands for

A) initially indeterminate differences. B) irregularly integrated dichotomies. C) identically initiated deltas (as in changes). D) independently and identically distributed. Answer: D

18) Omitted variable bias

A) will always be present as long as the regression R2 < 1.

B) is always there but is negligible in almost all economic examples.

C) exists if the omitted variable is correlated with the included regressor but is not a determinant of the dependent variable.

D) exists if the omitted variable is correlated with the included regressor and is a determinant of the dependent variable.

Answer: D

19) The following OLS assumption is most likely violated by omitted variables bias: A) E(ui Xi) = 0

B) (Xi, Yi) i=1,..., n are i.i.d draws from their joint distribution C) there are no outliers for Xi, ui

D) there is heteroskedasticity Answer: A

20) The population multiple regression model when there are two regressors, X1i and X2i can be written as follows, with the exception of:

A) Yi = 0 + 1X1i + 2X2i + ui, i = 1,..., n

B) Yi = 0X0i + 1X1i + 2X2i + ui, X0i = 1, i = 1,..., n C) Yi =

j=0

j Xji + ui, i = 1,..., n

D) Yi = 0 + 1X1i + 2X2i + ... + kXki + ui, i = 1,..., n Answer: D

21) In the multiple regression model Yi = 0 + 1X1i+ 2 X2i + ... + kXki + ui, i = 1,..., n, the OLS estimators are obtained by minimizing the sum of

A) squared mistakes in n i=1 Yi - b0 - b1X1i - ... - bkXki2 B) squared mistakes in n i=1 Yi - b0 - b1X1i - ... - bkXki - ui2 C) absolute mistakes in n i=1 Yi - b0 - b1X1i - ... - bkXki D) squared mistakes in n i=1 Yi - b0 - b1Xi2 Answer: A

22) In the multiple regression model, the SER is given by A) 1 n-2 n i=1 u^i B) 1 n- k -2 n i=1 ui C) 1 n- k-2 n i=1 u^i D) 1 n- k-1 n i=1 u^2_i Answer: D

23) In multiple regression, the R2 increases whenever a regressor is

A) added unless the coefficient on the added regressor is exactly zero. B) added.

24) The adjusted R2, or R2, is given by A) 1- n-2 n - k -1 SSR TSS B) 1- n-2 n - k -1 ESS TSS C) 1- n-1 n - k -1 SSR TSS D) ESS TSS Answer: C

25) Consider the following multiple regression models (a) to (d) below. DFemme = 1 if the individual is a female, and is zero otherwise; DMale is a binary variable which takes on the value one if the individual is male, and is zero otherwise; DMarried is a binary variable which is unity for married individuals and is zero otherwise, and

DSingle is (1-DMarried). Regressing weekly earnings (Earn) on a set of explanatory variables, you will

experience perfect multicollinearity in the following cases unless: A) Earni = 0 + 1 DFemme + 2 Dmale + 3 X3i

B) Earni = 0 + 1 DMarried + 2 DSingle + 3 X3i C) Earni = 0 + 1 DFemme + 3 X3i

D) Earni = 1 DFemme + 2 Dmale + 3 DMarried + 4 DSingle + 5 X3i Answer: C

26) Consider the multiple regression model with two regressors X1and X2, where both variables are determinants

of the dependent variable. When omitting X2from the regression, then there will be omitted variable bias for 1

A) if X1and X2are correlated

B) always

C) if X2is measured in percentages

D) if X2is a dummy variable

Answer: A

27) The dummy variable trap is an example of A) imperfect multicollinearity

B) something that is of theoretical interest only C) perfect multicollinearity

D) something that does not happen to university or college students Answer: C

28) Imperfect multicollinearity

A) is not relevant to the field of economics and business administration B) only occurs in the study of finance

C) means that the least squares estimator of the slope is biased D) means that two or more of the regressors are highly correlated Answer: D

29) Consider the multiple regression model with two regressors X1and X2, where both variables are determinants

of the dependent variable. You first regress Y on X1only and find no relationship. However when regressing Y

on X1and X2, the slope coefficient 1 changes by a large amount. This suggests that your first regression

suffers from

A) heteroskedasticity B) perfect multicollinearity C) omitted variable bias D) dummy variable trap Answer: C

30) Imperfect multicollinearity

A) implies that it will be difficult to estimate precisely one or more of the partial effects using the data at hand

B) violates one of the four Least Squares assumptions in the multiple regression model C) means that you cannot estimate the effect of at least one of the Xs on Y

D) suggests that a standard spreadsheet program does not have enough power to estimate the multiple regression model

Answer: A

6.2 Essays and Longer Questions

1) Females, on average, are shorter and weigh less than males. One of your friends, who is a pre-med student, tells you that in addition, females will weigh less for a given height. To test this hypothesis, you collect height and weight of 29 female and 81 male students at your university. A regression of the weight on a constant, height, and a binary variable, which takes a value of one for females and is zero otherwise, yields the following result:

Studentw= -229.21 – 6.36 × Female + 5.58 × Height , R2=0.50, SER = 20.99

where Studentw is weight measured in pounds and Height is measured in inches. (a) Interpret the results. Does it make sense to have a negative intercept?

(b) You decide that in order to give an interpretation to the intercept you should rescale the height variable. One possibility is to subtract 5 ft. or 60 inches from your Height, because the minimum height in your data set is 62 inches. The resulting new intercept is now 105.58. Can you interpret this number now? Do you thing that the regression R2 has changed? What about the standard error of the regression?

(c) You have learned that correlation does not imply causation. Although this is true mathematically, does this always apply?

Answer: (a) For every additional inch in height, weight increases by roughly 5.5 pounds. Female students weigh approximately 6.5 pounds less than male students, controlling for height. The regression explains 50 percent of the weight variation among students. It does not make sense to interpret the intercept, since there are no observations close to the origin, or, put differently, there are no individuals who are zero inches tall.

2) The cost of attending your college has once again gone up. Although you have been told that education is investment in human capital, which carries a return of roughly 10% a year, you (and your parents) are not pleased. One of the administrators at your university/college does not make the situation better by telling you that you pay more because the reputation of your institution is better than that of others. To investigate this hypothesis, you collect data randomly for 100 national universities and liberal arts colleges from the 2000 -2001

U.S. News and World Report annual rankings. Next you perform the following regression Cost= 7,311.17 + 3,985.20 × Reputation – 0.20 × Size

+ 8,406.79 × Dpriv – 416.38 × Dlibart – 2,376.51 × Dreligion

R2=0.72, SER = 3,773.35

where Cost is Tuition, Fees, Room and Board in dollars, Reputation is the index used in U.S. News and World

Report (based on a survey of university presidents and chief academic officers), which ranges from 1 (“marginal

”) to 5 (“distinguished”), Size is the number of undergraduate students, and Dpriv, Dlibart, and Dreligion are binary variables indicating whether the institution is private, a liberal arts college, and has a religious affiliation.

(a) Interpret the results. Do the coefficients have the expected sign?

(b) What is the forecasted cost for a liberal arts college, which has no religious affiliation, a size of 1,500 students and a reputation level of 4.5? (All liberal arts colleges are private.)

(c) To save money, you are willing to switch from a private university to a public university, which has a ranking of 0.5 less and 10,000 more students. What is the effect on your cost? Is it substantial?

(d) Eliminating the Size and Dlibart variables from your regression, the estimation regression becomes

Cost = 5,450.35 + 3,538.84 × Reputation + 10,935.70 × Dpriv – 2,783.31 × Dreligion; R2=0.72, SER = 3,792.68

Why do you think that the effect of attending a private institution has increased now?

(e) What can you say about causation in the above relationship? Is it possible that Cost affects Reputation rather than the other way around?

Answer: (a) An increase in reputation by one category, increases the cost by roughly $3,985. The larger the size of the college/university, the lower the cost. An increase of 10,000 students results in a $2,000 lower cost. Private schools charge roughly $8,406 more than public schools. A school with a religious affiliation is approximately $2,376 cheaper, presumably due to subsidies, and a liberal arts college also charges roughly $416 less. There are no observations close to the origin, so there is no direct interpretation of the intercept. Other than perhaps the coefficient on liberal arts colleges, all coefficients have the expected sign.

(b) $ 32,935.

(c) Roughly $ 12,4.00. Since over the four years of education, this implies approximately $50,000, it is a substantial amount of money for the average household.

(d) Private institutions are smaller, on average, and some of these are liberal arts colleges. Both of these variables had negative coefficients.

(e) It is very possible that the university president and chief academic officer are influenced by the cost variable in answering the U.S. News and World Report survey. If this were the case, then the above equation suffers from simultaneous causality bias, a topic that will be covered in a later chapter. However, this poses a serious threat to the internal validity of the study.

3) In the multiple regression model with two explanatory variables

Yi = 0 + 1X1i + 2X2i + ui

the OLS estimators for the three parameters are as follows (small letters refer to deviations from means as in zi = Zi – Z): ^ 0 = Y –^1X1 –^2X2 ^ 1 = n i=1yix1i n i=1 x2_{2i -} n i=1yix2i n i=1x1ix2i n i=1 x 2_1i n i=1 x 2_{2i - (} n i=1 x1ix2i)2 ^ 2 = n i=1yix2i n i=1 x2_{1i -} n i=1yix1i n i=1x1ix2i n i=1 x 2_1i n i=1 x 2_{2i - (} n i=1x1ix2i )2

You have collected data for 104 countries of the world from the Penn World Tables and want to estimate the effect of the population growth rate (X1i) and the saving rate (X2i) (average investment share of GDP from 1980 to 1990) on GDP per worker (relative to the U.S.) in 1990. The various sums needed to calculate the OLS estimates are given below:

n i=1Yi = 33.33; n i=1 X1i = 2.025; n i=1 X2i = 17.313 n i=1 y2_{i = 8.3103;} n i=1 x2_{1i = .0122;} n i=1 x2_{2i = 0.6422} n i=1 yi x1i = -0.2304; n i=1 yi x2i = 1.5676; n i=1x1i x2i = -0.0520

(a) What are your expected signs for the regression coefficient? Calculate the coefficients and see if their signs correspond to your intuition.

(b) Find the regression R2, and interpret it. What other factors can you think of that might have an influence on productivity?

Answer: (a) You expect ^1 < 0 and^2 > 0 with no prior expectation on the intercept. Substituting the above numbers into the equations for the regression coefficients results in ^1 = -12.95,^2 = 1.39, and^0 = 0.34.

not the country was democratic over the sample period, or had political stability, etc.

4) A subsample from the Current Population Survey is taken, on weekly earnings of individuals, their age, and their gender. You have read in the news that women make 70 cents to the $1 that men earn. To test this hypothesis, you first regress earnings on a constant and a binary variable, which takes on a value of 1 for females and is 0 otherwise. The results were:

Earn = 570.70 – 170.72 × Female, R2=0.084, SER = 282.12.

(a) There are 850 females in your sample and 894 males. What are the mean earnings of males and females in this sample? What is the percentage of average female income to male income?

(b) You decide to control for age (in years) in your regression results because older people, up to a point, earn more on average than younger people. This regression output is as follows:

Earn= 323.70 – 169.78 × Female + 5.15 × Age, R2=0.135, SER = 274.45.

Interpret these results carefully. How much, on average, does a 40-year-old female make per year in your sample? What about a 20-year-old male? Does this represent stronger evidence of discrimination against females?

Answer: (a) Males earn $570.70, females $399.98. Percentage of average female income to male income is 70.1% in the sample.

(b) As individuals become one year older, they earn $5.15 more, on average. Females earn significantly less money on average and for a given age. 13.5 percent of the earnings variation is explained by the regression. A 40-year-old female earns $359.92, while a 20-year-old male makes $426.70. There is somewhat more evidence here, since age has been added as a regressor. However, many attributes, which could potentially explain this difference, are still omitted.

5) You have collected data from Major League Baseball (MLB) to find the determinants of winning. You have a general idea that both good pitching and strong hitting are needed to do well. However, you do not know how much each of these contributes separately. To investigate this problem, you collect data for all MLB during 1999 season. Your strategy is to first regress the winning percentage on pitching quality (“Team ERA”), second to regress the same variable on some measure of hitting (“OPS – On-base Plus Slugging percentage”), and third to regress the winning percentage on both.

Summary of the Distribution of Winning Percentage, On Base plus Slugging Percentage, and Team Earned Run Average for MLB in 1999

Average Standard deviation Percentile 10% 25% 40% 50% (median) 60% 75% 90% Team ERA 4.71 0.53 3.84 4.35 4.72 4.78 4.91 5.06 5.25 OPS 0.778 0.034 0.720 0.754 0.769 0.780 0.790 0.798 0.820 Winning Percentage 0.50 0.08 0.40 0.43 0.46 0.48 0.49 0.59 0.60

Winpct = -0.19 – 0.099 × teamera + 1.490 × ops , R2=0.92, SER = 0.02.

(a) Interpret the multiple regression. What is the effect of a one point increase in team ERA? Given that the Atlanta Braves had the most wins that year, wining 103 games out of 162, do you find this effect important? Next analyze the importance and statistical significance for the OPS coefficient. (The Minnesota Twins had the minimum OPS of 0.712, while the Texas Rangers had the maximum with 0.840.) Since the intercept is negative, and since winning percentages must lie between zero and one, should you rerun the regression through the origin?

(b) What are some of the omitted variables in your analysis? Are they likely to affect the coefficient on Team

ERA and OPS given the size of the R2 and their potential correlation with the included variables?

Answer: (a) A single point increase in team ERA lowers the winning percentage by approximately 10 percent. A 0.1 increase in OPS results roughly in an increase of 15 percent. Given that there are no observations close to the origin, you should not interpret the intercept. The multiple regression explains 92 percent of the variation in winning percentage. The Atlanta Braves only won 63.6 percent of their games. Given that this represents the best record during that season, a 10 percentage point drop is important. Although the intercept cannot be interpreted, it anchors the regression at a certain level and should therefore not be omitted.

(b) The quality of the management and coaching comes to mind, although both may be reflected in the performance statistics, as are salaries. There are other aspects of baseball performance that are missing, such as the fielding percentage of the team.

6) In the process of collecting weight and height data from 29 female and 81 male students at your university, you also asked the students for the number of siblings they have. Although it was not quite clear to you initially what you would use that variable for, you construct a new theory that suggests that children who have more

In document fgghhdh gdd hh (Page 122-154)