351_metrics_psets.pdf

(1)

1. Suppose that X is a Bernoulli random variable with Pr

(

X = =1

)

p. Show that

( )

k

E X = p for any k>0.

2. The table below gives the joint probability distribution between employment status and college graduation among those either employed or unemployed in the working-age US population.

a. Compute E Y( )

b. Compute E Y X( | =1) and E Y X( | =0).

c. A randomly selected member of this population reports being unemployed. What is the probability that this worker is a college graduate?

d. Are educational achievement and employment status independent? Explain.

Unemployed (Y=0) Employed (Y=1) Non-college grads (X=0) 0.045 0.709

College grads (X=1) 0.005 0.241

3. X and Y are discrete random variables with the joint distribution given below.

a. Calculate the probability distribution, mean and variance of Y.

b. Calculate the probability distribution, mean and variance of Y given that X=8. c. Calculate the covariance and correlation between X and Y.

Y=14 Y=22 Y=30 Y=40 Y=65 X=1 0.02 0.05 0.10 0.03 0.01 X=5 0.17 0.15 0.05 0.02 0.01 X=8 0.02 0.03 0.15 0.10 0.09

4. In a given population of two-earner couples, male earnings have a mean of $40,000 per year and a standard deviation of $12,000. Female earnings have a mean of $45,000 per year and a standard deviation of $18,000. The correlation between male and female earnings for a couple is 0.8. Let C denote the combined earnings for a couple.

a. What is the mean of C?

b. What is the covariance between male and female earnings? c. What is the standard deviation of C?

d. Convert the answers to (a)-(c) from US dollars to Euros.

5. The random variable Y has a mean of 1 and a variance of 4. Find the mean and variance

of 1

(

1

)

2

(2)

6. Suppose you have $1 to invest and you are planning to put a fraction w into a stock market fund and a fraction 1−w into a bond mutual fund. $1 invested in the stock fund yields R_S after one year and $1 invested in the bond fund yields R_B after one year. R_S is random with mean 0.08 and standard deviation 0.07. R_B is random with mean 0.05 and standard deviation 0.04. The correlation between R_S and R_B is 0.25. When you place a fraction w of your money in the stock fund and fraction 1−w in the bond fund, then the return on your investment is R=wR_S + −

(

1 w R

)

_B.

a. If w=0.5, compute the mean and standard deviation of R. b. If w=0.75, compute the mean and standard deviation of R. c. What value of w maximizes the mean of R?

d. What value of w minimizes the standard deviation of R?

7. X and Y are random variables with X

Y independent of Y. Show that

( )

E X X

E

Y E Y

⎛ _{⎞ =}

⎜ ⎟

⎝ ⎠ .

8. Compute the following probabilities.

a. If Y ∼N

( )

1, 2 , then calculate Pr

(

Y ≤3

)

.

b. If Y ∼ N

( )

3, 3 , then calculate Pr

(

Y >0

)

.

c. If Y ∼ N

(

50, 5

)

, then calculate Pr 40

(

≤ ≤Y 52

)

.

(3)

1. Values of height in inches (X) and weight in pounds (Y) are recorded from a sample of 300 male college students. The resulting summary statistics are X =70.5, Y =158,

1.8 x

s = , sy =14.2, sxy =21.73 and r=0.85. Convert these summary statistics to meters and kilograms.

2. Consider a population with μ_Y =100 and 2

43 Y

σ = (note that this is the variance rather than the standard deviation).

a. For a random sample of size 100, find Pr

(

Y <101

)

.

b. For a random sample of size 64, find Pr 101

(

< <Y 103

)

.

c. For a random sample of size 165, find Pr

(

Y >98

)

.

3. Resistors to be used in a circuit have average resistance of 200 ohms with a standard deviation of 10 ohms. If 25 of these resistors are randomly selected to be used in a circuit, find the probability that the total resistance is less than 5100 ohms.

4. In any year, the weather can inflict storm damage to a home. From year to year, the damage is random. Let Y denote the dollar value of damage in a given year. With probability 0.95, Y is 0. But with probability 0.05, Y is 20,000.

a. Calculate the mean and standard deviation of the damage Y.

b. Consider an insurance pool of 100 people whose homes are far enough apart that damage to each home is independent of damage to any other home. What is the expected value of the average damage Y ?

(4)

Unit 2.3

1. In a survey of 400 voters, 215 responded that they would vote for the Democrat and 185 responded that they would vote for the Republican. Let p denote the true fraction of voters who intend to vote for the Democrat.

a. Estimate p.

b. Find the standard error of your estimate.

c. What is the p-value for the test H0:p=0.5 versus Ha:p≠0.5? d. What is the p-value for the test H0:p=0.5 versus Ha:p>0.5? e. Is there statistically significant evidence that the Democrat is ahead?

2. Suppose that a lightbulb manufacturing plant produces bulbs with a mean life of 2000 hours and a standard deviation of 200 hours. An inventor claims to have developed a process that produces bulbs with a longer mean life. The plant manager randomly selects 100 bulbs produced by the new process. She says that she will accept the inventor’s claim if the sample mean life of the new bulbs is greater than 2100 hours.

a. What is the size of the manager’s testing procedure?

b. Suppose that the new process is in fact better and has a mean bulb life of 2150 hours. What is the power of the manager’s testing procedure?

c. What testing procedure should the manager use if she wants the size of the test to be 5%? Be explicit.

3. The average APGAR score of newborn babies from a particular city has historically been 5.35. However, you believe that pollution from a new chemical plant is causing problems for expecting mothers, and so APGAR scores have fallen below the historical level. You take a sample of 50 babies and find that their average APGAR score is 4.75. The standard deviation in APGAR scores is 1.88.

a. Define the null and alternative hypotheses for testing your claim. b. Conduct the hypothesis test at a significance level of 5%. What is your

conclusion?

c. Explain what a significance level of 5% means. d. Determine the p-value of your sample.

e. What is the power of your test under the alternative that the mean APGAR score has fallen to 5.00?

(5)

4.

designed to encourage high school students to register to vote. In a sample of 80 students enrolled in the program, 75 registered to vote.

a. Construct a 95% confidence interval for the proportion of students enrolled in the program who register to vote.

b. Historically, 90% of high school students register to vote. Use the data above to test the hypothesis that the new program raises the percentage of students who register to vote. Define the hypotheses explicitly and test the claim. What is your conclusion?

c. Determine the p-value of your sample.

d. What is the power of your test under the alternative that 95% of the students in the program will register to vote?

e. Suppose that you could increase your sample size to 500 students. What now is the power of your hypothesis test under the same alternative as in (d)?

(6)

Unit 3.1

1. A researcher uses a sample of 100 classrooms with data on class size (CS) and average test scores (T) to estimate the following regression:

ˆ _{520.4 5.82*}

T = − CS

2

0.08

R = , SER=11.5

a. A classroom has 22 students. What is the regression’s prediction for the average test score?

b. Last year a classroom had 19 students and this year it has 23 students. What is the regression’s prediction for the change in the classroom average test score?

c. The sample average class size across the 100 classrooms is 21.4. What is the sample average of the test score across the 100 classrooms?

d. What is the sample standard deviation of test scores across the 100 classrooms?

2. Consider the single-variable regression model.

a. A linear regression yields βˆ₁=0. Show that R2 =0.

b. A linear regression yields 2

0

R = . Is it necessarily true that βˆ₁=0? Explain.

3. Consider the single-variable regression model and suppose you know that β₀ =0. Derive

an explicit formula for the least squares estimator βˆ₁.

4. This problem uses the data set teachingratings.wf1. One of the characteristics is an index of the professor’s “beauty” as rated by a panel of six judges. In this exercise, you will investigate how course evaluations are related to the professor’s beauty.

a. Generate a scatterplot of course evaluations and beauty. Does there appear to be a relationship?

b. Calculate the mean and standard deviation of course_eval and beauty.

c. Run a regression of average course evaluation on the professor’s beauty. What are the estimated intercept and slope?

d. Explain why the estimated intercept is equal to the sample mean of course_eval. e. Professor Watson has an average value of beauty, while Professor Stock’s beauty

is one standard deviation below average. Predict Professor Stock’s and Professor Watson’s course evaluations.

f. Comment on the size of the regression’s slope. Is the estimated effect of beauty on course evaluations large or small? Explain what you mean.

(7)

1. A researcher uses a sample of 100 classrooms with data on class size (CS) and average test scores (T) to estimate the following regression. Note that standard errors appear in parentheses below the estimated coefficient.

ˆ _{520.4 5.82 *}

(20.4) (2.21)

T = − CS

2

0.08

R = , SER=11.5

a. Construct a 95% confidence interval for the true slope β1.

b. Calculate the p-value for the two-sided test of the null hypothesis β1 =0. Do you

reject the null at the 5% level? What about the 1% level?

c. Calculate the p-value for the two-sided test of the null hypothesis β₁= −5.6. d. Is -5.6 contained in the 95% confidence interval for β₁? Relate this to (c).

2. A researcher estimates the following regression of wage (W) on years of education (EDU) and obtains the following results.

ˆ _{3.13 1.47 *}

(0.93) (0.07)

W = − + EDU

2

0.13

R = , SER=8.77

a. A randomly selected worker reports an education level of 16 years. What is his expected wage?

b. A high school graduate is contemplating whether to obtain a four-year degree. How much is his wage expected to rise?

c. Develop a 95% confidence interval for your answer in (b).

(8)

3. Y is regressed on X using a sample with n=250 to obtain the following results.

ˆ _{5.4 3.2}

(3.1) (1.5)

Y = + X

2

0.26

R = , SER=6.2

a. Test the null hypothesis that β₁ =0 against β₁ ≠0 at the 5% level. b. Construct a 95% confidence interval for β₁.

c. Would you be surprised if you learned that X and Y were independent? Explain. d. Suppose that X and Y are independent and that many random samples of size 250

are drawn. In what fraction of the samples would the null hypothesis β₁ =0 from (a) be rejected?

(9)

1. This problem uses the data set cps04.wf1. Run a regression of average hourly earnings (AHE) on Age and answer the following questions. Be sure to use White standard errors.

a. Is the estimated slope coefficient statistically significant?

b. What is the p-value associated with the slope coefficient’s t-statistic? c. Construct a 95% confidence interval for the slope coefficient.

d. Estimate the regression using only the data for observations with no college degree.

(10)

Unit 4.1

1. Data were collected from a random sample of 220 home sales from a community in 2003. Let P denote the selling price in thousands of US dollars. BDR denotes the number of bedrooms, Bath denotes the number of bathrooms, Hsize denotes the size of the house (in square feet), Lsize denotes the size of the lot (in square feet), Age denotes the age of the house (in years), and Poor is a binary variable set equal to 1 if the house is in poor condition. The estimated regression yields:

ˆ _{119.2 0.485} _23.4 _0.156 _0.002 _0.090 _48.8

P= + BDR+ Bath+ Hsize+ Lsize+ Age− Poor

2

0.72

R = , SER=41.5

a. Suppose that a homeowner converts part of an existing family room in her house into a new bathroom. What is the expected increase in the value of the house? b. Suppose that a homeowner adds a new bathroom to her house, which increases

the size of the house by 100 square feet. What is the expected increase in the value of the house?

c. Suppose that a homeowner converts a bedroom into a bathroom. What is the expected increase in the value of the house?

d. What is the loss in value if a homeowner lets his house run down so that its condition becomes poor?

e. Compute R2 for the regression.

2. This problem uses the dataset collegedistance.wf1.

a. Run a regression of years of completed education (ED) on distance to the nearest college (Dist). What is the estimated slope?

b. Run a regression of years of completed education (ED) on distance to the nearest college (Dist), but also include in the regression additional controls for the characteristics of the student, his family and the local labor market. In particular, include Bytest, Female, Black, Hispanic, Incomehi, Ownhome, DadColl, Cue80 and Stwmfg80. What is the estimated effect of Dist on ED?

c. Is the estimated effect of Dist on ED in the regression from (b) substantially different from the regression in (a)? Based on this, does the regression in (a) seem to suffer from omitted variables bias?

d. Compare the fit of the regression in (a) and (b) using R2 and R2. e. Interpret the value of the coefficient on DadColl precisely in words. f. Explain why Cue80 and Swmfg80 appear in the regression using economic

language. Are the signs of the coefficients what you would have expected? g. Bob is a black male. His high school was 20 miles from the nearest college and

his base-year composite test score was 58. His family income in 1980 was

(11)

1. A regression was estimated using a random sample of 4000 full-time workers. College is a binary variable equal to 1 if the worker graduated from college. Female is a binary variable equal to 1 if the worker is female. The following is the estimated regression.

n _{12.69 5.46} _2.64

(0.14) (0.21) (0.20)

Wage= + College− Female

a. Is the college-high school earnings difference estimated in this regression statistically significant at the 5% level? Construct a 95% confidence interval for this difference.

b. Is the male-female earnings difference estimated in this regression statistically significant at the 5% level? Construct a 95% confidence interval for this difference.

2. Data were collected from a random sample of 220 home sales from a community in 2003. Let P denote the selling price in thousands of US dollars. BDR denotes the number of bedrooms, Bath denotes the number of bathrooms, Hsize denotes the size of the house (in square feet), Lsize denotes the size of the lot (in square feet), Age denotes the age of the house (in years), and Poor is a binary variable set equal to 1 if the house is in poor condition. The estimated regression yields:

ˆ _{119.2 0.485} _23.4 _0.156 _0.002 _0.090 _48.8

(23.9) (2.61) (8.94) (0.011) (0.00048) (0.311) (10.5)

P= + BDR+ Bath+ Hsize+ Lsize+ Age− Poor

2

0.72

R = , SER=41.5

a. Is the coefficient on BDR significantly different from zero?

b. Typically, five-bedroom houses sell for much more than two-bedroom houses. Given your answer in (a), how is this consistent with the regression results? c. A homeowner purchases 2000 square feet from an adjacent lot. Construct a 99%

confidence interval for the change in the value of her house.

(12)

Unit 4.3

1. This problem uses the dataset growth.wf1.

a. Construct a table that shows the sample mean, standard deviation, maximum value and minimum value for Growth, TradeShare, YearsSchool, Oil, Rev_Coups, Assassinations and RGDP60. Also state the units in which each is measured. b. Run a regression of Growth on TradeShare, YearsSchool, Rev_Coups,

Assassinations and RGDP60. What is the value of the coefficient on Rev_Coups? Interpret the coefficient precisely in words.

c. Use the regression to predict the average annual growth rate for a country with average values for all the regressors.

d. Repeat (c), but with the country’s TradeShare one standard deviation above the mean.

e. Why is Oil omitted from the regression? What would happen if it were included?

2. This problem uses the dataset cps04.wf1.

a. Run a regression of average hourly earnings (AHE) on age. What is the estimated intercept? What is the estimated slope?

b. Run a regression of AHE on Age, gender (Female) and Bachelor. What is the estimated effect of Age on earnings? Construct a 95% confidence interval for this coefficient.

c. Are the results from the regression in (b) substantively different from the results in (a)? Does the specification in (a) seem to suffer from omitted variables bias? d. Bob is a 26 old male worker with a high school diploma. Alexis is a 30

year-old female worker with a college degree. Predict their earnings using the regression in (b).

e. Compare the fit of the regressions in (a) and (b) using R2 and R2. f. Why are R2 and R2 so close to each other in regression (b)?

g. Test the null hypothesis that gender is not a determinant of earnings. h. Test the null hypothesis that Bachelor is not a determinant of earnings. i. Test the null hypothesis that both Female and Bachelor jointly are not

determinants of earnings.

(13)

1. A recent study found that the death rate for people who sleep six to seven hours per night is lower than the death rate for people who sleep eight or more hours, and higher than the death rate for people who sleep five of fewer hours. The 1.1 million observations used for this study came from a random sample of Americans aged 30 to 102. Each survey

(14)

Unit 5.1

1. In the following regression, College is a binary variable equal to 1 if the individual holds a college degree. Female is a binary variable equal to 1 if the individual is female. Age measures the individual’s age in years. The other four dummy variables designate the four geographical regions in the US:

Ntheast = 1 if the individual resides in the Northeast Midwest = 1 if the individual resides in the Midwest South = 1 if the individual resides in the South West = 1 if the individual resides in the West

The estimated regression is as follows:

n _{3.75 5.44} _2.62 _0.29

0.69 0.60 0.27

Wage College Female Age

Ntheast Midwest South

= + − +

+ + −

6.21

SER= , R2 =0.194

a. Why is the regressor West omitted from the regression? What would happen if it were included?

(15)

1. This problem uses the dataset collegedistance.wf1.

a. Run a regression of ED on Dist, Female, Bytest, Tuition, Black, Hispanic,

Incomehi, Ownhome, DadColl, MomColl, Cue80 and Stwmfg80. If Dist increases from 2 to 3, how are years of education expected to change? If Dist increases from 6 to 7, how are years of education expected to change?

b. Run a regression of ED on Dist, Dist2, Female, Bytest, Tuition, Black, Hispanic, Incomehi, Ownhome, DadColl, MomColl, Cue80 and Stwmfg80. If Dist increases from 2 to 3, how are years of education expected to change? If Dist increases from 6 to 7, how are years of education expected to change?

(16)

Unit 5.3

1. This problem looks at the gender gap in earnings at top corporate jobs in the US. The study looks at total compensation among top executives in a large set of public corporations.

a. Let Female be an indicator equal to 1 for females. A regression of the log of earnings onto Female yields:

(

)

n

ln 6.48 0.44

(0.01) (0.05)

Earnings = − Female

2.65

SER=

i. Interpret the coefficient on Female precisely in words. ii. Interpret the SER precisely in words.

iii. Does this regression suggest that female executives earn less than male executives? Explain.

iv. Does this regression demonstrate that there is gender discrimination? Explain.

b. We now introduce two new variables to the regression above. Marketvalue is a measure of the firm’s size, in millions of dollars of market value. Return is a measure of firm stock performance, in percentage points.

(

)

n

₍

₎

ln 3.86 0.28 0.37 ln 0.004

(0.03) (0.04) (0.004) (0.003)

Earnings = − Female+ Marketvalue + Return

i. Interpret the coefficient on ln(Marketvalue) precisely in words. ii. Explain the difference in the coefficient on Female between this

regression and the regression in (a). Considering the regression evidence, give a credible reason for this difference.

iii. Are large firms more likely to have female executives than small firms?

(17)

1. A researcher runs several regressions to study determinants of housing prices. Price is the same price in dollars. Size is the size of the house in square feet. Bedrooms is a count of the the number of bedrooms. Pool is a binary variable equal to 1 if the house has a pool. View is a binary variable equal to 1 if the house has a nice view. Condition is a binary variable equal to 1 if the house is in excellent condition.

a. Using the results in column (1), what is the expected change in price resulting from building a 500 square-foot extension to a house? Construct a 95% confidence interval for the change in price. Be careful about units.

b. Comparing column (1) and column (2), is it better to use Size or ln(Size) as a regressor for housing prices?

c. Using column (2), what is the estimated effect of having a pool on the price of a house? Construct a 95% confidence interval for this effect. Be careful about units.

d. The regression in column (3) adds the number of bedrooms to the regression. How large is the estimated effect of an additional bedroom? Is this effect

statistically significant? Considering the other variables in the regression, why do you think this is so small?

e. Is the quadratic term ln(Size)2 important?

(18)

Dependent variable for all regressions is ln(Price).

Regression (1) Regression (2) Regression (3) Regression (4) Regression (5)

Size 0.00042

(0.000038)

ln(Size) 0.69

(0.054) 0.68 (0.087) 0.57 (2.03) 0.69 (0.055)

ln(Size)2 0.0078

(0.014)

Bedrooms 0.0036

(0.037)

Pool 0.082

(0.032) 0.71 (0.034) 0.71 (0.034) 0.71 (0.036) 0.071 (0.035)

View (0.037)

(0.029) 0.027 (0.028) 0.026 (0.026) 0.027 (0.029) 0.027 (0.030)

Pool*View 0.0022

(0.10)

Condition 0.13

(0.045) 0.12 (0.035) 0.12 (0.035) 0.12 (0.036) 0.12 (0.035)

Intercept 10.97

(0.069)

SER 0.102 0.098 0.099 0.099 0.099

2

(19)

2.

a. Run a regression of ED on Dist, Dist2, Female, Bytest, Tuition, Black, Hispanic, Incomehi, Ownhome, DadColl, MomColl, Cue80 and Stwmfg80 and also the interaction Dadcoll*Momcoll. Explain what the interaction term measures.

b. Alexis, Bonnie, Chloe and Diana have the same values of Dist, Dist2, Female, Bytest, Tuition, Black, Hispanic, Incomehi, Ownhome, Cue80 and Stwmfg80. Neither of Alexis’ parents attended college. Bonnie’s father attended college, but her mother did not. Chloe’s mother attended college, but her father did not. Both of Diana’s parents attended college. Use the regression from (a) to answer the following questions.

i. What does the regression predict for the difference between Bonnie’s and Alexis’ years of education?

ii. What does the regression predict for the difference between Chloe’s and Alexis’ years of education?

iii. What does the regression predict for the difference between Diana’s and Alexis’ years of education?

c. Is there any evidence that the effect of Dist on ED depends on the family’s income?

(20)

Unit 6.1

1. A researcher is interested in the effect of military service on future wages. He collects data from a random sample of 4000 workers all aged 40 and estimates the OLS

regression Y =β0+β1X +U, where Y is the worker’s annual earnings and X is a dummy

variable equal to 1 if the person served in the military.

a. Explain why the OLS estimates are likely to be unreliable.

b. During the Vietnam War, there was a military draft that was determined randomly by a lottery – that is, the lottery outcome determined who was required to serve in the military. Explain how this lottery might be used as an instrument to estimate the effect of military service on earnings.

2. A researcher is interested in studying the effect of management style (X) on employee performance (Y). He collects data on a large sample of firms and estimates the OLS regression Y =β0+β1X +U. He finds that managers who are more autocratic have less

productive employees and that managers who are friendly have more productive employees.

a. Describe the most obvious problem with this study.

(21)

1. This problem uses the dataset fertility.wf1. The research question concerns the effect of fertility on the labor supply decisions of married women aged 21-35 with two or more children.

a. Regress weeksworked on the indicator variable morekids using OLS. On average, do women with more than two children work less than women with two children? How much less?

b. Explain why the OLS regression in (a) is inappropriate for estimating the causal effect of fertility on labor supply.

c. The dataset contains the variable samesex, which is equal to 1 if the first two children are of the same sex and equal to 0 if the first two children are of different sexes. Are couples whose first two children are of the same sex more likely to have a third child? Is the effect large? Is it statistically significant?

d. Explain why samesex is a valid instrument for the regression of labor supply on fertility.

e. Is samesex a weak instrument?

f. Estimate the regression of weeksworked on morekids using TSLS with samesex as an instrument. Contrast the estimated effect of fertility on labor supply with the answer you obtained in (a).

2. Consider the regression model

0 1 2

i i i i

Y =β +β X +βW +u

W is exogenous, but X is endogenous and correlated with the error u. Which regression assumptions are violated for each of the following choices of instrumental variable Z?

a. Z_i is independent of

(

Y X W_i, _i, _i

)

b. Zi =Wi

c. Wi =1 for all i d. Zi = Xi

3. Consider the instrumental variable regression model Y =β₀+β₁X₁+β₂X₂+u, with Z as an instrument. Data on X₂ are not available and the model is estimated omitting X₂ from the regression.

(22)

4. This problem uses the dataset jec.wf1, which contains data on rail transport in the United States from 1880 to 1886. Suppose that the total demand for rail transport in tons (Q) is as follows:

( )

(

)

(

)

0 1 2 3 14

ln( )Q =β +β ln P +β ice +β seas1 + +... β seas12 +u

P is the transport price per ton; ice is a dummy variable equal to 1 if the Great Lakes were iced over (the issue is that grain could be shipped by sea instead of by rail if the Great Lakes were navigable); the dummies

{

seas1,...,seas12

}

indicate the month and are designed to control for seasonal variation in demand.

There was a cartel between rail producers in the United States at the time called the Joint Executive Committee (JEC) but it was not always operational since it frequently broke down as a result of cheating. The variable cartel is a dummy equal to 1 if the cartel was operational when the observation was taken.

a. Estimate the demand equation by OLS. What is the estimated price elasticity of demand and the standard error?

b. Explain why the answer from (a) is likely to be biased.

c. Argue that cartel is a valid instrument for the price in the demand equation. d. Estimate the first-stage regression. Is cartel a weak instrument?

e. Estimate the demand equation by TSLS. What is the estimated price elasticity of demand and the standard error?

f. Does the evidence suggest that the cartel was charging the profit-maximizing price? (Hint: Review your microeconomics notes on the relationship between monopoly pricing and price elasticity).

5. In an instrumental variables model with one regressor X and two instrumental variables

1

Z and Z2, the value of the J-statistic is J =18.2.

a. Does this suggest that E u Z Z

(

| ₁, ₂

)

≠0? Explain.

(23)

1. This problem uses the dataset smoking.wf1. The research question is whether workplace smoking bans induce smokers to quit.

a. What is the proportion of smokers among all workers? Workers affected by workplace smoking bans? Workers not affected by workplace smoking bans? b. What is the difference in the probability of smoking between workers affected by

a workplace smoking ban and workers not affected by a workplace smoking ban? Use a linear probability model to determine whether this difference is statistically significant.

c. Estimate a linear probability model with smoker as the dependent variable and the following regressors: smkban, female, age, age2, hsdrop, hsgrad, colsome,

colgrad, black and hispanic. Compare the estimated effect of a smoking ban with your answer from (b). Suggest a reason for this difference, based on the substance of this regression.

d. Test at 5% significance the hypothesis that the coefficient on smkban is equal to zero.

e. Does the probability of smoking rise or fall with level of education? Test the hypothesis that the probability of smoking does not depend on education. Note that this is a joint test involving multiple coefficients.

f. Is there a nonlinear relationship between age and the probability of smoking? Answer based on the regression in (c), and plot the relationship between

probability of smoking and age for a white non-Hispanic male college graduate with no workplace smoking ban. Show in your graph ages between 18 and 65. g. Estimate a probit model using the same regressors as in (c).

h. Test the hypothesis that the coefficient on smkban is zero in the probit regression. Compare the t-statistic and your conclusion with (d).

i. Mr. A is white, non-hispanic, 20 years old and a high school dropout. Using the probit regression from (g) and assuming that Mr. A is not subject to a workplace smoking ban, estimate the probability that Mr. A is a smoker. Carry out the same calculation if Mr. A is subject to a workplace smoking ban. What is the estimated effect of the smoking ban on his probability of smoking?

j. Repeat (i) for Ms. B, a female, black 40 year-old college graduate. k. Repeat (i) and (j) using the linear model from part (c).

l. Do the results from the linear and the probit models differ? If they do, which results make more sense?

(24)

Unit 6.4

1. Consider the following regression:

0 1 2 2 3 3 ...

it it i i n i it

Y =β +β X +γ D +γ D + +γ Dn +u

The set

{

D D1, 2,...,Dn

}

are fixed-effects dummies for the n entities.

a. What are the slope and intercept for entity 1 in time period 1? b. What are the slope and intercept for entity 1 in time period 3? c. What are the slope and intercept for entity 3 in time period 1? d. What are the slope and intercept for entity 3 in time period 3?

2. A researcher believes that traffic fatalities increase when roads are icy, so that states with more snow will have more fatalities than other states. Comment on the following

methods designed to estimate the effect of snow on fatalities, in the context of a regression with state fixed effects.

a. The researcher collects data on the average snowfall over 10 years in each state and adds AverageSnow_i to his regression for each state i.

b. The researcher collects data on snowfall for each state in each year and adds

it

(25)

3.

death for Americans between the ages of 5 and 32. The research question is whether mandatory seat belt laws reduce the number of fatalities. There are two ways that

mandatory seat belt laws are enforced. “Primary” enforcement means that a police officer can stop a car and ticket the driver if the officer observes an occupant not wearing a seat belt. “Secondary” enforcement means that a police office can write a ticket if an occupant is not wearing a seat belt, but must have another reason to stop the car. The dataset

contains data from 50 US states plus Washington DC (so 51 entities total), each collected for 15 years.

a. Convert the dataset to EViews format and estimate the effect of seat belt use on fatalities by regressing FatalityRate on sb_usage, speed65, speed70, ba08, drinkage21, ln(income) and age without any fixed effects. Does the estimated regression suggest that increased seatbelt use reduces fatalities?

b. Do the results change when you add state fixed effects?

c. Do the results change when you add state and time fixed effects?

d. Which regression specification – (a), (b) or (c) – is more reliable? Explain.

e. Using the results in (c), discuss the size of the coefficient on sb_usage. Is it large? How many lives would be saved if seat belt use increased from 52% to 90%? f. Run a regression of sb_usage on primary, secondary, speed65, speed70, ba08,

drinkage21, ln(income) and age, including state and time fixed effects. Here, primary is a dummy for primary enforcement and secondary is a dummy for secondary enforcement. Does primary enforcement increase seat belt use? What about secondary enforcement?

g. In 2000, New Jersey changed from secondary enforcement to primary